Fast and memory-efficient scRNA-seq k-means clustering with various distances

Daniel N. Baker; Nathan Dyjack; Vladimir Braverman; Stephanie C. Hicks; Benjamin Langmead

doi:10.1145/3459930.3469523

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Daniel N. Baker, Nathan Dyjack, Vladimir Braverman, Stephanie C. Hicks, Benjamin Langmead

Bloomberg School of Public Health

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

Original language	English (US)
Title of host publication	Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
Publisher	Association for Computing Machinery, Inc
ISBN (Electronic)	9781450384506
DOIs	https://doi.org/10.1145/3459930.3469523
State	Published - Jan 18 2021
Event	12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States Duration: Aug 1 2021 → Aug 4 2021

Publication series

Name	Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

Conference

Conference	12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
Country/Territory	United States
City	Virtual, Online
Period	8/1/21 → 8/4/21

Keywords

SIMD
clustering
importance sampling
single cell

ASJC Scopus subject areas

Software
Health Informatics
Biomedical Engineering
Computer Science Applications

Access to Document

10.1145/3459930.3469523

Cite this

Baker, D. N., Dyjack, N., Braverman, V., Hicks, S. C., & Langmead, B. (2021). Fast and memory-efficient scRNA-seq k-means clustering with various distances. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 (Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021). Association for Computing Machinery, Inc. https://doi.org/10.1145/3459930.3469523

Fast and memory-efficient scRNA-seq k-means clustering with various distances. / Baker, Daniel N.; Dyjack, Nathan; Braverman, Vladimir et al.
Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021. Association for Computing Machinery, Inc, 2021. (Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Baker, DN, Dyjack, N, Braverman, V, Hicks, SC & Langmead, B 2021, Fast and memory-efficient scRNA-seq k-means clustering with various distances. in Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021. Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021, Association for Computing Machinery, Inc, 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021, Virtual, Online, United States, 8/1/21. https://doi.org/10.1145/3459930.3469523

Baker DN, Dyjack N, Braverman V, Hicks SC, Langmead B. Fast and memory-efficient scRNA-seq k-means clustering with various distances. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021. Association for Computing Machinery, Inc. 2021. (Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021). doi: 10.1145/3459930.3469523

Baker, Daniel N. ; Dyjack, Nathan ; Braverman, Vladimir et al. / Fast and memory-efficient scRNA-seq k-means clustering with various distances. Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021. Association for Computing Machinery, Inc, 2021. (Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021).

@inproceedings{a0198bbc742e41e896d4472b7b7b7f56,

title = "Fast and memory-efficient scRNA-seq k-means clustering with various distances",

abstract = "Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.",

keywords = "SIMD, clustering, importance sampling, single cell",

author = "Baker, {Daniel N.} and Nathan Dyjack and Vladimir Braverman and Hicks, {Stephanie C.} and Benjamin Langmead",

note = "Funding Information: DNB and BL were supported by NIH/NIGMS grants R01GM118568 and R35GM139602 to BL. SCH and ND were supported by NIH/NHGRI R00HG009007 to SCH. This work was also supported by CZF2019-002443 (SCH) from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. Publisher Copyright: {\textcopyright} 2021 Owner/Author.; 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 ; Conference date: 01-08-2021 Through 04-08-2021",

year = "2021",

month = jan,

day = "18",

doi = "10.1145/3459930.3469523",

language = "English (US)",

series = "Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021",

publisher = "Association for Computing Machinery, Inc",

booktitle = "Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021",

}

TY - GEN

T1 - Fast and memory-efficient scRNA-seq k-means clustering with various distances

AU - Baker, Daniel N.

AU - Dyjack, Nathan

AU - Braverman, Vladimir

AU - Hicks, Stephanie C.

AU - Langmead, Benjamin

N1 - Funding Information: DNB and BL were supported by NIH/NIGMS grants R01GM118568 and R35GM139602 to BL. SCH and ND were supported by NIH/NHGRI R00HG009007 to SCH. This work was also supported by CZF2019-002443 (SCH) from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. Publisher Copyright: © 2021 Owner/Author.

PY - 2021/1/18

Y1 - 2021/1/18

N2 - Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

AB - Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

KW - SIMD

KW - clustering

KW - importance sampling

KW - single cell

UR - http://www.scopus.com/inward/record.url?scp=85112395227&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85112395227&partnerID=8YFLogxK

U2 - 10.1145/3459930.3469523

DO - 10.1145/3459930.3469523

M3 - Conference contribution

C2 - 34778889

AN - SCOPUS:85112395227

T3 - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

BT - Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

PB - Association for Computing Machinery, Inc

T2 - 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

Y2 - 1 August 2021 through 4 August 2021

ER -

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this