TY - JOUR
T1 - Mbkmeans
T2 - Fast clustering for single cell data using mini-batch k-means
AU - Hicks, Stephanie C.
AU - Liu, Ruoxi
AU - Ni, Yuwei
AU - Purdom, Elizabeth
AU - Risso, Davide
N1 - Funding Information:
This work has been supported by the National Institutes of Health grant R00HG009007 to SCH and by the the NIH BRAIN Initiative grant U19MH114830 (EP). This work was also supported by DAF2018-183201 (SCH, RL, YN, EP, DR) and CZF2019-002443 (SCH, RL, DR) from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. EP was supported by a ENS-CFM Data Science Chair. DR was supported by Programma per Giovani Ricercatori Rita Levi Montalcini granted by the Italian Ministry of Education, University, and Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publisher Copyright:
© 2021 Hicks et al.
PY - 2021/1/26
Y1 - 2021/1/26
N2 - Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/ mbkmeans.
AB - Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/ mbkmeans.
UR - http://www.scopus.com/inward/record.url?scp=85101135232&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101135232&partnerID=8YFLogxK
U2 - 10.1371/JOURNAL.PCBI.1008625
DO - 10.1371/JOURNAL.PCBI.1008625
M3 - Article
C2 - 33497379
AN - SCOPUS:85101135232
SN - 1553-734X
VL - 17
JO - PLoS Computational Biology
JF - PLoS Computational Biology
IS - 1
M1 - e1008625
ER -