TY - GEN
T1 - Rapid parallel genome indexing with MapReduce
AU - Menon, Rohith K.
AU - Bhat, Goutham P.
AU - Schatz, Michael C.
PY - 2011
Y1 - 2011
N2 - Sequence alignment is one of the most important applications in computational biology, and is used for such diverse tasks as identifying homologous proteins, analyzing gene expression, mapping variations between individuals, or assembling de novo the genome of organism. Except for trivial cases involving just a small number of short sequences, virtually all other sequence alignment tasks rely on a precomputed index of the sequence to accelerate the alignment. Two of the most important index structures are the suffix array, which consists of the lexicographically sorted list of suffixes of a genome, and the closely related Burrows-Wheeler Transform (BWT), which is a permutation of the genome based on the suffix array. Constructing these structures on large sequences, such as the human genome, requires several hours of serial computation and must be performed for each genome, or genome assembly, to be analyzed. Here we present a novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model. We demonstrate the performance of the algorithm by greatly accelerating suffix array and BWT construction on five significant genomes using as many as 120 cores leased from the Amazon Elastic Compute Cloud (EC2), reducing the end-to-end runtime from hours to mere minutes. The source code is available under an open source GPL License at: http://code.google.com/p/genome-indexing/
AB - Sequence alignment is one of the most important applications in computational biology, and is used for such diverse tasks as identifying homologous proteins, analyzing gene expression, mapping variations between individuals, or assembling de novo the genome of organism. Except for trivial cases involving just a small number of short sequences, virtually all other sequence alignment tasks rely on a precomputed index of the sequence to accelerate the alignment. Two of the most important index structures are the suffix array, which consists of the lexicographically sorted list of suffixes of a genome, and the closely related Burrows-Wheeler Transform (BWT), which is a permutation of the genome based on the suffix array. Constructing these structures on large sequences, such as the human genome, requires several hours of serial computation and must be performed for each genome, or genome assembly, to be analyzed. Here we present a novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model. We demonstrate the performance of the algorithm by greatly accelerating suffix array and BWT construction on five significant genomes using as many as 120 cores leased from the Amazon Elastic Compute Cloud (EC2), reducing the end-to-end runtime from hours to mere minutes. The source code is available under an open source GPL License at: http://code.google.com/p/genome-indexing/
KW - burrows-wheeler transform
KW - cloud computing
KW - DNA sequence analysis
KW - MapReduce
KW - parallel suffix array construction
UR - http://www.scopus.com/inward/record.url?scp=79961053764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79961053764&partnerID=8YFLogxK
U2 - 10.1145/1996092.1996104
DO - 10.1145/1996092.1996104
M3 - Conference contribution
AN - SCOPUS:79961053764
SN - 9781450307000
T3 - MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications
SP - 51
EP - 58
BT - MapReduce'11 - Proceedings of the 2nd International Workshop on MapReduce and Its Applications
T2 - 2nd International Workshop on MapReduce and Its Applications, MapReduce'11, Co-located with 20th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC 2011
Y2 - 8 June 2011 through 8 June 2011
ER -