Sapling: Accelerating suffix array queries with learned data models

Melanie Kirsche, Arun Das, Michael C. Schatz

Research output: Contribution to journalArticlepeer-review

Abstract

Motivation: As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets. Results: Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array's memory footprint.

Original languageEnglish (US)
Pages (from-to)744-749
Number of pages6
JournalBioinformatics
Volume37
Issue number6
DOIs
StatePublished - Mar 15 2021
Externally publishedYes

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint

Dive into the research topics of 'Sapling: Accelerating suffix array queries with learned data models'. Together they form a unique fingerprint.

Cite this