DOT-K: A distributed online top-K elements algorithm using extreme value statistics

Nicholas Carey, Tamás Budavári, Yanif Ahmad, Alexander Szalay

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform and naive algorithms can easily become quite expensive. We present a one pass, communications-efficient technique useful for both estimating upper order quantiles and selecting the largest k elements across a highly distributed dataset or stream. Our novel approach draws its foundations from Extreme Value Statistics (EVS) to reason about the statistical relationships between the tail distributions of dataset partitions. The tail of each partition is fitted by the Generalized Pareto Distribution, which captures threshold exceedances. The obtained parameters are communicated to a central coordinator and used to estimate quantiles, or solve for a threshold above which there are approximately k elements. We discuss the computational and bandwidth costs of the algorithm, and demonstrate the accuracy of the method on both a variety of synthetic datasets and a PageRank dataset.

Original languageEnglish (US)
Title of host publicationProceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages130-136
Number of pages7
ISBN (Electronic)9781509042722
DOIs
StatePublished - Mar 3 2017
Event12th IEEE International Conference on e-Science, e-Science 2016 - Baltimore, United States
Duration: Oct 23 2016Oct 27 2016

Publication series

NameProceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016

Conference

Conference12th IEEE International Conference on e-Science, e-Science 2016
Country/TerritoryUnited States
CityBaltimore
Period10/23/1610/27/16

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Environmental Science (miscellaneous)
  • Medicine (miscellaneous)
  • Social Sciences (miscellaneous)
  • Agricultural and Biological Sciences (miscellaneous)
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'DOT-K: A distributed online top-K elements algorithm using extreme value statistics'. Together they form a unique fingerprint.

Cite this