TY - GEN
T1 - DOT-K
T2 - 12th IEEE International Conference on e-Science, e-Science 2016
AU - Carey, Nicholas
AU - Budavári, Tamás
AU - Ahmad, Yanif
AU - Szalay, Alexander
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/3/3
Y1 - 2017/3/3
N2 - Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform and naive algorithms can easily become quite expensive. We present a one pass, communications-efficient technique useful for both estimating upper order quantiles and selecting the largest k elements across a highly distributed dataset or stream. Our novel approach draws its foundations from Extreme Value Statistics (EVS) to reason about the statistical relationships between the tail distributions of dataset partitions. The tail of each partition is fitted by the Generalized Pareto Distribution, which captures threshold exceedances. The obtained parameters are communicated to a central coordinator and used to estimate quantiles, or solve for a threshold above which there are approximately k elements. We discuss the computational and bandwidth costs of the algorithm, and demonstrate the accuracy of the method on both a variety of synthetic datasets and a PageRank dataset.
AB - Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform and naive algorithms can easily become quite expensive. We present a one pass, communications-efficient technique useful for both estimating upper order quantiles and selecting the largest k elements across a highly distributed dataset or stream. Our novel approach draws its foundations from Extreme Value Statistics (EVS) to reason about the statistical relationships between the tail distributions of dataset partitions. The tail of each partition is fitted by the Generalized Pareto Distribution, which captures threshold exceedances. The obtained parameters are communicated to a central coordinator and used to estimate quantiles, or solve for a threshold above which there are approximately k elements. We discuss the computational and bandwidth costs of the algorithm, and demonstrate the accuracy of the method on both a variety of synthetic datasets and a PageRank dataset.
UR - http://www.scopus.com/inward/record.url?scp=85016734899&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85016734899&partnerID=8YFLogxK
U2 - 10.1109/eScience.2016.7870893
DO - 10.1109/eScience.2016.7870893
M3 - Conference contribution
AN - SCOPUS:85016734899
T3 - Proceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016
SP - 130
EP - 136
BT - Proceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 October 2016 through 27 October 2016
ER -