TY - GEN
T1 - Just-in-time analytics on large file systems
AU - Howie Huang, H.
AU - Zhang, Nan
AU - Wang, Wei
AU - Das, Gautam
AU - Szalay, Alexander S.
N1 - Funding Information:
We thank the anonymous reviewers and our shepherd John Bent for their excellent comments that helped improve the quality of this paper. We also thank Hong Jiang and Yifeng Zhu for their help on replaying the NFS trace, and Ron Chiang for his help on the artwork. This work was supported by the NSF grants OCI-0937875, OCI-0937947, IIS-0845644, CCF-0852674, CNS-0852673, and CNS-0915834.
PY - 2011
Y1 - 2011
N2 - As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing pre-processing based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate - i.e., statistically accurate - answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy and scalability of Glance.
AB - As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing pre-processing based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate - i.e., statistically accurate - answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy and scalability of Glance.
UR - http://www.scopus.com/inward/record.url?scp=85077077980&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077077980&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85077077980
T3 - Proceedings of FAST 2011: 9th USENIX Conference on File and Storage Technologies
SP - 217
EP - 230
BT - Proceedings of FAST 2011
PB - USENIX Association
T2 - 9th USENIX Conference on File and Storage Technologies, FAST 2011
Y2 - 15 February 2011 through 17 February 2011
ER -