Traditional file systems, such as NTFS and ext3, often struggle to support complex analytics due to limited metadata and the absence of global views. Performing aggregate or top-k queries typically requires exhaustive scanning or pre-built indexes, which are time-consuming and resource-intensive. These limitations hinder efficient data management and analysis, especially when dealing with large-scale or unfamiliar file systems.
GW researchers have developed a novel just-in-time sampling-based system that enables efficient and accurate analytics over large file systems without prior knowledge or extensive pre-processing. By utilizing minimal disk accesses, the system employs two algorithms—FS_Agg for aggregate queries and FS_TopK for top-k queries—to provide statistically accurate estimations. The approach is file-system agnostic, scalable to billions of files, and eliminates the need for disk crawling or index building. This innovation offers a practical solution for real-time analytics in dynamic and expansive data environments.
Figure: A block diagram of the system architecture in accordance with a preferred embodiment of the invention
Advantages:
Applications: