Just-In-Time Analytics on Large File and Storage Systems

Traditional file systems, such as NTFS and ext3, often struggle to support complex analytics due to limited metadata and the absence of global views. Performing aggregate or top-k queries typically requires exhaustive scanning or pre-built indexes, which are time-consuming and resource-intensive. These limitations hinder efficient data management and analysis, especially when dealing with large-scale or unfamiliar file systems.

GW researchers have developed a novel just-in-time sampling-based system that enables efficient and accurate analytics over large file systems without prior knowledge or extensive pre-processing. By utilizing minimal disk accesses, the system employs two algorithms—FS_Agg for aggregate queries and FS_TopK for top-k queries—to provide statistically accurate estimations. The approach is file-system agnostic, scalable to billions of files, and eliminates the need for disk crawling or index building. This innovation offers a practical solution for real-time analytics in dynamic and expansive data environments.

Figure: A block diagram of the system architecture in accordance with a preferred embodiment of the invention

 

Advantages:

  • Approximately 90% accuracy in query results using just 20% of the directories compared to a full file system crawl.
  • Scalable to file systems containing up to one billion files and millions of directories.
  • Eliminates the need for pre-built indexes or metadata.

 

Applications:

  • Real-time analytics over unfamiliar or hidden file systems without prior knowledge.
  • Efficient data management and archiving in large-scale storage environments.
  • Enhancement of mobile interfaces for accessing hidden databases through context-sensitive suggestions.

 

Patent Information: