A method for approximate searching very large files using Pigeonhole Principle

Big Data is the new term for the exponential growth of data in the Internet and traditional methods for searching large datasets with fuzzy criteria often rely on brute-force techniques, which can be inefficient and time-consuming. These approaches typically involve sequentially scanning through data, leading to performance bottlenecks, especially as data volumes grow. While approximate matching methods like FuzzyFind offer some improvements, they may still fall short in scalability and speed when dealing with extensive datasets.

GW researchers have developed a novel system and method that leverages the Pigeonhole Principle to enhance approximate searching in large data environments. It partitions both the search query and data items into multiple binary vector segments. Each segment has a length greater than 23 bits, and the system sets a threshold for permissible mismatches, calculated as a function of the number of segments and allowed mismatches per segment, often using Hamming distance metrics. By applying this principle, the method significantly reduces the need for exhaustive comparisons, leading to faster search times and improved performance in big data applications. The technique can be particularly effective when integrated with existing approximate matching capabilities, such as those found in the FuzzyFind method, thereby amplifying their effectiveness.

Fig. The Software structure for Pigeonhole Search

Advantages:

Enhanced Search Efficiency
Scalability for Large Datasets, including those exceeding terabyte and even petabyte sizes.
Improved Accuracy with Tolerance for Mismatches

Applications:

Cloud environments and big data storage solutions.
Genomic Data Analysis
Error Correction in Data Transmission

Direct Link:

https://canberra-ip.technologypublisher.com/tech/A_method_for_approximate_sea rching_very_large_files_using_Pigeonhole_Principle

Bookmark this page

Download as PDF

For Information, Contact:

Michael Harpen

Licensing Manager

George Washington University

mharpen@gwu.edu