Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search

Lennart Behme, Sainyam Galhotra, Kaustubh Beedkar, Volker Markl

July, 2024

Abstract

Efficient data discovery is crucial in the era of data-driven decision-making. However, current practices face significant challenges due to the intricacies of identifying datasets with specific distributional characteristics, such as percentiles, when data repositories are decentralized. Traditional keyword-based search methods are insufficient for these complex requirements, often resulting in suboptimal dataset search results. To address these challenges, this paper presents Fainder, a fast and accurate index for “percentile predicates” on histogram-based data summaries, which streamlines the search process for datasets with specific distributional requirements. Fainder can be constructed on heterogeneous histogram collections and employs binary search in conjunction with multi-step pruning techniques to efficiently identify search results for percentile predicates. Thereby, it simplifies data provisioning and improves the effectiveness of dataset discovery. Empirical evaluation of our solution on three large-scale data repositories shows that Fainder is effective for distribution-aware search and provides order-of-magnitude efficiency gains over baselines.

Type

Conference paper

Publication

In Proceedings of the PVLDB (VLDB 2024)

Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search

Abstract

Kaustubh Beedkar

Assistant Professor