HNDS-R: Improving Data Integration Techniques - Soul Searching for Grad School

Project Abstract/Summary

This research project will develop methodologies to address the critical challenges researchers face when merging datasets without unique identifiers. As datasets have become more abundant and diverse, researchers are seeking ways to combine data from multiple sources to tackle important societal questions. A frequent obstacle to linking diverse datasets is the lack of a unique identifier, such as a social security number. This leads to uncertainty in linking records across datasets and computational complexity due to the need for multiple comparisons without prior knowledge of the correspondence between records. This project will develop easy-to-use, computationally efficient, and accurate methods of merging datasets without a unique identifier. Simulations and real-world data will be used to validate the new methods, and open-source software will be developed. The project will provide dedicated training and support for undergraduate and graduate students, including students from underrepresented groups. Overall, the outcomes of this project will advance theoretical understanding, improve research infrastructure, and provide societal benefits by making research findings widely accessible.

This research project will address the complex challenge of merging large datasets without unique identifiers by improving probabilistic data integration (record linkage) methods. The central scientific questions involve how to accurately classify record pairs using minimal labeled data and how to enhance computational efficiency when merging large datasets. The approach will involve combining probabilistic modeling with active learning to utilize both labeled and unlabeled data more efficiently. Additionally, it will include the development of new hashing techniques and the use of parallel computing to ensure scalability for datasets containing millions of records. The research will be validated through a comprehensive set of simulation studies and two empirical applications. The newly developed methods will extend beyond their initial application, addressing a wide array of data integration challenges across various scientific disciplines. Open-source software will be distributed, enabling researchers and practitioners to easily incorporate these advanced techniques into their work. The software and accompanying resources will promote broader adoption and utilization of the developed methodologies, enhancing the overall impact and accessibility of the research.

This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

Principal Investigator

Ted Enamorado Enamorado – Washington University located in SAINT LOUIS, MO

Co-Principal Investigators

Funders

National Science Foundation

Funding Amount

$233,955.00

Project Start Date

08/01/2024

Project End Date

07/31/2026

Will the project remain active for the next two years?

Source: National Science Foundation

Please be advised that recent changes in federal funding schemes may have impacted the project’s scope and status.

Updated: April, 2025