CAREER: Flexible Record Linkage through Realistic Modeling of Dependent, Missing, and Updating Data

Project Abstract/Summary

This CAREER research project will develop flexible Bayesian record linkage models to enable more accurate linking of data sets without unique identifiers. Unleashing the potential of increasingly ubiquitous data to solve grand problems from conservation biology to demography and population estimation often requires linking multiple data sources. Record linkage is the process of resolving duplicates in partially overlapping sets of records from noisy data sources without a unique identifier. A statistical model-based approach is attractive in that it allows for uncertainty quantification in the linkage. Current approaches to record linkage, however, struggle to account for data with dependence, data with missing values, and updating data, challenges that this research will address. This novel methodology will be applied to publicly available and Census Bureau data from the 19th century with identifying information about freedom seekers, captive individuals in the coastwise slave trade, and Black Americans in the Antebellum South. Official records of these individuals are fractured into many sources, each only giving a small glimpse in time. Applying the developed flexible Bayesian record linkage model, this research will reconstruct a fuller account of their lives through linking multiple noisy data sources. Building on these results, the investigator will develop a real-time updating web application to search the linked historical records. Educational activities supporting the web application include an international workshop on record linkage and historical data, a community field day, and a PhD. course on statistical tools for historical demographic data. In addition, graduate students will be mentored, and software will be made publicly available.

Current methodology makes three unrealistic assumptions: 1) linking fields are independent, 2) missing values are missing completely at random (MCAR), and 3) data are static. The first assumption simply does not reflect the reality of many linkage data sets. Therefore, this research will advance Bayesian record linkage through development of a flexible prior to incorporate dependence between fields in the data using copulas. Next, the research will address missing data in the linkage variables. Missing information in historic data is common, either due to factors like the loss of data over time, hand-collected data, lower literacy rates, etc. Current methods for handling missing data assume MCAR and marginalize out the effects of missingness. In contrast, this investigation will incorporate records that contain missing data in the final model via a flexible nonparametric prior. Finally, in developing the web application, the research will allow for dynamic data by developing a real-time updating querying framework for record linkage achieved through a novel filtering approach. Bayesian record linkage is an emerging field with many exciting avenues of exploration. This research will address three such avenues, resulting in more accurate linkage in complex data scenarios. The new methods will be applicable beyond the current application to a broad range of record linkage problems across the sciences.

This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

Principal Investigator

Andrea Kaplan – Colorado State University located in FORT COLLINS, CO

Co-Principal Investigators

Funders

National Science Foundation

Funding Amount

$193,038.00

Project Start Date

05/15/2024

Project End Date

04/30/2029

Will the project remain active for the next two years?

The project has more than two years remaining

Source: National Science Foundation

Please be advised that recent changes in federal funding schemes may have impacted the project’s scope and status.

Updated: April, 2025