RTI uses cookies to offer you the best experience online. By clicking “accept” on this website, you opt in and you agree to the use of cookies. If you would like to know more about how RTI uses cookies and how to manage them please view our Privacy Policy here. You can “opt out” or change your mind by visiting: http://optout.aboutads.info/. Click “accept” to agree.

Newsroom

New RTI SynthPop™ dataset offers groundbreaking research opportunities

Comprehensive, privacy-preserving dataset mirrors U.S. population to allow study of community-and neighborhood-level effects


RESEARCH TRIANGLE PARK, N.C. — Researchers at RTI International, a nonprofit research institute, have published a paper titled “A National Synthetic Populations Dataset for the United States that describes a new 2019 U.S. population dataset and open-source code base rti_synth_pop. The new dataset provides a statistically accurate and geospatially explicit representation of households and individuals, enabling researchers to study community- and neighborhood-level effects while protecting individual privacy.

"Our new synthetic population dataset allows researchers to design and test hypotheses that would otherwise not be possible. We think of the individual records in the dataset as highly extensible digital twins, nearly one twin for every person in the U.S.," said James Rineer, director of geospatial science and technology at RTI. "Incorporation of synthetic data like this, combined with artificial intelligence and machine learning, is already providing cost efficiencies on the front end of research as well as providing built-in scalability on the backend."

The dataset contains records representing more than 120 million households and 303 million individuals across the U.S. It was created using publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015–2019 ACS, along with Iterative Proportional Fitting (IPF) to ensure statistical accuracy. The researchers then used a population density grid to spatially allocate households. Validation of the dataset showed strong correlation with original census variables, matching almost perfectly in many categories with a 0.99 Pearson’s r correlation coefficient.

“Innovations in synthetic datasets are foundational to advancing secure statistical linkage between various demographic datasets required in almost all social science research today,” said Georgiy Bobashev, Ph.D., Senior Fellow in RTI’s Center for Data Science and AI.

“There are a near infinite number of ways to add attributes and behaviors to these records,” added Nick Kruskamp, Ph.D., a research geospatial scientist at RTI who maintains the open-source code base. “The bulk of our work is focused on further augmenting these data for inclusion in simulation models, including ongoing work modeling the opioid epidemic in the U.S., alcohol use disorders, infectious diseases, chronic illnesses including diabetes and planning and response for natural disasters.”

The dataset's applications span various fields, including epidemiology, social science and economics. Its high level of granularity allows for more realistic modeling and predictive studies using methods such as microsimulation, agent-based modeling and linkage of multiple datasets.

The new dataset, publicly available for research and collaboration, was presented in a paper published in Scientific Data, a peer-reviewed, open-access journal.

View the full paper (via Scientific Data)

Access the code (via GitHub)

Learn more about RTI SynthPop™