Artificial intelligence-assisted data extraction with a large language model: A study within reviews

Gerald Gartlehner; Shannon Kugley; Karen Crotty; Meera Viswanathan; Andreea Dobrescu; Barbara Nussbaumer-Streit; Graham Booth; Jonathan R Treadwell; Jung Min Han; Jesse Wagner; Eric A Apaydin; Erin L Coppola; Margaret Maglione; Rainer Hilscher; Robert Chew; Meagan Pilar; Bryan Swanton; Leila C Kahwati

Artificial intelligence-assisted data extraction with a large language model

A study within reviews

Gartlehner, G., Kugley, S., Crotty, K., Viswanathan, M., Dobrescu, A., Nussbaumer-Streit, B., Booth, G., Treadwell, J. R., Han, J. M., Wagner, J., Apaydin, E. A., Coppola, E. L., Maglione, M., Hilscher, R., Chew, R., Pilar, M., Swanton, B., & Kahwati, L. C. (2025). Artificial intelligence-assisted data extraction with a large language model: A study within reviews. Annals of Internal Medicine, 178(12). https://doi.org/10.7326/ANNALS-25-00739

Copy citation

Abstract

BACKGROUND: Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.

OBJECTIVE: To compare an AI-assisted versus a traditional, human-only data extraction process.

DESIGN: Study within reviews (SWAR) using a prospective, parallel-group comparison with blinded data adjudicators.

SETTING: Workflow validation within 6 ongoing systematic reviews of interventions under real-world conditions.

INTERVENTION: Initial data extraction using an LLM (Claude, versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer.

MEASUREMENTS: Concordance, time on task, accuracy, sensitivity, positive predictive value, and error analysis.

RESULTS: The 6 systematic reviews in the SWAR yielded 9341 data elements from 63 studies. Concordance between the 2 methods was 77.2% (95% CI, 76.3% to 78.0%). Compared with the reference standard, the AI-assisted approach had an accuracy of 91.0% (CI, 90.4% to 91.6%) and the human-only approach an accuracy of 89.0% (CI, 88.3% to 89.6%). Sensitivities were 89.4% (CI, 88.6% to 90.1%) and 86.5% (CI, 85.7% to 87.3%), respectively, with positive predictive values of 99.2% (CI, 99.0% to 99.4%) and 98.9% (CI, 98.6% to 99.1%). Incorrect data were extracted in 9.0% (CI, 8.4% to 9.6%) of AI-assisted cases and 11.0% (CI, 10.4% to 11.7%) of human-only cases, with corresponding proportions of major errors of 2.5% (CI, 2.2% to 2.8%) versus 2.7% (CI, 2.4% to 3.1%). Missed data items were the most frequent error type in both approaches. The AI-assisted method reduced data extraction time by a median of 41 minutes per study.

LIMITATIONS: Assessing concordance and classifying errors required subjective judgment. Consistently tracking time on task was challenging.

CONCLUSION: Data extraction assisted by AI may offer a viable, more efficient alternative to human-only methods.

PRIMARY FUNDING SOURCE: Agency for Healthcare Research and Quality and RTI International.

Publications Info

To contact an RTI author, request a report, or for additional information about publications by our experts, send us your request.

publications@rti.org

RTI shares its evidence-based research - through peer-reviewed publications and media - to ensure that it is accessible for others to build on, in line with our mission and scientific standards.

Meet the Experts

Navigate to Meera Viswanathan

Leila Kahwati

Recent Publications

Article

Use of fentanyl test strips by people who inject drugs: Longitudinal findings from the south Atlantic fentanyl test strip study (SAFTSS)

August 2026

Article

Oral toxicokinetics of the indoor air pollutant, α-pinene, and its genotoxic metabolite, α-pinene oxide, in rodents and comparison to inhalation route of exposure

August 2026

Article

Implementation of the IWQOL-Lite-CT in observational research: Comparison of baseline scores with a clinical trial population and psychometric evaluation

August 2026

Article

Racial differences in adverse pregnancy outcomes and incident hypertension: A mediation analysis

July 2026

Article

Mental health, substance use, and child maltreatment

July 2026

Article

Global research requires global researchers: Opportunities and challenges for capacity-building

July 2026

Article

Impact of enhanced practices on opioid overdose deaths: A community-based modeling approach

July 2026

Article

A cross-sectional study of acceptability and influence of HEALing communities study communications campaign messaging among community members in four U.S. states

July 2026

View All Publications

Artificial intelligence-assisted data extraction with a large language model

Abstract

Meet the Experts

Meera Viswanathan

Rainer Hilscher

Robert Chew

Leila Kahwati

Recent Publications

Use of fentanyl test strips by people who inject drugs: Longitudinal findings from the south Atlantic fentanyl test strip study (SAFTSS)

Oral toxicokinetics of the indoor air pollutant, α-pinene, and its genotoxic metabolite, α-pinene oxide, in rodents and comparison to inhalation route of exposure

Implementation of the IWQOL-Lite-CT in observational research: Comparison of baseline scores with a clinical trial population and psychometric evaluation

Racial differences in adverse pregnancy outcomes and incident hypertension: A mediation analysis

Mental health, substance use, and child maltreatment

Global research requires global researchers: Opportunities and challenges for capacity-building

Impact of enhanced practices on opioid overdose deaths: A community-based modeling approach

A cross-sectional study of acceptability and influence of HEALing communities study communications campaign messaging among community members in four U.S. states

RTI International and Othram awarded NIJ funding for major study of forensic genetic genealogy across ancestral populations

Youth tobacco use continues to decline: RTI publishes results of the 2025 National Youth Tobacco Survey in partnership with FDA

Cogeneration’s Advantage: Efficiency, Resilience, and the Case for Captured Heat

Turning Clean Energy Investment into Economic Growth in North Carolina

Supporting Defense Innovation Through North Carolina’s Smart Textile Ecosystem

Microplastics in the Public Eye: What Consumers Are Saying—and Why It Matters

Current Nutrition Trends: Fact, Fiction, and Half-Truths

Landmark 10-year clinical study finds lasting benefit for women with two distinct pelvic organ prolapse surgeries

Evaluating Alternative Strategies to Traditional Local Police Response

Developing and Validating New Methods of High-Quality Survey Data Collection