RTI uses cookies to offer you the best experience online. By clicking “accept” on this website, you opt in and you agree to the use of cookies. If you would like to know more about how RTI uses cookies and how to manage them please view our Privacy Policy here. You can “opt out” or change your mind by visiting: http://optout.aboutads.info/. Click “accept” to agree.
Artificial intelligence-assisted data extraction with a large language model
A study within reviews
Gartlehner, G., Kugley, S., Crotty, K., Viswanathan, M., Dobrescu, A., Nussbaumer-Streit, B., Booth, G., Treadwell, J. R., Han, J. M., Wagner, J., Apaydin, E. A., Coppola, E. L., Maglione, M., Hilscher, R., Chew, R., Pilar, M., Swanton, B., & Kahwati, L. C. (2025). Artificial intelligence-assisted data extraction with a large language model: A study within reviews. Annals of Internal Medicine. Advance online publication. https://doi.org/10.7326/ANNALS-25-00739
BACKGROUND: Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.
OBJECTIVE: To compare an AI-assisted versus a traditional, human-only data extraction process.
DESIGN: Study within reviews (SWAR) using a prospective, parallel-group comparison with blinded data adjudicators.
SETTING: Workflow validation within 6 ongoing systematic reviews of interventions under real-world conditions.
INTERVENTION: Initial data extraction using an LLM (Claude, versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer.
MEASUREMENTS: Concordance, time on task, accuracy, sensitivity, positive predictive value, and error analysis.
RESULTS: The 6 systematic reviews in the SWAR yielded 9341 data elements from 63 studies. Concordance between the 2 methods was 77.2% (95% CI, 76.3% to 78.0%). Compared with the reference standard, the AI-assisted approach had an accuracy of 91.0% (CI, 90.4% to 91.6%) and the human-only approach an accuracy of 89.0% (CI, 88.3% to 89.6%). Sensitivities were 89.4% (CI, 88.6% to 90.1%) and 86.5% (CI, 85.7% to 87.3%), respectively, with positive predictive values of 99.2% (CI, 99.0% to 99.4%) and 98.9% (CI, 98.6% to 99.1%). Incorrect data were extracted in 9.0% (CI, 8.4% to 9.6%) of AI-assisted cases and 11.0% (CI, 10.4% to 11.7%) of human-only cases, with corresponding proportions of major errors of 2.5% (CI, 2.2% to 2.8%) versus 2.7% (CI, 2.4% to 3.1%). Missed data items were the most frequent error type in both approaches. The AI-assisted method reduced data extraction time by a median of 41 minutes per study.
LIMITATIONS: Assessing concordance and classifying errors required subjective judgment. Consistently tracking time on task was challenging.
CONCLUSION: Data extraction assisted by AI may offer a viable, more efficient alternative to human-only methods.
PRIMARY FUNDING SOURCE: Agency for Healthcare Research and Quality and RTI International.
RTI shares its evidence-based research - through peer-reviewed publications and media - to ensure that it is accessible for others to build on, in line with our mission and scientific standards.