Re-engineering a machine learning phenotype to adapt to the changing COVID-19 landscape: A machine learning modelling study from the N3C and RECOVER consortia

Miles Crosskey; Tomas McIntee; Sandy Preiss; M Daniel Brannock; John M Baratta; Yun Jae Yoo; Emily Catherine Hadley; Frank Blancero; Rob Chew; Johanna Loomba; Abhishek Bhatia; Christopher G. Chute; Melissa Haendel; Richard A Moffitt; Emily R. Pfaff

Re-engineering a machine learning phenotype to adapt to the changing COVID-19 landscape

A machine learning modelling study from the N3C and RECOVER consortia

Crosskey, M., McIntee, T., Preiss, S., Brannock, M. D., Baratta, J. M., Yoo, Y. J., Hadley, E. C., Blancero, F., Chew, R., Loomba, J., Bhatia, A., Chute, C. G., Haendel, M., Moffitt, R. A., & Pfaff, E. R. (2025). Re-engineering a machine learning phenotype to adapt to the changing COVID-19 landscape: A machine learning modelling study from the N3C and RECOVER consortia. The Lancet Digital Health, 7(8), Article 100887. https://doi.org/10.1016/j.landig.2025.100887

Copy citation

Abstract

Background
In 2021, we used the National COVID Cohort Collaborative (N3C) as part of the National Institutes of Health RECOVER Initiative to develop a machine learning pipeline to identify patients with a high probability of having post-acute sequelae of SARS-CoV-2 infection or long COVID. However, the increased home testing, missing documentation, and reinfections that characterise the pandemic beyond 2022 necessitated the re-engineering of our original model to account for these changes in the COVID-19 research landscape.

Methods
Trained on 72 745 patient records (36 238 with long COVID and 36 507 with no evidence of long COVID), our updated XGBoost model gathered data for each patient in overlapping 100-day periods that progressed through time and issued a probability of long COVID for each 100-day period. We ran the model on patients in N3C (n=5 875 065) who met at least one of the following criteria from Jan 1, 2020, to June 22, 2023: a U07·1 (COVID-19) diagnosis code; a positive SARS-CoV-2 test; a U09·9 (post-acute sequelae of SARS-CoV-2 infection) diagnosis code; a prescription for nirmatrelvir–ritonavir or remdesivir; or an M35·81 (multisystem inflammatory syndrome in children [MIS-C]) diagnosis code. Each patient was given a model score that predicted long COVID status for each 100-day window in which they were aged ≥18 years. If a patient had known acute COVID-19 during any 100-day window (including reinfections), we censored the data from 7 days before the diagnosis or positive test date to 28 days after. We ran the model on controls selected from pre-2020 data to assess the likelihood of false positives.

Findings
The updated model had an area under the receiver operating characteristic curve of 0·90. Precision and recall could be adjusted according to a given use case, depending on whether greater sensitivity or specificity was warranted. Using our model, we estimate the overall prevalence of long COVID among the COVID-19 positive cohort within N3C repository to be 10.4%.

Interpretation
By eschewing the COVID-19 index date as an anchor point for analysis, we can assess the probability of long COVID among patients who might have tested at home, or with suspected (but untested) cases of COVID-19, or multiple SARS-CoV-2 reinfections. We view this exercise as a model for maintaining and updating any machine learning pipeline used for clinical research and operations.

Publications Info

To contact an RTI author, request a report, or for additional information about publications by our experts, send us your request.

publications@rti.org

RTI shares its evidence-based research - through peer-reviewed publications and media - to ensure that it is accessible for others to build on, in line with our mission and scientific standards.

Meet the Experts

Navigate to Robert Chew

Robert Chew

Navigate to Sandy Preiss

Sandy Preiss

Navigate to Daniel Brannock

Daniel Brannock

Recent Publications

Article

Factors influencing wasting in children under 5 in arid regions of Kenya

March 2026

Article

Psychometric evaluation of the weekly version of the PTSD checklist for DSM-5

March 2026

Article

Uptake of newly licensed influenza vaccine formulations among patients receiving chronic hemodialysis during the 2010/2011 to 2021/2022 influenza seasons

March 2026

Article

Health care providers' perceptions of screening for risk of type 1 diabetes in newborns using genetic risk scores

February 2026

Article

Housing age and sociodemographic characteristics as predictors of residential lead exposure and modeled child blood lead levels

February 2026

Article

Systematic examination of gene expression and proteomic evidence across tissues supports the role of mitochondrial dysregulation in me/cfs

February 2026

Article

The mortality and economic benefits of achieving air pollution standards in India

February 2026

Article

EpiSmokEr2: A robust epigenetic classifier for smoking status inference using Illumina EPIC methylation data

February 2026

View All Publications