RTI uses cookies to offer you the best experience online. By clicking “accept” on this website, you opt in and you agree to the use of cookies. If you would like to know more about how RTI uses cookies and how to manage them please view our Privacy Policy here. You can “opt out” or change your mind by visiting: http://optout.aboutads.info/. Click “accept” to agree.
Exploring identification of social determinants of health (SDOH) data in NHLBI BioData Catalyst® (BDC) using embedding-based methods
Krishnamurthy, M., Dave, L. R., Slade, T. S., Marcial, L. H., Montavon, J. M., Tyndall, B., Ortiz, J. E., & Thessen, A. (2025). Exploring identification of social determinants of health (SDOH) data in NHLBI BioData Catalyst® (BDC) using embedding-based methods. Zenodo. Advance online publication. https://doi.org/10.5281/zenodo.15270779
In alignment with the Make America Healthy Again initiative to promote the policies to improve public health, this initiative focuses on the identification of survey questions and answers in datasets hosted within the NHLBI BioData Catalyst® (BDC) ecosystem for the purpose of easier search in the ecosystem’s cohort-building tool, BDC Powered by PIC-SURE (BDC-PIC-SURE). Leveraging data standards developed by the Gravity Project, we systematically evaluated and ranked 113 high-value datasets within BDC based on the variables represented in the Gravity Project domains. The four datasets with the highest representation of variables in the Gravity Project domains were then manually annotated using Simple Knowledge Organization System (SKOS) relations to match survey questions and answers with the Gravity Project elements. We used this manually annotated data set as a “gold standard” to test a proof-of-concept annotation tool that uses embedding-based approaches to match survey-based data with the Gravity Project value set. Performance varied by domain, with employment status being the best (F1-Score = 1.0) and financial insecurity being the worst (F1-Score = 0.42). Some domains, such as financial insecurity, material hardship, and medical cost burden had significant overlap and were challenging for human annotators to differentiate. Future work includes further refinement of this workflow by comparing the performance of different embedding algorithms, examining performance on categorical variables versus continuous variables, determining binning for semantic similarity scores (high, medium, low), and exploring the possibility of other vocabularies or annotating data with multiple domains.
RTI shares its evidence-based research - through peer-reviewed publications and media - to ensure that it is accessible for others to build on, in line with our mission and scientific standards.