The Artificial Intelligence (AI) landscape looks very different today than it did 1 year ago. With the dissemination of Large Language Models (LLMs) through popularized generative AI tools, such as ChatGPT, the ability to automate cumbersome and time-consuming tasks feels more feasible to practitioners.
Over the past year, RTI International data scientists and subject matter experts in various public sector domains evaluated a range of client-driven AI use cases to respond to a new and evolving research landscape. We found that generative AI can help provide initial drafts of plain language text derived from technical documents, perform qualitative coding at human levels of accuracy, assist with developing machine learning models in scenarios with sensitive or uncommon data, and more. Below we discuss six of these generative AI use cases.
Developing Plain Language Content Faster to Disseminate Messaging Equitably
Generative AI Use Case 1: Assess the ability of LLMs to convert complex texts into simpler, more accessible language.
Nearly 130 million adults in the United States read at or below a 6th-grade level, highlighting a significant challenge in accessing and comprehending essential information. In response, the Plain Writing Act of 2010 mandated federal agencies to adopt clear and accessible communication for public use.
To accomplish this, we formulated a structured prompt that included the original text, a detailed description of the intended audience, and a comprehensive set of Plain Language Writing principles. This prompt was then submitted to the LLM to rewrite the input text into plain language.
Our evaluation methodology encompassed qualitative and quantitative measures. Qualitative evaluators assessed the content using CDC’s Clear Communication Index and considered accuracy, language, tone, and other qualitative aspects. Quantitative readability measures, such as McLaughlin’s SMOG, Flesch-Kincaid Grade Level, and Flesch Reading Ease, were also implemented in our evaluation process.
The assessment identified notable strengths and areas for improvement. Strengths included effectively condensing large volumes of content, organizing information into easily readable sections, maintaining the meaning of the original content, and writing in an active voice. Areas for improvement included using more inclusive language, preserving content while simplifying the language, clearly defining medical and health care terminology, correctly interpreting the input content, and maintaining consistency across multiple rewriting tasks.
Evaluators assessed whether the resulting plain language text was user-friendly and aligned with our commitment to high-quality standards. To streamline this process, we developed a custom web application, facilitating the implementation of plain language writing requirements across various projects. This web application has been deployed to writing teams currently piloting its use in ongoing projects with plain language mandates. By providing a rewritten plain language draft through the tool, we improved the efficiency of the overall writing workflow, enabling teams to focus on final revisions to the draft text.
Integrating this plain language writing approach enables us to meet the mandates of the Plain Writing Act by communicating more effectively and inclusively, thereby ensuring that written information reaches all members of the public.
Enhancing Accuracy and Speed for Providing Technical Assistance
Generative AI Use Case 2: Develop a generative AI-based tool that streamlines analysis of and information extraction from these documents.
Navigating federal health regulations for technical assistance (TA) inquiries is often challenging due to the dense and complex nature of source documents. This complexity can lead to increased workload for analysts, prolonged response times, and a reliance on extensive training for accurate interpretation.
This innovation shows the potential to accelerate response times, increase accuracy and fidelity of answers, and possibly ease the burden of onboarding new analysts.
Our team developed a two-phase approach for using generative AI with large language models for TA inquiries:
- In phase 1, the LLM serves as a document analysis and search tool in response to the inquiry, identifying relevant passages to answer the question within the complex source documentation.
- In phase 2, a structured prompt guides the LLM in crafting a tailored response to the inquiry using the relevant passages identified during phase 1.
This approach resulted in the development of an application called SmartSearch for ease of use by the TA team. The TA team evaluated the responses by SmartSearch in a small-scale study, finding that 53% of the LLM-provided answers were identified as correct and requiring very few edits, whereas 78% were identified as usable as first drafts.
SmartSearch is primed to enhance our TA team’s capabilities, streamlining response drafting by extracting information from relevant document citations. This innovation not only increases the efficiency of experienced analysts in locating specific information, but it also expedites the learning process for new analysts by familiarizing them with reference materials. Beyond the initial health services use case, this approach can be extended and applied to other TA tasks in other domains where timely information retrieval against complex source documentation is required.
Expediting Qualitative Coding with LLM-Assisted Content Analysis
Generative AI Use Case 3: Explore using LLMs to reduce the time it takes for deductive coding while retaining the flexibility of traditional content analysis.
Deductive coding is a key practice in qualitative research within the social sciences and public health domains. In this practice, researchers manually discern prevailing themes within documents, aiding in work like understanding the current discourse on electronic nicotine delivery systems across social media. However, deductive coding is resource-intensive, prone to errors, and demands significant human labor to sift through and reliably categorize a large body of unstructured text documents.
We developed an approach termed LLM-Assisted Content Analysis (LACA) to measure the impact of LLMs on deductive coding. We evaluated LACA on a publicly available deductive coding dataset and conducted empirical benchmarks on four distinct datasets to assess OpenAI’s GPT-3.5’s performance across various deductive coding tasks. Our results showed that GPT-3.5 often achieves deductive coding results comparable to human coders. Additionally, we asked that the LLM return an explanation for the coding decision, which was helpful in evaluating model performance and building trust in predictions.
Overall, our findings have implications for future deductive coding and related research methods. LACA and LLMs show potential for expediting deductive coding without compromising accuracy and are supportive tools for accelerating the latter stages of deductive coding that tend to be more manually taxing and less fulfilling for researchers.
In June 2023, a preprint, “LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding,” was made available to access the LACA methodology, reporting standard, prompts, and codebooks.
Automating Information Extraction for Systematic Reviews and Other Evidence Synthesis
Generative AI Use Case 4: Improve the efficiency and accuracy of data extraction in evidence synthesis using LLMs.
In the field of evidence synthesis, data extraction is a critical yet arduous process, prone to human error and requiring substantial time investment. Previous research has shown that single investigator data extraction and verification by a second investigator can take 107 minutes, on average. In addition, evidence suggests that manual data extraction errors can reach up to 63% in systematic reviews.
Despite previous attempts, harnessing machine learning to streamline data extraction in systematic reviews has failed to achieve sufficient accuracy and usability.
In a proof-of-concept study, we evaluated the performance of the Claude 2 LLM in extracting data elements from published studies, comparing it with the human-driven data extraction methodology commonly used in systematic reviews. This analysis used a sample of 10 open-access, English-language publications of randomized controlled trials included in a single systematic review. We selected 16 distinct types of data (totaling 160 data elements across the studies), each posing varying difficulty levels for extraction.
We supplied the text to the LLM for each article and then formulated prompts to extract specific data elements of interest. The model processed the prompts and attempted extractions from the text, which were compared to our reference standard from a prior systematic review by our team. Across the 160 data elements, the process had an overall accuracy rate of 96.3% with high test-retest reliability. The process only produced 6 errors, with the most common errors being missed data items. Perhaps most importantly, the overall user-friendliness of the web-browser interface of LLMs was noted, making it easy for researchers without a technical background to contribute to the information extraction task. Additionally, data preprocessing or topic-specific training datasets are not required. The process was as simple as uploading the relevant PDFs into the user interface and engaging with the LLM from there.
Findings from the proof-of-concept study suggest that leveraging LLMs could substantially enhance the efficiency and accuracy of data extraction processes required for evidence syntheses. Learn more about the methodology and results in the articles, “Data Extraction for Evidence Synthesis Using a Large Language Model: A Proof-of-Concept Study” and "Performance of Two Large Language Models for Data Extraction in Evidence Synthesis: Brief Report," in Research Synthesis Methods.
Automating Text Cluster Naming with Generative AI
Generative AI Use Case 5: Assess automating the cluster-naming process using generative AI.
Text clustering enables the discovery of natural groupings in collections of texts, making it a common and powerful analytical tool. For example, text clustering can be used for thematic analysis of open-text survey responses; however, describing and naming text clusters is essential to making use of the results. In practice, this step usually requires a subject matter expert to read texts from each cluster. This manual labor creates a bottleneck that negates the efficiency gains promised by text clustering.
Through experimentation with OpenAI’s GPT-3.5 on two benchmark datasets, we found that the LLM-generated cluster names were of comparable—and at times superior—quality to those generated by human subject matter experts. Notably, our research found that providing the model with complete texts was unnecessary. Strategies using only keywords extracted from documents and clusters demonstrated quality performance, which can provide an alternative to supplying the full text when concerns surrounding data security and privacy are present.
This case study promises substantial time and cost savings for projects that rely on text clustering methodologies. One of the most promising applications is in the domain of open-text survey analysis, where subject matter experts can focus on thematic analysis and enhancing the naming foundation provided by automation, rather than beginning the process from scratch. Beyond survey data, our team is extending this approach to constructing a data-driven taxonomy of research topics for a large federally funded education study. With more than 2,000 distinct research-focused clusters identified, our goal of automating the cluster naming process will be pivotal in being able to interpret such a large quantity of clusters meaningfully. To learn more about our automated cluster naming work, see our presentation at the 2023 Government Advances in Statistical Programming conference.
Bridging Data Gaps with LLM-Generated Synthetic Data
Generative AI Use Case 6: Use LLMs to generate synthetic data to train machine learning models in text classification tasks.
Developing effective machine learning models is often hampered by a lack of sufficient labeled data. This data scarcity limits the models' accuracy and applicability in real-world scenarios.
We found models trained on synthetic data performed comparably to those trained on real-world data, effectively overcoming the challenge of data scarcity, and enhancing the practical use of machine learning in various applications.
The process included developing prompts for generating optimized synthetic labeled data, training traditional machine learning models to perform various text classification tasks, and evaluating those trained models on datasets of human-labeled data. We also compared these models against zero-shot prompts fed to ChatGPT. In large language models, zero-shot prompts ask a model to perform a task without explicitly providing any prior examples, relying on a model’s training data and pre-existing knowledge to generate a response.
Specifically, we explored three generative AI use cases with this methodology:
- classifying criminal offense texts into a standardized taxonomy of 22 categories,
- classifying 911 calls/computer-aided dispatch descriptions into a taxonomy of 8 categories, and
- classifying an open-ended survey item for identifying instances of an adverse event requiring follow-up.
In the first two AI use cases, models trained on synthetic data demonstrated comparable performance to ChatGPT performing classifications in a zero-shot setting—achieving F1 scores of 0.71 and 0.81. In the third case for adverse event detection, we used ChatGPT to supplement existing data by generating synthetic examples of adverse events we were interested in detecting (e.g., domestic violence, abuse). Since the event is very rare, traditional machine learning methods struggle to make accurate classifications due to the imbalance in the outcome.
As a part of this project, we also developed an internal tool called EZPrompt, which streamlines the process of prompt engineering, submits multiple LLM requests, and interacts with the ChatGPT API. The EZPrompt Python package has been adopted by our team, extending its application to diverse domains and use cases. Synthetic data are well-aligned with LLMs because the purpose is to generate data across diverse domains. Beyond overcoming the issue of data scarcity, synthetic data enables the use of machine learning when original data are inaccessible due to security or computational constraints.
Diverse Applications, Common Lessons: Learning from Generative AI Use Cases Across Domains
Generative AI has proven versatile across various use cases, yet its effective deployment requires rigor and careful consideration. In exploring the application of generative AI to address client needs, we found the following key cross-domain takeaways around the capabilities and limitations of generative AI:
- Human review of AI use is still necessary. At the end of the day, LLMs are generative models producing text content. It is important to perform quality control and always manually review the output of LLMs to verify it accomplishes the goal of the task and to modify the content as necessary before distributing the output. Data scientists need to review output to facilitate effective prompt engineering, and subject matter experts need to review output to verify the veracity and accuracy for any given task.
- Develop a structured evaluation process for generative AI use. It is never too early to ask, “how will we know if this works?” and “how do we know if it’s improving?” Adopting quantitative performance metrics from machine learning makes it straightforward to evaluate LLM performance compared to humans or traditional machine learning approaches. When direct quantitative performance is not as straightforward (i.e., summarizing text), developing clear standards and benchmarks for the task—such as number of responses usable as a first draft—is a good way to assess the performance and potential efficiency gains from generative AI.
- Generative AI has applications across numerous domains. Generative AI's adaptability is evident in the diverse applications evaluated. It simplifies complex texts for wider accessibility, aids in social science research, enhances TA workflows, and optimizes data science tasks. These varied uses across domains underscore the broad applicability of generative AI tools.
How to Use Generative AI in the Future
Although we achieved promising results across the diverse tasks presented here, we are only beginning to explore the landscape of use cases and nuances involved with generative AI use. Over the past year, the generative AI field has drastically changed with the introduction of competing tools from companies like OpenAI, Google, and Anthropic, along with the release of open-source models. We will continue to remain agile in this constantly shifting environment and will work with our clients to explore additional use cases for generative AI, including the following:
- evaluating open-source models and securely self-hosting LLMs, including creating secure environments for LLM use on sensitive, non-public data;
- exploring approaches like Retrieval Augmented Generation to reduce hallucinations; and
- ensuring that use of LLMs aligns with Responsible AI principles and practices.
Learn more about RTI’s AI capabilities and data science services.