Large language model-generated clinical practice guideline for appendicitis.
Division
South Atlantic
Hospital
Grand Strand Medical Center
Document Type
Manuscript
Publication Date
4-18-2025
Keywords
Appendicitis, ChatGPT, Clinical practice guideline, Generative AI, Large language models, Surgery
Disciplines
Diagnosis | Health Information Technology | Medicine and Health Sciences | Surgery
Abstract
BACKGROUND: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.
METHODS: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.
RESULTS: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.
CONCLUSIONS: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
Publisher or Conference
Surgical Endoscopy
Recommended Citation
Boyle A, Huo B, Sylla P, et al. Large language model-generated clinical practice guideline for appendicitis. Surg Endosc. Published online April 18, 2025. doi:10.1007/s00464-025-11723-3