Conference Program

2.3 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation

Wednesday, September 17, 2025

11:00 AM - 12:30 PM

Speaker

Miss Bingyi Wang

China

Lanzhou University

Consistency and Limitations of Large Language Models in Clinical Decision-Making: A Systematic Review Against Practice Guidelines

Abstract

Objective: To systematically evaluate the alignment of large language models (LLMs) with clinical practice guidelines across diverse medical specialties and identify methodological gaps in existing validation studies.

Methods: Following PRISMA 2020 guidelines, we searched PubMed, Embase, Web of Science, and preprints (up to March 7, 2025) for empirical studies comparing LLM-generated recommendations with authoritative guidelines. Data extraction focused on disease domains (ICD-11 classification), LLM versions, evaluation metrics, and methodological rigor. Two reviewers independently screened studies and assessed quality using the QUADAS-2 tool.

Results: Among 183 included studies (n=51 unique first-author groups in the preliminary dataset), musculoskeletal disorders (28.4%) and oncology (22.9%) were the most represented specialties, followed by urology (15.3%), gastroenterology (12.0%), and cardiovascular diseases (9.8%). The pooled accuracy of LLMs across all domains was 54.7% (95% CI: 48.2–60.9%, I²=89.2%). The accuracy of specific disciplines will be presented at the conference.

Conclusions: While LLMs demonstrate potential for standardized decision support, their reliability remains context-dependent and constrained by dynamic guideline updates. This review highlights an urgent need for standardized evaluation frameworks addressing version control, temporal validity, and domain-specific calibration. Future research should prioritize transparency in model training data, ethical safeguards against misinformation, and prospective validation in clinical workflows.

Paper Number

473

Biography

Bingyi Wang, a master's degree candidate, possesses extensive expertise in conducting systematic reviews, encompassing the retrieval, screening, information extraction. She has given many lectures in the training class about some steps of systematic review. At present, she has published five articles, including one systematic review, and has participated in the production of several systematic reviews.

Dr. Bei Pan

School of Basic Medical Sciences, Lanzhou University

Development and Evaluation of an Automated Prompt Generator for AI-Assisted Guideline Evidence Synthesis

Abstract

Background:
Clinical practice guideline (CPG) development faces significant challenges in balancing the need for rapid evidence updates with resource-intensive systematic review processes. Data extraction and risk-of-bias (ROB) assessment are particularly time-consuming steps. While large language models (LLMs) show promise in these tasks, creating specific prompts for varied guideline development needs remains a challenge.
Objective:
To develop and evaluate an automated prompt generator that creates tailored instructions for LLMs to assist in data extraction and ROB assessment.
Methods:
We constructed a comprehensive vector database using over 20,000 verified extraction and assessment cases from more than 30 unique tools. The system generates structured prompts from natural language inputs provided by CPG developers. We tested the system on five randomly selected clinical questions using the prompts to guide eight different LLMs to conduct extractions and assessments, evaluating accuracy, consistency, and efficiency against expert consensus standards.
Results:
Using prompts generated by the system, LLMs achieved average accuracies of 92.7% for data extraction and 91.6% for ROB assessment across all tested scenarios. Both intra- and inter-model consistency showed high reliability (Cohen's kappa >0.80). Processing time decreased by 75.6% compared to conventional methods. Performance remained robust across different tools and languages, with non-significant variation in accuracy.
Discussion:
Previous studies have confirmed the potential of LLMs in supporting evidence synthesis for CPG development. Our research provides a method to solve the challenge of generating reliable prompts. The system's ability could address the diverse needs of guideline development, potentially accelerating the evidence synthesis process without compromising quality.

Paper Number

Biography

Honghao Lai, a Ph.D. candidate at the School of Public Health, Lanzhou University, focuses on evidence-based research methods and evidence-informed health policy-making. He has participated in 5 national and provincial-level projects, contributed to 10 guidelines led by the China Academy of Chinese Medical Sciences, Guangzhou University of Chinese Medicine, and Shanghai Jiao Tong University, co-authored 2 books, including "A Guide to Meta-analysis Software Operation," and published over 50 papers, with 8 as first/co-first author in top journals such as NPJ Digital Health, JAMA Network Open, Metabolism, BMJ Mental Health, and Critical Reviews in Food Science and Nutrition.

Dr Jacqueline Deurloo

TNO

Developing data-protocols for healthcare guidelines to support the use of decision support tools

Abstract

Background
Data is playing an increasingly crucial role in healthcare, supporting monitoring, research, and innovations such as decision support systems (DSS). This requires accurate, consistent and reliable data. In the Netherlands, Youth Health Care (YHC) professionals routinely monitor growth and development of children aged 0-18 years. Data are registered in electronic healthrecords (EHR), in accordance with a national ontology, the Basic DataSet (BDS).
Objective
Translating recommendations of all 35 YHC guidelines into elements fitting the BDS. This will contribute to accurate, reliable and consistent data.
Methods
Recommendations are ‘translated’ to what needs to be registered in the EHR. Subsequently, these registering requirements are compared to items in the BDS. The relevant items in the BDS are documented in a BDS-protocol. Criteria for inclusion in the BDS-protocol are: required for individual care and/or required for DSS.
Results
BDS-protocols are developed for each YHC-guideline. If necessary, if registering requirements do not match the items in the BDS, a change request for the BDS is submitted. Adjustments to the BDS and the EHR required a very long lead time. For some guidelines it was difficult to create accurate and consistent data-elements.
Discussion
The development and use of a protocol for data registration should be a standard part of guideline development. It will contribute to more accurate, consistent and reliable data in the HER, accommodating the use of DSS and monitoring of the use of guidelines. Further research is needed to explore opportunities for optimization, and define key conditions necessary to enhance data-driven decision-making.

Paper Number

520

Biography

Jacqueline Deurloo is a Youth Health Care physician, she works as a guideline developer at TNO.

Ms Ye Wang

CHINESE

Lanzhou University

Evaluating Adherence to PRISMA Checklist in Systematic Reviews Using Large Language Models: A Feasibility Study

Abstract

Evaluating Adherence to PRISMA Checklist in Systematic Reviews Using Large Language Models: A Feasibility Study
Background: Systematic reviews and meta-analyses are crucial for evidence-based decision-making, and adherence to PRISMA guidelines ensures their quality. Evaluating adherence manually is labor-intensive. Large language models (LLMs) such as ChatGPT, Gemini, and Claude offer potential efficiency gains, but their accuracy and feasibility require validation.
Objective: To evaluate the feasibility and accuracy of ChatGPT, Gemini, and Claude for assessing adherence to the PRISMA checklist.
Methods: We assessed 23 systematic reviews previously evaluated by experts against PRISMA guidelines, using the study "Adherence to the PRISMA Checklist in Systematic Reviews: A Meta-epidemiological Study" published in Cochrane as the reference standard. Assessments by ChatGPT, Gemini, and Claude models were compared with expert evaluations using overall consistency scores (OCS), Cohen’s kappa, prevalence- and bias-adjusted kappa (PABAK), sensitivity, specificity, and F1 scores.
Results: Preliminary results from this pilot phase are based on an initial subset of the data, and comprehensive results will be presented upon completion of the full study. Preliminary results showed an average OCS of 76.9% (95% CI: 67.0%-86.8%). Of 42 checklist items, 76.2% had consistency scores above 70%, and 10 items reached 100% consistency.
Discussion: ChatGPT, Gemini, and Claude showed potential in assessing adherence to PRISMA, although performance varied across specific items.
Conclusions: LLMs demonstrate promising feasibility as auxiliary tools for assessing PRISMA adherence. However, further improvements are required for comprehensive accuracy.
The author gratefully acknowledges the support of K.C.Wong Education Foundation, Hong Kong'.

Paper Number

425

Biography

Ye Wang is an MPH student at the School of Public Health, Lanzhou University. Her research focuses on clinical guideline development, AI-accelerated evidence synthesis, and AI-assisted COI management. With a growing interest and experience in guideline development, her research aims to enhance the transparency and consistency of guidelines and promote AI's role in evidence synthesis and conflict of interest management.

Christopher Wolfkiel

Clinical Guidelines Resources

Applying LLM Extraction and Reasoning Models to ROB2 Domain Risk Assessment

Abstract

Background: Risk of Bias tools require considerable time and expertise to implement. Gen AI prompts theoretically can be designed to draft domain responses.
Objective: To test performance of AI prompts in assessing ROB2 domain risk compared to published results.
Methods: ROB2 prompts were implemented in three platforms: Indico’s Data Extraction, GRADE GPT and Anthropic Claude 3.7 Sonnet. Implementation included an iterative prompt design/editing process with 2-3 representative PDFs. AI results from available RCT PDFs from three systematic reviews were compared to published data. Available PDFs were defined as those open access or available through Google Scholar. AI Models tested include Indico's api implementation of GPT-4o, Grade GPT-4-Turbo and Anthropic Claude 3.7 reasoning model. Prompts were designed as to not include copyrighted ROB2 text, rather generic instructions to assess domain risk were used (i.e. "Assess ROB2 Domain Risk"). Claude's prompt also included a request for signaling question detail.
Results: AI ROB2 domain agreement with published data from 44 RCTs was modest:
Indico 4o - 60% (range 39-80%)
Grade GPT 4-Turbo - 56% (range 39-75%)
Claude 3.7 - 62% (range 50-68%)
In all models Missing Outcome Bias was the lowest agreement domain (30-50%).
Discussion: While the alure of using AI to semi-automate risk of bias determination is great. There appears to be significant limitations of a generic prompt approach which assumes the LLM has enough inherent understanding of ROB2 to accurately complete the task. Perhaps an implementation of proprietary content with more advanced AI tools could lead to a reliable implementation.

Paper Number

441

Biography

Healthcare informaticist specializing in clinical guidelines, experienced in product management, generative AI, analytics, consulting and clinical research; leadership roles in healthcare content, software, pharmaceutical marketing, managed care and medical device industries. Strengths include leading talented healthcare professionals, project management and cross functional interactions with IT, education, finance, sales and marketing. Progressive responsibilities include management of key opinion leaders and society relationships, leading clinical research programs, analytical services and informatics product management including marketing, software development, vendor management and consulting efforts.

Ms Yingwen Wang

China

Nurse

Children's Hospital Of Fudan University

Automated Evidence Integration System for Pediatric CAP Knowledge Graphs: Enhancing Guideline Currency Through AI-Driven Updates

Abstract

Purpose: To designed an automated knowledge graph update mechanism for pediatric community-acquired pneumonia (CAP).

Methods: Given the volume and continuous updates of evidence-based medical data from systematic reviews and randomized controlled trials, manual knowledge graph construction is costly and inefficient. We utilized a high-quality knowledge graph constructed by expert teams as training data. For model development, we fine-tuned BioBERT, a biomedical pre-trained language model based on Transformer architecture, to perform named entity recognition and relationship extraction from medical texts. The model was applied to large-scale evidence-based medical data to automatically extract and structure knowledge. We then implemented knowledge fusion to integrate new findings with existing graphs, resolve conflicts, and ensure consistency. Finally, we established an automatic update mechanism that periodically collects new data and repeats this process, enabling dynamic knowledge graph updates.

Results: From sources including 7 guidelines, 4 protocols, 10 consensus documents, 2 textbooks, 53 systematic reviews, and 232 original studies, three specialists and twenty annotators manually identified 2,769 entities and 8,283 relationships. The automated system extracted 3,983 entities and 14,652 relationships. Validation on 4 RCTs showed strong performance: entity extraction (precision=0.877, recall=0.947, F1=0.911) and relationship extraction (precision=0.839, recall=0.929, F1=0.882).

Conclusion: This research supports data-driven and knowledge-driven deep learning models, with potential to further enhance the accuracy and reliability of clinical prediction models and guideline updates. The automated system enables more efficient integration of new evidence into clinical guidelines for pediatric CAP.

Paper Number

439

Biography

Yingwen Wang is a research nurse at Children's Hospital of Fudan University, specializing in AI application in pediatric healthcare sector and evidence-based nursing. She has 5 years of experience in knowledge graph development/clinical guideline implementation/AI-driven healthcare solutions]. Yingwen leads research on automated evidence synthesis and knowledge integration for clinical decision support, with a focus on pediatric respiratory conditions. She has published several articles in peer-reviewed journals and has contributed to [relevant projects or guidelines.

Dr. Ning Ma

CHINESE

Lanzhou University

Integrating Large Language Models into a Chain-of-Thought Reasoning Framework for Enhanced Risk-of-Bias Assessment

Abstract

Background:
Risk of bias (ROB) assessment is crucial in systematic reviews for clinical practice guideline development. While previous studies showed Large Language Models (LLMs) can achieve 84.5%-89.5% accuracy in ROB assessment through direct prompting, existing approaches lack transparent reasoning processes and methodological rigor in decision pathways.
Objective:
To develop and validate an LLM-based chain-of-thought framework that enhances the accuracy, efficiency, and transparency of ROB assessment in systematic reviews.
Methods:
We developed a framework integrating multiple LLM calls with a retrieval-augmented generation system containing 500 expert-verified assessments. The system employs structured reasoning paths and consensus voting through predefined methodological criteria based on the modified Cochrane tool. We evaluated the framework on 200 randomly selected trials, comparing results with expert consensus and testing across six different LLMs with diverse model architectures.
Results:
Using GPT-4o mini, the framework achieved 98.4% overall accuracy (95% CI: [97.9–99.0]%) across 1,200 assessments. Domain-specific accuracies ranged from 97.5% to 99.2%, with high sensitivity (98.1%) and specificity (98.7%). Average processing time was 53 seconds per RCT. Performance remained consistent across different LLMs (95.4%-98.9% accuracy range), with low variability in repeated evaluations (coefficient of variation <2%).
Discussion:
Our structured framework demonstrates high accuracy, efficiency, and consistency in ROB assessment. The chain-of-thought approach with RAG integration provides transparent and traceable decision processes, suggesting potential for reliable systematic review support in guideline development. The framework's robust performance across multiple LLMs indicates its potential for widespread application in evidence synthesis and methodological research.

Paper Number

100

Biography

Mr Qiao Huang

CHINA

Zhongnan hospital of Wuhan University

Developing Short-Video Versions of Clinical Practice Guidelines: A Protocol

Abstract

Background
Clinical practice guidelines (CPGs), including patient versions, serve as essential tools for evidence-based healthcare, empowering patients to make informed decisions. However, their complexity often hinders patient understanding, limiting practical use and adherence. Short-video platforms, such as TikTok and Douyin, have become popular for sharing information and offer a novel opportunity to simplify and disseminate complex guideline recommendations in an engaging and accessible format.

Objective
This study aims to explore the feasibility and develop a framework for creating patient-friendly short-video versions of CPGs to enhance patient comprehension and adherence.

Methods
A preliminary survey of health-related short videos on short-video platforms will be conducted to evaluate content style, engagement, strengths, and limitations. A literature review on online health communication strategies and artificial intelligence based tools will be performed to identify key elements for effective knowledge translation. Based on these findings, expert consultations and brainstorming sessions will be held to develop a structured framework for transforming complex recommendations into patient-friendly short-video content.

Results
This study will identify core elements of effective short-video health communication, including simplified language, engaging visuals, storytelling techniques, and accessibility features. The framework will outline essential steps for CPG content adaptation, and incorporate mechanisms for content validation, feedback evaluation, iterative optimization, and adaptation to diverse patient populations.

Discussion
Leveraging short-video platforms for CPG dissemination may represent an innovative strategy to bridge the gap between guideline developers and the public, promoting more patient-centered healthcare communication. However, balancing information simplification and scientific rigor is crucial to maintaining guideline credibility and accessibility.

Paper Number

Biography

Biostatistician and clinical guideline developer at the Center for Evidence-Based and Translational Medicine, Zhongnan Hospital of Wuhan University (Wuhan, China). Combines advanced statistical methodologies with evidence synthesis to design robust clinical guidelines, bridging biomedical research and healthcare practice. Collaborates with multidisciplinary teams to transform complex data into actionable insights.

Dr Nofisat Ismaila

Senior Clinical Guideline Methodologist

Asco

A Framework for Integrating Informatics and Artificial Intelligence into Living Clinical Guidelines

Abstract

Background: Living guidelines (LGs) have been proposed as the solution to the challenge of keeping up with the rapid pace of evidence generation. Integration of artificial intelligence (AI) and informatics into (LGs) holds the potential to enhance their adaptability, efficiency, and responsiveness to new evidence. However, there is limited guidance on how to systematically incorporate these technologies into LG development. This study aims to develop a comprehensive framework for integrating informatics and AI into LGs using a structured, multi-phase approach.
Methods: The framework development follows three steps. Step 1 involves a systematic review of existing literature and frameworks related to AI integration in LGs. This review will identify key themes, methodologies, and challenges associated with AI integration, informing the initial structure of the framework. In Step 2, a draft framework will be constructed based on the findings from Step 1, incorporating core components such as organization readiness, living evidence synthesis, automation of updates and publication, and AI-assisted decision support. Finally, in Step 3, the draft framework will be refined through expert consultation, utilizing a modified Delphi process with stakeholders in guideline development and health informatics.
Expected outcome and Relevance: A preliminary version of the framework will be completed in time for the presentation at the conference. The final framework aims to improve the sustainability and responsiveness of LGs, ensuring that they remain current with the evolving evidence base. Future applications of this framework may support the development of AI-enhanced guideline methodologies, ultimately improving clinical decision-making and patient care.

Paper Number

501

Biography

A medical doctor specializing in evidence-based clinical practice guidelines, I focus on rigorous methodology, including systematic reviews and meta-analyses. My global collaborations translate research into patient-centered recommendations, emphasizing transparency and efficiency. Currently, my work centers on developing and implementing living guidelines, ensuring recommendations remain current with evolving evidence. I am particularly interested in exploring and integrating AI technology to streamline guideline development, enhance evidence synthesis, and improve the efficiency of updating living guidelines. My commitment lies in optimizing clinical guidelines to improve patient outcomes through innovative, evidence-driven approaches.

Dr Honghao Lai

CHINESE

School of Public Health, Lanzhou University

Enhancing Clinical Practice Guideline Implementation through AI-Powered Consultation System: Development and Evaluation

Abstract

Background:
Clinical practice guidelines (CPGs) are essential for evidence-based healthcare but face implementation challenges due to complexity and accessibility issues. Large language models (LLMs) offer potential solutions but often generate unverified responses.
Objective:
This study developed an AI-powered consultation system based on CPGs, using Chinese patent medicines (CPMs) as a case study.
Methods:
We systematically searched international and Chinese databases for CPGs on CPMs. Included CPGs were preprocessed and deconstructed into structured knowledge units (SKUs). These SKUs were converted into vector representations and integrated with a LLM using a retrieval-augmented generatio framework. Performance was evaluated using 500 questions (250 from CPGs, 250 real-world queries) in Chinese and English, assessing applicability, recommendation alignment, traceability, safety, readability, and semantic consistency. Manual and automated evaluations were conducted, and statistical analyses were performed.
Results:
ChatCPM achieved high applicability (99.2% and 98.0%), recommendation alignment (96.0% and 96.4%), and traceability (94.4% and 87.8%) in Chinese and English, respectively, with 100% safety compliance. Readability scores averaged 77.6 in Chinese and 75.8 in English. ChatCPM outperformed other LLMs in recommendation alignment (96.0% vs. 29.6%-58.6%), traceability (94.4% vs. 13.8%-32.6%), and safety compliance (100% vs. 93.2%-97.6%). The system maintained consistent performance across different disease and question types, with an average response time of 14.1 seconds.
Discussion:
ChatCPM successfully transformed CPGs into an AI consultation platform, ensuring accuracy and traceability. This system offers a promising solution for improving guideline implementation in clinical practice. Future work will expand the knowledge base, integrate user feedback, and enhance handling of complex scenarios.

Paper Number

Biography

Miss Parwenayi Talifu

China

Center For Evidence-based And Translational Medicine, Zhongnan Hospital Of Wuhan University

PICO-Based Clinical Question Generation from Chinese Clinical Practice Guidelines Using LLMs

Abstract

Background: Many CPGs lack standardization, particularly in clearly defining clinical questions, which are the foundation of recommendations and central to clinical practice. Large language models (LLMs) offer powerful capabilities for text understanding and generation. Using coronary heart disease (CHD)-related guidelines as an example, we decide to use prompt engineering, fine-tuning, Retrieval-Augmented Generation (RAG), and Chain-of-Thought (CoT) techniques to extract clinical questions from CPGs by analyzing recommendations and evidence-related text.
Methods: We leverage the dependency between clinical questions, evidence, and recommendations to deconstruct PICO elements and generate corresponding clinical questions. We collected CHD-related guidelines published in the past five years, focusing on those with clear recommendations and methodological sections. These guidelines were divided into two categories: those explicitly stating clinical questions and those without.
First, guidelines with explicit clinical questions were used to train LLM. The steps included:
1. Extracting recommendations and evidence, evaluating the rationality of recommendations.
2. Deconstructing PICO elements based on evidence and recommendations, with expert verification for accuracy.
3. Classifying PICO elements by interventions and outcomes, prompting the LLM to generate clinical questions.
4. Categorizing the clinical questions listed in the guidelines as accurate or inaccurate after expert review. Evaluating model performance using accuracy, recall, and F1 scores.
Once accuracy exceeded 90%, the optimized retrieval strategies and prompt templates are applied to guidelines without explicit clinical questions.
Anticipated value: This framework aims to be adaptable for clinical question extraction across various disease-specific guidelines, enhancing its utility in diverse medical contexts.

Paper Number

199

Biography

Parwenayi Talifu is a first-year Master’s student in Geriatric Medicine at Wuhan University. Currently, she is undergoing standardized clinical training at Zhongnan Hospital of Wuhan University, where she gains hands-on experience in elderly healthcare. Additionally, she is studying evidence-based medicine at the Center for Evidence-Based and Translational Medicine at Zhongnan Hospital. As a dedicated learner, Parwenayi is focused on acquiring both clinical and research skills to better understand and address the health needs of aging populations. She is committed to continuous growth and aims to contribute meaningfully to the field of geriatric medicine in the future.