4.6 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation
Wednesday, September 17, 2025 |
4:15 PM - 5:30 PM |
Speaker
Dr Jungang Zhao
CHINESE
Phd Student
Children’s Hospital Of Chongqing Medical University
AI-Powered Diagnostic Support for Pediatric Rare Diseases
Abstract
AI-Powered Diagnostic Support for Pediatric Rare Diseases
Background: Pediatric rare diseases pose diagnostic challenges due to complexity, low prevalence, and atypical presentations, leading to diagnostic delays and misdiagnosis. Artificial intelligence (AI), including large language models (LLMs) and knowledge graphs, offers potential for improved accuracy and efficiency.
Objective: To develop an AI-powered diagnostic support system for pediatric rare diseases, integrating multi-source data to provide clinicians with rapid, accurate diagnostic assistance, reducing delays and improving outcomes.
Methods: We constructed a knowledge base encompassing 141 pediatric-onset rare diseases from China’s Rare Disease Catalogs, supplemented by 200+ clinical guidelines and pediatric textbooks. This was structured using a vector database for similarity search and a knowledge graph for reasoning. A Retrieval-Augmented Generation (RAG) approach combined knowledge retrieval with fine-tuned LLMs (e.g., Deepseek, Gemini Pro). The system was evaluated using published case reports, expert-curated question-answer pairs, and clinical exam questions.
Results: The resulting web/mobile application shortens diagnostic cycles, enhances accuracy, and supports clinical decision-making. Preliminary evaluations demonstrate improved efficiency in evidence synthesis and guideline adaptation, with potential for global scalability.
Discussion: This AI-enhanced system offers a scalable approach to improving rare disease diagnosis globally, particularly in resource-limited settings. By integrating specialized knowledge with AI, we democratize access to diagnostic expertise, supporting health system goals of quality improvement and equitable care.
Background: Pediatric rare diseases pose diagnostic challenges due to complexity, low prevalence, and atypical presentations, leading to diagnostic delays and misdiagnosis. Artificial intelligence (AI), including large language models (LLMs) and knowledge graphs, offers potential for improved accuracy and efficiency.
Objective: To develop an AI-powered diagnostic support system for pediatric rare diseases, integrating multi-source data to provide clinicians with rapid, accurate diagnostic assistance, reducing delays and improving outcomes.
Methods: We constructed a knowledge base encompassing 141 pediatric-onset rare diseases from China’s Rare Disease Catalogs, supplemented by 200+ clinical guidelines and pediatric textbooks. This was structured using a vector database for similarity search and a knowledge graph for reasoning. A Retrieval-Augmented Generation (RAG) approach combined knowledge retrieval with fine-tuned LLMs (e.g., Deepseek, Gemini Pro). The system was evaluated using published case reports, expert-curated question-answer pairs, and clinical exam questions.
Results: The resulting web/mobile application shortens diagnostic cycles, enhances accuracy, and supports clinical decision-making. Preliminary evaluations demonstrate improved efficiency in evidence synthesis and guideline adaptation, with potential for global scalability.
Discussion: This AI-enhanced system offers a scalable approach to improving rare disease diagnosis globally, particularly in resource-limited settings. By integrating specialized knowledge with AI, we democratize access to diagnostic expertise, supporting health system goals of quality improvement and equitable care.
Paper Number
244
Biography
Jungang Zhao is a first-year PhD student in Medical Informatics at Chongqing Medical University. His research interests include evidence-based medicine, pediatrics, and guideline methodology. He has participated in various national and provincial research projects and has assisted in the development of over ten clinical practice guidelines and consensus statements, both in China and internationally. Currently, his research focuses on the construction and validation of an AI-based diagnostic and treatment system for rare pediatric diseases.
Dr Nikita van der Zwaluw
Kennisinstituut van Medisch Specialisten
Boosting Guideline Implementation and engaging Healthcare Professionals: Knowledge Quizzes on Clinical Practice Guidelines
Abstract
• Background
Implementation of CPGs is lacking. The scoping review of Peters (2022) showed that there are 50+ CPG implementation strategies. Clinicians have busy schedules and do not get a lot of implementation support (in the Netherlands), so an organization thought: what if we make it fun and quick. The knowledge quiz on guidelines was born.
• Objective
To ascertain if we can introduce (to which effect) the knowledge quiz to medical specialists and integrate it in the PDCA-cycle of guidelines.
• Methods
Pilot the knowledge quiz for 2 years with five medical associations of medical specialists and selected guidelines.
• Results
8.800 medical specialists (in training) received 1 or 2 questions per week on guidelines relevant for them. After the first year the quiz was evaluated positively by all 5 medical associations (response 34%) and at the end of the pilot as well (response 36%).
More than 90% of the medical associations’ members participated actively and interacts several times a week. Again, more than 90% of the members would recommend the quiz to colleagues. A majority indicated the quiz refreshed their knowledge and thought it contributed to the quality of healthcare.
• Discussion for scientific abstracts
Adding a quiz seems to contribute to better knowledge/guideline implementation and possibly healthcare quality but it also requires extra means as in money and time, but also skills/competencies in CPG-panels. Is it a worthwhile effort? How to (timely) feedback data? To CPG-panels? To medical specialists? Is it a fad or lasting fashion?
Implementation of CPGs is lacking. The scoping review of Peters (2022) showed that there are 50+ CPG implementation strategies. Clinicians have busy schedules and do not get a lot of implementation support (in the Netherlands), so an organization thought: what if we make it fun and quick. The knowledge quiz on guidelines was born.
• Objective
To ascertain if we can introduce (to which effect) the knowledge quiz to medical specialists and integrate it in the PDCA-cycle of guidelines.
• Methods
Pilot the knowledge quiz for 2 years with five medical associations of medical specialists and selected guidelines.
• Results
8.800 medical specialists (in training) received 1 or 2 questions per week on guidelines relevant for them. After the first year the quiz was evaluated positively by all 5 medical associations (response 34%) and at the end of the pilot as well (response 36%).
More than 90% of the medical associations’ members participated actively and interacts several times a week. Again, more than 90% of the members would recommend the quiz to colleagues. A majority indicated the quiz refreshed their knowledge and thought it contributed to the quality of healthcare.
• Discussion for scientific abstracts
Adding a quiz seems to contribute to better knowledge/guideline implementation and possibly healthcare quality but it also requires extra means as in money and time, but also skills/competencies in CPG-panels. Is it a worthwhile effort? How to (timely) feedback data? To CPG-panels? To medical specialists? Is it a fad or lasting fashion?
Paper Number
41
Biography
After graduating from Wageningen Univerity (NL), Nutrition & health, continued with a PhD. The thesis subject was Nutrition and cognition in older adults. Studies on the role of glucose, sucrose, protein, vitamin B12 and folic acid. Since 2015 working with the Knowledge Institute of Medical Specialists. First as advisor and project lead on guideline and quality improvement projects. Furthermore outsourced to the medical association for gastro-enterologists supporting their quality of care committee. After a couple years promoted to senior advisor and team lead; responsible for coaching team members, setting up the CPG-quiz and negotiate cooperation with Dutch College of GPs.
Dr Julian Hirt
University of Basel / Eastern Switzerland University of Applied Sciences
Improving access to methods guidance for clinical guideline developers - the Library of Guidance for Health Scientists (LIGHTS)
Abstract
BACKGROUND: Methods guidance supports health researchers in conducting optimal primary research, evidence syntheses, and developing clinical practice guidelines. However, methods guidance articles, including guidance for developing clinical practice guidelines, are often difficult to find, which limits dissemination and uptake. Reasons for the poor findability include the lack of search filters and poor indexing in general medical databases.
OBJECTIVE: To improve the dissemination and uptake of methods guidance, we created a specialized, comprehensive, searchable database.
METHODS: We created an open-access database, the Library of Guidance for Health Scientists (LIGHTS; www.lights.sience). LIGHTS includes journal articles and regulatory documents that aim to provide methods guidance. We manually screen methods-specific journal sections, journal series, and search other information sources. Methodologically trained researchers review full texts for eligibility and classify guidance articles according to study design, methods topic, medical context, guidance type, and development process of the guidance.
RESULTS: Initially launched in 2022, approximately 6000 individual users visit LIGHTS every month. In February 2025, LIGHTS included 1658 methods guidance articles, primarily focusing on applying (n= 775), understanding (n= 559), or reporting methods (n= 418). Of these, 108 articles were specific to clinical practice guidelines.
DISCUSSION: The large number of users, positive feedback, and requests for new features suggest that the community appreciates the new resource. At the conference, we will demonstrate the main features of LIGHTS, explain how it complements other databases for methods, and discuss future directions.
OBJECTIVE: To improve the dissemination and uptake of methods guidance, we created a specialized, comprehensive, searchable database.
METHODS: We created an open-access database, the Library of Guidance for Health Scientists (LIGHTS; www.lights.sience). LIGHTS includes journal articles and regulatory documents that aim to provide methods guidance. We manually screen methods-specific journal sections, journal series, and search other information sources. Methodologically trained researchers review full texts for eligibility and classify guidance articles according to study design, methods topic, medical context, guidance type, and development process of the guidance.
RESULTS: Initially launched in 2022, approximately 6000 individual users visit LIGHTS every month. In February 2025, LIGHTS included 1658 methods guidance articles, primarily focusing on applying (n= 775), understanding (n= 559), or reporting methods (n= 418). Of these, 108 articles were specific to clinical practice guidelines.
DISCUSSION: The large number of users, positive feedback, and requests for new features suggest that the community appreciates the new resource. At the conference, we will demonstrate the main features of LIGHTS, explain how it complements other databases for methods, and discuss future directions.
Paper Number
114
Biography
Julian is a research fellow at the University of Basel and University Hospital Basel and a lecturer at the Eastern Switzerland University of Applied Sciences in St.Gallen where he is mainly working on evidence syntheses and meta-research in the field of neurology, dementia, and evidence-based healthcare. He is a core team member of the Library of Guidance for Health Scientists (LIGHTS).
Dr Julian Hirt
University of Basel / Eastern Switzerland University of Applied Sciences
The TARCiS statement: Guidance on terminology, application, and reporting of citation searching
Abstract
Background
Citation searches are used in addition to database searches for evidence syntheses and guideline development. Their purpose is to find additional relevant study publications based on their citation relationships with already known study publications. Citation searching methodology and terminology lack standardization.
Objective
To develop guidance on terminology, application, and reporting of citation searching in health-related research.
Design and Methods
The development of the Terminology, Application, and Reporting of Citation Searching (TARCiS) statement was based on a two-step process. First, we carried out a scoping review on citation searching benefits, tools and terminology to systematically investigate the evidence for the formulation of draft recommendations and identify experts in the field. Second, we conducted a four-round Delphi study with international experts and composed the TARCiS statement. An agreement rate of at least 75% per recommendation was pre-defined as consensus.
Results
Forty-one terms, eight recommendations and one research topic were derived from the scoping review. We invited 35 international experts (mainly researchers and information specialists/librarians) and 27 of them took part in the Delphi study. The Delphi study resulted in one recommendation on terminology (comprising eight terms), eight on conduct, and one on reporting of citation searching, and four suggestions for research priorities. The agreement rate per recommendation was between 83 and 100 percent.
Conclusions
At the conference, we present the recommendations of the TARCiS Statement that contribute to the standardized use and reporting of citation searching. We encourage those conducting evidence syntheses and developing guidelines to incorporate TARCiS into their workflows.
Citation searches are used in addition to database searches for evidence syntheses and guideline development. Their purpose is to find additional relevant study publications based on their citation relationships with already known study publications. Citation searching methodology and terminology lack standardization.
Objective
To develop guidance on terminology, application, and reporting of citation searching in health-related research.
Design and Methods
The development of the Terminology, Application, and Reporting of Citation Searching (TARCiS) statement was based on a two-step process. First, we carried out a scoping review on citation searching benefits, tools and terminology to systematically investigate the evidence for the formulation of draft recommendations and identify experts in the field. Second, we conducted a four-round Delphi study with international experts and composed the TARCiS statement. An agreement rate of at least 75% per recommendation was pre-defined as consensus.
Results
Forty-one terms, eight recommendations and one research topic were derived from the scoping review. We invited 35 international experts (mainly researchers and information specialists/librarians) and 27 of them took part in the Delphi study. The Delphi study resulted in one recommendation on terminology (comprising eight terms), eight on conduct, and one on reporting of citation searching, and four suggestions for research priorities. The agreement rate per recommendation was between 83 and 100 percent.
Conclusions
At the conference, we present the recommendations of the TARCiS Statement that contribute to the standardized use and reporting of citation searching. We encourage those conducting evidence syntheses and developing guidelines to incorporate TARCiS into their workflows.
Paper Number
115
Biography
Julian is a research fellow at the University of Basel and University Hospital Basel and a lecturer at the Eastern Switzerland University of Applied Sciences in St.Gallen where he is mainly working on evidence syntheses and meta-research in the field of neurology, dementia, and evidence-based healthcare. He co-developed the TARCiS statement.
Dr Qianling Shi
CHINA
Phd Student
Lanzhou University
Enhancing guideline accessibility: using large language models to generate executive summaries in leading gastroenterology and hepatology guidelines
Abstract
Background: Despite wide publication, clinical guidelines have had limited effect on surgery practice. Barriers include heavy workloads, limited resources, and guideline complexity. A guideline executive summary provides a concise overview of recommendations, background, and methods, to give readers a quick preview of its contents. Large language models (LLMs) are featured in processing vast amounts of text, extracting key information, and generating concise, structured summaries. By leveraging natural language processing, LLMs can potentially distill complex guideline content into accessible formats while maintaining accuracy and clarity.
Methods: This study will investigate the accuracy, clarity, and clinical utility of LLM-generated guideline summaries. We will analyze gastroenterology and hepatology guidelines (priority for those with clear methodology and recommendations) from leading academic societies and journals. Journal selection will be guided by the Journal Citation Reports of the Web of Science, including the 10 highest-impact gastroenterology and hepatology journals of 2023 and 4 general medical journals (NEJM, The Lancet, JAMA, and The BMJ). OpenAI’s GPT-4 Turbo and DeepSeek models will be employed to generate structured summaries, evaluated gastroenterologist and guideline methodologist across four criteria: (a) inaccuracy of summarized information; (b) hallucination of information; (c) mission of relevant clinical information; (d) ease of understanding. Processing time and performance across guideline types (different languages, standard vs. rapid) will also be recorded.
Results: This study is ongoing and detailed results will be presented at the Colloquium.
Conclusions: If effective, LLM-generated summaries could be used to enhance guideline accessibility globally and rapid decision support, especially in resource-limited settings.
Methods: This study will investigate the accuracy, clarity, and clinical utility of LLM-generated guideline summaries. We will analyze gastroenterology and hepatology guidelines (priority for those with clear methodology and recommendations) from leading academic societies and journals. Journal selection will be guided by the Journal Citation Reports of the Web of Science, including the 10 highest-impact gastroenterology and hepatology journals of 2023 and 4 general medical journals (NEJM, The Lancet, JAMA, and The BMJ). OpenAI’s GPT-4 Turbo and DeepSeek models will be employed to generate structured summaries, evaluated gastroenterologist and guideline methodologist across four criteria: (a) inaccuracy of summarized information; (b) hallucination of information; (c) mission of relevant clinical information; (d) ease of understanding. Processing time and performance across guideline types (different languages, standard vs. rapid) will also be recorded.
Results: This study is ongoing and detailed results will be presented at the Colloquium.
Conclusions: If effective, LLM-generated summaries could be used to enhance guideline accessibility globally and rapid decision support, especially in resource-limited settings.
Paper Number
356
Biography
Qianling SHI, a PhD Student from the First School of Clinical Medicine, Lanzhou University. She has worked on the methodology of evidence-based medicine and development of clinical guidelines for five years. As a methodologist, she has contributed to nearly 30 international and domestic guidelines. As a researcher, she has published several studies in peer-reviewed journals, like EClinicalMedicine, BMJ Open, and Journal of Clinical Epidemiology. As a clinical medical student, Qianling also focuses on the usage of high-quality evidence. She is a translator of Chinese-Simplified translation project, and does a contribution for better implementation for Cochrane systematic reviews.
Dr Hong Cao
Tongji Medical College Of Hust
Heterogeneity existed among reporting outcomes in systematic reviews on antimicrobials in the treatment of adult pneumonia: A Comparative Study Involving Artificial Intelligence
Abstract
Objectives: This overview aimed to summarize primary outcomes reported in systematic reviews (SRs) on antimicrobials for adult pneumonia and establish an indicator pool for developing a core outcome set (COS).
Methods: We searched PubMed, Embase, Cochrane Library, and the Chinese Electronic Database for SRs on antimicrobial therapy in adult pneumonia. Two reviewers independently screened studies, extracted data, while both the reviewers and artificial intelligence (AI) assessed methodological quality using AMSTAR 2.0. Similar primary outcomes were categorized in line with the type and severity of pneumonia.
Results: A total of 97 SRs were included, most of which (92.78%) were rated as critically low to low certainty. The consistency between the use of AMSTAR 2.0 by reviewers and AI assessment modalities was found to be 56.70%. Twenty-one primary indicators were identified from these SRs, with 13, 11, 14, and 15 for non-severe CAP, severe CAP, non-severe HAP, and severe HAP, respectively. Clinical success (n = 58, 59.79%) and mortality (n = 56, 57.73%) were the most frequently reported outcomes, but discrepancies existed in their measure descriptions and time points. The types of indicators in registered SRs were less than those in non-registered SRs. Furthermore, we detected a wide variation in the instruments for assessing the severity of pneumonia.
Conclusions: There exist significant variations in reported primary outcomes for SRs on antimicrobials in the treatment of adult pneumonia, calling for the development of a COS.
Methods: We searched PubMed, Embase, Cochrane Library, and the Chinese Electronic Database for SRs on antimicrobial therapy in adult pneumonia. Two reviewers independently screened studies, extracted data, while both the reviewers and artificial intelligence (AI) assessed methodological quality using AMSTAR 2.0. Similar primary outcomes were categorized in line with the type and severity of pneumonia.
Results: A total of 97 SRs were included, most of which (92.78%) were rated as critically low to low certainty. The consistency between the use of AMSTAR 2.0 by reviewers and AI assessment modalities was found to be 56.70%. Twenty-one primary indicators were identified from these SRs, with 13, 11, 14, and 15 for non-severe CAP, severe CAP, non-severe HAP, and severe HAP, respectively. Clinical success (n = 58, 59.79%) and mortality (n = 56, 57.73%) were the most frequently reported outcomes, but discrepancies existed in their measure descriptions and time points. The types of indicators in registered SRs were less than those in non-registered SRs. Furthermore, we detected a wide variation in the instruments for assessing the severity of pneumonia.
Conclusions: There exist significant variations in reported primary outcomes for SRs on antimicrobials in the treatment of adult pneumonia, calling for the development of a COS.
Paper Number
95
Biography
Cao Hong, a Chinese. With a strong academic background from China Pharmaceutical University and Guizhou University, she has been actively involved in numerous scientific research projects. Her research achievements include several SCI - indexed publications. Currently, she is a medical doctor at Tongji Medical College, Huazhong University of Science and Technology. Her skills in literature search, data analysis, and evidence evaluation further strengthen her capabilities in the medical field.
Ms Meihua Wu
China
Lanzhou University
Mapping the Role of Large Language Models in Guideline Adaptation: A Scoping Review
Abstract
Innovative artificial intelligence (AI), particularly large language models (LLMs), offers transformative potential for evidence synthesis and guideline adaptation. Its capacity to enhance recommendation accessibility, promote the efficient utilization of existing guidelines, and mitigate unnecessary resource allocation warrants systematic exploration.
This scoping review aims to map the state of the art regarding the role of LLMs in guideline adaptation, focusing on technological advances, methodological approaches, and their potential to catalyze equitable healthcare delivery.
A comprehensive literature search will be conducted across multiple databases, grey literature, and relevant conference proceedings. The review adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) guidelines to ensure transparency and reproducibility. Two researchers will independently extract data on application contexts, implementation strategies, methodological frameworks, and outcome measures.
While preliminary findings indicate a growing interest in the integration of LLMs within digital guideline adaptation processes, with the potential to improve healthcare system efficiency, this field remains in early development. Detailed results will be finalized before the GIN 2025 conference.
We will explore how LLM-driven innovations can enhance global accessibility of guideline recommendations and address implementation challenges. Ethical, methodological, and scalability issues will also be discussed. The study is ongoing, and final outcomes will be available for presentation at the conference.
This scoping review aims to map the state of the art regarding the role of LLMs in guideline adaptation, focusing on technological advances, methodological approaches, and their potential to catalyze equitable healthcare delivery.
A comprehensive literature search will be conducted across multiple databases, grey literature, and relevant conference proceedings. The review adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) guidelines to ensure transparency and reproducibility. Two researchers will independently extract data on application contexts, implementation strategies, methodological frameworks, and outcome measures.
While preliminary findings indicate a growing interest in the integration of LLMs within digital guideline adaptation processes, with the potential to improve healthcare system efficiency, this field remains in early development. Detailed results will be finalized before the GIN 2025 conference.
We will explore how LLM-driven innovations can enhance global accessibility of guideline recommendations and address implementation challenges. Ethical, methodological, and scalability issues will also be discussed. The study is ongoing, and final outcomes will be available for presentation at the conference.
Paper Number
445
Biography
Trained in advanced evidence synthesis methodologies through the Chevidence team, Meihua Wu specialize in advancing AI-driven solutions for clinical guideline adaptation and equitable dissemination.
Dr Brian Alper
President
Scientific Knowledge Accelerator Foundation
Make Your Guideline Computable
Abstract
Background
In 2017, GINTech stimulated efforts to develop interoperability solutions for the evidence ecosystem. In 2025, a resulting “Evidence Based Medicine on Fast Healthcare Interoperability Resources” (EBMonFHIR) project provides a technical standard for the computable expression of evidence and guidelines. Free tools are available on the FEvIR Platform (fevir.net) to facilitate conversion of guidelines in Microsoft Word format into computable form matching the EBMonFHIR implementation guide.
Objective
To enable participants to convert their existing guidelines in Word documents to computable format, and to provide immediate technical support.
Format (including interactive elements) for workshops
A brief (up to 15 minutes) introduction will recap an orientation from a related panel presentation:
--Brian Alper will introduce the origin and current state of EBMonFHIR activities.
--Joanne Dehnbostel will introduce the Fast Evidence Interoperability Resources (FEvIR) Platform with emphasis on relevant tools for guideline developers, including demonstration of adapting guidelines from MAGICapp and GRADEpro.
--Noella Awah will report on her actual experience using the FEvIR Platform to convert a 199-page PDF for the Cameroon malaria guideline into a computable guideline.
Participants will ‘work on their own’ to process guideline documents into computable form with the tooling freely available on the FEvIR Platform. Facilitators will be available throughout to help as needed. As helpful hints are discovered, or questions from participants are noted to have educational value, the discussion will be shared with the audience and the example shown on the large screen.
PLEASE BRING YOUR LAPTOP AND YOUR GUIDELINE AS MS-Word (.doc or .docx) FILE.
In 2017, GINTech stimulated efforts to develop interoperability solutions for the evidence ecosystem. In 2025, a resulting “Evidence Based Medicine on Fast Healthcare Interoperability Resources” (EBMonFHIR) project provides a technical standard for the computable expression of evidence and guidelines. Free tools are available on the FEvIR Platform (fevir.net) to facilitate conversion of guidelines in Microsoft Word format into computable form matching the EBMonFHIR implementation guide.
Objective
To enable participants to convert their existing guidelines in Word documents to computable format, and to provide immediate technical support.
Format (including interactive elements) for workshops
A brief (up to 15 minutes) introduction will recap an orientation from a related panel presentation:
--Brian Alper will introduce the origin and current state of EBMonFHIR activities.
--Joanne Dehnbostel will introduce the Fast Evidence Interoperability Resources (FEvIR) Platform with emphasis on relevant tools for guideline developers, including demonstration of adapting guidelines from MAGICapp and GRADEpro.
--Noella Awah will report on her actual experience using the FEvIR Platform to convert a 199-page PDF for the Cameroon malaria guideline into a computable guideline.
Participants will ‘work on their own’ to process guideline documents into computable form with the tooling freely available on the FEvIR Platform. Facilitators will be available throughout to help as needed. As helpful hints are discovered, or questions from participants are noted to have educational value, the discussion will be shared with the audience and the example shown on the large screen.
PLEASE BRING YOUR LAPTOP AND YOUR GUIDELINE AS MS-Word (.doc or .docx) FILE.
Paper Number
87
Biography
Brian Alper is project lead for EBMonFHIR, GINTech Chair, CEO of Computable Publishing LLC, and has founded systems for decision support and guideline adaptation. Joanne Dehnbostel is research and analysis manager for Computable Publishing LLC and project administrator for Health Evidence Knowledge Accelerator. Noella Awah is secretary of GIN Africa Community and a researcher at eBASE Africa with interest in guidelines and decision aids. Brian and Joanne have extensive experience with the development of the FEvIR Platform, and Noella, Kinlabel, and Ambang have direct experience using the FEvIR Platform to convert a guideline document to computable form.
Miss Bingyi Wang
China
Lanzhou University
The Efficacy of Large Language Models in Research Quality Assessment: A Systematic Review of Methodological and Reporting Guideline Compliance
Abstract
Background: Evidence-based quality assessment relies on manual checks, which are time-consuming. LLMs automate evaluations but face challenges interpreting complex methodologies.
Objective: To systematically evaluate the efficacy of large language models (LLMs) in assessing the quality of academic research, with a focus on their automated evaluation capabilities for methodological rigor and adherence to reporting checklists.
Methods: Following the PRISMA 2020 guidelines, we searched PubMed, Embase, Web of Science, and preprint platforms (up to March 7, 2025) to include studies utilizing LLMs for evaluating the methodological or reporting quality of clinical research, systematic reviews, or other study types. Evidence synthesis was conducted using the PRISMA-AI framework, and the risk of bias was assessed using the QUADAS-AI tool. Effect sizes were pooled via a random-effects model.
Results: Thirteen studies were included (7 on methodological quality assessment, 6 on reporting quality evaluation). Results demonstrated that LLMs (e.g., GPT-4, Claude 3) performed well in identifying randomization errors, with high accuracy and recall rates, outperforming some traditional automated evaluation tools. However, logical inconsistencies were observed in interpreting complex methodological concepts (e.g., Bayesian analytical designs, intention-to-treat principles). The detailed results will be presented at the conference.
Conclusion: LLMs can serve as effective tools for preliminary screening of research quality but exhibit limitations in conceptual interpretation during in-depth methodological reviews, necessitating a "human-AI collaborative" auditing mechanism. We recommend developing an evidence-based, domain-adaptive framework enhanced by dynamic knowledge distillation to optimize the application value of LLMs in scientific quality governance.
Objective: To systematically evaluate the efficacy of large language models (LLMs) in assessing the quality of academic research, with a focus on their automated evaluation capabilities for methodological rigor and adherence to reporting checklists.
Methods: Following the PRISMA 2020 guidelines, we searched PubMed, Embase, Web of Science, and preprint platforms (up to March 7, 2025) to include studies utilizing LLMs for evaluating the methodological or reporting quality of clinical research, systematic reviews, or other study types. Evidence synthesis was conducted using the PRISMA-AI framework, and the risk of bias was assessed using the QUADAS-AI tool. Effect sizes were pooled via a random-effects model.
Results: Thirteen studies were included (7 on methodological quality assessment, 6 on reporting quality evaluation). Results demonstrated that LLMs (e.g., GPT-4, Claude 3) performed well in identifying randomization errors, with high accuracy and recall rates, outperforming some traditional automated evaluation tools. However, logical inconsistencies were observed in interpreting complex methodological concepts (e.g., Bayesian analytical designs, intention-to-treat principles). The detailed results will be presented at the conference.
Conclusion: LLMs can serve as effective tools for preliminary screening of research quality but exhibit limitations in conceptual interpretation during in-depth methodological reviews, necessitating a "human-AI collaborative" auditing mechanism. We recommend developing an evidence-based, domain-adaptive framework enhanced by dynamic knowledge distillation to optimize the application value of LLMs in scientific quality governance.
Paper Number
521
Biography
Bingyi Wang, a master's degree candidate, possesses extensive expertise in conducting systematic reviews, encompassing the retrieval, screening, information extraction. She has given many lectures in the training class about some steps of systematic review. At present, she has published five articles, including one systematic review, and has participated in the production of several systematic reviews.
Dr Angelika Eisele-Metzger
Institute for Evidence in Medicine, Medical Center – University of Freiburg / Medical Faculty – University of Freiburg
Leveraging large language models to enhance the conduct of systematic reviews: a scoping review
Abstract
Background: Systematic reviews (SRs) are an important basis for evidence-based guidelines, but their production is time-consuming and resource-intensive. Machine learning and particularly large language models (LLMs) such as GPT offer promising support for SR production.
Objective: To provide an overview of LLM applications for supporting SR conduct in health research.
Methods: We searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), and Google Scholar, and conducted an additional hand search (last search: 26 February 2024; informal update planned for this conference presentation). We included scientific articles published from April 2021, building on the results of a mapping review that had not yet included LLM applications to support SR conduct. Following independent screening by two reviewers, data extraction was performed by one reviewer and verified by another.
Results: From 8,087 identified records, we included 37 articles that addressed 10 of 13 defined SR steps, primarily literature search (n=15, 41% of studies), study selection (n=14, 38%), and data extraction (n=11, 30%). GPT models dominated the field (n=33, 89%). Most publications were validation approaches in which a defined reference standard or expert review was used to verify accuracy of LLMs (n=21, 57%). The conclusions of study authors were predominantly promising (n=20, 54%), with fewer neutral (n=9, 24%) or non-promising (n=8, 22%) appraisals.
Conclusions: We identified a number of promising applications of LLMs for SR support. However, fully validated approaches were not yet available. LLMs should currently be used with caution and limited to specific SR tasks under human supervision.
https://doi.org/10.1016/j.jclinepi.2025.111746
Objective: To provide an overview of LLM applications for supporting SR conduct in health research.
Methods: We searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), and Google Scholar, and conducted an additional hand search (last search: 26 February 2024; informal update planned for this conference presentation). We included scientific articles published from April 2021, building on the results of a mapping review that had not yet included LLM applications to support SR conduct. Following independent screening by two reviewers, data extraction was performed by one reviewer and verified by another.
Results: From 8,087 identified records, we included 37 articles that addressed 10 of 13 defined SR steps, primarily literature search (n=15, 41% of studies), study selection (n=14, 38%), and data extraction (n=11, 30%). GPT models dominated the field (n=33, 89%). Most publications were validation approaches in which a defined reference standard or expert review was used to verify accuracy of LLMs (n=21, 57%). The conclusions of study authors were predominantly promising (n=20, 54%), with fewer neutral (n=9, 24%) or non-promising (n=8, 22%) appraisals.
Conclusions: We identified a number of promising applications of LLMs for SR support. However, fully validated approaches were not yet available. LLMs should currently be used with caution and limited to specific SR tasks under human supervision.
https://doi.org/10.1016/j.jclinepi.2025.111746
Paper Number
254
Biography
Angelika is a postdoctoral researcher with a background in public health and physiotherapy. Her work focuses on the production, methodology and quality of evidence syntheses and, more recently, on artificial intelligence support for systematic reviews. As an associate of Cochrane Germany, she is involved in disseminating findings from Cochrane Reviews as well as Cochrane methodology.
Prof Hongyong Deng
Shanghai University Of Traditional Chinese Medicine
Introducing an Advanced Online Tool for Interactive and User-Friendly Evidence Map Creation: Enhancing Research Visualization for the Guidelines
Abstract
Evidence maps are visual tools that summarize existing evidence, highlight knowledge gaps, and guide future studies. Traditionally, creating these maps has been cumbersome, requiring specialized software like Excel, R, or Matplotlib, which may lack user-friendliness and interactivity. To address this challenge, we have developed EvdMap@Pymeta, an advanced online tool for creating interactive evidence maps.
Available at https://www.pymeta.com/evdmap/, EvdMap@Pymeta simplifies the process by transforming pre-processed structured evidence data into visually engaging graphics without complex data preparation. Users input Label data (for axis titles and values) and Map data (listing study elements). The resulting bubble plots display evidence effect on the X-axis, quality rating on the Y-axis, and sample size via bubble size. Different colors represent factors such as groups, populations, or interventions.
The tool's innovation lies in its simplicity, flexibility, aesthetic appeal, and immediate usability—all at no cost. Its user-friendly design allows seamless interaction with data; simply hover over any bubble to access detailed information about the underlying evidence, making complex data easily understandable. Flexible customization options for background, grid, and bubble size ensure tailored maps for specific presentation needs. Additionally, it produces visually appealing images that can be saved in high-quality PNG or SVG formats for further use.
By enhancing clarity and accessibility of evidence visualization, EvdMap@Pymeta is invaluable for guideline developers and researchers seeking to communicate complex evidence effectively. Join us at the Guidelines International Network 2025 to discover how this tool can revolutionize your research presentations and analysis.
Available at https://www.pymeta.com/evdmap/, EvdMap@Pymeta simplifies the process by transforming pre-processed structured evidence data into visually engaging graphics without complex data preparation. Users input Label data (for axis titles and values) and Map data (listing study elements). The resulting bubble plots display evidence effect on the X-axis, quality rating on the Y-axis, and sample size via bubble size. Different colors represent factors such as groups, populations, or interventions.
The tool's innovation lies in its simplicity, flexibility, aesthetic appeal, and immediate usability—all at no cost. Its user-friendly design allows seamless interaction with data; simply hover over any bubble to access detailed information about the underlying evidence, making complex data easily understandable. Flexible customization options for background, grid, and bubble size ensure tailored maps for specific presentation needs. Additionally, it produces visually appealing images that can be saved in high-quality PNG or SVG formats for further use.
By enhancing clarity and accessibility of evidence visualization, EvdMap@Pymeta is invaluable for guideline developers and researchers seeking to communicate complex evidence effectively. Join us at the Guidelines International Network 2025 to discover how this tool can revolutionize your research presentations and analysis.
Paper Number
179
Biography
Dr. Hongyong Deng is a Professor and PhD Supervisor at Shanghai University of Traditional Chinese Medicine. He is a member, editor, and reviewer for Cochrane, and serves on several editorial boards and committees, including the Chinese Medical Association’s Epidemiology and Evidence-Based Medicine Branch. Dr. Deng leads national and regional research projects, having published over 100 papers, authored 3 books, and secured 7 patents. He developed PythonMeta and EvdMap@pymeta, focusing on Traditional Chinese Medicine informatics, evidence-based medicine, and clinical evidence evaluation.
Dr Radhika A.G.
university college of medical sciences & GTB Hospital
Harnessing AI for Cervical Cancer Screening and Diagnosis
Abstract
Introduction Cervical cancer remains a major public health challenge, particularly in low- and middle-income countries (LMICs). Pap smears, HPV DNA testing, and colposcopy are effective screening methods, but suffer from inter-operator variability, a lack of standardization, and limited accessibility. These challenges can be addressed through artificial intelligence (AI) and digital innovations. As part of the Indian Colposcopy Network (InColNet), we aim to develop a validated image bank of cervical lesions to facilitate AI-driven diagnostics for early detection of cervical cancer. MethodologyThe study will follow a hub-and-spoke model over 5 years. Colposcopy shall be performed on 1000 consenting women with abnormal pap smears or clinically suspicious cervixes. A minimum of five standard datasets would be collected per colposcopy. The images will be validated using colposcopy images of 250 women who are screen-negative. The annotation of images would be standardized using the criteria of the International Federation of Cervical Pathology and Colposcopy (IFCPC). An AI-assisted colposcopy tool will be developed using deep learning models trained on annotated images. Finally, the system shall be integrated into telemedicine networks. ResultsData collected and processes will be presented at the presentation.Discussion AI-driven annotation and validation will ensure uniformity in image interpretation . By integrating AI tools into mobile applications and digital health platforms, healthcare workers can receive AI-guided diagnostic support, even in resource-limited settings. The Indian Colposcopy Network (InColNet) will provide a model for AI-driven guideline implementation, enhancing diagnostic accuracy, standardization, and accessibility.
Paper Number
26
Biography
As a Senior Consultant in Obstetrics and Gynecology at a tertiary care center, I have special interests in Evidence Synthesis and Clinical Practice Guidelines.
Considering the rising incidence of cancer cervix, my current focus is learning ways to use AI in public health, training medical and paramedical workers in early detection methods.
I have served as the principal investigator for three WHO-funded projects and five ICMR-funded projects. As an external reviewer for the National Institute for Health and Care Research (NIHR), UK, from 2017 to 2019 and again since 2022, I have also contributed to global health research..
Mr Yong-Bo Wang
CHINA
Wuhan
Zhongnan Hospital of Wuhan University
Diagnosis and treatment process ontology of clinical practice guidelines
Abstract
Objective
Our team had developed an ontology for CPGs, extracting core concepts, defining semantic relationships, and creating ontology-based reasoning tools. Nevertheless, the current ontology lacks a comprehensive representation of diagnosis and treatment processes, failing to fully capture temporal relationships and decision logic. This study aims to further refine the ontology by systematically extracting diagnosis and treatment nodes and their relationships.
Method
During the construction process, we referenced core concepts and relationships within the process ontology domain to ensure the accuracy and logical consistency of the ontology's knowledge structure. Taking coronary heart disease guidelines as an example, we screened and analyzed compliant CPGs, systematically classified their contents, extracted common structures, and finally built a conceptual framework and semantic relationship table containing diagnostic and therapeutic processes.
Result
Based on the CPGs for Coronary Heart Disease, this study constructed a structured ontology repository of the diagnosis and treatment process. The content of the ontology is divided into the following core modules: Diagnosis and treatment process (including history taking, physical examination, laboratory tests, diagnosis, treatment and follow-up); Interventions (e.g., assessment, psychosocial, and treatment); Roles; Process (e.g., composite process); Variation events; Temporal events and Gateways.
Conclusion
The ontology not only provides a standardized framework for the semantic representation of coronary heart disease diagnosis and treatment processes, but also plays an important foundation for the intelligent application of the guidelines (e.g., clinical decision support, knowledge reasoning, and process automation).
Our team had developed an ontology for CPGs, extracting core concepts, defining semantic relationships, and creating ontology-based reasoning tools. Nevertheless, the current ontology lacks a comprehensive representation of diagnosis and treatment processes, failing to fully capture temporal relationships and decision logic. This study aims to further refine the ontology by systematically extracting diagnosis and treatment nodes and their relationships.
Method
During the construction process, we referenced core concepts and relationships within the process ontology domain to ensure the accuracy and logical consistency of the ontology's knowledge structure. Taking coronary heart disease guidelines as an example, we screened and analyzed compliant CPGs, systematically classified their contents, extracted common structures, and finally built a conceptual framework and semantic relationship table containing diagnostic and therapeutic processes.
Result
Based on the CPGs for Coronary Heart Disease, this study constructed a structured ontology repository of the diagnosis and treatment process. The content of the ontology is divided into the following core modules: Diagnosis and treatment process (including history taking, physical examination, laboratory tests, diagnosis, treatment and follow-up); Interventions (e.g., assessment, psychosocial, and treatment); Roles; Process (e.g., composite process); Variation events; Temporal events and Gateways.
Conclusion
The ontology not only provides a standardized framework for the semantic representation of coronary heart disease diagnosis and treatment processes, but also plays an important foundation for the intelligent application of the guidelines (e.g., clinical decision support, knowledge reasoning, and process automation).
Paper Number
158
Biography
Yong-Bo Wang, a Research Assistant at the Center for Evidence-Based and Translational Medicine, Zhongnan Hospital of Wuhan University, and a Lecturer at the Department of Evidence-Based Medicine and Clinical Epidemiology, Second School of Clinical Medicine of Wuhan University. Her research focuses on clinical practice guideline development and implementation methodology, computable guidelines, clinical decision support systems, and causal inference.
Dr Jungang Zhao
CHINESE
Phd Student
Children’s Hospital Of Chongqing Medical University
The Future of Clinical Documentation: Can AI Solve the Burden and Boost Guideline Implementation?
Abstract
The Future of Clinical Documentation: Can AI Solve the Burden and Boost Guideline Implementation?
Background: Globally, healthcare systems struggle with implementing evidence-based guidelines due to administrative burdens on clinicians. Excessive documentation contributes to burnout and reduces patient care time. AI-powered tools, such as ambient scribes and large language models (LLMs), offer opportunities to streamline documentation, improve efficiency, and enhance the integration of guidelines into practice.
Objective: To systematically review the impact of AI-based documentation tools on healthcare professionals’ documentation burden and explore their potential to enhance guideline implementation and healthcare system integration.
Methods: A systematic review and meta-analysis were conducted, searching PubMed, Web of Science, and Scopus from database inception to 2025. Studies evaluating AI-based documentation tools used by frontline healthcare professionals were included. Outcomes assessed were documentation time, workload, burnout, and usability. Study quality was assessed using JBI Critical Appraisal Checklists, with narrative synthesis and meta-analysis where appropriate.
Results: Preliminary findings indicate AI tools can significantly reduce documentation time and cognitive workload, with effectiveness varying by technology and context. Full results will be presented at the Summit.
Discussion: This study innovatively evaluates AI’s role in reducing documentation burdens, offering insights into enhancing healthcare efficiency and reducing burnout. It highlights the importance of integrating AI into healthcare systems and informs discussions on AI’s transformative potential in clinical practice. Future research should optimize AI tools for broader applicability, addressing usability, integration, and ethical challenges for sustainable healthcare practices.
Background: Globally, healthcare systems struggle with implementing evidence-based guidelines due to administrative burdens on clinicians. Excessive documentation contributes to burnout and reduces patient care time. AI-powered tools, such as ambient scribes and large language models (LLMs), offer opportunities to streamline documentation, improve efficiency, and enhance the integration of guidelines into practice.
Objective: To systematically review the impact of AI-based documentation tools on healthcare professionals’ documentation burden and explore their potential to enhance guideline implementation and healthcare system integration.
Methods: A systematic review and meta-analysis were conducted, searching PubMed, Web of Science, and Scopus from database inception to 2025. Studies evaluating AI-based documentation tools used by frontline healthcare professionals were included. Outcomes assessed were documentation time, workload, burnout, and usability. Study quality was assessed using JBI Critical Appraisal Checklists, with narrative synthesis and meta-analysis where appropriate.
Results: Preliminary findings indicate AI tools can significantly reduce documentation time and cognitive workload, with effectiveness varying by technology and context. Full results will be presented at the Summit.
Discussion: This study innovatively evaluates AI’s role in reducing documentation burdens, offering insights into enhancing healthcare efficiency and reducing burnout. It highlights the importance of integrating AI into healthcare systems and informs discussions on AI’s transformative potential in clinical practice. Future research should optimize AI tools for broader applicability, addressing usability, integration, and ethical challenges for sustainable healthcare practices.
Paper Number
243
Biography
Zhao is a first-year PhD student in Medical Informatics at Chongqing Medical University. His research interests include evidence-based medicine, pediatrics, and guideline methodology. He has participated in various national and provincial research projects and has assisted in the development of over ten clinical practice guidelines and consensus statements, both in China and internationally. Currently, his research focuses on the construction and validation of an AI-based diagnostic and treatment system for rare pediatric diseases.
Mr Hui Liu
Chinese
Lanzhou University
Large language models facilitate dissemination and implementation of clinical practice guidelines: a systematic review
Abstract
Background: Large Language Models (LLMs) have shown potential in optimizing the development of clinical practice guidelines (CPGs). However, their application in the dissemination and implementation of CPGs has not been systematically evaluated.
Objective: This study aims to systematically assess the utilization and effectiveness of LLMs in facilitating the dissemination and implementation of CPGs.
Methods: On February 14, 2025, we conducted a search using terms such as “large language model” and “guideline” across several databases, including CNKI, WANFANG MED ONLINE, SinoMed, MEDLINE (via PubMed), Embase, and Web of Science. Two researchers independently performed literature screening and data extraction, resolving any conflicts through consensus or by consulting a third party.
Results: A total of 11,204 records were retrieved, with 17 publications ultimately included in the analysis. Regarding the countries of the first authors, nine countries were represented: China, Italy, Germany, France, Qatar, South Korea, Canada, the United States, and Singapore. In terms of LLMs utilized, 13 studies focused on a single LLM, while four studies involved multiple LLMs, with ChatGPT-4 being the most frequently used model. The content areas addressed included the use of LLMs for extracting CPG recommendations, answering questions based on CPGs, interpreting CPGs, adapting CPGs, and generating patient health education materials based on CPGs.
Discussion for scientific abstracts: LLMs demonstrate significant promise for enhancing the dissemination and implementation of CPGs. It is recommended that future stakeholders give greater attention to this emerging technology.
The author gratefully acknowledges the support of K.C. Wong Education Foundation, Hong Kong.
Objective: This study aims to systematically assess the utilization and effectiveness of LLMs in facilitating the dissemination and implementation of CPGs.
Methods: On February 14, 2025, we conducted a search using terms such as “large language model” and “guideline” across several databases, including CNKI, WANFANG MED ONLINE, SinoMed, MEDLINE (via PubMed), Embase, and Web of Science. Two researchers independently performed literature screening and data extraction, resolving any conflicts through consensus or by consulting a third party.
Results: A total of 11,204 records were retrieved, with 17 publications ultimately included in the analysis. Regarding the countries of the first authors, nine countries were represented: China, Italy, Germany, France, Qatar, South Korea, Canada, the United States, and Singapore. In terms of LLMs utilized, 13 studies focused on a single LLM, while four studies involved multiple LLMs, with ChatGPT-4 being the most frequently used model. The content areas addressed included the use of LLMs for extracting CPG recommendations, answering questions based on CPGs, interpreting CPGs, adapting CPGs, and generating patient health education materials based on CPGs.
Discussion for scientific abstracts: LLMs demonstrate significant promise for enhancing the dissemination and implementation of CPGs. It is recommended that future stakeholders give greater attention to this emerging technology.
The author gratefully acknowledges the support of K.C. Wong Education Foundation, Hong Kong.
Paper Number
410
Biography
Hui Liu, a PhD student of Lanzhou University, focuses on clinical practice guidelines, patient and public versions of guidelines, as well as evidence-based evaluation. He is a member of the Secretariat of the Scientific, Transparent and Applicable Rankings tool for clinical practice guidelines (STAR) Working Group. Hui Liu has led or participated in five provincial and ministerial projects and contributed to 15 guidelines. He has published more than 50 academic papers, with 18 as the first author (including co-first author), and has edited or translated the contents of 3 books.
Mr JIE Zhang
中国
Master Student
Evidence-based Medicine Center, School Of Basic Medical Sciences, Lanzhou University
Comparison between Real-time Evidence Synthesis Based on the ad^Viser System and Manual Systematic Review: A Mixed Methods Study
Abstract
Background
During guideline development, evidence searches for published systematic reviews on specific clinical questions are often required. If no existing systematic review is found, a new one must be conducted. This process is time-consuming and hinders evidence synthesis and translation. Rapid advancements in AI technology, such as the ad^Viser system developed by the ADVANCED working group, now enable real-time evidence integration.
Objective
To compare the ad^Viser system's outputs with peer-reviewed manual systematic reviews on the same clinical questions, and evaluate their value for clinicians' decision-making.
Methods
1.Quantitative comparison of evidence status and outcomes between ad^Viser and manual reviews (limited to RCTs).
2.Integration of conclusions from ad^Viser sub-outcomes and original studies into two independent clinical reports, followed by clinician evaluation via 5-point Likert scale questionnaires.
Results
1.Study inclusion: Comparison of the number of included studies and their overlap between ad^Viser and manual reviews.
2.Outcomes comparison: Analysis of the quantity and overlap of reported outcomes.
3.Clinician preference: Clinicians' ratings for ad^Viser versus manual review outputs based on decision-making utility.
Discussion
The study identifies potential reasons for discrepancies in evidence synthesis, such as differences in literature retrieval strategies or outcome prioritization. The ad^Viser system demonstrates advantages in synthesis speed but may require improvements in contextual interpretation and subgroup analysis capabilities. Future research should focus on optimizing AI-human collaboration frameworks for guideline development.
Acknowledgement: The author gratefully acknowledges the support of K.C.Wong Education Foundation,Hong Kong
During guideline development, evidence searches for published systematic reviews on specific clinical questions are often required. If no existing systematic review is found, a new one must be conducted. This process is time-consuming and hinders evidence synthesis and translation. Rapid advancements in AI technology, such as the ad^Viser system developed by the ADVANCED working group, now enable real-time evidence integration.
Objective
To compare the ad^Viser system's outputs with peer-reviewed manual systematic reviews on the same clinical questions, and evaluate their value for clinicians' decision-making.
Methods
1.Quantitative comparison of evidence status and outcomes between ad^Viser and manual reviews (limited to RCTs).
2.Integration of conclusions from ad^Viser sub-outcomes and original studies into two independent clinical reports, followed by clinician evaluation via 5-point Likert scale questionnaires.
Results
1.Study inclusion: Comparison of the number of included studies and their overlap between ad^Viser and manual reviews.
2.Outcomes comparison: Analysis of the quantity and overlap of reported outcomes.
3.Clinician preference: Clinicians' ratings for ad^Viser versus manual review outputs based on decision-making utility.
Discussion
The study identifies potential reasons for discrepancies in evidence synthesis, such as differences in literature retrieval strategies or outcome prioritization. The ad^Viser system demonstrates advantages in synthesis speed but may require improvements in contextual interpretation and subgroup analysis capabilities. Future research should focus on optimizing AI-human collaboration frameworks for guideline development.
Acknowledgement: The author gratefully acknowledges the support of K.C.Wong Education Foundation,Hong Kong
Paper Number
240
Biography
I am currently a second-year graduate student at the School of Basic Medical Sciences at Lanzhou University advised by Prof. Yaolong Chen. I am involved in the ADVANCED team, focusing on the development and research of the Ad^Viser product.
Mr Weilong Zhao
CHINESE
Lanzhou University
Performance of Large Language Models in Answering Chinese Medicine Questions: A Comparative Analysis with Clinical Practice Guidelines
Abstract
Background: Chinese medicine (CM) plays an important role in global healthcare, but faces challenges in standardizing and modernizing its knowledge system. Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain.
Objective: To evaluate the performance of LLMs in answering questions related to CM clinical practice guidelines.
Methods: This cross-sectional study was conducted between May 4 and September 22, 2024. Ten CM clinical practice guidelines were randomly selected, and 150 questions were constructed across three categories: syndrome differentiation and medication, specific prescription consultation, and CM theory analysis. Four LLMs (GPT-4o, Claude-3.5 sonnet, Moonshot v1, and ChatGLM-4) were evaluated using both English and Chinese queries. The main evaluation metrics included alignment of the responses with the guidelines, readability, and use of safety disclaimers.
Results: Overall, GPT-4o had significantly higher guideline alignment scores (median 4.00, interquartile range (IQR) 2.98-5.00) than the other models on the English responses, whereas GPT-4o (median 4.30, IQR 3.60-5.00) and ChatGLM-4 (median 4.30, IQR 3.10-5.00) performed comparably and better than the other models in Chinese. All LLMs performed well on the TTA questions, but there were significant differences in performance on the SDM and SPC questions. English responses had lower readability (mean FRES 34.9) compared to Chinese responses. Moonshot provided the highest rate of safety disclaimers (98.7% English, 100% Chinese).
Conclusion: LLMs showed varying degrees of potential for acquiring CM knowledge acquisition. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.
Objective: To evaluate the performance of LLMs in answering questions related to CM clinical practice guidelines.
Methods: This cross-sectional study was conducted between May 4 and September 22, 2024. Ten CM clinical practice guidelines were randomly selected, and 150 questions were constructed across three categories: syndrome differentiation and medication, specific prescription consultation, and CM theory analysis. Four LLMs (GPT-4o, Claude-3.5 sonnet, Moonshot v1, and ChatGLM-4) were evaluated using both English and Chinese queries. The main evaluation metrics included alignment of the responses with the guidelines, readability, and use of safety disclaimers.
Results: Overall, GPT-4o had significantly higher guideline alignment scores (median 4.00, interquartile range (IQR) 2.98-5.00) than the other models on the English responses, whereas GPT-4o (median 4.30, IQR 3.60-5.00) and ChatGLM-4 (median 4.30, IQR 3.10-5.00) performed comparably and better than the other models in Chinese. All LLMs performed well on the TTA questions, but there were significant differences in performance on the SDM and SPC questions. English responses had lower readability (mean FRES 34.9) compared to Chinese responses. Moonshot provided the highest rate of safety disclaimers (98.7% English, 100% Chinese).
Conclusion: LLMs showed varying degrees of potential for acquiring CM knowledge acquisition. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.
Paper Number
86
Biography
Weilong Zhao is currently pursuing the M.S. degree in Public Health at Lanzhou University, Lanzhou, China. His research interests include evidence-based medicine, guideline development methodology, health policy, and public health.
