4.2 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation
Wednesday, September 17, 2025 |
3:45 PM - 5:00 PM |
Speaker
Miss Jiayi Liu
China
Lanzhou University
QUEST-TCM - A Framework for Human Evaluation of Large Language Models in Traditional Chinese Medicine Practice
Abstract
Abstract
Background: Traditional Chinese Medicine (TCM) has gained international attention while large language models (LLMs) show promise in assisting healthcare. Existing LLM evaluation frameworks focus on Western medicine and cannot accommodate TCM's unique characteristics.
Objective: This study develops a standardized framework for evaluating the performance of LLMs in TCM practice, addressing the lack of TCM-specific evaluation methodologies.
Methods: We conducted a scoping review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR) guidelines. Literature searches were performed across multiple databases including PubMed, Embase, Web of Science, CNKI, and Wanfang. Studies that examined the application of LLMs in tasks related to TCM practice, such as developing guidelines or in clinical practice itself. We reviewed these studies across multiple dimensions including accuracy, relevance, comprehensiveness, consistency, safety, and usability. Based on the findings of the review, we developed a TCM-specific evaluation framework of LLMs.
Results: From 1,100 initial records, 41 articles were selected for analysis after screening. The framework, named as QUEST-TCM, was built on five core evaluation principles: TCM knowledge conformity, diagnostic accuracy, treatment rationality, safety and ethics, and modern integration. The framework provides a structured approach across preparation, execution, and evaluation phases.
Discussion for scientific abstracts: The QUEST-TCM framework provides a comprehensive, standardized approach for evaluating LLMs in TCM application. This framework bridges traditional knowledge systems with modern AI capabilities, promoting responsible integration of LLMs into TCM practice while preserving its philosophical foundations.
Background: Traditional Chinese Medicine (TCM) has gained international attention while large language models (LLMs) show promise in assisting healthcare. Existing LLM evaluation frameworks focus on Western medicine and cannot accommodate TCM's unique characteristics.
Objective: This study develops a standardized framework for evaluating the performance of LLMs in TCM practice, addressing the lack of TCM-specific evaluation methodologies.
Methods: We conducted a scoping review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR) guidelines. Literature searches were performed across multiple databases including PubMed, Embase, Web of Science, CNKI, and Wanfang. Studies that examined the application of LLMs in tasks related to TCM practice, such as developing guidelines or in clinical practice itself. We reviewed these studies across multiple dimensions including accuracy, relevance, comprehensiveness, consistency, safety, and usability. Based on the findings of the review, we developed a TCM-specific evaluation framework of LLMs.
Results: From 1,100 initial records, 41 articles were selected for analysis after screening. The framework, named as QUEST-TCM, was built on five core evaluation principles: TCM knowledge conformity, diagnostic accuracy, treatment rationality, safety and ethics, and modern integration. The framework provides a structured approach across preparation, execution, and evaluation phases.
Discussion for scientific abstracts: The QUEST-TCM framework provides a comprehensive, standardized approach for evaluating LLMs in TCM application. This framework bridges traditional knowledge systems with modern AI capabilities, promoting responsible integration of LLMs into TCM practice while preserving its philosophical foundations.
Paper Number
352
Biography
AI-Driven Evidence Synthesis: Data Extraction of Randomized Controlled Trials with Large Language Models (Accepted by International Journal of Surgery in November)
Enhancing Systematic Reviews with Large Language Models: Data Extraction of Randomized Controlled Trials (Poster, The Global Evidence Summit 2024, Prague)
Miss Jiayi Liu
China
Lanzhou University
Using Large Language Models to Generate Medical Plain-Language Summaries: A Comparative Study
Abstract
Abstract
Background: Effective translation of medical evidence for lay audiences is crucial for guideline implementation. While generative artificial intelligence (GenAI) increasingly supports healthcare communication, its outputs often exhibit unwarranted optimism that may obscure critical uncertainties (e.g., risks of bias, certainty of evidence) and generate misinformation.
Objective: We aimed to assess gaps in plain language summaries (PLS) of systematic reviews generated by GenAI based on standardized prompts from different large language models (LLMs), identify reasons for incomplete disclosures, and test how better prompt engineering improves the quality of the PLS.
Methods: We analyzed 50 Cochrane reviews (2018–2023) comparing five PLS versions: 1) manually developed (published), 2) standard GPT-4o, 3) standard Claude-3, 4) GPT-4o refined using Cochrane guidelines, and 5) Claude-3.7 refined. A multidisciplinary panel assessed completeness (16-item checklist), readability (Flesch-Kincaid), and risk communication adequacy (Likert scale), with inter-rater reliability validated (Kappa>0.75).
Results: Standard LLMs omitted 66-71% of limitations from the PLSs (GPT-4o: 68% [95% confidence interval 61–75%], Claude-3: 71% [64–78%]) vs. 12% manual. Evidence-structured prompts improved the disclosure of limitations 3.8-4.5-fold (GPT-4:4.2-fold; Claude-3:3.9-fold), achieving parity with humans in conflict-of-interest transparency (Δ≤5%, p>0.05). Claude-3 showed marginally higher lexical diversity (measure of textual lexical diversity =83.6 vs GPT-4o: 79.2), while GPT-4 better replicated Cochrane terminology (86% vs Claude-3: 72%). Both LLMs maintained readability at the level understandable for 8th-9th grade students.
Discussion for scientific abstracts: Evidence-structured prompts greatly improve the quality of AI-generated plain-language summaries, bridging a critical gap in the communication of medical information.
Keywords: LLMs, evidence translation, plain language summary
Background: Effective translation of medical evidence for lay audiences is crucial for guideline implementation. While generative artificial intelligence (GenAI) increasingly supports healthcare communication, its outputs often exhibit unwarranted optimism that may obscure critical uncertainties (e.g., risks of bias, certainty of evidence) and generate misinformation.
Objective: We aimed to assess gaps in plain language summaries (PLS) of systematic reviews generated by GenAI based on standardized prompts from different large language models (LLMs), identify reasons for incomplete disclosures, and test how better prompt engineering improves the quality of the PLS.
Methods: We analyzed 50 Cochrane reviews (2018–2023) comparing five PLS versions: 1) manually developed (published), 2) standard GPT-4o, 3) standard Claude-3, 4) GPT-4o refined using Cochrane guidelines, and 5) Claude-3.7 refined. A multidisciplinary panel assessed completeness (16-item checklist), readability (Flesch-Kincaid), and risk communication adequacy (Likert scale), with inter-rater reliability validated (Kappa>0.75).
Results: Standard LLMs omitted 66-71% of limitations from the PLSs (GPT-4o: 68% [95% confidence interval 61–75%], Claude-3: 71% [64–78%]) vs. 12% manual. Evidence-structured prompts improved the disclosure of limitations 3.8-4.5-fold (GPT-4:4.2-fold; Claude-3:3.9-fold), achieving parity with humans in conflict-of-interest transparency (Δ≤5%, p>0.05). Claude-3 showed marginally higher lexical diversity (measure of textual lexical diversity =83.6 vs GPT-4o: 79.2), while GPT-4 better replicated Cochrane terminology (86% vs Claude-3: 72%). Both LLMs maintained readability at the level understandable for 8th-9th grade students.
Discussion for scientific abstracts: Evidence-structured prompts greatly improve the quality of AI-generated plain-language summaries, bridging a critical gap in the communication of medical information.
Keywords: LLMs, evidence translation, plain language summary
Paper Number
353
Biography
PUBLICATIONS
AI-Driven Evidence Synthesis: Data Extraction of Randomized Controlled Trials with Large Language Models (Accepted by International Journal of Surgery in November)
CONFERENCE PRESENTATIONS
Enhancing Systematic Reviews with Large Language Models: Data Extraction of Randomized Controlled Trials (Poster, The Global Evidence Summit 2024, Prague)
Ms Ye Wang
CHINESE
Lanzhou University
The Role of Large Language Models in Guideline Peer Review: Current Adoption, Challenges, and Future Prospects
Abstract
The Role of Large Language Models in Guideline Peer Review: Current Adoption, Challenges, and Future Prospects
Background: Large language models (LLMs), such as ChatGPT, Claude, and Gemini, have the potential to enhance the peer review process of clinical practice guidelines (CPGs) by identifying methodological issues, reporting gaps, and conflicts of interest. However, their adoption, effectiveness, and challenges in this context remain unclear.
Objective: To evaluate the current adoption, effectiveness, and challenges of using large language models in the peer review process of clinical practice guidelines and explore potential improvements.
Methods: A mixed-methods, cross-sectional study will be conducted in three phases. Phase 1 will involve distributing surveys to guideline developers, journal editors, and reviewers to assess current LLM adoption and acceptance. Phase 2 will evaluate the effectiveness of LLMs using selected guideline documents previously assessed by experts (RIGHT and AGREE II as reference standards), measuring accuracy, sensitivity, specificity, and consistency. Phase 3 will involve interviews with guideline developers, journal editors, reviewers, and AI ethics experts to explore challenges in LLM application and potential improvements in guideline peer review.
Results: Data collection and analysis are ongoing. Comprehensive results will be presented at the upcoming congress.
Discussion: This study will provide critical insights into the practical role of LLMs in guideline peer review, including their strengths, limitations, and areas for future improvement. Findings will inform best practices and recommendations for integrating AI tools into guideline development processes.
The author gratefully acknowledges the support of K.C.Wong Education Foundation, Hong Kong’.
Background: Large language models (LLMs), such as ChatGPT, Claude, and Gemini, have the potential to enhance the peer review process of clinical practice guidelines (CPGs) by identifying methodological issues, reporting gaps, and conflicts of interest. However, their adoption, effectiveness, and challenges in this context remain unclear.
Objective: To evaluate the current adoption, effectiveness, and challenges of using large language models in the peer review process of clinical practice guidelines and explore potential improvements.
Methods: A mixed-methods, cross-sectional study will be conducted in three phases. Phase 1 will involve distributing surveys to guideline developers, journal editors, and reviewers to assess current LLM adoption and acceptance. Phase 2 will evaluate the effectiveness of LLMs using selected guideline documents previously assessed by experts (RIGHT and AGREE II as reference standards), measuring accuracy, sensitivity, specificity, and consistency. Phase 3 will involve interviews with guideline developers, journal editors, reviewers, and AI ethics experts to explore challenges in LLM application and potential improvements in guideline peer review.
Results: Data collection and analysis are ongoing. Comprehensive results will be presented at the upcoming congress.
Discussion: This study will provide critical insights into the practical role of LLMs in guideline peer review, including their strengths, limitations, and areas for future improvement. Findings will inform best practices and recommendations for integrating AI tools into guideline development processes.
The author gratefully acknowledges the support of K.C.Wong Education Foundation, Hong Kong’.
Paper Number
442
Biography
Ye Wang is an MPH student at the School of Public Health, Lanzhou University. Her research focuses on clinical guideline development, AI-accelerated evidence synthesis, and AI-assisted COI management. With a growing interest and experience in guideline development, her research aims to enhance the transparency and consistency of guidelines and promote AI's role in evidence synthesis and conflict of interest management.
Dr Marta Souto Maior
Coordinator
Conitec
Platform for consulting drug recommendation in Clinical Practice Guidelines
Abstract
Background: The National Committee for Health Technology Incorporation (Conitec) advises Brazilian Ministry of Health (MoH) in Clinical Practice Guidelines (CPG) developing. These documents establish criteria for diagnosing a disease or health problem; treatment, clinical control mechanisms; and the monitoring and verification of therapeutic results, to be monitored by managers of the Unified Health System (SUS).
Objective: To describe the development of a platform for consulting drug recommendation in Clinical Practice Guidelines
Methods: Descriptive qualitative study about platform development.
Results: For each CPG published by MoH until December 2024, it was collected its title, Internation Classification of Diseases codes and drug recommendations and publishing date. All data were organized using Excel 2010 software. Later, data was exported to Microsoft Power BI, to create a platform. The platform is now being improved so that it can be made available on Conitec website.
Conclusions: The platform will improve access of patients, health managers and health professionals to information on which drugs are recommended in each guideline and which guidelines provide recommendations on the care of each disease.
Objective: To describe the development of a platform for consulting drug recommendation in Clinical Practice Guidelines
Methods: Descriptive qualitative study about platform development.
Results: For each CPG published by MoH until December 2024, it was collected its title, Internation Classification of Diseases codes and drug recommendations and publishing date. All data were organized using Excel 2010 software. Later, data was exported to Microsoft Power BI, to create a platform. The platform is now being improved so that it can be made available on Conitec website.
Conclusions: The platform will improve access of patients, health managers and health professionals to information on which drugs are recommended in each guideline and which guidelines provide recommendations on the care of each disease.
Paper Number
517
Biography
Pharmacist. MSc and pHD in Public Health. Works at Conitec.
Mr Gregor Wenzel
German Cancer Society
Evaluation of AI-Generated Summaries from Evidence Tables for Evidence-Based Guidelines in the German Guidelines Program in Oncology
Abstract
Objective: The development of S3 guidelines requires meticulous evidence retrieval and synthesis. While comprehensive evidence tables are available, their interpretation is often left to guideline groups, imposing a significant resource burden. This study evaluates AI-generated summaries through a quantitative assessment of endpoint accuracy and a qualitative evaluation of text quality, including plausibility and usability.
Methods: Two AI models, Claude Sonnet 3.5 and OpenAI o3-mini, processed 30 randomly selected evidence tables from 17 clinical guidelines. Two assessors evaluated recognized, erroneous, and hallucinated endpoints quantitatively, as well as plausibility and usability qualitatively on a 3- and 5-point scale, respectively. Summary length was also analyzed for its potential impact on readability and interpretation.
Results: OpenAI o3-mini recognized more endpoints (92.8%) than Claude (53.2%) and had fewer erroneous extractions (0.1% vs. 2.8%).Only Claude hallucinated any endpoints (1.8%). On average, Claude’s summaries were shorter (243.5 vs. 630.4 words) and slightly more plausible (1.32 vs. 1.74). Usability scores were comparable (2.15 vs. 2.26), though differences in summary length may influence qualitative assessments.
Conclusion: OpenAI o3-mini excelled in endpoint recognition with minimal errors, while Claude generated summaries that assessors found slightly more plausible. Both models show promise for aiding evidence interpretation, but refinements are needed to optimize usability for guideline development.
Methods: Two AI models, Claude Sonnet 3.5 and OpenAI o3-mini, processed 30 randomly selected evidence tables from 17 clinical guidelines. Two assessors evaluated recognized, erroneous, and hallucinated endpoints quantitatively, as well as plausibility and usability qualitatively on a 3- and 5-point scale, respectively. Summary length was also analyzed for its potential impact on readability and interpretation.
Results: OpenAI o3-mini recognized more endpoints (92.8%) than Claude (53.2%) and had fewer erroneous extractions (0.1% vs. 2.8%).Only Claude hallucinated any endpoints (1.8%). On average, Claude’s summaries were shorter (243.5 vs. 630.4 words) and slightly more plausible (1.32 vs. 1.74). Usability scores were comparable (2.15 vs. 2.26), though differences in summary length may influence qualitative assessments.
Conclusion: OpenAI o3-mini excelled in endpoint recognition with minimal errors, while Claude generated summaries that assessors found slightly more plausible. Both models show promise for aiding evidence interpretation, but refinements are needed to optimize usability for guideline development.
Paper Number
288
Biography
Gregor Wenzel is a theoretical biologist and medical writer and has been working in the German Guidelines Program in Oncolgoy, assisting in guideline digitalization and as methodologist.
Mr Dianchun Liu
China
Beijing University of Chinese Medicine
AI-based Recommendations Map (RecMap) for Traditional Chinese Medicine (TCM) Treatment of Diabetes: Design, Development and Dissemination
Abstract
Introduction
Approximately, 828 million adults worldwide were affected by diabetes in 2022, with a notable trend towards younger age groups and broader prevalence. Due to the varied quality of evidence in the field of traditional Chinese medicine (TCM), the inconsistency and conflicting recommendations often occurs between guidelines. Our aim is to develop an RecMap for TCM treatment of diabetes covered whole progress stage integrated with AI.
Methods
In the first phase, planning and investigating. We will explore the needs and expectations from both clinicians and patients. Steering committee will be set up to specify the scope and obligations of this platform. The second, development phase. comprehensive and systematic screening will be conducted from a wide sources of databases such as Embase, Pubmed, Chinese Medical Association Guide, and Wanfang. Guidelines will be evaluated using AGREE-II and AGREE-REX. The recommendations extracted from the guidelines will be evaluated using GRADEPro infrastructure. In the AI training stage, the AI system is integrated into the website to provide accurate responses to the questions according to the guideline and recommendations of the website. Finally, dissemination. Strategies for debugging and testing will be collaboratively developed with stakeholders, and feedback from clinicians and patients will optimize and promote the dual-mode display. In the clinical trial and promotion phase, feedback from patients and clinicians will optimize and promote the website.
Discussions
The AI-based TCMDia-RecMap will significantly enhance the utilization of reliable guidelines among clinicians, patients, and policymakers, thereby optimizing evidence-based diabetes management.
Approximately, 828 million adults worldwide were affected by diabetes in 2022, with a notable trend towards younger age groups and broader prevalence. Due to the varied quality of evidence in the field of traditional Chinese medicine (TCM), the inconsistency and conflicting recommendations often occurs between guidelines. Our aim is to develop an RecMap for TCM treatment of diabetes covered whole progress stage integrated with AI.
Methods
In the first phase, planning and investigating. We will explore the needs and expectations from both clinicians and patients. Steering committee will be set up to specify the scope and obligations of this platform. The second, development phase. comprehensive and systematic screening will be conducted from a wide sources of databases such as Embase, Pubmed, Chinese Medical Association Guide, and Wanfang. Guidelines will be evaluated using AGREE-II and AGREE-REX. The recommendations extracted from the guidelines will be evaluated using GRADEPro infrastructure. In the AI training stage, the AI system is integrated into the website to provide accurate responses to the questions according to the guideline and recommendations of the website. Finally, dissemination. Strategies for debugging and testing will be collaboratively developed with stakeholders, and feedback from clinicians and patients will optimize and promote the dual-mode display. In the clinical trial and promotion phase, feedback from patients and clinicians will optimize and promote the website.
Discussions
The AI-based TCMDia-RecMap will significantly enhance the utilization of reliable guidelines among clinicians, patients, and policymakers, thereby optimizing evidence-based diabetes management.
Paper Number
122
Biography
Dianchun Liu, an undergraduate at Beijing University of Chinese Medicine, is committed to research on diabetes, cancer, and gastrointestinal diseases. Proficient in bioinformatics, artificial intelligence, and evidence - based medicine, Liu has published studies as the first author in journals like the World Journal of Gastrointestinal Oncology and Chinese General Practice. These works are crucial steps in advancing research in these medical areas.
Yishan Qin
CHINA
Lanzhou University
Application and Exploration of Large Language Models in the Dissemination and Implementation of Infertility Guidelines
Abstract
Infertility is rising in reproductive-age populations and is now a major global public health issue. The promotion and implementation of clinical practice guidelines is the key to promoting medical equity, improving medical quality and solving public health problems. This research uses infertility guidelines as an example to study the application of large language models in adapting guideline content and helping its spread and implementation.
We first collected and processed key recommendations from infertility-related guidelines. Then, we used large language models to create explanatory and promotional texts and videos. Next, medical experts and patient representatives checked the accuracy and readability of these materials. The early results show that these models can quickly produce information in many formats. This information is easy for patients from different cultural and educational backgrounds to access, thus boosting guideline accessibility. These models also have the potential to offer personalized patient information and support healthcare providers in low-resource settings.
This new method provides a promising way to enhance the efficiency and coverage of guideline dissemination and implementation, promoting fair access to evidence - based recommendations. However, ensuring the accuracy of generated content and sticking to clinical and legal rules are crucial for future development and use.
We first collected and processed key recommendations from infertility-related guidelines. Then, we used large language models to create explanatory and promotional texts and videos. Next, medical experts and patient representatives checked the accuracy and readability of these materials. The early results show that these models can quickly produce information in many formats. This information is easy for patients from different cultural and educational backgrounds to access, thus boosting guideline accessibility. These models also have the potential to offer personalized patient information and support healthcare providers in low-resource settings.
This new method provides a promising way to enhance the efficiency and coverage of guideline dissemination and implementation, promoting fair access to evidence - based recommendations. However, ensuring the accuracy of generated content and sticking to clinical and legal rules are crucial for future development and use.
Paper Number
450
Biography
Qin Yishan is a PhD candidate at the School of Basic Medicine, Lanzhou University. Her research interests are traditional medicine and guideline methodology.
Dr Danielle Pollock
Senior Research Fellow
Health Evidence Synthesis, Recommendations And Impact (hesri), School Of Public Health, University Of Adelaide
Can evidence and gap maps improve guideline efficiency?
Abstract
Background: The development of trustworthy guidelines requires extensive resources. There is a need to improve current workflow, whilst maintaining high standards to meet the needed demands of clinical practice. A key area to improve efficiencies is in the conduct and reporting of evidence synthesis. Traditionally, guideline developers conduct individual searches for each prioritized research question, this approach is tedious and inefficient, often resulting in the same study being screened for inclusion multiple times. Our team conducted an evidence and gap map (EGM) to improve workflow efficiencies in the development of the Australian Motor Neurone Disease (MND) Guideline. By conducting one search and categorizing the evidence base, we propose this can improve guideline workflow.
Objective: To discuss our process of conducting an EGM to assist in question prioritization, evidence searching, screening and conduct.
Methods: This EGM was conducted according to Campbell guidance and JBI guidance for scoping reviews. It was designed with people living with MND, clinicians, and researchers. The EGM was conducted prior to prioritization of research questions.
Results: Our EGM is currently underway, and by GIN 2025 will be completed. We will discuss the benefits, challenges, feasibility, implications and recommendations by conducting an EGM.
Discussion: EGMs could provide a foundation for transparent clinical practice guidelines to be developed more efficiently.
Objective: To discuss our process of conducting an EGM to assist in question prioritization, evidence searching, screening and conduct.
Methods: This EGM was conducted according to Campbell guidance and JBI guidance for scoping reviews. It was designed with people living with MND, clinicians, and researchers. The EGM was conducted prior to prioritization of research questions.
Results: Our EGM is currently underway, and by GIN 2025 will be completed. We will discuss the benefits, challenges, feasibility, implications and recommendations by conducting an EGM.
Discussion: EGMs could provide a foundation for transparent clinical practice guidelines to be developed more efficiently.
Paper Number
126
Biography
Dr Danielle Pollock is a Research Fellow at HESRI (Health Evidence Synthesis Recommendations and Impact). She developed the JBI Scoping Review Network is the chair of of the JBI Scoping Review methodology group and GIN ANZ working group.
Ms Yanfang Ma
CHINA
Vincent V.C. Woo Chinese Medicine Clinical Research Institute, School of Chinese Medicine, Hong Kong Baptist University
LLMs-CMPs collaboration for clinical decision-making to promote guidelines implementation in Chinese medicine
Abstract
Background: Traditional Chinese Medicine (TCM), one of the complementary therapies, is widely used globally. TCM offers diverse therapeutic approaches such as acupuncture, herbal medicine, and tuina, rooted in centuries of holistic practices. However, a limitation of TCM is its non-standardized, individualized approach to diagnosis and treatment, often resulting in variability in decision-making. With the rapid growth of clinical evidence and evidence-based clinical guidelines, Chinese Medicine Practitioners (CMPs) face challenges in integrating them for consistent care. Artificial intelligence, particularly Large Language Models (LLMs), offers a potential solution. LLMs can self-train on extensive datasets, providing real-time, evidence-based recommendations to support clinical decision-making.
Objective: This study aims to evaluate the consistency between LLMs, CMPs, and LLM-CMP collaboration in clinical decision-making. We seek to understand how LLMs, learning guideline recommendations in real-time, can enhance decision-making consistency and promote the implementation of guidelines in TCM.
Methods: A cross-sectional observational study will compare the diagnosis and treatment decisions made by LLMs, CMPs, and their collaboration on standardized clinical gastroenterology routine cases. LLMs will provide self-trained recommendations, CMPs will apply clinical expertise, and LLM-CMP collaboration will combine both to make final decisions.
Results: Diagnostic and treatment decisions will be assessed for consistency using Cohen’s κ, targeting κ≥0.8 for routine cases. The performances of different LLM models in collaboration with CMPs will also be quantified.
Discussion: The study is ongoing, and the results will be presented at the GIN Conference.
Objective: This study aims to evaluate the consistency between LLMs, CMPs, and LLM-CMP collaboration in clinical decision-making. We seek to understand how LLMs, learning guideline recommendations in real-time, can enhance decision-making consistency and promote the implementation of guidelines in TCM.
Methods: A cross-sectional observational study will compare the diagnosis and treatment decisions made by LLMs, CMPs, and their collaboration on standardized clinical gastroenterology routine cases. LLMs will provide self-trained recommendations, CMPs will apply clinical expertise, and LLM-CMP collaboration will combine both to make final decisions.
Results: Diagnostic and treatment decisions will be assessed for consistency using Cohen’s κ, targeting κ≥0.8 for routine cases. The performances of different LLM models in collaboration with CMPs will also be quantified.
Discussion: The study is ongoing, and the results will be presented at the GIN Conference.
Paper Number
50
Biography
Ms. Ma focuses on evidence-based medicine, systematic reviews, and developing clinical practice guidelines from 2016. She joined the Chinese EQUATOR Centre at Hong Kong Baptist University in August 2022 and is also interested in the development of reporting guidelines (data sharing and traditional Chinese medicine). Ms. Ma has authored or co-authored over 40 articles in peer-reviewed journals and participated in more than five books related to reporting guidelines, GRADE applications, and evidence-based assessment of Chinese Medicine.
Ms Ye Wang
CHINESE
Lanzhou University
Assessing the Effects of Large Language Models on Guideline Quality and Efficiency: An Interrupted Time Series Approach
Abstract
Assessing the Effects of Large Language Models on Guideline Quality and Efficiency: An Interrupted Time Series Approach
Background: Large language models (LLMs) such as ChatGPT and Claude have emerged as promising tools to enhance guideline development by potentially improving guideline quality and efficiency.
Objective: To evaluate the impact of implementing LLMs on the quality and efficiency of clinical practice guideline (CPG) development.
Methods: We will conduct an interrupted time series (ITS) analysis to compare guideline quality and efficiency before and after the implementation of LLMs in guideline development processes. Guidelines developed between 2020 and 2024 will be collected from databases including G-I-N, MEDLINE, Embase, and Web of Science. We will assess guideline quality using the Reporting Items for Practice Guidelines in Healthcare (RIGHT) checklist and the Appraisal of Guidelines for Research and Evaluation II (AGREE II). Efficiency will be assessed based on development time and resource utilization. Segmented regression analyses will quantify changes attributable to LLM implementation.
Results: Data collection and analysis are ongoing. Results will be presented at the upcoming conference.
Conclusion: This study will provide critical evidence on the role of LLMs in enhancing guideline development, potentially informing best practices in guideline methodology.
The author gratefully acknowledges the support of K.C.Wong Education Foundation, Hong Kong’.
Background: Large language models (LLMs) such as ChatGPT and Claude have emerged as promising tools to enhance guideline development by potentially improving guideline quality and efficiency.
Objective: To evaluate the impact of implementing LLMs on the quality and efficiency of clinical practice guideline (CPG) development.
Methods: We will conduct an interrupted time series (ITS) analysis to compare guideline quality and efficiency before and after the implementation of LLMs in guideline development processes. Guidelines developed between 2020 and 2024 will be collected from databases including G-I-N, MEDLINE, Embase, and Web of Science. We will assess guideline quality using the Reporting Items for Practice Guidelines in Healthcare (RIGHT) checklist and the Appraisal of Guidelines for Research and Evaluation II (AGREE II). Efficiency will be assessed based on development time and resource utilization. Segmented regression analyses will quantify changes attributable to LLM implementation.
Results: Data collection and analysis are ongoing. Results will be presented at the upcoming conference.
Conclusion: This study will provide critical evidence on the role of LLMs in enhancing guideline development, potentially informing best practices in guideline methodology.
The author gratefully acknowledges the support of K.C.Wong Education Foundation, Hong Kong’.
Paper Number
440
Biography
Ye Wang is an MPH student at the School of Public Health, Lanzhou University. Her research focuses on clinical guideline development, AI-accelerated evidence synthesis, and AI-assisted COI management. With a growing interest and experience in guideline development, her research aims to enhance the transparency and consistency of guidelines and promote AI's role in evidence synthesis and conflict of interest management.
Dr Natasha Gloeck
South Africa
Senior Scientist
South African Medical Research Council
Promoting efficiency in an evidence response service towards advancing universal health coverage (UHC) in South Africa
Abstract
Background
The Evidence to Decision (E2D)Initiative builds on a decade of engagement with academic and government partners to strengthen healthcare recommendations through evidence synthesis for Universal Health Coverage(UHC) in South Africa. E2D advances the partnership through clear workplans and funding to ensure timely, responsive evidence synthesis and methodological support. A key component of this initiative involves leveraging technology to streamline the evidence requests process and improve overall efficiency.
Aim
To develop a tailored database to meet specific needs of the E2D evidence-response system and enhance review request processes.
Methods
Previously, the service operated through email requests, informal discussions and manual spreadsheet updates, making real-time updates cumbersome and dependent on several people for maintenance. The current approach utilizes Microsoft Forms for request submissions where updates still rely on manual input. However, formalization of the evidence response service through E2D highlighted the need for a more robust platform supporting real-time updates and multi-user accessibility.
Results
We have developed and are piloting a new platform to manage requests for evidence from NDoH, using Redcap technology. This links databases and creates more streamlined processes for allocating available reviewers. This also enables real-time updates of review product progress. Further testing is ongoing, and inclusive of additional modules such as report generation.
Discussion
Harnessing technology will enhance efficiency, improve reviewer capacity management, and minimize the risk of overlooked requests. We further anticipate this serving as a pilot project to optimize processing for a planned Health Technology Assessment agency, supporting the transition towards UHC in South Africa.
The Evidence to Decision (E2D)Initiative builds on a decade of engagement with academic and government partners to strengthen healthcare recommendations through evidence synthesis for Universal Health Coverage(UHC) in South Africa. E2D advances the partnership through clear workplans and funding to ensure timely, responsive evidence synthesis and methodological support. A key component of this initiative involves leveraging technology to streamline the evidence requests process and improve overall efficiency.
Aim
To develop a tailored database to meet specific needs of the E2D evidence-response system and enhance review request processes.
Methods
Previously, the service operated through email requests, informal discussions and manual spreadsheet updates, making real-time updates cumbersome and dependent on several people for maintenance. The current approach utilizes Microsoft Forms for request submissions where updates still rely on manual input. However, formalization of the evidence response service through E2D highlighted the need for a more robust platform supporting real-time updates and multi-user accessibility.
Results
We have developed and are piloting a new platform to manage requests for evidence from NDoH, using Redcap technology. This links databases and creates more streamlined processes for allocating available reviewers. This also enables real-time updates of review product progress. Further testing is ongoing, and inclusive of additional modules such as report generation.
Discussion
Harnessing technology will enhance efficiency, improve reviewer capacity management, and minimize the risk of overlooked requests. We further anticipate this serving as a pilot project to optimize processing for a planned Health Technology Assessment agency, supporting the transition towards UHC in South Africa.
Paper Number
374
Biography
Tasha is a Senior Scientist in the Health Systems Research Unit at the SAMRC. She holds a MBChB (UP), DTM&H (UP), MSCE (UP) and PG Dip in Health Economics (UCT). She is currently pursuing a PHD in Public Health. Her special interests include evidence-based health care, primary health care, evidence synthesis, evidence-informed decision-making, and clinical practice guideline methodology. Tasha helps to co-ordinate the South African GRADE Network, and co-leads Goal 2 of the SAMRC/NdoH E2D project. Tasha is passionate about implementing training and research that positively impacts the lives of the people of South Africa, and other low-and-middle income countries.
Ms Kinlabel Okwen Tetamiyaka Tezok
Software Engineer
Effective Basic Services (eBASE) Africa
Harnessing the Transferability Toolkit for Guideline Adaptation in Local Contexts
Abstract
Background
Most solutions developed in the Global North are tailored to their specific contexts. A guideline that is effective in the United Kingdom may not yield the same results in Africa due to diversities in culture, infrastructure, healthcare systems, and socioeconomic factors. This problem highlights the need to adjust guidelines to fit different local contexts to yield better health outcomes.
Objective
This study explores how the education-based transferability toolkit can assess the feasibility of adapting healthcare guidelines across diverse settings using machine learning.
Methods
We apply the transferability toolkit developed by eBASE Africa, leveraging Classification and Regression Trees (CART) and Natural Language Processing (NLP) to predict guideline adaptability. The model evaluates five key variables: relevance, complexity, cost, average importance, and impact to classify guidelines as highly transferable, moderately transferable, and/or not transferable in a given context.
Results
A threshold of 69% was realized for the transferability of educational strands for stakeholders in education. We argued that transferability is valid when there is high relevance, low complexity, low cost, and high importance. Building on these results, we expect that this tool will be greatly applicable in guideline adaptation.
Discussion for scientific abstracts
The study explores how the transferability toolkit can be used to adapt healthcare guidelines to different contexts. This approach will make guidelines development a living evidence and their context specificities more ideal. Living guidelines through the transferability tool will mean a reduction in efforts and cost of funding the frequent development of guidelines across the globe.
Most solutions developed in the Global North are tailored to their specific contexts. A guideline that is effective in the United Kingdom may not yield the same results in Africa due to diversities in culture, infrastructure, healthcare systems, and socioeconomic factors. This problem highlights the need to adjust guidelines to fit different local contexts to yield better health outcomes.
Objective
This study explores how the education-based transferability toolkit can assess the feasibility of adapting healthcare guidelines across diverse settings using machine learning.
Methods
We apply the transferability toolkit developed by eBASE Africa, leveraging Classification and Regression Trees (CART) and Natural Language Processing (NLP) to predict guideline adaptability. The model evaluates five key variables: relevance, complexity, cost, average importance, and impact to classify guidelines as highly transferable, moderately transferable, and/or not transferable in a given context.
Results
A threshold of 69% was realized for the transferability of educational strands for stakeholders in education. We argued that transferability is valid when there is high relevance, low complexity, low cost, and high importance. Building on these results, we expect that this tool will be greatly applicable in guideline adaptation.
Discussion for scientific abstracts
The study explores how the transferability toolkit can be used to adapt healthcare guidelines to different contexts. This approach will make guidelines development a living evidence and their context specificities more ideal. Living guidelines through the transferability tool will mean a reduction in efforts and cost of funding the frequent development of guidelines across the globe.
Paper Number
224
Biography
Kinlabel is a tech professional specializing in mobile development and machine learning in evidence-based practices. She leads the eBASE Connect app development at iCode Abakwa, where she helps shape app design and architecture. Her work includes a paper on improving livelihoods for people with disabilities in Cameroon through evidence-based toolkits. She is currently part of the DESTINY development team, working to integrate the transferability toolkit into their DEST tool. The toolkit predicts the transferability of interventions across different contexts, enhancing adaptive learning and evidence utilization.
Mr Haodong Li
CHINA
Master Candidate
Lanzhou University
Evaluation of Compliance of Methodological Quality Compared with LLM with AMSTAR 2 Tool: A Cross-Sectional Survey
Abstract
Abstract
With the increasing application of Large Language Models (LLMs) in the medical field, their potential in assessing the methodological quality of systematic reviews has garnered significant attention. This study aims to compare the assessment results of methodological quality between three LLMs (Kimi, DouBao, and DeepSeek) and human evaluation in 73 systematic review articles using the AMSTAR 2 tool. The study is ongoing, with all tests expected to be completed and results presented before the conference.
Background:
The methodological quality of systematic reviews is crucial, and the AMSTAR 2 tool is widely used for evaluation. This study compares the performance of three LLMs (Kimi, DouBao, DeepSeek) and two human evaluators in assessing 73 systematic reviews.
Methods:
Each review is assessed three times by LLMs and humans. Primary indicators include overall consistency score (OCS), OCS for each item, testing time, LLM stability, and intraclass correlation coefficient (ICC).
Results:
The study is ongoing, and complete results will be available before the conference. Preliminary findings show differences in overall OCS between LLMs and humans. Detailed analysis will reveal LLM performance in different dimensions, efficiency differences, and consistency.
Conclusion:
This study will provide empirical evidence on LLMs' strengths and limitations in medical literature evaluation, guiding future research and practice.
The author gratefully acknowledges the support of K.C. Wong Education Foundation, Hong Kong.
With the increasing application of Large Language Models (LLMs) in the medical field, their potential in assessing the methodological quality of systematic reviews has garnered significant attention. This study aims to compare the assessment results of methodological quality between three LLMs (Kimi, DouBao, and DeepSeek) and human evaluation in 73 systematic review articles using the AMSTAR 2 tool. The study is ongoing, with all tests expected to be completed and results presented before the conference.
Background:
The methodological quality of systematic reviews is crucial, and the AMSTAR 2 tool is widely used for evaluation. This study compares the performance of three LLMs (Kimi, DouBao, DeepSeek) and two human evaluators in assessing 73 systematic reviews.
Methods:
Each review is assessed three times by LLMs and humans. Primary indicators include overall consistency score (OCS), OCS for each item, testing time, LLM stability, and intraclass correlation coefficient (ICC).
Results:
The study is ongoing, and complete results will be available before the conference. Preliminary findings show differences in overall OCS between LLMs and humans. Detailed analysis will reveal LLM performance in different dimensions, efficiency differences, and consistency.
Conclusion:
This study will provide empirical evidence on LLMs' strengths and limitations in medical literature evaluation, guiding future research and practice.
The author gratefully acknowledges the support of K.C. Wong Education Foundation, Hong Kong.
Paper Number
511
Biography
I'm from the School of Public Health, Lanzhou University. My major is Epidemiology and Health Statistics, and my research focuses on evidence-based medicine and chronic epidemiology. Currently, I'm working on a project that combines artificial intelligence and evidence-based medicine.
Dr Xiaomei Yao
Mcmaster University
The Role of COVIDENCE: An AI-Based Tool for Title and Abstract Screening in A Breast Cancer Evidence-Based Clinical Practice Guideline
Abstract
Background: Developing systematic review (SR)-based, high-quality cancer clinical practice guidelines (CPGs) typically requires two years without any assistance from artificial intelligence (AI).
Objective: To compare the performance of a newly introduced AI-assisted title and abstract screening (Stage I) in COVIDENCE with fully manual screening, using retrospective data from a SR supporting an already-published breast-cancer CPG.
Methods: In a SR comprising 8,774 articles, each article was assessed for relevance and final inclusion through manual review. From this dataset, three article subsets (n=500, 1000, and 2000) were randomly selected to run 30 independent Stage I, AI-assisted trials for each subset. The primary outcome of each trial is workload savings (the proportion of articles not requiring manual screening) at AI-assisted identification of 95% and 100% relevant articles, and 100% finally-included articles. The secondary outcome is missed articles (number of finally-included articles missed upon identifying 95% relevant articles).
Results: To date, 10 trials in each of the first two subsets were completed. At the identification of 95%, 100% (relevant) and 100% (finally included) articles, mean [standard deviation] workload savings are 37.1%[14.1%], 25.9%[15.1%], 57.8%[17.9%] (n=500) and 27.6%[18.2%], 13.5%[13.4%], 49.7%[28.6%] (n=1000), respectively. Workload savings differed significantly (p=0.038) between n=500 and n=1000 trials at the identification of 100% relevant articles. One missed article in 1 trial for the subset of n=500 and 2 missed articles in 2 trials for n=1000 were noted.
Discussion: AI assistance in COVIDENCE demonstrates promise in improving the efficiency of Stage I screening. Complete data will be available for discussion by May 2025.
Objective: To compare the performance of a newly introduced AI-assisted title and abstract screening (Stage I) in COVIDENCE with fully manual screening, using retrospective data from a SR supporting an already-published breast-cancer CPG.
Methods: In a SR comprising 8,774 articles, each article was assessed for relevance and final inclusion through manual review. From this dataset, three article subsets (n=500, 1000, and 2000) were randomly selected to run 30 independent Stage I, AI-assisted trials for each subset. The primary outcome of each trial is workload savings (the proportion of articles not requiring manual screening) at AI-assisted identification of 95% and 100% relevant articles, and 100% finally-included articles. The secondary outcome is missed articles (number of finally-included articles missed upon identifying 95% relevant articles).
Results: To date, 10 trials in each of the first two subsets were completed. At the identification of 95%, 100% (relevant) and 100% (finally included) articles, mean [standard deviation] workload savings are 37.1%[14.1%], 25.9%[15.1%], 57.8%[17.9%] (n=500) and 27.6%[18.2%], 13.5%[13.4%], 49.7%[28.6%] (n=1000), respectively. Workload savings differed significantly (p=0.038) between n=500 and n=1000 trials at the identification of 100% relevant articles. One missed article in 1 trial for the subset of n=500 and 2 missed articles in 2 trials for n=1000 were noted.
Discussion: AI assistance in COVIDENCE demonstrates promise in improving the efficiency of Stage I screening. Complete data will be available for discussion by May 2025.
Paper Number
117
Biography
Dr. Xiaomei Yao is the Associate Director for Quality and Methods at the Program in Evidence-Based Care (PEBC), Health Ontario (Cancer Care Ontario). She is a part-time faculty member at the Department of Health Research Methods, Evidence, and Impact at McMaster University, Canada, and a former member of the GIN/NA Steering Group. Dr. Yao is the Section Editor of "Epidemiology and Statistics" for Surgical Oncology and an Associate Editor of GIN's journal, CPHG.
Dr Gabriella Facchinetti
Istituto Superiore Di Sanità
Harnessing Artificial Intelligence for guideline question and recommendations development: a mapping review protocol
Abstract
Background: Clinical practice guidelines (CPG) traditionally rely on expert panels to formulate key questions and recommendations. Explainable Artificial Intelligence (XAI) could support this process, making it faster and less susceptible to undue influence.
Objective: To identify and categorize available literature across various disciplines on the use of generative AI in developing CPG and recommendations.
Methods: A systematic mapping review following the PRISMA-ScR guideline will be conducted to identify and categorize available literature across various disciplines on the use of generative AI in developing CPG and to compare it with the traditional expert-driven process.
The inclusion criteria will cover AI-generated clinical questions and recommendations across all healthcare fields. Major CPG databases and additional sources will be searched from 2020 onward.
The primary outcome is the frequency of AI-generated clinical questions and recommendations, highlighting key clinical areas and pathway phases. Study selection will use Rayyan with independent screening and discussion for disagreements. Data extraction includes author, year, country, population, clinical area, AI model, and guideline development phase. Findings will be presented narratively and in tables based on research questions.
Results: This review will estimate AI’s role in CPGs over the past five years, identifying key clinical areas. To our knowledge, this is the first study on Large Language Models in CPG development, laying the groundwork for future research.
Discussion for scientific abstract: This study will inform policy, improve guideline development, and promote inclusive, diverse, evidence-based healthcare practices. AI could increase efficiency, accuracy, and objectivity, reducing biases and ensuring more reliable, evidence-based recommendations.
Objective: To identify and categorize available literature across various disciplines on the use of generative AI in developing CPG and recommendations.
Methods: A systematic mapping review following the PRISMA-ScR guideline will be conducted to identify and categorize available literature across various disciplines on the use of generative AI in developing CPG and to compare it with the traditional expert-driven process.
The inclusion criteria will cover AI-generated clinical questions and recommendations across all healthcare fields. Major CPG databases and additional sources will be searched from 2020 onward.
The primary outcome is the frequency of AI-generated clinical questions and recommendations, highlighting key clinical areas and pathway phases. Study selection will use Rayyan with independent screening and discussion for disagreements. Data extraction includes author, year, country, population, clinical area, AI model, and guideline development phase. Findings will be presented narratively and in tables based on research questions.
Results: This review will estimate AI’s role in CPGs over the past five years, identifying key clinical areas. To our knowledge, this is the first study on Large Language Models in CPG development, laying the groundwork for future research.
Discussion for scientific abstract: This study will inform policy, improve guideline development, and promote inclusive, diverse, evidence-based healthcare practices. AI could increase efficiency, accuracy, and objectivity, reducing biases and ensuring more reliable, evidence-based recommendations.
Paper Number
284
Biography
Gabriella Facchinetti is a nurse, Senior Researcher at the Italian National Institute of Health (Istituto Superiore di Sanità) in the National Center for Clinical Governance and Care Excellence, and a university lecturer. She is an expert in research on clinical governance, home and community-based care for older adults with chronic, degenerative, and incurable diseases. With extensive experience in healthcare research and education, she is committed to improving care models and enhancing the quality of services for vulnerable populations.
Dr Simon Van Cauwenbergh
Methodologist
WOREL / USP
Adapting guideline recommendations on smoking cessation within a cross-border primary care collaboration: Lessons learned
Abstract
Background: The collaboration between WOREL (Belgium) and NHG (Netherlands) was initiated at the GIN 2022 conference in Toronto. This partnership was formalized with a Memorandum of Understanding in 2023. Recently, both primary care organisations were accredited by the Belgian Centre for Evidence-Based Medicine.
Objective: To present facilitators, barriers and practical considerations when exchanging summaries of evidence (including GRADE SoF tables) and evidence to decision formats for guideline recommendations on smoking cessation in primary care.
Methods: The collaboration focuses on the guideline development process on smoking cessation. In 2024, the topic was coincidently addressed by the other organization, providing a unique opportunity to assess the feasibility of combined cross-border guideline adaptation and adoption (adolopment procedure). The collaboration involves:
- exchanging methodological details, development processes, search strategies and resources,
- using MagicApp to facilitate the sharing of critically appraised research evidence, rationales and evidence-to-decision frameworks,
- identifying barriers, facilitators, and practical considerations in cross-border guideline development.
Results: During the adolopment process, search strategies, SoF tables and evidence to decision formats are exchanged. Key outcomes will include a bilateral exchange of guideline development methodologies, identification of challenges and facilitators in cross-border collaboration, and documentation of lessons learned.
Discussion: The collaboration will establish/generate a roadmap for the next steps of collaboration, outlining strategies to optimize the exchange and joint development of primary care guidelines in future.
Objective: To present facilitators, barriers and practical considerations when exchanging summaries of evidence (including GRADE SoF tables) and evidence to decision formats for guideline recommendations on smoking cessation in primary care.
Methods: The collaboration focuses on the guideline development process on smoking cessation. In 2024, the topic was coincidently addressed by the other organization, providing a unique opportunity to assess the feasibility of combined cross-border guideline adaptation and adoption (adolopment procedure). The collaboration involves:
- exchanging methodological details, development processes, search strategies and resources,
- using MagicApp to facilitate the sharing of critically appraised research evidence, rationales and evidence-to-decision frameworks,
- identifying barriers, facilitators, and practical considerations in cross-border guideline development.
Results: During the adolopment process, search strategies, SoF tables and evidence to decision formats are exchanged. Key outcomes will include a bilateral exchange of guideline development methodologies, identification of challenges and facilitators in cross-border collaboration, and documentation of lessons learned.
Discussion: The collaboration will establish/generate a roadmap for the next steps of collaboration, outlining strategies to optimize the exchange and joint development of primary care guidelines in future.
Paper Number
435
Biography
Ton Kuijpers is epidemiologist at Dutch College of General Practitioners and co-chair Dutch GRADE Network.
Simon Van Cauwenbergh is a medical doctor, PhD-candidate in Physical and Rehabilitation Medicine and since 2021 working for the Belgian Working Group Development of Primary Care Guidelines (WOREL).
Ms Lejla Koco
Guideline Advisor
Stichting PZNL
The potential of AI in assisting palliative care guideline revisions: insights from a pilot study
Abstract
Background:
Artificial Intelligence can potentially assist in different ways during all phases of the guideline development process, such as collection of evidence, formulating recommendations, structuring texts and the writing process. However, integration of AI must align with established guideline development frameworks to ensure transparency and reliability. Despite growing use of AI in medicine, its role in palliative care guideline development remains limited.
Objective:
This study explores how AI can be applied in guideline development, focusing on various prompt engineering techniques. We evaluate multiple AI-generated texts on their content, quality and expected usefulness and assess the required recourses, human efforts and potential benefits for guideline developers.
Methods:
We examined various AI applications by using ChatGPT 4o for text generation with several prompt variations for generating new texts. Different prompt structures were tested to optimize the AI-generated output. The content of the generated AI texts was evaluated through text analysis and expert opinions.
Results:
Well-structured prompts significantly improved AI-generated content quality, ensuring coherence and relevance. By providing reference materials for AI, as input, the quality of AI-generated texts improved and is expected to reduce undesired hallucinated output. These AI applications reduced drafting time and enhanced content consistency of guideline recommendations and considerations. However, human checks remained crucial for maintaining methodological rigor and clinical accuracy.
Discussion:
Future research should focus on refining AI applications, integrating them into structured workflows, and ensuring alignment with established guideline development methodologies. Responsible AI implementation will require ongoing evaluation and adaptation to maintain scientific integrity and trustworthiness.
Artificial Intelligence can potentially assist in different ways during all phases of the guideline development process, such as collection of evidence, formulating recommendations, structuring texts and the writing process. However, integration of AI must align with established guideline development frameworks to ensure transparency and reliability. Despite growing use of AI in medicine, its role in palliative care guideline development remains limited.
Objective:
This study explores how AI can be applied in guideline development, focusing on various prompt engineering techniques. We evaluate multiple AI-generated texts on their content, quality and expected usefulness and assess the required recourses, human efforts and potential benefits for guideline developers.
Methods:
We examined various AI applications by using ChatGPT 4o for text generation with several prompt variations for generating new texts. Different prompt structures were tested to optimize the AI-generated output. The content of the generated AI texts was evaluated through text analysis and expert opinions.
Results:
Well-structured prompts significantly improved AI-generated content quality, ensuring coherence and relevance. By providing reference materials for AI, as input, the quality of AI-generated texts improved and is expected to reduce undesired hallucinated output. These AI applications reduced drafting time and enhanced content consistency of guideline recommendations and considerations. However, human checks remained crucial for maintaining methodological rigor and clinical accuracy.
Discussion:
Future research should focus on refining AI applications, integrating them into structured workflows, and ensuring alignment with established guideline development methodologies. Responsible AI implementation will require ongoing evaluation and adaptation to maintain scientific integrity and trustworthiness.
Paper Number
464
Biography
Lejla Koco, MSc, is a guideline advisor specializing in Dutch palliative care guidelines. With expertise in palliative care guideline development, she focuses on enhancing evidence-based practices to improve patient care. Her work involves developing, revising, and implementing guidelines to ensure high-quality palliative care standards in the Netherlands. Passionate about innovation, she explores the role of AI in optimizing guideline processes.
Dr, Prof Janine Vetsch
Ost
A systematic Comparison of data extractions using a large language model (Elicit) and human reviewers
Abstract
Background: Elicit is an artificial intelligence tool which may automate data extraction for the conduct of evidence synthesis and guidelines. However, the tool’s performance and accuracy is unclear and requires an independent assessment.
Objective: We aimed at comparing data extractions of randomized controlled trial reportsdone by Elicit and human reviewers.
Methods: We sampled 20 randomized controlled trial reports of which data was extracted manually by a human reviewer. We assessed the variables study objectives, sample characteristics and size, study design, intervention, outcome measured and intervention effects and classified the results into "deviating extractions", "partially equal with less information" and "equal to or more information".
Results: Data extractions were equal between Elicit and human extractions in 49 % of all variables in all twenty studies, partially equal in 46% and deviating in 5%. Across all variables, Elicit extracted equal to or more information compared to a human reviewer in 1-20 studies (median 11). Only for the variable study design, all extractions (100%) by Elicit were equal to human reviewers. For the variable intervention effects, extractions by Elicit were equal to human reviewers in only one study (5%).
Discussion for scientific abstract: Elicit extracted data only partly correct for our predefined variables. Variables like ‘intervention effect’ or ‘intervention’ may require a human reviewer to complete the data extraction. Our results suggest that verification by human reviewers is necessary to ensure that all relevant information is captured completely and correctly by Elicit.
Objective: We aimed at comparing data extractions of randomized controlled trial reportsdone by Elicit and human reviewers.
Methods: We sampled 20 randomized controlled trial reports of which data was extracted manually by a human reviewer. We assessed the variables study objectives, sample characteristics and size, study design, intervention, outcome measured and intervention effects and classified the results into "deviating extractions", "partially equal with less information" and "equal to or more information".
Results: Data extractions were equal between Elicit and human extractions in 49 % of all variables in all twenty studies, partially equal in 46% and deviating in 5%. Across all variables, Elicit extracted equal to or more information compared to a human reviewer in 1-20 studies (median 11). Only for the variable study design, all extractions (100%) by Elicit were equal to human reviewers. For the variable intervention effects, extractions by Elicit were equal to human reviewers in only one study (5%).
Discussion for scientific abstract: Elicit extracted data only partly correct for our predefined variables. Variables like ‘intervention effect’ or ‘intervention’ may require a human reviewer to complete the data extraction. Our results suggest that verification by human reviewers is necessary to ensure that all relevant information is captured completely and correctly by Elicit.
Paper Number
62
Biography
Magdalena Vogt is a research associate at the Competence Centre Evidence-based Healthcare (EBHC) at the Institute of Health Sciences since September 2023. She holds a Master degree in Public Health.
Her activities include service provision, research and teaching in the field of EBHC. Service and research projects focus on knowledge management and networking in healthcare professions as well as the transfer of research results into practice to promote EBHC The teaching content includes research methods and systematic literature research as well as supervision of bachelor theses. Magdalena and her team published several articles on the topics mentioned above.
