6.5 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation
Thursday, September 18, 2025 |
1:15 PM - 2:15 PM |
Speaker
Dr, Prof Zachary Munn
Director
HESRI, University of Adelaide
Applying Cutting-Edge Methods in the Development of the Australian Motor Neurone Disease (MND) Guideline: Innovations, Challenges, and Lessons Learned
Abstract
Background
Guideline development has evolved significantly, integrating new methodologies to enhance transparency, engagement, and real-world impact. The Australian MND Guideline, a collaboration between HESRI and FightMND, leverages modern best practices to ensure rigour, inclusivity, and implementatability.
Objective
This presentation will showcase innovative methodologies used in developing the Australian MND Guideline, including living methods, GRADE-ADOLOPMENT, outcome prioritisation frameworks, and co-production with people with lived experience, amongst other innovations. We will share insights on applying these methods and lessons learned.
Methods
Following GRADE-based methodology, the guideline integrates:
• Outcome prioritisation through structured engagement with clinicians, researchers, policy makers, advocates and people with lived experience.
• Living synthesis methods to allow real-time updates.
• Rapid evidence synthesis using GRADE-ADOLOPMENT to adapt global recommendations efficiently.
• Implementation science strategies for usability and uptake.
Results
Key innovations have strengthened interest holder inclusivity and transparency. Early engagement aligned clinical priorities with patient-centred outcomes, while GRADE-ADOLOPMENT enhanced efficiency.
Discussion
This presentation will explore what worked, challenges faced, and how these approaches inform future guideline development. The Australian MND Guideline is designed as a continuously evolving, evidence-based approach to improve MND care in Australia and beyond.
Guideline development has evolved significantly, integrating new methodologies to enhance transparency, engagement, and real-world impact. The Australian MND Guideline, a collaboration between HESRI and FightMND, leverages modern best practices to ensure rigour, inclusivity, and implementatability.
Objective
This presentation will showcase innovative methodologies used in developing the Australian MND Guideline, including living methods, GRADE-ADOLOPMENT, outcome prioritisation frameworks, and co-production with people with lived experience, amongst other innovations. We will share insights on applying these methods and lessons learned.
Methods
Following GRADE-based methodology, the guideline integrates:
• Outcome prioritisation through structured engagement with clinicians, researchers, policy makers, advocates and people with lived experience.
• Living synthesis methods to allow real-time updates.
• Rapid evidence synthesis using GRADE-ADOLOPMENT to adapt global recommendations efficiently.
• Implementation science strategies for usability and uptake.
Results
Key innovations have strengthened interest holder inclusivity and transparency. Early engagement aligned clinical priorities with patient-centred outcomes, while GRADE-ADOLOPMENT enhanced efficiency.
Discussion
This presentation will explore what worked, challenges faced, and how these approaches inform future guideline development. The Australian MND Guideline is designed as a continuously evolving, evidence-based approach to improve MND care in Australia and beyond.
Paper Number
58
Biography
Professor Zachary Munn is an advocate for evidence-based healthcare and for ensuring policy and practice is based on the best available evidence. Professor Munn is the founding Director of Health Evidence Synthesis, Recommendations and Impact (HESRI) in the School of Public Health at the University of Adelaide; Head of the Evidence Synthesis Taxonomy Initiative (ESTI); Founding Director of the Adelaide GRADE Centre; past-Chair of the Guidelines International Network (GIN) and a National Health and Medical Research Council (NHMRC) Investigator. He is a systematic review, evidence implementation and guideline development methodologist.
Dr Gaelen Adam
Brown University School Of Public Health
Artificial Intelligence and Machine Learning Tools to (Semi-)Automate Evidence Synthesis: A Living Rapid Review and Evidence Map
Abstract
Background. Tools that leverage artificial intelligence (AI) or machine learning (ML) are reaching proficiency levels that make their integration into systematic review (SR) procedures increasingly viable.
Objective. To systematically map AI/ML tools designed to support SRs and evaluate their performance.
Methods. We searched PubMed, Embase, and the Association for Computing Machinery Digital Library to October 1, 2024, with updates planned biannually. We included primary studies of publicly available ML/AI tools, assessing quantitative performance across multiple reviews. We screened all studies in duplicate and extracted tool characteristics, evaluation methods, performance results, and the authors’ conclusions.
Results. We included 56 studies that assessed the performance of AI/ML tools compared to manual processes. Tools for identifying randomized controlled trials performed well, with a median recall of 98% and precision of 92%. Abstract screening tools also showed promising, though variable, results, achieving a median recall of 87% for fully automated screening using zero-shot models; 93% for semi-automated models with a median 51% reduction in screening burden. In contrast, tools for searching had low recall (median 21%) and precision (median 4%), and data extraction tools varied widely, with a median 61% of data correctly extracted. Risk of bias assessment tools showed moderate agreement with human assessments (Cohen’s kappa=0.45; median agreement: 71%).
Discussion. Our findings underscore the need for further development of AI/ML tools across SR tasks, as well as the need for frameworks to reliably evaluate and integrate them in evidence synthesis. While some tools promise significant support, no existing tool can replace human expertise.
Objective. To systematically map AI/ML tools designed to support SRs and evaluate their performance.
Methods. We searched PubMed, Embase, and the Association for Computing Machinery Digital Library to October 1, 2024, with updates planned biannually. We included primary studies of publicly available ML/AI tools, assessing quantitative performance across multiple reviews. We screened all studies in duplicate and extracted tool characteristics, evaluation methods, performance results, and the authors’ conclusions.
Results. We included 56 studies that assessed the performance of AI/ML tools compared to manual processes. Tools for identifying randomized controlled trials performed well, with a median recall of 98% and precision of 92%. Abstract screening tools also showed promising, though variable, results, achieving a median recall of 87% for fully automated screening using zero-shot models; 93% for semi-automated models with a median 51% reduction in screening burden. In contrast, tools for searching had low recall (median 21%) and precision (median 4%), and data extraction tools varied widely, with a median 61% of data correctly extracted. Risk of bias assessment tools showed moderate agreement with human assessments (Cohen’s kappa=0.45; median agreement: 71%).
Discussion. Our findings underscore the need for further development of AI/ML tools across SR tasks, as well as the need for frameworks to reliably evaluate and integrate them in evidence synthesis. While some tools promise significant support, no existing tool can replace human expertise.
Paper Number
184
Biography
Adam has worked as librarian, editor, and research associate at Brown’s Center for Evidence Synthesis in Health (CESH) since 2013, contributing to the production of over 30 evidence synthesis products (systematic reviews, technology assessments, and other similar products) and clinical practice guidelines on a wide variety of clinical and public-health topics. She has also done extensive research into systematic review methods, particularly as it relates to leveraging technology (e.g., machine learning and text mining) to improve evidence synthesis methods.
Dr Curtis Harrod
Senior Scientist
American College Of Physicians
Visual Clinical Guidelines by the American College of Physicians: Visualizing the Future of Guidelines
Abstract
Background
Clinical guidelines and associated systematic reviews are often lengthy and dense texts, which can be time-consuming and difficult for busy, practicing clinicians to digest.
Objectives
The American College of Physicians (ACP) set out to develop a template for a visual clinical guideline (VCG) to succinctly present clinical guideline recommendations and engage clinician end-users through interactive features.
Methods
ACP developed a VCG prototype and iteratively refined the template incorporating qualitative feedback from clinicians through user testing.
Results
ACP published its first VCG in April 2024. The VCG presents the guideline scope, graded recommendations and rationale, clinical considerations, and an interactive data visualization of evidence supporting each recommendation. Subsequent VCGs were expanded to present information on intervention cost.
Discussion
Innovative clinical guideline summaries and presentations require cross-disciplinary collaboration. The development of ACP’s VCG template drew from experts in evidence-based medicine, data visualization, technology, and scientific publishing.
Clinical guidelines and associated systematic reviews are often lengthy and dense texts, which can be time-consuming and difficult for busy, practicing clinicians to digest.
Objectives
The American College of Physicians (ACP) set out to develop a template for a visual clinical guideline (VCG) to succinctly present clinical guideline recommendations and engage clinician end-users through interactive features.
Methods
ACP developed a VCG prototype and iteratively refined the template incorporating qualitative feedback from clinicians through user testing.
Results
ACP published its first VCG in April 2024. The VCG presents the guideline scope, graded recommendations and rationale, clinical considerations, and an interactive data visualization of evidence supporting each recommendation. Subsequent VCGs were expanded to present information on intervention cost.
Discussion
Innovative clinical guideline summaries and presentations require cross-disciplinary collaboration. The development of ACP’s VCG template drew from experts in evidence-based medicine, data visualization, technology, and scientific publishing.
Paper Number
208
Biography
Dr. Curtis Harrod is an epidemiologist and biostatistician and has nearly 20 years of evidence synthesis experience. He is a Senior Scientist at the American College of Physicians and was formerly an Assistant Professor of Medicine and Research Director at Oregon Health and Science University. He has multifaceted experience of leading the implementation of evidence into policy decision making for state Medicaid programs and other public and private insurers, clinical guideline programs, and other stakeholders.
Mr José Molina
European Society Of Clinical Microbiology And Infectious Diseases
A Survey on Perceived Priorities for the Implementation of Artificial Intelligence in Developing Guidelines at the European Society of Clinical Microbiology and Infectious Diseases (ESCMID).
Abstract
Background.
Artificial intelligence-based tools (AI-t) come with significant challenges, such as establishing reproducible processes, a learning curve, and direct costs. Therefore, a strategic implementation is essential, focusing on areas that would benefit most from these tools while safeguarding the integrity of guidelines (GL) development.
Objective.
To assess which tasks are perceived as a priority for being enhanced through AI-t by ESCMID GL developers.
Methods.
An online survey was disseminated in May 2024 and February 2025 among ESCMID GL developers, including Guidelines Subcommittee (GLSC) members and liaisons, Evidence Review Groups (ERG) and Guidelines Panel Members (GPM). Eighteen Gl development tasks were assessed, distributed in 5 domains: planning; development; dissemination; implementation; clinical research. Respondents were asked to rank the perceived priority for each task in a 1 to 10 scale.
Results.
The survey was sent to 123 ESCMID developers (12 GLSC, 51 ERG, 60 GPM), and 66 responses were returned (53.6%). The participants showed a positive predisposition for implementing AI-t, with 17 of 18 proposed tasks ranked with a median grade of 7 to 10. The tasks with the highest ranks were those related to evidence search and abstract screening. Planning tasks were ranked with the lowest grades. With few exceptions, no major divergences were observed between GLSC, ERG and GPM groups.
Discussion.
ESCMID GL developers showed favorable receptiveness for implementation of AI-t. The tasks involving evidence search and screening were identified as priority. This survey will impact future directions of ESCMID GL development strategy.
Artificial intelligence-based tools (AI-t) come with significant challenges, such as establishing reproducible processes, a learning curve, and direct costs. Therefore, a strategic implementation is essential, focusing on areas that would benefit most from these tools while safeguarding the integrity of guidelines (GL) development.
Objective.
To assess which tasks are perceived as a priority for being enhanced through AI-t by ESCMID GL developers.
Methods.
An online survey was disseminated in May 2024 and February 2025 among ESCMID GL developers, including Guidelines Subcommittee (GLSC) members and liaisons, Evidence Review Groups (ERG) and Guidelines Panel Members (GPM). Eighteen Gl development tasks were assessed, distributed in 5 domains: planning; development; dissemination; implementation; clinical research. Respondents were asked to rank the perceived priority for each task in a 1 to 10 scale.
Results.
The survey was sent to 123 ESCMID developers (12 GLSC, 51 ERG, 60 GPM), and 66 responses were returned (53.6%). The participants showed a positive predisposition for implementing AI-t, with 17 of 18 proposed tasks ranked with a median grade of 7 to 10. The tasks with the highest ranks were those related to evidence search and abstract screening. Planning tasks were ranked with the lowest grades. With few exceptions, no major divergences were observed between GLSC, ERG and GPM groups.
Discussion.
ESCMID GL developers showed favorable receptiveness for implementation of AI-t. The tasks involving evidence search and screening were identified as priority. This survey will impact future directions of ESCMID GL development strategy.
Paper Number
321
Biography
Member of the ESCMID Guidelines Subcommittee.
Prof Luciane Lopes
Full Professor
University Of Sorocaba
AI-Assisted Cross-Cultural Adaptation of a Tool for Situational Diagnosis of Support for Routine Evidence Use in Policymaking: The EviPolicy-Brazil Study
Abstract
Background: Institutionalizing evidence-informed policymaking (EIPM) requires structured approaches to integrating evidence into policy processes. The World Health Organization (WHO) checklist supports routine evidence use in policymaking but requires adaptations to local contexts.
Objective: To cross-culturally adapt the WHO checklist Supporting the Routine Use of Evidence During the Policymaking Process to the Brazilian context using AI-assisted methods and expert evaluation.
Methods: The adaptation process included semantic and cultural modifications by AI-driven EIPM researchers and expert validation via the Hybrid Delphi method. A multidisciplinary panel of 30 experts from the Ministry of Health, universities, and Evidence Centers assessed clarity, relevance, and representativity using a Likert scale. Items with a Content Validity Index (CVI) < 0.80 were refined in virtual discussions, followed by a final reassessment.
Results: The adapted WHO checklist achieved clarity, relevance, and representativeness (all items CVI > 0.80) and was approved by two Brazilian researchers involved in the original tool’s development. A pilot study applied the checklist in two Brazilian evidence centers at different EIPM institutionalization stages, including a mini-focus group. The tool was also discussed in a deliberative dialogue with 16 EIPM organization representatives, highlighting challenges and improvements.
Discussion: Expert input and AI ensured a rigorous adaptation, reinforcing the tool’s applicability in Brazil. Future applications will include a national survey of Health Technology Assessment Centers to assess EIPM maturity and inform capacity-building efforts.
Objective: To cross-culturally adapt the WHO checklist Supporting the Routine Use of Evidence During the Policymaking Process to the Brazilian context using AI-assisted methods and expert evaluation.
Methods: The adaptation process included semantic and cultural modifications by AI-driven EIPM researchers and expert validation via the Hybrid Delphi method. A multidisciplinary panel of 30 experts from the Ministry of Health, universities, and Evidence Centers assessed clarity, relevance, and representativity using a Likert scale. Items with a Content Validity Index (CVI) < 0.80 were refined in virtual discussions, followed by a final reassessment.
Results: The adapted WHO checklist achieved clarity, relevance, and representativeness (all items CVI > 0.80) and was approved by two Brazilian researchers involved in the original tool’s development. A pilot study applied the checklist in two Brazilian evidence centers at different EIPM institutionalization stages, including a mini-focus group. The tool was also discussed in a deliberative dialogue with 16 EIPM organization representatives, highlighting challenges and improvements.
Discussion: Expert input and AI ensured a rigorous adaptation, reinforcing the tool’s applicability in Brazil. Future applications will include a national survey of Health Technology Assessment Centers to assess EIPM maturity and inform capacity-building efforts.
Paper Number
346
Biography
Luciane Lopes is a full professor in the Graduate Program in Pharmaceutical Sciences at the University of Sorocaba, São Paulo, Brazil, focusing on evidence-informed policymaking and health technology assessment. Dr. Lopes has consulted for the World Health Organization and the Pan American Health Organization, contributing to initiatives such as Brazil's National List of Essential Medicines.
As the leader of the SERIEMA-EVIPNet-Brazil Evidence Center and chair of the Latin American group of the International Society for Pharmacoepidemiology, she fosters regional and global collaborations. Her work spans diverse sectors, including health and social protection, emphasizing the intersection of evidence synthesis and policymaking.
Ms Xiangying Ren
CHINA
Zhongnan Hospital Of Wuhan University
Research on the Identification and Extraction Methods of Clinical Practice Guidelines Recommendations Driven by Large Language Models
Abstract
Abstract
This study proposes an automated recommendation extraction method based on large language models, aimed at efficiently and accurately extracting recommendations from clinical practice guidelines. A standardized extraction framework was developed, and an automated extraction model was built by fine-tuning a large language model. Experimental results show that this method outperforms traditional rule-based methods in terms of accuracy and recall, effectively extracting clinically valuable recommendations, and providing support for intelligent healthcare decision-making.
Methodology and Results
Based on the established recommendation extraction standards, a dataset containing expert-annotated recommendations was first constructed. Then, a large language model was fine-tuned to develop the automated extraction model. The model analyzes the context of guideline texts to automatically identify and extract recommendations. The model's performance was evaluated using metrics such as precision, recall, and F1 score, and compared with traditional rule-based methods to verify its advantages in extraction efficiency and accuracy. The results show that the automated extraction model performs better than traditional methods, accurately identifying and extracting recommendations, especially when handling complex language and professional terminology.
Discussion
This study shows that the automated extraction method using large language models is effective in extracting recommendations, especially in handling complex language and terminology. However, challenges like language ambiguity and domain differences remain, which may affect accuracy. Future work can focus on optimizing the model and expanding the dataset to improve performance and support intelligent healthcare decision-making.
This study proposes an automated recommendation extraction method based on large language models, aimed at efficiently and accurately extracting recommendations from clinical practice guidelines. A standardized extraction framework was developed, and an automated extraction model was built by fine-tuning a large language model. Experimental results show that this method outperforms traditional rule-based methods in terms of accuracy and recall, effectively extracting clinically valuable recommendations, and providing support for intelligent healthcare decision-making.
Methodology and Results
Based on the established recommendation extraction standards, a dataset containing expert-annotated recommendations was first constructed. Then, a large language model was fine-tuned to develop the automated extraction model. The model analyzes the context of guideline texts to automatically identify and extract recommendations. The model's performance was evaluated using metrics such as precision, recall, and F1 score, and compared with traditional rule-based methods to verify its advantages in extraction efficiency and accuracy. The results show that the automated extraction model performs better than traditional methods, accurately identifying and extracting recommendations, especially when handling complex language and professional terminology.
Discussion
This study shows that the automated extraction method using large language models is effective in extracting recommendations, especially in handling complex language and terminology. However, challenges like language ambiguity and domain differences remain, which may affect accuracy. Future work can focus on optimizing the model and expanding the dataset to improve performance and support intelligent healthcare decision-making.
Paper Number
202
Biography
Xiangying Ren is a research assistant at the Center for Evidence-Based and Translational Medicine, Zhongnan Hospital of Wuhan University. She has expertise in the development processes of systematic reviews and clinical practice guidelines.
Dr Tim Barker
Senior Research Fellow
HESRI, The University Of Adelaide
My paper was peer-reviewed by AI
Abstract
Background
Large language models (LLMs) have emerged as powerful tools to assist in the research landscape. However, the technology is not without it’s limitations.
Objective
In this presentation, we will present a real-world example of a manuscript that our team submitted for peer-review, and the anonymised responses we got back from the peer-reviewers themselves. We will discuss how we suspected the likelihood that an LLM was used in the peer-review process and the ethical considerations of engaging with an LLM as a peer-reviewer.
Methods
Following return of a manuscript that was submitted for peer-review, and the comments that left by the peer-reviewers, we uploaded the same manuscript submitted to the journal to three LLMs, Chat-GPT 4.0, Gemini 1.5Pro, DeepSeek-V3, accessed on January 29th, 2024. Several prompts were asked of the LLM’s to peer-review the document, which were compared to the comments left by the peer-reviewers.
Results
Several comments left by the peer-reviewer were near identical to the comments left by the LLMs. The comments left by the peer-reviewer were also structured the same way that the LLMs presented their critiques. These comments will be presented in full during the presentation.
Discussion for scientific abstracts
LLMs have vastly improved research productivity in their short existence, however they are limited in numerous ways that need a nuanced discussion before their implementation as a standard tool to assist in peer-review can be further explored.
Large language models (LLMs) have emerged as powerful tools to assist in the research landscape. However, the technology is not without it’s limitations.
Objective
In this presentation, we will present a real-world example of a manuscript that our team submitted for peer-review, and the anonymised responses we got back from the peer-reviewers themselves. We will discuss how we suspected the likelihood that an LLM was used in the peer-review process and the ethical considerations of engaging with an LLM as a peer-reviewer.
Methods
Following return of a manuscript that was submitted for peer-review, and the comments that left by the peer-reviewers, we uploaded the same manuscript submitted to the journal to three LLMs, Chat-GPT 4.0, Gemini 1.5Pro, DeepSeek-V3, accessed on January 29th, 2024. Several prompts were asked of the LLM’s to peer-review the document, which were compared to the comments left by the peer-reviewers.
Results
Several comments left by the peer-reviewer were near identical to the comments left by the LLMs. The comments left by the peer-reviewer were also structured the same way that the LLMs presented their critiques. These comments will be presented in full during the presentation.
Discussion for scientific abstracts
LLMs have vastly improved research productivity in their short existence, however they are limited in numerous ways that need a nuanced discussion before their implementation as a standard tool to assist in peer-review can be further explored.
Paper Number
39
Biography
Dr. Tim Barker is a senior research fellow within Health Evidence Synthesis, Recommendations and Impact (HESRI) and is the deputy-director of the Adelaide GRADE Centre. He is a research methodologist, systematic reviewer and clinical epidemiologist. Tim has experience in multiple evidence synthesis types and methodologies and has been internationally accredited (INGUIDE - Level III) to serve as a research methodologist and chair in the development of clinical practice guidelines.
