Header image

4.1 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation

Wednesday, September 17, 2025
4:15 PM - 5:30 PM

Speaker

Agenda Item Image
Ms Simin Zhu
Chinese
School Of Medicine, The Chinese University Of Hong Kong, Shenzhen, China

A Reporting Quality Assessment AI-model for Adapted Guidelines in Health Care: The Ad@pt-AI

Abstract

Background: The RIGHT-Ad@pt checklist is an international standard to assess and guide the reporting of adapted guidelines. However, the assessment process is repetitive and lengthy. Large language models (LLMs) demonstrate their capabilities to process and generate human language, offering a promising solution to improve efficiency.

Objective: To develop Ad@pt-AI, a prompt-driven automated tool leveraging an existing LLM to assess the reporting quality of adapted guidelines based on the RIGHT-Ad@pt checklist.

Methods: We will develop and validate Ad@pt-AI through multiple processes: 1) establishing a prototype using an existing LLM and creating initial prompts based on the RIGHT-Ad@pt checklist and user guide, enabling the AI to assess 34 checklist items and retrieve relevant quotes; 2) iteratively optimizing Ad@pt-AI using prompt engineering until an F1 score of 0.75 is achieved; 3) conducting user testing of the optimized Ad@pt-AI in internal and external settings. We built a ground-truth set by annotating 120 adapted guidelines identified from previous scoping reviews. In user testing, each sample will be assessed by three groups: 1) Ad@pt-AI model, 2) a combined scholar evaluation team (internal and external scholars), and 3) a hybrid scholar-AI approach.

Results: We have established the Ad@pt-AI prototype and are currently finishing the annotation. We will report the optimization process and results at the GIN conference 2025.

Discussion for scientific abstracts: Ad@pt-AI can save time and resources while ensuring substantial accuracy and consistency in assessing the reporting of adapted guidelines. Its establishment process will also inform future automation in adapting clinical practice guidelines.

Paper Number

428

Biography

Simin Zhu earned her Bachelor’s degree in Pharmacy from Sun Yat-sen University and is currently pursuing a PhD under the supervision of Prof. Yang Song at the School of Medicine, The Chinese University of Hong Kong, Shenzhen, China. Her research interests encompass clinical guideline development methodologies, evidence synthesis, appraisal, and knowledge translation, application of artificial intelligence in healthcare, etc.
Agenda Item Image
Prof Bernardo Sousa-Pinto
Faculty Of Medicine, University Of Porto

On the use of Artificial Intelligence-based platforms to support searches in evidence synthesis studies

Abstract

Background: Artificial intelligence (AI)-based platforms specialized on the identification of scientific publications may potentially support searches in evidence synthesis studies.
Objective: To evaluate how AI-based platforms can support the search for primary studies in the context of evidence synthesis and to propose a structured, reproducible and platform-agnostic methodological framework.
Methods: As our main case study, we queried three AI-based platforms to identify randomised controlled trials (RCTs) investigating mometasone in patients with rhinitis. We evaluated the performance of several different strategies (i.e., prompting strategies and search approaches), comparing how many eligible RCTs we identified using AI-based platforms versus in a systematic search in multiple electronic bibliographic databases. We also compared meta-analytical results obtained with the RCTs identified by querying AI-based platforms versus those obtained in the context of a systematic review. We developed a methodological framework and a reporting checklist for the use of these AI-based platforms.
Results: A strategy involving searching multiple AI-based platforms identified 56.3% of eligible RCTs investigating mometasone in patients with rhinitis (85.7% of all full papers with DOI). Meta-analytical estimates were similar when considering only the RCTs identified using AI-based strategies versus those identified in the context of a systematic review. Consistent results were obtained when querying AI-based platforms one month later.
Discussion for scientific abstracts: Querying AI-based tools can support searches in evidence synthesis studies, complementing classical methods or helping to test the comprehensiveness of search queries for electronic bibliographic databases. However, AI-based tools can have shortcomings for identification of gray literature.

Paper Number

94

Biography

Bernardo Sousa-Pinto (MD, PhD) is an Assistant Professor at the Faculty of Medicine of the University of Porto (Porto, Portugal). He obtained his Medical Degree in 2016, and he has completed his PhD in Clinical and Health Services Research in 2019. His main methodological fields of interest are evidence synthesis, health economic evaluation studies, data science, and guidelines and health decision-making. Bernardo is a member of the GRADE Working Group, being the coordinator of the GRADE Portugal Network. He has participated in the coordination of the Allergic Rhinitis and its Impact on Asthma (ARIA) 2024-2025 guidelines.
Agenda Item Image
Prof Gerald Gartlehner
UWK Krems, RTI International

Semi-Automated Data Extraction with a Large Language Model: A Study Within Reviews

Abstract

Background. Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.

Objective. To compare an AI-assisted to a traditional human-led data extraction process.

Methods. Study within reviews (SWAR) based on six ongoing systematic reviews of the US Agency for Healthcare Research and Quality. Within each review, we utilized a prospective, parallel group comparison of a traditional human-led data extraction approach with an AI-assisted approach. In the AI-assisted approach, the LLM Claude conducted initial data extraction, which was verified by a human reviewer. Blinded data adjudicators compared the results of the two data extraction approaches and resolved discrepancies be checking study reports.

Results. The six systematic reviews of the SWAR contributed 9,341 data elements, extracted from 63 randomized and non-randomized studies. Concordance between the two methods was 77.2%. The accuracy of the AI-assisted approach compared with the adjudicated reference standard was 91.0%, with a recall of 89.6% and a precision of 98.9%. The AI-assisted approach had fewer incorrect extractions (9.0% vs. 11.0%) and similar risks of major errors (2.5% vs. 2.7%) compared to the traditional human-led method, with an average time saving of 44 minutes per study. Missed data items were the most frequent errors in both approaches.

Conclusion. The use of an LLM can improve the efficiency and accuracy of data extraction in evidence synthesis. Results reinforce previous findings that human-led data extraction is prone to errors.

Paper Number

9

Biography

Prof. Gerald Gartlehner is an expert in evidence synthesis methodology with more than 20 years of experience. He currently serves as the associate director of the RTI International-University of North Carolina Evidence-based Practice Center (RTI-UNC EPC) and as co-director of Cochrane Austria. He is also the chair of the Department for Evidence-based Medicine and Evaluation at the University for Continuing Education Krems, Austria and has teaching appointments at Universite Paris Cite, and the Karl Landsteiner University for Health Sciences, Austria. Dr Gartlehner is also member of the Cochrane Editorial Board and a co-convenor of the Cochrane Rapid Review Methods Group.
Agenda Item Image
Mr Christian Cao
Canada
Medical Student
University Of Toronto

Accelerating Evidence Synthesis: Large Language Models Match or Exceed Human Performance in Systematic Review Screening

Abstract

Background: Systematic reviews (SRs) are the highest standard of evidence, guiding clinical guidelines, policy, and research. However, labor-intensive screening delays timely information synthesis.

Objective: To evaluate the performance of large language model (LLM)-driven screening methods compared to dual human screening, the current gold standard.

Methods: We compared the performance of optimized LLM prompts against human reviewers across five SRs covering prevalence, intervention benefits, diagnostic accuracy, and prognosis questions. Single human (screening all available titles/abstracts) and dual human (sequential dual screening of titles/abstracts and full texts) workflows were evaluated. The original review authors' inclusion/exclusion labels were the reference standard.

Results: LLM-based abstract screening had performance (mean sensitivity: 98.5%, specificity: 85.1%) superior to single human reviewers (mean sensitivity: 89.3%, specificity: 92.9%), achieving similar accuracy in four SRs (p>0.05) and exceeded human sensitivity in all five, with significantly higher sensitivity in three reviews (p<0.05). LLM-based full-text screening (mean sensitivity 97.4%, specificity 91.1%) also outperformed dual human reviewer workflows (mean sensitivity 75.1%, specificity 97.8%), with significantly higher sensitivity in four SRs (p<0.05). Traditional human screening for 7,000 articles took 530 hours and cost $10,000, while our method completed screening in one day for $430.

Discussion: LLMs can match or exceed human performance in SR screening, while significantly reducing time and cost. LLM-driven automation can streamline screening, allowing human reviewers to focus on deeper scientific analysis and accelerating the synthesis of study conclusions.

Paper Number

124

Biography

Christian Cao is a third-year medical student at the University of Toronto interested in the application of large language models in healthcare, particularly their potential to accelerate evidence synthesis and clinical decision-making. Drawing from his background in bioinformatics and experience with SeroTracker, a living systematic review of COVID-19 seroprevalence studies, his current research focuses on developing automated approaches to systematic review screening.
Agenda Item Image
Mr Artur Nowak
Cto
Evidence Prime

AI and Plain Language Summaries: Three Years of Progress and Persistent Challenges (2022–2025)

Abstract

Background: Plain language summaries enhance understanding of recommendations by providing clear, accessible content. At GIN 2022, we demonstrated the feasibility of automating creation of these summaries, substantially improving readability. However, challenges in factual accuracy remained.

Objective: Evaluate if advanced AI models further improve readability and factuality in generating plain language summaries compared to GPT-3.

Methods: We revisited a dataset of 444 recommendations accompanied by "additional information." We generated plain language summaries using Claude 3.7 Sonnet. Readability was assessed via the Flesch Reading Ease score. Three reviewers qualitatively evaluated readability and factual accuracy on a subset of 40 randomly chosen summaries, comparing the results to the previous GPT-3-based system.

Results: Claude 3.7 Sonnet markedly improved readability compared to GPT-3, increasing the mean Flesch Reading Ease score for recommendations by 42.53 points from baseline (9.33 to 51.86), and 13.87 from GPT-3 (CI: (11.96, 15.78)). Qualitative evaluation showed substantial readability improvements, with 38 out of 40 recommendations rated as improved or much improved, compared to 25 in 2022. Regarding factual accuracy, Claude 3.7 Sonnet significantly reduced outright hallucinations compared to GPT-3 (1 vs. 4 cases). However, it frequently introduced oversimplifications or misinterpretations (7 cases), effectively doubling the overall rate of summaries requiring fact-checking (8 vs. 4 previously).

Discussion: While the last three years brought enormous improvements in terms of model quality, assuring factual accuracy still remains a nontrivial task. Therefore, this area requires further careful research and we advise against using these tools without human supervision.

Paper Number

337

Biography

Artur Nowak is the co-founder of Evidence Prime, a company at the forefront of integrating artificial intelligence into evidence-based medicine. He holds a Master's degree in Computer Science from Jagiellonian University, specializing in natural language processing and information retrieval—key areas in the efficient synthesis and analysis of medical evidence. As a seasoned software engineer, Artur leads Evidence Prime's AI team and directs product strategy. His research, featured in peer-reviewed journals, focuses on advancing GRADE guidelines, applying AI to evidence synthesis, and developing user-friendly software for healthcare professionals.
loading