5.1 - Theme 2. Harnessing Artificial Intelligence, Technology and Digital Innovations in Guideline Development and Implementation
Thursday, September 18, 2025 |
11:00 AM - 12:30 PM |
Speaker
Dr Xufei Luo
China
Lanzhou University
Ethical Considerations of Generative AI in the Development of Practice Guidelines: A Cross-sectional Review
Abstract
Background
Generative artificial intelligence (AI) is increasingly integrated into healthcare and policy-making, including the development of clinical and operational practice guidelines. However, ethical concerns—such as algorithmic bias, accountability, and transparency—remain inadequately addressed, raising risks of inequitable recommendations and eroded trust in guidelines.
Objective
This study systematically evaluates the ethical frameworks and challenges associated with using generative AI in formulating practice guidelines, aiming to identify gaps and propose actionable recommendations for responsible integration.
Methods
A cross-sectional review was conducted by analyzing peer-reviewed articles, institutional reports, and AI ethics guidelines published between 2022 and 2024. Databases including PubMed, Embase, IEEE Xplore, and Google Scholar were searched using terms related to “generative AI,” “ethics,” and “practice guidelines.” Included studies were thematically analyzed to extract ethical themes, regulatory approaches, and stakeholder perspectives.
Results
Of 1,452 screened documents. Key ethical issues included: (1) lack of transparency in AI decision-making processes, (2) risks of perpetuating biases from training data, and (3) unclear accountability for guideline errors. Regulatory gaps were prominent in low-resource settings and non-clinical domains. Detailed results will be presented at meeting.
Discussion
Current ethical frameworks for GAI in guideline development emphasize technical validation but underprioritize human oversight, equity, and participatory design. Strengthening accountability mechanisms, ensuring diverse dataset curation, and mandating transparency in AI contributions are critical to safeguarding guideline integrity. Future research should focus on co-developing ethical standards with clinicians, policymakers, and marginalized communities to mitigate harms and enhance trust in AI-augmented guidelines.
Generative artificial intelligence (AI) is increasingly integrated into healthcare and policy-making, including the development of clinical and operational practice guidelines. However, ethical concerns—such as algorithmic bias, accountability, and transparency—remain inadequately addressed, raising risks of inequitable recommendations and eroded trust in guidelines.
Objective
This study systematically evaluates the ethical frameworks and challenges associated with using generative AI in formulating practice guidelines, aiming to identify gaps and propose actionable recommendations for responsible integration.
Methods
A cross-sectional review was conducted by analyzing peer-reviewed articles, institutional reports, and AI ethics guidelines published between 2022 and 2024. Databases including PubMed, Embase, IEEE Xplore, and Google Scholar were searched using terms related to “generative AI,” “ethics,” and “practice guidelines.” Included studies were thematically analyzed to extract ethical themes, regulatory approaches, and stakeholder perspectives.
Results
Of 1,452 screened documents. Key ethical issues included: (1) lack of transparency in AI decision-making processes, (2) risks of perpetuating biases from training data, and (3) unclear accountability for guideline errors. Regulatory gaps were prominent in low-resource settings and non-clinical domains. Detailed results will be presented at meeting.
Discussion
Current ethical frameworks for GAI in guideline development emphasize technical validation but underprioritize human oversight, equity, and participatory design. Strengthening accountability mechanisms, ensuring diverse dataset curation, and mandating transparency in AI contributions are critical to safeguarding guideline integrity. Future research should focus on co-developing ethical standards with clinicians, policymakers, and marginalized communities to mitigate harms and enhance trust in AI-augmented guidelines.
Paper Number
431
Biography
Xufei Luo is working at Lanzhou University, where his research focuses on the development of evidence-based clinical practice guidelines, evidence-based public health decision making, reporting guideline and evidence-based methodologies.
Dr Olivier Blanson Henkemans
Senior Researcher
TNO Child Health
AI-driven decision support system for use of guidelines in Youth Health Care
Abstract
Background
Decision Support Systems (DSS) help transform clinical guidelines into actionable recommendations within electronic health records (EHRs). Traditional DSS rely on structured data and rule-based logic, making it challenging to process unstructured information in guidelines and free-text EHR entries. This study explores the integration of generative artificial intelligence (AI) to enhance decision support for Youth Health Care (JGZ) professionals.
Methods
An AI model, based on large language models (LLM), was developed using JGZ guidelines, data protocols, and an ontology, i.e., the "basis dataset". It was validated by analysing 14 pediatric cases (ages 0-4 years), focusing on hip dysplasia, skin conditions, and congenital heart defects. The AI-generated recommendations included the identification of abnormalities, suggested follow-up actions, and assessment of intervention urgency. These AI recommendations were compared with 1) existing guideline-based recommendations and 2) real-world decisions made by JGZ professionals.
Results
* Full agreement between AI-generated recommendations and guidelines in distinguishing normal vs. abnormal findings.
* 60% agreement on follow-up actions, with discrepancies primarily due to missing contextual factors, such as patient history, regional guidelines, and professional expertise.
* AI-generated recommendations provided relevant and actionable educational guidance for parents, supporting informed decision-making.
Conclusions
AI-driven decision support can effectively assist YHC professionals, particularly in identifying abnormalities and providing tailored education and advice. However, human expertise remains essential for context-based decision-making. Future improvements should focus on integrating historical cliënt data, coping with regional variations, and an interactive user interface to enhance AI-human collaboration in clinical practice.
Decision Support Systems (DSS) help transform clinical guidelines into actionable recommendations within electronic health records (EHRs). Traditional DSS rely on structured data and rule-based logic, making it challenging to process unstructured information in guidelines and free-text EHR entries. This study explores the integration of generative artificial intelligence (AI) to enhance decision support for Youth Health Care (JGZ) professionals.
Methods
An AI model, based on large language models (LLM), was developed using JGZ guidelines, data protocols, and an ontology, i.e., the "basis dataset". It was validated by analysing 14 pediatric cases (ages 0-4 years), focusing on hip dysplasia, skin conditions, and congenital heart defects. The AI-generated recommendations included the identification of abnormalities, suggested follow-up actions, and assessment of intervention urgency. These AI recommendations were compared with 1) existing guideline-based recommendations and 2) real-world decisions made by JGZ professionals.
Results
* Full agreement between AI-generated recommendations and guidelines in distinguishing normal vs. abnormal findings.
* 60% agreement on follow-up actions, with discrepancies primarily due to missing contextual factors, such as patient history, regional guidelines, and professional expertise.
* AI-generated recommendations provided relevant and actionable educational guidance for parents, supporting informed decision-making.
Conclusions
AI-driven decision support can effectively assist YHC professionals, particularly in identifying abnormalities and providing tailored education and advice. However, human expertise remains essential for context-based decision-making. Future improvements should focus on integrating historical cliënt data, coping with regional variations, and an interactive user interface to enhance AI-human collaboration in clinical practice.
Paper Number
519
Biography
Dr. Olivier Blanson Henkemans is a senior TNO researcher, specializing in human-computer interaction and digital health innovation. He focuses on developing user-friendly, AI-driven solutions to improve youth healthcare, collaborating with children, parents, and professionals to co-create impactful technologies. With a PhD from TU Delft on enhancing eHealth with personal computer assistants, he integrates AI, data spaces, and privacy-enhancing technologies (PETs) into scalable healthcare innovations. Olivier actively contributes to strategic developments within Health & Work, shaping the future of digital health. His passion lies in translating cutting-edge research into practical solutions that improve well-being and healthcare accessibility.
Dr Xufei Luo
China
Lanzhou University
Reporting guideline for the use of chatbots and other generative Artificial Intelligence toolS in mEdical research: the RAISE Statement
Abstract
Background
Generative artificial intelligence (GAI) tools can enhance the quality and efficiency of medical research, but their improper use may result in plagiarism, academic fraud, and unreliable findings. Transparent reporting of GAI use is essential, yet existing guidelines from journals and institutions are inconsistent, with no standardized principles.
Objective
To address this, we developed the RAISE checklist (Reporting guideline for the use of chatbots and other generative Artificial Intelligence toolS in mEdical research) through an international, multidisciplinary expert group.
Methods
The development process included a scoping review, two Delphi rounds, and virtual meetings with 51 experts from 26 countries.
Results
The final checklist comprises nine reporting items: general declaration, GAI tool specifications, prompting techniques, tool’s role in the study, declaration of new GAI model(s) developed, AI-assisted sections in the manuscript, content verification, data privacy, and impact on conclusions.
Discussion for scientific abstracts
RAISE provides universal and standardised guideline for GAI use in medical research, ensuring transparency, integrity, and quality.
Generative artificial intelligence (GAI) tools can enhance the quality and efficiency of medical research, but their improper use may result in plagiarism, academic fraud, and unreliable findings. Transparent reporting of GAI use is essential, yet existing guidelines from journals and institutions are inconsistent, with no standardized principles.
Objective
To address this, we developed the RAISE checklist (Reporting guideline for the use of chatbots and other generative Artificial Intelligence toolS in mEdical research) through an international, multidisciplinary expert group.
Methods
The development process included a scoping review, two Delphi rounds, and virtual meetings with 51 experts from 26 countries.
Results
The final checklist comprises nine reporting items: general declaration, GAI tool specifications, prompting techniques, tool’s role in the study, declaration of new GAI model(s) developed, AI-assisted sections in the manuscript, content verification, data privacy, and impact on conclusions.
Discussion for scientific abstracts
RAISE provides universal and standardised guideline for GAI use in medical research, ensuring transparency, integrity, and quality.
Paper Number
430
Biography
Xufei Luo is working at Lanzhou University, where his research focuses on the development of evidence-based clinical practice guidelines, evidence-based public health decision making, reporting guideline and evidence-based methodologies.
Miss Bingyi Wang
China
Lanzhou University
Evaluating the Quality of Clinical Practice Guidelines Using the AGREE II Instrument: A Comparative Analysis between GPT-4o and Human Appraisers
Abstract
Background: In interpersonal and human-computer interaction (HCI), natural language processing significantly enhances communication efficiency, with large language models (LLMs) like GPT-4o demonstrating advanced capabilities in contextual analysis through deep learning architectures. While educational psychology substantiates the efficacy of positive feedback in human performance optimization, its impact on LLM functionality, particularly in critical healthcare applications such as clinical practice guideline evaluation using standardized tools like AGREE II, remains inadequately investigated.
Objectives: This study systematically examines whether encouraging linguistic prompts enhance GPT-4o's performance in assessing clinical guideline quality compared to neutral prompts.
Methods: Utilizing 28 clinical guidelines with human AGREE II evaluations from PubMed as the gold standard, we conducted a controlled experiment where GPT-4o's experimental group received encouraging prompts (e.g., "Please analyze meticulously") while the control group received neutral instructions. Methodological rigor was ensured through intraclass correlation coefficient (ICC) analysis, Bland-Altman agreement assessments, and paired sample t-tests comparing domain score variances between LLM and human evaluations.
Results: Quantitative analysis revealed superior alignment in the encouragement cohort, demonstrating a 0.632% mean score differential from human ratings (95% LoA: -35.354% to 47.996%) with 89.9% of deviations within ±33.3% clinical acceptability thresholds. Comparatively, neutral prompts yielded a 12.471% mean divergence (95% LoA: -30.566% to 55.508%) and 81.5% threshold compliance. Statistical confirmation via paired t-tests (p<0.05) established the significant mitigating effect of encouraging language on scoring discrepancies across AGREE II domains, substantiating LLM sensitivity to motivational linguistic cues in clinical evaluation contexts.
Objectives: This study systematically examines whether encouraging linguistic prompts enhance GPT-4o's performance in assessing clinical guideline quality compared to neutral prompts.
Methods: Utilizing 28 clinical guidelines with human AGREE II evaluations from PubMed as the gold standard, we conducted a controlled experiment where GPT-4o's experimental group received encouraging prompts (e.g., "Please analyze meticulously") while the control group received neutral instructions. Methodological rigor was ensured through intraclass correlation coefficient (ICC) analysis, Bland-Altman agreement assessments, and paired sample t-tests comparing domain score variances between LLM and human evaluations.
Results: Quantitative analysis revealed superior alignment in the encouragement cohort, demonstrating a 0.632% mean score differential from human ratings (95% LoA: -35.354% to 47.996%) with 89.9% of deviations within ±33.3% clinical acceptability thresholds. Comparatively, neutral prompts yielded a 12.471% mean divergence (95% LoA: -30.566% to 55.508%) and 81.5% threshold compliance. Statistical confirmation via paired t-tests (p<0.05) established the significant mitigating effect of encouraging language on scoring discrepancies across AGREE II domains, substantiating LLM sensitivity to motivational linguistic cues in clinical evaluation contexts.
Paper Number
222
Biography
Bingyi Wang, a master's degree candidate, possesses extensive expertise in conducting systematic reviews, encompassing the retrieval, screening, information extraction. She has given many lectures in the training class about some steps of systematic review. At present, she has published five articles, including one systematic review, and has participated in the production of several systematic reviews.
Prof Bernardo Sousa-Pinto
Faculty Of Medicine, University Of Porto
ReMarQ: A new tool to support the AI-based assessment of the methodological quality of systematic reviews
Abstract
Background: Published systematic reviews (SRs) display an heterogeneous methodological quality. Poorly conducted SRs can have biased results and/or inadequate conclusions. Therefore, guideline developers should carefully assess the methodological quality of the SRs they use as sources of evidence. However, such an assessment can be time-consuming.
Objective: (i) To develop a tool to assess the reported methodological quality of SRs and (ii) to evaluate its implementation through large language models (LLMs).
Methods: We developed a new tool – ReMarQ – consisting of 26 dichotomous items to evaluate the reported methodological quality of SRs. We applied an Item Response Theory model to assess the difficulty and discrimination of ReMarQ items. We assessed the performance of five foundational and three fine-tuned LLMs in implementing ReMarQ in a sample of 100 medical SRs. In particular, we compared the answers provided by the LLMs to each of the 26 items of ReMarQ with those provided independently by human reviewers.
Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy=96.5%; mean kappa coefficient=0.90; mean F1-score=0.91). When compared with a human reviewer, this model displayed an accuracy above 80% and a kappa coefficient higher than 0.60 for all individual items. The model produced consistent results when the analysis was repeated 60 times on the same sample.
Discussion for scientific abstracts: We have developed a new tool to assess the methodological quality of SRs. Its implementation through LLMs – which demonstrated high accuracy and reliability – may increase the efficiency of the methodological evaluation of SRs.
Objective: (i) To develop a tool to assess the reported methodological quality of SRs and (ii) to evaluate its implementation through large language models (LLMs).
Methods: We developed a new tool – ReMarQ – consisting of 26 dichotomous items to evaluate the reported methodological quality of SRs. We applied an Item Response Theory model to assess the difficulty and discrimination of ReMarQ items. We assessed the performance of five foundational and three fine-tuned LLMs in implementing ReMarQ in a sample of 100 medical SRs. In particular, we compared the answers provided by the LLMs to each of the 26 items of ReMarQ with those provided independently by human reviewers.
Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy=96.5%; mean kappa coefficient=0.90; mean F1-score=0.91). When compared with a human reviewer, this model displayed an accuracy above 80% and a kappa coefficient higher than 0.60 for all individual items. The model produced consistent results when the analysis was repeated 60 times on the same sample.
Discussion for scientific abstracts: We have developed a new tool to assess the methodological quality of SRs. Its implementation through LLMs – which demonstrated high accuracy and reliability – may increase the efficiency of the methodological evaluation of SRs.
Paper Number
93
Biography
Manuel Marques-Cruz (MD, PhD) is an Invited Assistant Professor at the Faculty of Medicine of the University of Porto and a researcher at RISE-Health (Health Research Network). He obtained his Medical Degree (2016) and his PhD in Health Data Science (2025).
He bridges his medical expertise and data science skills to inform and shape evidence-based health policies. His work focuses on large language models applied to evidence synthesis studies and clinical guidelines.
He is a member of the GRADE Working Group, being particularly active in the GRADE Artificial Intelligence interest groups.
