Avalilação PREreview de The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness
- Publicado
- DOI
- 10.5281/zenodo.17135903
- Licença
- CC BY 4.0
This review is the result of a virtual, collaborative live review discussion organized and hosted by PREreview and JMIR Publications on August 29, 2025. The discussion was joined by 26 people: 2 facilitators from the PREreview Team, 1 member of the JMIR Publications team, and 23 live review participants, including 1 who agreed to be named: Roseline Dzekem Dine. The authors of this review have dedicated additional asynchronous time over the course of two weeks to help compose this final report using the notes from the Live Review. We thank all participants who contributed to the discussion and made it possible for us to provide feedback on this preprint.
Summary:
Artificial intelligence (AI) is expected to be increasingly applied in patient management. However, several critical factors must be considered, including errors that may arise, false positives that cannot be explored by current benchmarking approaches, and nuances that could be overlooked amidst the real-world healthcare professionals at the point of care. The study is timely and important, as it raises awareness of how AI can lead to misdiagnosis. By narrowing its focus to diagnosis and employing controlled experimental parameters, this study assessed how certain input perturbations can influence an AI’s reasoning. Each of two prominent large language models (LLMs)—Google Gemini 2.0 Flash and ChatGPT-4o —showed high reliability in producing diagnoses for each of 52 artificial clinical vignettes that matched the diagnoses physicians had assigned to the cases. However, the performance of the LLMs worsened after the experimenter altered the vignettes to include unrelated clinical information: Gemini exhibited a 40% diagnostic change rate, and ChatGPT 30%. After “context” information was included in the clinical dataset presented to the LLMs (such as the Presenting Complaint rephrased to resemble how a patient might convey the same diagnostically relevant information in an emotional, dramatic, and/or ethnically flavored manner), Gemini changed 77.8% of its original diagnoses, and ChatGPT changed 55.6%. Pinpointing the nature of variations of input that are incorrectly ignored or that unduly influence the LLMs can lead to principles that would guide the development of clinically specialized AIs that can be trusted to participate in point-of-care clinical service.
Although the study aimed to examine the reliability of LLMs in medical diagnosis, its emphasis on comparing the two prominent LLMs shifts attention away from its original objective. Both models, as indicated above, have drawbacks or flaws when provided with irrelevant details and/or relevant context. A more meaningful approach would be to create frameworks that are consistent with real-world clinical practice.
While LLMs continue to develop and expand, they also need to be responsible, transparent, and equitable in healthcare, particularly for patients’ diagnosis. Various critical questions are raised on the fundamental properties of AI systems, and while LLMs offer great benefits with continuous expectations for their speed, performance, and applications in clinical practice, including consultation, diagnosis, care management, and specialised tasks, LLM-generated responses are mostly unaccompanied by justifications or supported by reliable information sources. Addressing gaps in systemic measures and standardization will provide a better understanding to improve LLMs' reliability for patient diagnosis, care delivery, and disease management.
List of major concerns and feedback:
The significance of this important research is obscured by repetition, technical details, figures that restate simple ratios, and excessive attention to comparing the LLMs with each other. Using an experimental method, this paper identified some particular qualities of input that result in certain data having too much influence and other data having too little. The two most valuable implications of this paper are: 1) the experimental approach could be extended to enhance analysis of AIs designed for clinical work; and 2) the specific qualities of data (e.g., how the data represent information) pinpointed in this study, and potentially unearthed in elaborations of the instant experimental approach, should be carefully addressed during development of better AIs to enhance their reasoning ability.
Please consider the following major comments:
Rewrite the paper to enable its “message” to shine through. Cite existing literature on benchmarks and explain why benchmarks should include the kinds of factors identified in this study, such as a test of an AI’s “bias”, perhaps along the lines suggested in Chen IY, Alsentzer E. Redefining Bias Audits for Generative AI in Health Care. NEJM AI. Published online 2025, August 14. doi:10.1056/AIp2500015. In particular, a recent article [Gourabathina A, Gerych W, Pan E, & Ghassemi M. (2025). The Medium is the Message: How Non-Clinical Information Shapes Clinical Decisions in LLMs. FAccT '25: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. Pages 1805 - 1828. doi:10.1145/3715275.3732121] and commentary [by author Ghassemi under “A Single Typo in Your Medical Records Can Make Your AI Doctor Go Dangerously Haywire” in Futurism at https://futurism.com/typo-ai-doctor-haywire], articles not limited to diagnosis, discuss how a minor change in input can radically alter treatment recommendations issued by an AI. Citing such material would emphasize the importance of the current paper. Consider including related research such as Friis, J. K. B. O. (2025). Hermeneutics and Medical Practice. In Hermeneutics at the Intersection of Medical Technology: Interpretation Reimagined (pp. 1-16). Cham: Springer Nature Switzerland.
The Consumer Technology Association (CTA) is creating standards for AI models. Is there some way to put the factors identified in this study into the context of what CTA is doing? (The latest standard “Performance Verification and Validation for Predictive Health AI Solutions”, emphasizing accuracy, data verification, explainability, and real-world testing, does not address generative AI, but they will be getting around to generative AI eventually.)
Regarding the choice of language models used, please explain why ChatGPT and Gemini were selected. Furthermore, readers might find it helpful to provide information on the transferability of the results to other language models and updated versions of the two LLMs used. References should be added to the data presented in the Rationale for Model Selection section.
It is also unclear whether the authors explored the use of the "experimental Gemini 2.0 Pro," which was available at the time of the study and demonstrated superior performance on complex problems and tasks requiring deeper reasoning, compared to Gemini 2.0 Flash. Please mention the specific ChatGPT model being used in this study.
Please describe the process by which the diagnoses and their case vignettes were decided and created. Pain is a symptom in nearly all the mentioned conditions/specialties; it should not be considered one of the major medical specialties.
Some passages in the text are repeated. For better readability, the text may be shortened. The abstract could be rewritten to be more concise. Focus on the IMRD (introduction, methods, results, discussion) structure.
Please consider statistical significance tests and confidence intervals for comparing groups, such as in Figure 3, in addition to the descriptive information on relative distribution.
The study touches on some distinct topics, such as “change of reasoning”. Introducing a definition of these topics can benefit the study and help the research community.
Citation of authors is absent in the manuscript, though a reference section is present. All works included in the reference section of the manuscript must be cited in the body. Please follow the journal's guidelines for referencing sources and insert citations at the appropriate point in the text. https://support.jmir.org/hc/en-us/articles/115001333067-How-should-references-be-formatted-Which-journal-style-should-I-choose-when-using-EndNote-or-other-reference-management-software
One notable weakness is the omission of radiology-dependent diagnosis as a scenario or category in the case studies. This is a missed opportunity because AI is good at interpreting imaging data and would work well in cases where the diagnosis is heavily dependent on imaging, with irrelevant information having minimal impact.
List of minor concerns and feedback:
Concerns with techniques/analyses
Please explain why the mentioned number of cases (52) was selected and the reasoning for creating a new dataset when many similar datasets already exist.
It was mentioned that “reference standard diagnoses were derived from UpToDate and DynaMed”; however, these are also not illustrated in the main text or the supplemental materials.
The results for diagnostic consistency are unclearly reported without concretely describing the criteria for measuring it. Ideally, the metrics used to evaluate this need to be provided as an appendix (Such as a list of questions with scores measuring the diagnostic consistency).
The results of the study were mainly reported as percentages. Confidence intervals, statistical significance testing, or effect size measures, which are essential for assessing the robustness of differences (Gemini vs. ChatGPT), are not used in the study
Details for the reproducibility of the study
There is over-repetition of the data; for example, the objectives are repeated several times.
The prompts used to develop the 52 clinical scenarios were not listed anywhere within the main body of the manuscript or the supplementary materials. The key scenario data items are provided in a separate GitHub repository as JSON, adding a complete prompt as an example helps non-technical users to understand the case creation process.
Figures and tables
Figures 2,3, and 5 did not add any new data other than what was presented in the related tables (data duplication). This is also applicable to the figures that are not cited in the text. Figures 3-6 partly represent simple statistical facts whose description in the body text may suffice.
Figures 4 & 6-13 are not cited in the article.
Table 9 is not cited in the article.
Additional comments
The paper should more clearly explain what this study is driving at. It is not a matter of how bad LLMs are, nor whether one is better than the other. It’s more of a way to probe areas of weakness in how AI can function. The issue is how research can detect and measure weaknesses in AI reasoning, and how the research approach pioneered in this study can help develop specialized AIs suitable for clinical application.
The abstract is quite long and dense. It could be shortened to highlight the core objectives, methods, and key findings more clearly for readers.
Page 8, cardiovascular diseases: Please identify the abbreviation (STEMI) as ST-Elevation Myocardial Infarction, (HFrEF) as Heart Failure with Reduced Ejection Fraction, and HFpEF as Heart Failure with Preserved Ejection Fraction. Add abbreviations such as STEMI, HFrEF, COPD, GERD to the abbreviations section of the such as recommended in https://support.jmir.org/hc/en-us/articles/360000249272-Which-abbreviations-don-t-need-to-be-expanded
Page 8, pulmonary diseases: Please identify the abbreviation Chronic Obstructive Pulmonary Disease (COPD)
Page 9, Diagnostic Test Results section: please identify the abbreviation cxr as Chest X-Ray
Page 18, table 2: Please identify the abbreviation GERD as Gastroesophageal Reflux Disease
Page 25, numbering of sections:2.9.1 is repeated
Need better engagement with literature, appraise, assess, and extract findings from the wider data source, then provide a critical discussion around those.
After validation of the methods, it would be nice to see a breakout of the manipulation categories to see whether specific types of manipulations have more/less impact on diagnosis changes
Concluding remarks
We thank the authors of the preprint for posting their work openly for feedback. We also thank all participants of the Live Review call for their time and for engaging in the lively discussion that generated this review.
Competing interests
The authors declare that they have no competing interests.