PREreviews of “Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework”

Skip to preprint details Skip to PREreviews

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

by Avi-ad Avraam Buskila

Posted: April 12, 2026
Server: arXiv
DOI: 10.48550/arxiv.2604.10535

Abstract

Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.

Read the preprint

1 PREreview

Write a PREreview Request a PREreview

PREreview by Mattia Gaggi

Authored by Mattia Gaggi

Summary

This paper tackles a vital, often overlooked bottleneck in clinical AI: the "reliability gap" between average model accuracy and the consistency of its outputs. The author argues convincingly that a medical tool that fluctuates between different answers for the same patient query is…

Read the PREreview by Mattia Gaggi

PREreviews of Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

1 PREreview

Summary