Write a comment

PREreview of Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

by Mattia Gaggi

Published: April 17, 2026
DOI: 10.5281/zenodo.19626388
License: CC BY 4.0

Summary

This paper tackles a vital, often overlooked bottleneck in clinical AI: the "reliability gap" between average model accuracy and the consistency of its outputs. The author argues convincingly that a medical tool that fluctuates between different answers for the same patient query is fundamentally unsafe, regardless of how "smart" it seems on average. By introducing an open-source evaluation framework that treats reproducibility as a "first-class citizen," the work provides practitioners with a much-needed toolkit for auditing models on local hardware. The finding that even "stable" low-temperature settings (T=0.2) result in nearly 87–97% unique outputs is a sobering wake-up call for the community. This work effectively moves the field forward by shifting the conversation from "How accurate is this model?" to "How much can we trust this model's next response?".

Strengths

Novel Evaluation Axis: Shifting the focus from single-pass accuracy to within-model reproducibility is a timely and necessary contribution to safety-critical NLP.

Practical Framework: The three-stage CLI (run → score → report) and the commitment to releasing all code and experimental data ensure the work is highly reproducible and useful for practitioners.

Local Inference Focus: By evaluating models that run on commodity workstations (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B), the paper addresses the reality of clinical settings where data privacy is paramount.

Honest Limitations: The author proactively identifies confounding variables, such as model scale vs. fine-tuning, which adds significant scientific integrity to the discussion.

Major Issues & Actionable Suggestions

The "Greedy" Baseline & Temperature Sensitivity:

Concern: The choice of T=0.2 is a reasonable production default, but since the core claim centers on the failure of low temperature to buy reproducibility, the lack of a T=0.0(Greedy Search) baseline is a missed opportunity.

Suggestion: Evaluate the models at T=0.0 or a range of temperatures. This would clarify if the observed instability is inherent to the model's weights or merely a byproduct of any stochastic sampling. A small "Temperature vs. Uniqueness" sensitivity analysis would make the warning to practitioners much more robust.

Isolating the Fine-Tuning Effect:

Concern: The observation that MedGemma 1.5 4B underperforms is intriguing, but as the author notes, the comparison is confounded by model scale.

Suggestion: Include a same-scale comparison against a general-purpose baseline (e.g., Gemma 3 4B). This is essential to ensure we aren't simply measuring the "tax" of smaller parameter counts rather than a failure of domain adaptation.

Refining the Judge’s Rubric:

Concern: Using a 0–1 scalar score for a single-pass LLM-as-judge can be "noisy," especially since the judge itself is stochastic and was not run multiple times.

Suggestion: Move away from subjective 0–1 scoring toward a more deterministic, structured extraction. Asking the judge to check for specific clinical "Safety Flags" or binary factual markers may yield a more stable signal than a floating-point average.

Lexical vs. Clinical Consistency:

Concern: "Uniqueness" is currently measured by lexical text normalization.

Suggestion: Incorporate a metric for clinical semantic stability. If a model generates 10 different ways to say "Consult a doctor," it may be lexically unique but clinically consistent. Differentiating between "trivial phrasing variance" and "substantive medical variance" would elevate the paper's practical relevance.

Minor Issues

The Cost of Ensembling: While the paper provides excellent data on raw throughput, the "real-world" speed of a reliable answer (via majority voting or repeated sampling) would be much lower. Briefly discussing this "reliability-adjusted latency" would be helpful for deployment planning.

Chain of Thought (CoT): The models were restricted to $\le 6$ sentences. It would be interesting to discuss if this brevity hurts consistency. Allowing a model to "think aloud" via CoT before giving a final answer can sometimes stabilize the output distribution.

Clarity on Judge Selection: A quick sentence explaining why a 20B model was chosen as the judge for models of 4B–12B scale would help readers understand if the judge is considered an authoritative expert or merely a larger peer.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

You can write a comment on this PREreview of Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.