Comentarios
Escribir un comentarioNo se han publicado comentarios aún.
Shi et al have written an interesting and timely piece on the reliability of large langauge models (LLMs) in producing medical research manuscripts. They introduce MedResearchBench, a benchmarking tool to assess the reliability of LLM outputs, and report on how different LLMs perform on their programmatic benchmark.
The biggest issue is that the authors find that the proposed benchmark’s best ‘signal’ is citation integrity, as the LLMs had high hallucination rates (from 2.9% to 36.8% across systems) and detecting hallucinated references was therefore a fingerprint of low-integrity or problematic research.
This is undermined, however, by the authors parallel finding that adding a “citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9”. The authors acknowledge this as a limitation, but still report that “multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship”. The findings do not seem consistent with this conclusion, rather they seem to suggest the opposite, that programmatic assurance is easily gamed.
In addition, the manuscript does not address whether the replacement references are relevant to the claims they support — only that they exist (either in CrossRef or PubMed). So the benchmark is easily gamed in ways that may not improve integrity.
This is also consistent with our group’s research, which found for example that programmatic assessment tools including iThenticate can be trivially gamed by introducing syntactic alteration to an LLM-based workflow: https://doi.org/10.1186/s12916-025-04569-y
Similar points have been made on the ‘arms race’ between low-integrity actors and publishers by Marcus Munafo and George Davey Smith: https://doi.org/10.1371/journal.pbio.3003660
The authors may wish to consider reframing their findings by setting them in the context of this arms race.
There are also some limitations of the manuscript that may be worth flagging. The manuscript reports a bimodal distribution for ‘with’ and ‘without’ hallucinations but n = 7 is too few to make these sorts of distributional claims
There is not much detail in the STROBE compliance scoring (D4) methodology, and all the 7 tools score highly. It might be worth specifying what ‘automated text detection’ under D4 entails.
There are some numerical inconsistencies in the report, for example in Table 3 the hallucination rate would be calculated as 4.08% (1 failed + 1 corrupted, divided by 49) but the paper reports 2.9%. This appears to be a formula error in the last column.
The authors could usefully acknowledge the limitations of D2 (numerical fidelity) and whether they have considered that low scores under D2 could simply be due to regex quality as much as numerical accuracy. Similarly, D6 does not offer meaningful differentiation, but could this be because of how the rubric was defined (too coarse in its granularity). It could also be worth assessing inter-task variability in a more systematic way.
There is also a conflict of interest that could be discussed (or at least made explicit) which is that the authors have chosen their benchmarks, set their own thresholds, selected the competing systems, and did not pre-register the workflow. This is not necessarily a problem, but may be worth mentioning and pre-registering a workflow can sometimes assist with transparency around why choices are made or models selected, later on.
A very minor point, but the computing environment is reported, but is unnecessary detail as the whole work (unless I am mistaken) is conducted by API calls, therefore the operator environment is independent of any work and findings.
Overall, however, I enjoyed reading this work and wish the authors’ success in their endeavors.
The author declares that they have no competing interests.
The author declares that they did not use generative AI to come up with new ideas for their review.
No se han publicado comentarios aún.