Escrever uma avaliação PREreview

Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems

de Xuefei Shi, Zhanxiao Tian, Shuping Tan e Xiaolong Wang

Publicado: 4 de abril de 2026
Servidor: medRxiv
DOI: 10.64898/2026.04.02.26350091

Large language model (LLM) systems can now generate complete research manuscripts, yet their reliability in clinical medicine — where citation accuracy and reporting standards carry direct consequences — has not been systematically assessed. We introduce MedResearchBench, a benchmark of three clinical epidemiology tasks built on NHANES data, and use it to evaluate six AI research systems across six quality dimensions. Evaluation combines programmatic citation verification, rule-based reporting compliance checks, and multi-model LLM judging, providing a more discriminative assessment than conventional single-judge approaches.

Citation integrity emerged as the decisive quality dimension. Hallucination rates ranged from 2.9% to 36.8% across systems, and a hard-rule threshold on per-task citation scores capped four of six systems’ total scores at the penalty ceiling. Adding a multi-agent citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9 and raised the weighted total from 68.9 to 81.8. Strikingly, a single-model evaluation ranked this system last (55.5), while our three-tier framework ranked it first (81.8) —a complete reversal that exposes the limitations of subjective LLM-only evaluation.

These results suggest that programmatic citation verification should be a core metric in future evaluations of AI scientific writing systems, and that multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship.

Você pode escrever uma avaliação PREreview de Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems. Uma avaliação PREreview é uma avaliação de um preprint e pode variar de algumas frases a um parecer extenso, semelhante a um parecer de revisão por pares realizado por periódicos.

Antes de começar

Vamos pedir que você faça login com seu ORCID iD. Se você não tiver um iD, pode criar um.