Escrever um comentário

Avalilação PREreview de Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

de Jay Patel, Quratul Ayn Zahara, Sandra Grinschgl, Diptarup Mallick, Randa Salah Gomaa Mahmoud, Nitya Khetarpal, Xiuqi Li, Dr. Mwafaq Ramzi Haji, Raimi Morufu Olalekan (BSc. M.Sc. Ph.D. MNES, REHO, LEHO, FAIWMES)) e Michael Andrades do/a ASAPbio Meta-Research Crowd

Publicado: 31 de março de 2026
DOI: 10.5281/zenodo.19359570
Licença: CC0 1.0

Summary

This study explores the feasibility of using Large Language Models (LLMs) to automate the screening of scientific publications for Responsible Research Practices (RRPs). These practices include reporting guidelines for randomization, blinding, and sample sizes. By comparing the performance of four proprietary LLMs against a "gold standard" of three human reviewers across 52 life sciences papers, the authors demonstrated that optimized LLMs (specifically Gemini 1.5 Pro) can achieve accuracy (~90%) comparable to a single human reviewer (~86%), suggesting that AI can effectively replace one human in a standard dual-reviewer evidence synthesis pipeline.

Major issues

Consensus Criteria: The authors narrowed RRPs down to 12 indicators through a three-round Delphi process. However, the specific decision-making criteria used for consensus were not fully disclosed. Clarification is needed on whether this was based on statistical thresholds, average scores, or majority vote.
Comparison Groups: A new suggested comparison would be Human + LLM vs. Human Expert vs. LLM. Exploring human-AI teaming could reveal synergistic benefits of human and LLMs.
Methodological Rigor: The Delphi study and the "gold standard" human assessment (two independent reviewers plus a third for reconciliation) are significant strengths of the paper.
Validation Split: With 37 papers for training and only 15 for validation, the small validation set limits claims of generalizability across diverse scientific sub-disciplines. Choosing papers from the BOX Program is strange and needs to be justified. Why not search across the most cited papers or most recent ones?
Affirmative Bias: A critical finding is the LLM’s tendency to report a practice was followed when it was missing. This makes LLMs less reliable at confirming the absence of information.
Prompting Reproducibility: Prompt optimization was performed manually by a single researcher. The lack of a standardized, automated protocol may limit the reproducibility of the results.
Sample Size: The sample (n=52) may be insufficient for the linguistic variability in life sciences. The authors should provide justification or benchmarks supporting the reliability of this sample size.
Data Leakage: Figure 3 appears to combine training and validation sets. Metrics (Precision, Recall, F1-score) should be reported exclusively for the independent validation set to avoid performance overestimation.

Reporting

The preprint would benefit from adhering to the TRIPOD-LLM reporting guidelines (https://tripod-llm.vercel.app/).

Abstract & Model Selection

Abstract: Mention human oversight to catch hallucinations and key limitations (data leakage, sample size, outdated models). Avoid language about “replacing” a human reviewer, as we need to consider other setups (like Human + LLM) and the controversy around this. Or else justify the implication more. Even when accuracies match, specific behaviors differ.
Models: Justify the "availability heuristic" for model selection. Consider comparing proprietary models against open-source options (e.g., Llama 3.3) to address data privacy/security concerns.

Introduction

Add the bit from subsection Selection of experimental research papers and human review about “part of a larger endeavour to estimate the impact of higher education courses on research outputs of participants and the application of RRPs.”

Results & Figures

Accessibility: Provide analysis scripts in Zenodo (the current file cleaning script appears deprecated).
Visuals: Use a confusion matrix to communicate accuracy (TP, FP, TN, FN).
Human Assessment: Clarify how the 87% individual accuracy was calculated (F1 vs. basic percentage).
Figure 1 Typos: Correct "Refinemnet" and "LMM."
Efficiency Results: Report standard deviations across the three reviewers.

Supplementary Materials & Data

Organization: Ensure supplemental tables are presented in the order they appear in the text. Add direct links to these materials in the article.
Reproducibility: Clarify which PDF numbers belong to which papers.
Technical Details: * Table S2: Verify if a temperature of 2.0 is possible for Gemini (usually maxes at 1.0).
- Report why the Delphi study excluded: code availability, compute environment/FAIR principles, missing data strategy, metadata standards, and explicit ethics approval (IRB/ACUC/GDPR).
Repository Management: * Enhance Zenodo metadata.
- Include a direct GitHub link in the manuscript for convenience.
- Clarify licensing on GitHub (the Zenodo archive is CC BY, but GitHub is unspecified).

Minor issues

Figure 3B: Was the caption data also calculated across all three human experts/reviewers?
Figure 4: Consider a 1:1 aspect ratio and define the lines (e.g., black line as y=x, blue as regression).
Figure 5: Fix x-axis label ("paper" to "papers") and add "minutes" to plot labels. Justify why paper reviewing times are listed at ≤ 60 minutes, as literature often suggests 3–5 hours.
Terminology: In the Introduction, "Qua overall time" should likely be rephrased to: "This would allow for the partial replacement of human reviewers in such assessment processes, improving efficiency in terms of time and human review effort."
Acknowledgements: Use the CRediT taxonomy to clarify author contributions and acknowledge specific contributions by volunteers.
Disclosures: Disclose use of LLMs or other automated technologies in the research process and authoring the manuscript.

Competing interests

The authors declare that they have no competing interests.

Use of Artificial Intelligence (AI)

The authors declare that they used generative AI to come up with new ideas for their review.

Você pode escrever um comentário nesta Avaliação PREreview de Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications.

Antes de começar

Vamos pedir para você fazer login com seu ORCID iD. Se você não tiver um iD, você pode criar um.