Escribir un comentario

PREreview del Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift

por Mattia Gaggi

Publicado: 19 de abril de 2026
DOI: 10.5281/zenodo.19648392
Licencia: CC BY 4.0

Summary

This position paper proposes a shift in AI safety evaluation from static benchmarks to "Harmful Capability Uplift" a metric quantifying the marginal advantage a user gains from AI assistance. The authors advocate for rigorous three-condition human-subjects experiments (Human-alone, AI-alone, and Human-AI) to measure how models amplify harmful potential, supported by a formal framework for validating safe proxy tasks.

Strengths

This is a top-notch theoretical paper that successfully formalizes an essential, under-researched area of AI safety. By grounding its methodology in established behavioral science and human-computer interaction research, the paper provides a much-needed, actionable, and mathematically rigorous structure for moving the field beyond the "safetywashing" of current static benchmarks.

Potential Improvements

1) Lack of evidence regarding AI as a multi-stage "co-conspirator"

The current framework treats tasks as relatively discrete units, failing to address how models act as co-conspirators by breaking complex harmful tasks into seemingly benign, multi-stage sub-tasks.

Action: Future iterations must specifically test for this "incremental assistance" by measuring how models facilitate the planning and debugging phases of a malicious objective, rather than just the final output.

2) Ecological validity gaps in lab-based red teaming

Lab environments often force a "get to the point" speed that ignores the reality of persistent, long-form adversarial interaction.

Action: The methodology should explicitly include "sustained interaction" protocols that allow participants to engage in the same iterative, multi-day, or multi-week workflows that a motivated, domain-expert attacker would use to manipulate a model into helping with harmful content

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Puedes escribir un comentario en esta PREreview de Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un ORCID iD, puedes crear uno.