PREreview del Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift
- Publicado
- DOI
- 10.5281/zenodo.19648392
- Licencia
- CC BY 4.0
Summary
This position paper proposes a shift in AI safety evaluation from static benchmarks to "Harmful Capability Uplift" a metric quantifying the marginal advantage a user gains from AI assistance. The authors advocate for rigorous three-condition human-subjects experiments (Human-alone, AI-alone, and Human-AI) to measure how models amplify harmful potential, supported by a formal framework for validating safe proxy tasks.
Strengths
This is a top-notch theoretical paper that successfully formalizes an essential, under-researched area of AI safety. By grounding its methodology in established behavioral science and human-computer interaction research, the paper provides a much-needed, actionable, and mathematically rigorous structure for moving the field beyond the "safetywashing" of current static benchmarks.
Potential Improvements
1) Lack of evidence regarding AI as a multi-stage "co-conspirator"
The current framework treats tasks as relatively discrete units, failing to address how models act as co-conspirators by breaking complex harmful tasks into seemingly benign, multi-stage sub-tasks.
Action: Future iterations must specifically test for this "incremental assistance" by measuring how models facilitate the planning and debugging phases of a malicious objective, rather than just the final output.
2) Ecological validity gaps in lab-based red teaming
Lab environments often force a "get to the point" speed that ignores the reality of persistent, long-form adversarial interaction.
Action: The methodology should explicitly include "sustained interaction" protocols that allow participants to engage in the same iterative, multi-day, or multi-week workflows that a motivated, domain-expert attacker would use to manipulate a model into helping with harmful content
Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.