Saltar al contenido principal

Escribe una PREreview

HB-Eval: From Benchmark to Reliability Operating System—A Five-Metric Framework with Triple-Methodology Validation, SIL/ASIL Certification, and Production-Grade Deployment

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202606.0186.v1

Background: Agentic AI systems are deployed in safety-critical domains where operational reliabil-ity under fault conditions determines patient safety, system integrity, and infrastructure continuity. Current evaluation paradigms measure nominal task-completion capability exclusively, providing no mechanism for estimating the capability–reliability gap ∆(π) = Cnom(π) − Rop(π) that separates benchmark performance from operational performance. Methods: We present HB-Eval OS, a five-metric Reliability Operating System comprising a secured evaluation Gateway, Evaluation-Driven Memory (EDM), and a production SDK (pip install hb-eval-sdk v2.0.0) integrating AES-256-GCM encryption and Safe Halt protocol. Three fully independent validation methodologies were applied across 14,000 evaluations: Methodology A (6,000 behavioural trajectory experiments across six open-weight architectures and six safety-critical domains), Methodology B (4,998 three-layer con-straint verification assessments across five frontier open-weight models), and Methodology C (3,002 evaluations of GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash judged by an independent third-party model). A fifth diagnostic metric—the Consistency Stability Index (CSI)—is introduced to quantify temporal performance stability across sequential runs. Results: Methodologies A and B converge on aggregate reliability near 36% (z=0.653, p=0.514, 95% CI ±1.80 pp), confirming the deficit is not a methodological artefact. Methodology C establishes gaps of +7.6 pp, +10.6 pp, and +22.5 pp for GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash respectively across 14 architectures from five organi-sations. The Intentional Recovery Score (IRS) reveals that only 23% of recoveries are memory-guided; the remaining 77% degrade 55 pp under distribution shift. Cascade fault injection imposes a 21.6 pp reliability penalty (z=10.80, p<0.001). A live Gemini API case study demonstrates transition from UNSAFE (PEI = 0.67) to SAFE (PEI = 1.00) through single-prompt refinement guided by HB-Eval OS attribution. Conclusions: No evaluated model qualifies for Tier 2 or Tier 3 SIL/ASIL certification. The 55 pp IRS distribution-shift divergence and 21.6 pp cascade penalty identify specific, actionable architectural targets. Complete protocols, all 14,000 evaluation records, and the production SDK are released open source.

Puedes escribir una PREreview de HB-Eval: From Benchmark to Reliability Operating System—A Five-Metric Framework with Triple-Methodology Validation, SIL/ASIL Certification, and Production-Grade Deployment. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora