Avalilação PREreview de Towards a Science of AI Agent Reliability

de Zirui Wei

Publicado: 15 de março de 2026
DOI: 10.5281/zenodo.19039349
Licença: CC BY 4.0

Summary

This paper argues that current agent evaluation practice — reporting mean task success rates — fundamentally fails to capture whether agents are reliable enough for real-world deployment. The authors propose a four-dimensional reliability framework decomposed into twelve concrete metrics: consistency (outcome, trajectory, resource), robustness (fault, environment, prompt), predictability (calibration, discrimination, Brier score), and safety (compliance, harm severity). Evaluating 14 agentic models across GAIA and τ-bench, they find that capability gains as measured by accuracy have not translated into proportional reliability improvements, and that the two dimensions can decouple significantly depending on the benchmark.

Strengths

The central thesis is both timely and important. The observation that accuracy and reliability can decouple — that a model can achieve high mean task success while still failing inconsistently, unpredictably, or unsafely — exposes a structural blind spot in how the field evaluates agentic systems. The real-world failure cases cited in the introduction (Replit's database deletion, OpenAI Operator's unauthorized purchase, the NYC government chatbot providing illegal advice) are not anecdotal edge cases but systematic failures that occurred precisely because these agents passed internal assessments. This framing grounds the paper's motivation in concrete stakes rather than abstract concerns.

The four-dimensional decomposition is well-motivated by analogy to reliability engineering in safety-critical domains such as aviation and nuclear systems. Importing concepts like fault injection, calibration, and harm severity from established engineering disciplines is appropriate and overdue. The consistency metrics — particularly the distinction between outcome consistency, trajectory consistency, and resource consistency — capture meaningfully different failure modes that a single success rate collapses together. An agent that achieves the right answer via different reasoning paths on different runs may be less deployable than one with a lower average accuracy but highly predictable behavior.

The finding that reliability gains lag behind capability progress (Figure 1) is the paper's most significant empirical contribution. The visualization showing accuracy rising steadily while reliability trails behind across both benchmarks is striking and directly actionable for the field. This result challenges the implicit assumption that scaling model capability is sufficient for safe deployment.

The distinction between reliability and capability in Section 3.5.1 is conceptually valuable. Disentangling these two properties clarifies that improving a model's average task performance does not address the orthogonal question of how that performance degrades under perturbation, across runs, or at the tails of the error distribution.

Weaknesses and Limitations

The aggregation methodology in Section 3.5 raises important questions that are not fully resolved. Combining twelve metrics into a scalar reliability score requires weighting decisions that are inherently application-dependent. In a financial services context, harm severity and compliance failures are far more costly than resource inconsistency; in a research assistant context, the reverse may be true. The paper uses equal weighting as a default but does not provide principled guidance for practitioners on how to calibrate weights for specific deployment contexts. This limits the framework's direct applicability without further adaptation.

The evaluation benchmarks, while appropriate for demonstrating the framework, are both general-purpose. GAIA and τ-bench do not represent the full diversity of high-stakes agentic deployments — particularly in domains like healthcare, legal reasoning, or financial analysis — where reliability requirements are more stringent and failure modes more domain-specific. The paper would benefit from at least a discussion of how the framework would need to be extended or recalibrated for such settings, even if empirical evaluation in those domains is left to future work.

The predictability metrics rely on post-hoc confidence elicitation — asking the model to self-report its confidence after task completion. The paper acknowledges this limitation, but the implications deserve more discussion. Self-reported confidence in LLMs is known to be poorly calibrated and can be manipulated by prompt phrasing. An agent that is confidently wrong is arguably more dangerous than one that is uncertain and wrong, and the current protocol may not reliably distinguish these cases. Native uncertainty quantification approaches, even approximate ones, would strengthen the predictability dimension.

The safety evaluation using τ-bench's tool-use environment captures policy compliance failures well, but harm severity assessment depends on an LLM judge whose own reliability is not fully characterized. There is a recursive problem here: using an AI system to assess the harm severity of another AI system's failures requires the judge to be reliably calibrated for harm assessment, which is itself an open problem. The paper should more explicitly flag this dependency.

Suggestions

The framework would be significantly more useful to practitioners if accompanied by a decision procedure for weight selection. Even a simple guide — such as a matrix mapping deployment context characteristics (reversibility of actions, presence of human oversight, domain stakes) to recommended metric weightings — would make the paper more actionable beyond academic evaluation.

The authors should consider whether the twelve metrics are independent or whether there are systematic correlations between dimensions. For example, if models that are highly consistent also tend to be better calibrated, the reliability score double-counts correlated properties. An empirical correlation analysis across the evaluated models would clarify the framework's information content.

A discussion of how the reliability framework interacts with agent scaffolding choices would be valuable. The experiments control for scaffolding, but in practice, scaffolding decisions — number of retries, fallback behaviors, human-in-the-loop checkpoints — are the primary levers practitioners use to improve reliability. Understanding how these interact with the proposed metrics would bridge the gap between the theoretical framework and engineering practice.

Overall Assessment

This is a high-quality and important contribution that addresses a genuine gap in the AI agent evaluation literature. The four-dimensional reliability framework is well-motivated, empirically grounded, and practically relevant. The finding that reliability lags behind accuracy despite rapid capability progress is a significant empirical result that should inform how the community designs benchmarks and deployment criteria going forward. The main weaknesses are the aggregation methodology's sensitivity to weighting choices and the reliance on self-reported confidence for predictability assessment. These are acknowledged limitations that future work can address. Strongly recommended for acceptance.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Comentários

Escrever um comentário

Nenhum comentário foi publicado ainda.