Ir para detalhes do preprintIr para avaliações PREreview

Avaliações PREreview de Towards a Science of AI Agent Reliability

1 Avaliação PREreview

  1. Avaliação PREreview de Zirui Wei

    Summary

    This paper argues that current agent evaluation practice — reporting mean task success rates — fundamentally fails to capture whether agents are reliable enough for real-world deployment. The authors propose a four-dimensional reliability framework decomposed into twelve concrete metrics:…

    Ler a avaliação PREreview de Zirui Wei