Saltar al contenido principal

Escribe una PREreview

Evaluating Agentic AI Systems:A Balanced Framework for Performance, Robustness, Safety and Beyond

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202508.1847.v1

Agentic artificial intelligence (AI)—multi-agent systems that combine large languagemodels with external tools and autonomous planning—are rapidly transitioning from researchlabs into high-stakes domains. Existing evaluations emphasise narrow technicalmetrics such as task success or latency, leaving important sociotechnical dimensions likehuman trust, ethical compliance and economic sustainability under-measured. We proposea balanced evaluation framework spanning five axes (capability&efficiency, robustness&adaptability, safety&ethics, human-centred interaction and economic&sustainability)and introduce novel indicators including goal-drift scores and harm-reduction indices. Beyondsynthesising prior work, we identify gaps in current benchmarks, develop a conceptualdiagram to visualise interdependencies and outline experimental protocols for empiricallyvalidating the framework. Case studies from recent industry deployments illustrate thatagentic AI can yield 20–60 % productivity gains yet often omit assessments of fairness,trust and long-term sustainability. We argue that multidimensional evaluation—combiningautomated metrics with human-in-the-loop scoring and economic analysis—is essential forresponsible adoption of agentic AI.

Puedes escribir una PREreview de Evaluating Agentic AI Systems:A Balanced Framework for Performance, Robustness, Safety and Beyond. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora