Evaluating Agentic AI Systems:A Balanced Framework for Performance, Robustness, Safety and Beyond
- Posted
- Server
- Preprints.org
- DOI
- 10.20944/preprints202508.1847.v1
Agentic artificial intelligence (AI)—multi-agent systems that combine large languagemodels with external tools and autonomous planning—are rapidly transitioning from researchlabs into high-stakes domains. Existing evaluations emphasise narrow technicalmetrics such as task success or latency, leaving important sociotechnical dimensions likehuman trust, ethical compliance and economic sustainability under-measured. We proposea balanced evaluation framework spanning five axes (capability&efficiency, robustness&adaptability, safety&ethics, human-centred interaction and economic&sustainability)and introduce novel indicators including goal-drift scores and harm-reduction indices. Beyondsynthesising prior work, we identify gaps in current benchmarks, develop a conceptualdiagram to visualise interdependencies and outline experimental protocols for empiricallyvalidating the framework. Case studies from recent industry deployments illustrate thatagentic AI can yield 20–60 % productivity gains yet often omit assessments of fairness,trust and long-term sustainability. We argue that multidimensional evaluation—combiningautomated metrics with human-in-the-loop scoring and economic analysis—is essential forresponsible adoption of agentic AI.