PREreviews de “Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models”

Saltar a detalles del preprint Saltar a PREreviews

Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

por Chenyuan Zhang, Simin Liu, Hanjing Li, Te Gao, Yidi Wang, Qiguang Chen, Xiachong Feng, Li Cai, Mengnan Du, Zhuotao Tian, Libo Qin, Philip S. Yu y Min Zhang

Publicado: 13 de mayo de 2026
Servidor: Preprints.org
DOI: 10.20944/preprints202605.0806.v1

Resumen

As reasoning becomes a defining capability of large language models, reasoning benchmarks have moved to the center of evaluation. However, despite the rapid growth in the number of benchmarks and reported scores, benchmark results are often not directly comparable. This is because benchmarks may differ not only in the reasoning capabilities they target, but also in the conditions under which models are evaluated and the criteria used to assess success. To address this challenge, we present the first survey of reasoning benchmarks for large language models across three dimensions: Object, Setting, and Evaluation. Object defines the reasoning capability under examination. Setting specifies the conditions that shape model behavior. Evaluation determines how success is measured. We further introduce extended scenarios to account for special conditions. Based on this analysis, we identify two major weaknesses in current practice, namely heterogeneous benchmark objects and weakly justified settings, and derive practical guidance for benchmark selection, construction, and reporting, along with future directions for benchmark development. We hope this survey will help advance reasoning evaluation beyond score comparison alone toward benchmarks that are more interpretable, better justified, and easier to implement. A repository for the related papers is available at https://github.com/chenyuanTKCY/Awesome-Benchmarks-for-LLM-Reasoning.

Leer el preprint

0 PREreviews

Redactar una PREreview Solicitar una PREreview

PREreviews de Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

0 PREreviews