Saltar al contenido principal

Escribe una PREreview

Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202605.0806.v1

As reasoning becomes a defining capability of large language models, reasoning benchmarks have moved to the center of evaluation. However, despite the rapid growth in the number of benchmarks and reported scores, benchmark results are often not directly comparable. This is because benchmarks may differ not only in the reasoning capabilities they target, but also in the conditions under which models are evaluated and the criteria used to assess success. To address this challenge, we present the first survey of reasoning benchmarks for large language models across three dimensions: Object, Setting, and Evaluation. Object defines the reasoning capability under examination. Setting specifies the conditions that shape model behavior. Evaluation determines how success is measured. We further introduce extended scenarios to account for special conditions. Based on this analysis, we identify two major weaknesses in current practice, namely heterogeneous benchmark objects and weakly justified settings, and derive practical guidance for benchmark selection, construction, and reporting, along with future directions for benchmark development. We hope this survey will help advance reasoning evaluation beyond score comparison alone toward benchmarks that are more interpretable, better justified, and easier to implement. A repository for the related papers is available at https://github.com/chenyuanTKCY/Awesome-Benchmarks-for-LLM-Reasoning.

Puedes escribir una PREreview de Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora