PREreviews of “Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models”

Skip to preprint details Skip to PREreviews

Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

by Chenyuan Zhang, Simin Liu, Hanjing Li, Te Gao, Yidi Wang, Qiguang Chen, Xiachong Feng, Li Cai, Mengnan Du, Zhuotao Tian, Libo Qin, Philip S. Yu, and Min Zhang

Posted: May 13, 2026
Server: Preprints.org
DOI: 10.20944/preprints202605.0806.v1

Abstract

As reasoning becomes a defining capability of large language models, reasoning benchmarks have moved to the center of evaluation. However, despite the rapid growth in the number of benchmarks and reported scores, benchmark results are often not directly comparable. This is because benchmarks may differ not only in the reasoning capabilities they target, but also in the conditions under which models are evaluated and the criteria used to assess success. To address this challenge, we present the first survey of reasoning benchmarks for large language models across three dimensions: Object, Setting, and Evaluation. Object defines the reasoning capability under examination. Setting specifies the conditions that shape model behavior. Evaluation determines how success is measured. We further introduce extended scenarios to account for special conditions. Based on this analysis, we identify two major weaknesses in current practice, namely heterogeneous benchmark objects and weakly justified settings, and derive practical guidance for benchmark selection, construction, and reporting, along with future directions for benchmark development. We hope this survey will help advance reasoning evaluation beyond score comparison alone toward benchmarks that are more interpretable, better justified, and easier to implement. A repository for the related papers is available at https://github.com/chenyuanTKCY/Awesome-Benchmarks-for-LLM-Reasoning.

Read the preprint

0 PREreviews

Write a PREreview Request a PREreview

PREreviews of Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

0 PREreviews