Skip to main content

Write a PREreview

When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the H_dp Bandwidth Bound

Posted
Server
Zenodo
DOI
10.5281/zenodo.20348809

v2 — Substantive correction of the v1 (May 2026) release.

CHANGE SUMMARYInference outputs are unchanged from v1; only the scoring of MMLU andARC-Challenge was affected. A regex-parsing bug in src/runner.py:125returned the first standalone capital letter found in each output, whichfor chain-of-thought outputs is the "(A)" in the listed answer optionsrather than the model's final answer. The bug systematically scored CoToutputs on MMLU/ARC as "A" regardless of the model's stated finalanswer. With correct answer extraction (results/rescore_mmlu_arc.py,released alongside this version), the central "CoT Sign Reversal" claimof v1 is not supported. The math-side findings, HumanEval functionalscoring, and the GSM-Symbolic memorisation control are unaffected.

KEY NUMBER CHANGES- MMLU CoT delta: was -28 to -38 pp (v1) -> +2.4 to +4.6 pp (v2, approximately neutral across all three models).- ARC-Challenge CoT delta: was -56 to -67 pp (v1) -> +0.0 to +3.3 pp (v2, approximately neutral across all three models, including Qwen-7B where it lands at literally 0.0 pp).- GSM8K, MATH, HumanEval deltas: unchanged.- Pre-registered McNemar tests significant after Bonferroni: was 15/15 (v1) -> 10/15 (v2); the five non-significant cells are all of MMLU and ARC except Qwen-7B/MMLU -- exactly the cells the framework predicted as TC^0 (no CoT benefit).- Pre-registered Spearman depth gradient: was rho = 0.850 (v1) -> rho = 0.661 (v2), p = 0.007. Per-model: Llama-8B and Qwen-32B rho = 0.866; Qwen-7B rho = 0.289.- Pre-registered H3 (MMLU no-CoT >= CoT): confirmed in v1 -> falsified in v2. One-sided p-values 1.00, 0.97, 0.98.

WHAT v2 ARGUESThe math-side prediction of the H_dp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC^0 prediction (CoT actively hurts low-depth tasks) is not supported: with correctly extracted answers, CoT is approximately neutral on MMLU and ARC across all six (model, benchmark) cells (delta range 0.0 to +4.6 pp). HumanEval continues to show the predicted model-size-dependent transition (+68.9 pp for Qwen-32B, +15.9 pp for Llama-8B, -27.4 pp for Qwen-7B).

PROVENANCE -- TWO-PHASE CORRECTIONPhase 1 (May 22, afternoon): a qualitative audit of ARC-Challenge failures surfaced the core regex bug. The parser at src/runner.py:125 used re.search(r"\b([A-D])\b", text.upper()) which returns the FIRST standalone capital letter. For no-CoT outputs starting with "Answer: B" this works. For CoT outputs that enumerate "(A)... (B)... (C)... (D)... Answer: C" the regex returns A regardless of the model's final answer.

Phase 2 (May 22, evening): direct database verification caught two follow-on problems. (a) Qwen-7B emits pseudo-tags <|"answer"|> and <|/assistant|> that the Phase-1 leak regex missed. The Phase-2 LEAK regex was broadened to match any <|...|> token plus </s>. (b) Case-insensitive [A-D] matching after the answer marker matched the English article "a" in phrases like "a balanced equation". Phase-2 switched the tail search to case-sensitive and tightened the "Answer" marker to require a colon. Phase 2 also added \boxed{X} extraction for Llama's math-style ARC answers. These fixes collapsed a Phase-1 residual Qwen-7B ARC penalty (-17.9 pp) to exactly +0.0 pp.

The label_source column of the released SQLite database tags all rescored rows as 'parser_fix_v2'. Re-running verify_paper_stats.py reproduces all v2 numbers. A pre-fix backup is at results/ccb_profile_merged.backup_before_parser_fix_20260522_161258.sqlite on the OSF project.

This preprint is not currently under submission to any peer-reviewed venue.

You can write a PREreview of When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the H_dp Bandwidth Bound. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.

What is an ORCID iD?

An ORCID iD is a unique identifier that distinguishes you from everyone with the same or similar name.

Start now