When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the H_dp Bandwidth Bound
- Publicada
- Servidor
- Zenodo
- DOI
- 10.5281/zenodo.20348809
v2 — Substantive correction of the v1 (May 2026) release.
CHANGE SUMMARYInference outputs are unchanged from v1; only the scoring of MMLU andARC-Challenge was affected. A regex-parsing bug in src/runner.py:125returned the first standalone capital letter found in each output, whichfor chain-of-thought outputs is the "(A)" in the listed answer optionsrather than the model's final answer. The bug systematically scored CoToutputs on MMLU/ARC as "A" regardless of the model's stated finalanswer. With correct answer extraction (results/rescore_mmlu_arc.py,released alongside this version), the central "CoT Sign Reversal" claimof v1 is not supported. The math-side findings, HumanEval functionalscoring, and the GSM-Symbolic memorisation control are unaffected.
KEY NUMBER CHANGES- MMLU CoT delta: was -28 to -38 pp (v1) -> +2.4 to +4.6 pp (v2, approximately neutral across all three models).- ARC-Challenge CoT delta: was -56 to -67 pp (v1) -> +0.0 to +3.3 pp (v2, approximately neutral across all three models, including Qwen-7B where it lands at literally 0.0 pp).- GSM8K, MATH, HumanEval deltas: unchanged.- Pre-registered McNemar tests significant after Bonferroni: was 15/15 (v1) -> 10/15 (v2); the five non-significant cells are all of MMLU and ARC except Qwen-7B/MMLU -- exactly the cells the framework predicted as TC^0 (no CoT benefit).- Pre-registered Spearman depth gradient: was rho = 0.850 (v1) -> rho = 0.661 (v2), p = 0.007. Per-model: Llama-8B and Qwen-32B rho = 0.866; Qwen-7B rho = 0.289.- Pre-registered H3 (MMLU no-CoT >= CoT): confirmed in v1 -> falsified in v2. One-sided p-values 1.00, 0.97, 0.98.
WHAT v2 ARGUESThe math-side prediction of the H_dp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC^0 prediction (CoT actively hurts low-depth tasks) is not supported: with correctly extracted answers, CoT is approximately neutral on MMLU and ARC across all six (model, benchmark) cells (delta range 0.0 to +4.6 pp). HumanEval continues to show the predicted model-size-dependent transition (+68.9 pp for Qwen-32B, +15.9 pp for Llama-8B, -27.4 pp for Qwen-7B).
PROVENANCE -- TWO-PHASE CORRECTIONPhase 1 (May 22, afternoon): a qualitative audit of ARC-Challenge failures surfaced the core regex bug. The parser at src/runner.py:125 used re.search(r"\b([A-D])\b", text.upper()) which returns the FIRST standalone capital letter. For no-CoT outputs starting with "Answer: B" this works. For CoT outputs that enumerate "(A)... (B)... (C)... (D)... Answer: C" the regex returns A regardless of the model's final answer.
Phase 2 (May 22, evening): direct database verification caught two follow-on problems. (a) Qwen-7B emits pseudo-tags <|"answer"|> and <|/assistant|> that the Phase-1 leak regex missed. The Phase-2 LEAK regex was broadened to match any <|...|> token plus </s>. (b) Case-insensitive [A-D] matching after the answer marker matched the English article "a" in phrases like "a balanced equation". Phase-2 switched the tail search to case-sensitive and tightened the "Answer" marker to require a colon. Phase 2 also added \boxed{X} extraction for Llama's math-style ARC answers. These fixes collapsed a Phase-1 residual Qwen-7B ARC penalty (-17.9 pp) to exactly +0.0 pp.
The label_source column of the released SQLite database tags all rescored rows as 'parser_fix_v2'. Re-running verify_paper_stats.py reproduces all v2 numbers. A pre-fix backup is at results/ccb_profile_merged.backup_before_parser_fix_20260522_161258.sqlite on the OSF project.
This preprint is not currently under submission to any peer-reviewed venue.