Skip to main content
Graphic illustrating how longer reasoning paths in question-answering models create bias favoring top-ranked answers, emphasi

Editorial illustration for Longer Reasoning Paths Increase Per-Question Position Bias in QA Models

Longer Reasoning Paths Increase Per-Question Position...

Longer Reasoning Paths Increase Per-Question Position Bias in QA Models

Updated: 2 min read

Chain‑of‑thought (CoT) prompting and reasoning‑tuned models such as DeepSeek‑R1 are often praised for “thinking” their way past shallow heuristics. The new study asks a tougher question: does more reasoning actually introduce a new kind of bias? By truncating generation trajectories and resuming them later, researchers observed a clear drift toward options that sit later in the answer list—16 % to 32 % for the R1‑Qwen‑7B pair across absolute‑position buckets.

Even when model size balloons to 671 B parameters, the aggregate position‑bias score (PBS) shrinks to 0.019, yet the longest quartile still shows a PBS of 0.071, hinting that sheer accuracy doesn’t erase the length‑driven effect. The authors also separate this from the well‑known direct‑answer bias, which appears strong in Llama‑Instruct‑direct, weak in Qwen‑Instruct‑direct, and shows no link to trajectory length. In other words, CoT reasoning swaps one bias for another.

The paper concludes that reasoning‑capable models can’t be assumed order‑robust in multiple‑choice evaluations and offers a diagnostic toolkit—PBS, commitment change point, effective switching, and truncation probes—to audit these hidden preferences.

We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory.Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles.

Why this matters

We see that longer chain‑of‑thought sequences do not automatically neutralize heuristic shortcuts; instead, per‑question position bias rises as the reasoning trajectory lengthens. This pattern holds across thirteen reasoning‑mode configurations, including two R1‑distilled 7‑8B models and two base models prompted with similar techniques. For developers, the finding suggests that simply adding more “thinking” steps may introduce a new source of error, especially in multiple‑choice QA where answer order can sway outcomes.

Founders should question whether scaling model size or prompting depth will yield cleaner decisions without additional safeguards. Researchers are left with an open question: is the bias an artifact of the training data, the prompting style, or an inherent property of extended reasoning paths? It is unclear whether alternative architectures could decouple length from positional preference.

Until we understand the mechanism, we may need to monitor answer ordering more closely and consider bias‑mitigation strategies as a standard part of model evaluation.

Further Reading