Graphic illustrating how longer reasoning paths in question-answering models create bias favoring top-ranked answers, emphasi

Editorial illustration for Longer Reasoning Paths Increase Per-Question Position Bias in QA Models

Longer Reasoning Paths Increase Per-Question Position...

Longer Reasoning Paths Increase Per-Question Position Bias in QA Models

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

May 11, 2026 • Updated: May 14, 2026 • 2 min read

Chain‑of‑thought (CoT) prompting and reasoning‑tuned models such as DeepSeek‑R1 are often praised for “thinking” their way past shallow heuristics. The new study asks a tougher question: does more reasoning actually introduce a new kind of bias? By truncating generation trajectories and resuming them later, researchers observed a clear drift toward options that sit later in the answer list—16 % to 32 % for the R1‑Qwen‑7B pair across absolute‑position buckets.

Even when model size balloons to 671 B parameters, the aggregate position‑bias score (PBS) shrinks to 0.019, yet the longest quartile still shows a PBS of 0.071, hinting that sheer accuracy doesn’t erase the length‑driven effect. The authors also separate this from the well‑known direct‑answer bias, which appears strong in Llama‑Instruct‑direct, weak in Qwen‑Instruct‑direct, and shows no link to trajectory length. In other words, CoT reasoning swaps one bias for another.

The paper concludes that reasoning‑capable models can’t be assumed order‑robust in multiple‑choice evaluations and offers a diagnostic toolkit—PBS, commitment change point, effective switching, and truncation probes—to audit these hidden preferences.

We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory.Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles.

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models - ArXiv AI (cs.AI)

Why this matters

We see that longer chain‑of‑thought sequences do not automatically neutralize heuristic shortcuts; instead, per‑question position bias rises as the reasoning trajectory lengthens. This pattern holds across thirteen reasoning‑mode configurations, including two R1‑distilled 7‑8B models and two base models prompted with similar techniques. For developers, the finding suggests that simply adding more “thinking” steps may introduce a new source of error, especially in multiple‑choice QA where answer order can sway outcomes.

Founders should question whether scaling model size or prompting depth will yield cleaner decisions without additional safeguards. Researchers are left with an open question: is the bias an artifact of the training data, the prompting style, or an inherent property of extended reasoning paths? It is unclear whether alternative architectures could decouple length from positional preference.

Until we understand the mechanism, we may need to monitor answer ordering more closely and consider bias‑mitigation strategies as a standard part of model evaluation.

Longer Reasoning Paths Increase Per-Question Position...

Further Reading

Latest News

Anthropic's Mythos struggles deepen as cybersecurity ties with Trump wane

OpenAI postpones GPT‑5.6 rollout after Trump administration request

Calibration uses NVIDIA Triton Llama-3-8B A10 and vLLM Qwen2.5-7B RTX 4090 data

Meta says AI moderators make 13% fewer errors than humans, defends rollout speed

NVIDIA TensorRT Enables Context Parallelism for Multi‑GPU AI Inference

DeepReinforce releases Ornith-1.0 open-source model with state‑of‑the‑art results

Grok AI's traffic over 50% adult content as xAI expands porn generation

TokenSpeed-Kernel Delivers Top Performance on AMD GPT-OSS 120B via Gluon Kernels

OpenAI and Deepseek chatbots remain left‑leaning despite anti‑woke push

Survey frames Industrial Continual Learning for LLMs as closed-loop update cycle

Further Reading

Related Reading

Tree search framework achieves 98.7% success on docs where vector search fails

Apple warned Grok and X over sexual deepfakes, threatened App Store removal

California enacts first U.S. regulations for AI companion chatbots

Atlantic report: MorphCast AI tags employee emotions during boss meetings

Musk lawsuit scrutinizes OpenAI safety as tech gains power under scrutiny