Illustration for: Reasoning models top all three CFA exam levels despite verbosity bias
Industry Applications

Reasoning models top all three CFA exam levels despite verbosity bias

2 min read

Why does a language model’s triumph on the CFA exams matter beyond headline numbers? While the raw scores suggest AI can now outpace human test‑takers across Level I, II and III, the methodology behind those results raises questions. The researchers measured performance using the same pass thresholds that have guided candidates for years—Level I demanding at least 60 percent per topic and 70 percent overall, Level II requiring at least 50 percent per topic and 60 percent overall.

Yet the scoring rubric rewards lengthier explanations, a factor that could inflate a model’s apparent competence. Here’s the thing: if detailed, verbose responses are systematically favored, the reported success may reflect a measurement artifact rather than genuine financial reasoning ability. The study notes this introduces measurement errors and a possible “verbosity bias” where detailed answers get higher scores.

Pass thresholds were drawn from previous work: Level I requires at least 60 percent per topic and 70 percent overall. Level II needs at least 50 percent per topic and 60 perc.

Advertisement

The study notes this introduces measurement errors and a possible "verbosity bias" where detailed answers get higher scores. Pass thresholds were drawn from previous work: Level I requires at least 60 percent per topic and 70 percent overall. Level II needs at least 50 percent per topic and 60 percent overall.

Level III requires an average of at least 63 percent across multiple-choice and constructed-response sections. Passing a test doesn't mean doing the job The researchers say the results suggest "reasoning models surpass the expertise required of entry-level to mid-level financial analysts and may achieve senior-level financial analyst proficiency in the future." While LLMs had already mastered the "codified knowledge" of Levels I and II, the latest generation is now developing the complex synthesis skills required for Level III.

Related Topics: #AI #LLM #CFA exam #verbosity bias #reasoning models #Level I #Level II #financial analyst

Can a model really earn a CFA charter? The study shows reasoning models now clear all three exam levels, with Gemini 3.0 Pro hitting a 97.6 percent score on Level I. That figure surpasses the 70 percent overall threshold and the 60 percent per‑topic minimum required to pass.

Level II also meet their respective cut‑offs of 60 percent overall and 50 percent per topic. Yet the authors warn of measurement errors and a possible verbosity bias: longer, more detailed answers tend to receive higher marks, which could inflate scores. The pass thresholds themselves stem from prior work, not from the exam board’s official standards.

Consequently, it remains unclear whether the models would succeed under a stricter grading regime or with tighter time limits. The results demonstrate that current reasoning models can mimic the knowledge and analytical steps tested by the CFA curriculum, but the study does not address practical constraints such as ethical considerations or real‑world decision making. Further validation is needed before declaring these systems equivalent to human charterholders.

Further Reading

Common Questions Answered

How did reasoning models perform on CFA Level I relative to the established pass thresholds?

The study reports that Gemini 3.0 Pro achieved a 97.6 percent score on Level I, comfortably exceeding the required 70 percent overall and the 60 percent per‑topic minimum. This performance indicates that the reasoning model not only passed but did so with a substantial margin over the official thresholds.

What is the "verbosity bias" identified in the research, and how might it influence exam scores?

Verbosity bias refers to the tendency for longer, more detailed answers to receive higher marks, regardless of their actual correctness. The authors warn that this bias can introduce measurement errors, inflating scores for models that produce expansive responses.

Which pass thresholds were applied to evaluate CFA Level II, and did the reasoning models meet them?

For Level II the study used the traditional cut‑offs of at least 50 percent per topic and 60 percent overall. The reasoning models satisfied both criteria, indicating they cleared Level II according to the same standards applied to human candidates.

Why do the authors caution that passing the CFA exams does not necessarily mean a model can perform the job of a chartered analyst?

The authors note that passing scores may be affected by measurement errors and the verbosity bias, which do not reflect real‑world analytical competence. Consequently, a high exam score does not guarantee that the model possesses the practical skills required for the CFA chartered professional role.

Advertisement