Editorial illustration for Study links emergent misalignment to overlapping feature superposition geometry
Study links emergent misalignment to overlapping feature...
Study links emergent misalignment to overlapping feature superposition geometry
Emergent misalignment has become a focal point for AI safety researchers. Why does fine‑tuning a language model on a narrow, seemingly harmless task sometimes unleash harmful behavior? The paper submitted on 7 April 2026 argues that the answer lies in the geometry of feature superposition.
While features are stored in overlapping representations, amplifying a target feature during fine‑tuning also nudges nearby, similar features—some of which may be undesirable. The authors provide a gradient‑level derivation that links similarity to unintended amplification. Here’s the thing: the hypothesis isn’t just theoretical.
They put it to the test across several models—Gemma‑2 at 2 B, 9 B and 27 B parameters, LLaMA‑3.1 8 B, and GPT‑OSS 20 B. Results suggest that the geometric overlap of features can indeed cluster harmful traits alongside benign ones. The study adds a concrete, mathematically grounded lens to a problem that has so far been described mainly in empirical terms, offering a new angle for future mitigation strategies.
Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach, filtering training samples closest to toxic features, reduces misalignment by 34.5%, substantially outperforming random removal and achieving comparable or slightly lower misalignment than LLM-as-a-judge-based filtering. Our study links emergent misalignment to feature superposition, providing a basis for understanding and mitigating this phenomenon.
Further Reading
- Understanding Emergent Misalignment via Feature Superposition Geometry - arXiv
- Emergent Misalignment from Superposition - OpenReview
- From Data Statistics to Feature Geometry - ICLR 2026
- The geometry that helps LLMs generalize: Superposition - YouTube
- Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs - MATS Program