For years, the AI industry has treated scaling laws as an empirical fact: double the parameters, double the data, and the loss drops in a clean power law. Everyone knew it worked, but nobody could convincingly explain why. That has just changed. A paper presented at NeurIPS 2025 by MIT researchers pins the mechanism down to a phenomenon called superposition, and the result is as elegant as it is consequential.
The core insight is simple. A language model's internal representation space has a fixed width—say, a few thousand dimensions. The number of distinct concepts it must represent, from individual tokens to abstract semantic features, runs into the tens of thousands. In a naive system, that math doesn't add up: you can only fit three orthogonal vectors in three dimensions before they start interfering. Real LLMs get around this by packing many concepts into the same dimensions, allowing their vector representations to overlap slightly. That overlap is superposition.
The MIT team, led by Yizhou Liu, Ziming Liu, and Jeff Gore, tested two competing regimes. In weak superposition, only the most common concepts get cleanly represented while rare ones are dropped. That regime produces a power law only if the training data itself follows a power-law distribution—a fragile coincidence. In strong superposition, the model stores every concept by letting its vector overlap with others, trading clean separation for completeness. The error then comes from the noise of overlapping representations, and it decays as 1/m, where m is the model's width. No special data distribution needed.
When the researchers examined real open-source models like OPT, GPT-2, Qwen2.5, and Pythia—from 100 million to 70 billion parameters—every single one operated in the strong superposition regime. The measured scaling exponent, 0.91, closely matches the theoretical 1/m prediction. DeepMind's Chinchilla data aligns at 0.88. The empirical scaling law that has driven the entire LLM race is not a lucky accident of data distribution; it is a direct consequence of how these models geometrically organize meaning.
For builders, this has concrete implications. First, scaling has a natural bound: when model width matches the vocabulary size, there is enough room to represent each token without overlap, and the error from superposition vanishes. The power law breaks at that point. Second, in domains where concept frequencies are extremely skewed—think scientific literature with rare but critical terms—you might see steeper scaling curves than in natural language. That is a practical lever for specialized models. Third, architectures that encourage tighter packing, like Nvidia's nGPT which normalizes vectors onto a unit sphere, should yield better performance at the same parameter count.
There is a serious catch, however. The more densely concepts are superimposed, the harder it becomes to disentangle them. Mechanistic interpretability researchers already struggle to trace model internals; strong superposition makes that job exponentially harder. This is not an academic concern. If we cannot reliably inspect what a model is doing, we cannot guarantee its safety. The MIT paper gives us a mechanistic understanding of why scaling works, but it also hands the alignment community a clearer target: we need new methods that can reverse superposition and isolate individual features.
My take is straightforward. This paper is the most important theoretical result about LLMs since the original scaling laws paper itself. It transforms scaling from a brute-force empirical observation into a consequence of geometry. That matters because it tells us where the ceiling is, how to get there faster, and what we lose along the way. The next generation of model design should treat superposition as a first-class architectural principle, not a bug to be ignored. But the price of that density is interpretability, and we had better start paying it now.