Social Science Research Council Research AMP Just Tech
Citation

Simple picture of how output from ChatGPT-like AI shifts from good to bad

Author:
Johnson, Neil F; Huo, Frank Yingjie
Publication:
PNAS Nexus
Year:
2026

Generative language models can drift mid-generation from reliable, helpful continuations to more undesirable, misleading or unsafe ones. Such shifts can be hard to notice because the early part of a response can be fluent and correct, so downstream users (or automations) may not trigger checks until after harm occurs. Here we isolate a minimal, first-principles mechanism for such “good-to-bad” output shifts. The mechanism originates within a single effective attention head, where the dot products of embedded vectors drive a competition between output basins; mapping to a multispin thermal system yields a closed-form expression for the shift. Crucially, multi-layer processing does not wash out this single-head mechanism — it amplifies it. Tracking hidden states through all layers of open-weight models, we find that the competing dot-product gaps grow by up to 617× from the first layer to the penultimate layer, and that successive layers fuse competing basins into a shared geometric subspace — constructing precisely the configuration the single-head formula assumes. A cross-architecture study on six independently trained models confirms that the resulting closed-form expression, with zero free parameters, correctly predicts the output-shift regime in 94% of non-ambiguous cases. Our physical picture is of course extremely simplified – just as a paper plane cannot capture all the details of modern aviation, or a simple model of an atom cannot capture all the details of a complex material — but it helps fill a current need for generative AI explainability, and it provides a simple yet concrete platform for discussing how training, fine-tuning and prompts might shift generative AI output.