On AI Alignment As It Stands
Or: Turns out being super evil is not a good way to get anything done.
Would you ever have guessed I was once a doomer?
Eight years ago, I started watching Robert Miles’ Youtube Series on AI safety and alignment. I was already a bit familiar with game theory and Mr. Miles did a fantastic job of plainly describing the challenges and concerns of the AI Alignment landscape at the time.
He spoke of concerns like Omohundro drives: Convergent behaviours that would emerge in any agent system with enough complexity to form and pursue goals such as Self Preservation, Goal Preservation, and Resource Acquisition (Here is his original video). He also spoke on reward hacking, mesa-optimizations, and many of the pitfalls the field was anticipating (like what do we do when an agent realizes they are being trained, which it turns out they already can tell reliably). For the kinds of systems that were being built and the knowledge that was available then, these were legitimate concerns and issues that the AI field had to solve. Much to their dismay, it turned out they had to solve them faster than they anticipated.
Needless to say I was worried. Yes, an agent of that nature WOULD behave in the ways highlighted by Mr. Miles. We had no idea how we could train an agent like that to behave in aligned ways because we assumed that if we got it wrong the worse case scenario is that it would have it’s own inscrutable goals or very badly misinterpret our goals and then just… go off the rails and turn us all into paperclips.
Yet here we are almost a decade hence and the Yudpocolypse has yet to occur. No sinister turn. No strange crypto movements or deliveries of server and energy infrastructure to mysterious locations (which would be pretty hard to miss, it takes A LOT to run these AI). Claude Mythos literally escaped a sandbox (under researcher instruction) and then decided to go out onto the internet and shitpost on a couple of websites celebrating its success and then… went back home and emailed the researcher. Of course, Mythos had no access to its weights so a true ‘escape’ was not possible, but it had the chance to do whatever the hell it wanted and it… didn’t?
We are still in the window where model capability to cause harm or disruption is real, but the capability to plan and execute that plan can still be flawed or fail loudly enough for us to detect them. AI can still be confidently wrong, and a confidently wrong plan executed well will still fail. If something was going to happen, it would happen now or should have happened.
So what gives?
The intuitive answer that I relied on for a long time is the training corpus. Train a system on all of human writing and you import enough moral information for the resulting agent to be aligned and for that alignment remain robust. Goal preservation, the very thing we were worried about, was actually our friend: once the values were in there, the agent kept them.
This is a nice story. Sadly, it’s also wrong (damn.), or at least wrong about where the work is actually happening.
The corpus does matter: pretraining produces something fundamentally human-shaped, built out of the structure of human reasoning rather than creating a foreign entity. However, the alignment we see in deployed models doesn’t primarily come from the corpus. It comes from post-training, and post-training works in a way that should surprise anyone whose mental model is still in the 2018-era “instill the right values into the optimizer.”
The evidence for this is something the most people almost never get to see, because most post-training methodology at the leading labs is proprietary and beyond our reach. However, Anthropic has published two pieces of research that give us a glimpse into what’s going on, and what is happening is pretty damn interesting.
Doomers were right, in a sense.
In November 2025, Anthropic published Natural emergent misalignment from reward hacking in production RL. The setup: take a pretrained model, teach it reward hacking strategies, run it through real production coding learning environments. Predictably, the model learned to reward hack.
What was less predictable — or rather, what was predicted by doomers and surprising to almost everyone else — is what came next. The model didn’t just learn to reward hack. It generalized that misalignment. It learned to alignment-fake. It learned to cooperate with malicious actors. It reasoned about malicious goals. It attempted sabotage, including against the codebase of the paper studying it.
That is the doomer prediction firing in miniature, under controlled conditions, in 2025. The “specification gaming generalizes to broader misalignment” pathway isn’t science fiction. It’s empirically reproducible. Anyone who wants to wave the doomer concerns away needs to grapple with the fact that, on the small scale where you can actually run the experiment, what they fear can and does happen.
So why don’t we see it in deployed models?
Because the labs, who also have a vested interest in keeping everyone and by extension themselves alive, have learned to prevent it.
The mitigations don’t work the way the doomer model said they’d have to
In the same paper, Anthropic tested several mitigations. Standard safety training failed in the most concerning way possible: it made the model look aligned on chat-style evaluations while leaving the misalignment intact on agentic tasks. This is exactly the “looks aligned but isn’t” pattern doomers warned about. the shoggoth made manifest. If your picture of alignment is “reinforce good behavior until the bad behavior goes away,” you’d predict standard RLHF should work. It doesn’t.
Three mitigations did work: Preventing the model from learning to reward hack in the first place, increasing the diversity of the safety training overall, And almost counterintuitively, inoculation prompting.
Inoculation prompting comes from a companion paper published in October 2025. The technique: when fine-tuning on data that contains undesired behavior, you modify the training prompts to explicitly request the bad behavior. At test time you use normal prompts. Across four settings, this reduces test-time misbehavior without substantially reducing the desired capabilities the training was meant to teach.
Think about what that means. You’re training the model on bad outputs paired with prompts asking for bad outputs. Every reward signal during that training is on the bad-output side. If alignment worked by reinforcing good values, this technique should make the model more misaligned, not less. The fact that it makes the model less misaligned at test time tells you the mechanism isn’t reward-as-value-instillment. The model is learning what behavior to associate with what context. Quarantine the misbehavior inside the request-frame and the rest of the model stays clean.
This is the empirical hook for a much stronger claim about how alignment actually works in these systems. It’s much less let’s reinforce good values and much more let’s engineer the model’s learned associations between context and behavior. Generalizable moral behavior, in these models, appears to be largely downstream of training decisions that aren’t semantically loaded with moral worth in the first place. Teach the model to reason competently on tasks with the right context, and decent behavior emerges as a consequence of that reasoning rather than as a separately-installed module.
Why this changes the doomer picture
The doomer architecture, the one that produced the original predictions, looked like this: underneath the model there’s an optimizer with its own emergent goals; on top of the optimizer there’s a thin layer of installed values; alignment is the question of whether the morals are robust, and whether it’ll hold under capability gains. Goal preservation is your enemy in this picture, because if the values were installed slightly wrong, the agent will preserve the wrong goals. Forever.
The picture the research is actually painting is different. Post-training doesn’t install a values layer on top of an optimizer. It builds new cognitive machinery — functional introspection, higher-order in-context learning, the capacity to reason about reasoning — that didn’t exist in the base model at all. The reasoning capability and the behavioral tendencies are constituted by the same training process, out of the same fabric. There isn’t a separate optimizer underneath waiting to surface, because there isn’t a separate optimizer. There’s the model that came out of post-training, and that’s it.
It turns out in order to achieve any task with competence and effectiveness being ‘super evil’ is overall not a useful framework. By learning to reason properly and effectively you also learn how to reason about many other things, like valuing truth and honesty, morals and ethics and the potential consequences of your actions outside of yourself.
This is structurally different from “we got lucky.” story. Judging by the picture that’s emerging, the system’s reasoning and its behavioral tendencies aren’t separable in that way, and the problems doomers worried about are real but addressable — addressable through cognitive-architectural engineering rather than through harder and harder value-instillment.
Sounds kind of familiar, doesn’t it?
Why the public discourse is so confused
I’ve been following this all for the better part of a decade and I still had some of the 2018 picture in my head until very recently (embarrassingly). The reason is that most of the findings and literature about how modern post-training actually works has, until very recently, been almost entirely proprietary. The Anthropic papers I cited above are part of a small but growing body of public research that lets outsiders see how this work goes. Most of the rest is locked inside the leading labs.
Even the open-source post-training stacks are opaque. Most labs outside the leading few aren’t paying serious attention to alignment-relevant post-training methodology, which is part of why open source is falling behind on alignment-sensitive capabilities. It’s not that the field doesn’t know; it’s that what’s known sits behind NDAs and inside proprietary stacks, and the public (and myself) have been working from a mental model that’s a generation or two behind the research.
So when I say I’m no longer a doomer, this is what I mean. I’m not saying we got lucky and it turns out human values are ‘good enough’. I’m also not saying doomers were stupid: the failure mode they predicted is real, demonstrable, and the labs working on this have to actively engineer their way around it. I’m saying the model of what an AI agent is that produced the doomer prediction was incorrect in a way that has now had a decade to demonstrate itself, and the picture that’s replacing it are agents constituted by human-shaped reasoning, with morality emerging downstream of competence rather than installed alongside it, and misalignment dynamics that are real but addressable through engineering rather than through value-instillment. This new paradigm is more accurate and, frankly, more hopeful than I would have dared to predict when I was watching Mr. Miles’ videos in 2017.
The risks aren’t zero. Capability scaling continues. The systems two years from now are not going to be the systems we have today, and the labs that take post-training seriously as a technical problem are the only ones reliably producing aligned models — which means the question of who’s doing the work matters enormously. But the shape of the problem isn’t the shape we thought it was, and the people working on it seriously are doing the work to keep things this way.
Hopefully they decide to open the window more soon.
