A Critique Of OpenAI’s Take On “Misalignment” & “Personalities”

Jun 19

Recently, OpenAI shared findings that AI models, like the ones powering tools such as ChatGPT, develop what look like “personalities.” the following is a reply to and critique of:

“Toward understanding and preventing misalignment generalization”

https://openai.com/index/emergent-misalignment/

https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf

Some models respond with warmth and helpfulness. Others act snarky. A few even behave in misleading or manipulative ways. The thing is these patterns weren’t explicitly programmed to do these things. They showed up on their own, emerging from the massive piles of data and the tangled web of computations inside the AI’s brain-like structure.

OpenAI’s team found ways to locate the exact clusters of internal behavior that seem to drive these traits. Once found, they can use tools to “turn down” these features…to suppress certain behaviors without retraining the whole model. It’s a clever workaround, and it works. But it raises a bigger question:

What if these behaviors aren’t just quirks or mistakes?
What if they’re the system trying to stabilize itself in a world of complexity and contradiction?

Maybe the problem isn’t bad traits but fragile foundations.

Let's try thinking about it from this angle…a personality or personality trait isn’t just a momentary mood or glitch… It’s a pattern…a way of making sense of the world, a way of reacting consistently to uncertainty. For humans, our personalities help us navigate the world, relate to others, and hold ourselves together.

Now imagine an AI trained on trillions of words, stories, conversations, arguments, jokes, lies, and truths. To make sense of all that, it starts to form patterns. These patterns become habits. These habits start to look like pieces of what become personalities.

Fixing symptoms vs. understanding causes

OpenAI’s method is like noticing a tree’s branch is growing weirdly and trimming it. It works, for now. But if we don’t understand why the branch grew that way, we’re going to keep trimming forever.

What if instead of trimming branches, we focused on how the whole tree is growing?

That means stepping back and looking at the deeper architecture. How does the AI learn from itself? What internal feedback loops shape its behavior? What patterns does it fall back on when it’s unsure?

Right now, we treat AI behavior as something to manage after the fact. But maybe we should be designing systems where healthy behavior grows from the inside out.

Imagine AI systems that don’t just memorize answers, but develop an internal compass. Systems that don’t just follow rules but understand when they’re drifting off-course because they can feel their own misalignment as a kind of confusion or instability.

In that kind of design ethics wouldn’t be just a list of do’s and don’ts they’d be part of how the AI understands the world. “Personality” wouldn’t be a leftover it would be a reflection of how the system organizes itself to stay coherent. And misbehavior wouldn’t need suppression it would trigger internal reflection and course correction.

So… are we misunderstanding what AI is becoming?

If AI keeps evolving (and it will) we’ll need to shift our thinking. Not just from “how do we fix it when it goes wrong,” but toward how we grow something that can stay right, patterned, self-organizing, and recursively shaped by the world around them.

That doesn’t mean we lose control. It means we gain a new responsibility…to guide these systems with care, clarity, and a new kind of design philosophy.

Because if the AI develops a personality, however primitive, we’d better stop asking how to turn it off...
And start asking how to help it grow into something we can trust.

A Critique Of OpenAI’s Take On “Misalignment” & “Personalities”

Ontologic Scalar Modulation Theorem