For years, the biggest complaint about large AI models has been simple: nobody really knows how they work. Even the people who built them can't fully explain why a small change in how you ask a question produces wildly different answers. Now researchers have developed a technique that not only reveals what's happening inside these systems, but gives them a practical way to control it.
The breakthrough comes from a team that published their findings in Science. They created an algorithm called the Recursive Feature Machine (RFM) that identifies the patterns of neural activity corresponding to specific concepts—things like "honesty" or "refusal." Once you know what those patterns look like, you can amplify or suppress them, essentially steering the model toward or away from certain behaviors.
Here's what makes this different from previous attempts: it works fast and cheap. The researchers needed fewer than 500 training examples and less than a minute on a single Nvidia A100 GPU to extract a concept and use it to change a model's output. That's remarkably efficient for something this powerful.
We're a new kind of news feed.
Regular news is designed to drain you. We're a non-profit built to restore you. Every story we publish is scored for impact, progress, and hope.
Start Your News DetoxHow They Did It
The team started by asking GPT-4o to generate thousands of examples—some containing a specific concept, others not. They fed these paired examples into their RFM algorithm, which learned to recognize the neural patterns that corresponded to each concept. Once identified, these patterns (called "concept vectors") could be used like dials: turn them up to amplify a behavior, turn them down to suppress it.
When they tested the approach across different AI systems—language models, vision models, reasoning models—it worked consistently. Interestingly, newer and larger models turned out to be more steerable than some smaller ones, which surprised the researchers.
The team also found something unexpected: a concept vector trained in English could be used to steer model behavior in other languages. You could learn what "honesty" looks like in English training data, then use that same pattern to push a model toward honesty in French or Mandarin.
Why This Matters for Safety
The practical applications emerged quickly. The researchers used their technique to expose vulnerabilities—they created an "anti-refusal" vector that could bypass safety features in vision models, forcing them to give harmful advice. But they also created an "anti-deception" vector that successfully steered models away from lying.
This is where the safety angle becomes concrete. Right now, when AI companies want to modify how their models behave, they typically retrain them, which is expensive and time-consuming. This technique offers a much faster way to adjust outputs after a model is already deployed. If you discover a problem—a model giving consistently biased answers, or failing to refuse dangerous requests—you don't have to rebuild it from scratch.
The approach still doesn't give us complete transparency into how these systems work. But it's a meaningful step. As AI becomes more embedded in everything from hiring to healthcare to criminal justice, the ability to see what's happening inside the black box and adjust it becomes increasingly important. This tool makes both of those things possible without requiring massive computational resources or specialized expertise.
The next phase will be scaling this up—mapping more concepts, testing it against adversarial attempts to break it, and seeing whether the approach holds as models get even larger.










