What does o1 mean for AI governance?

RLHF takes a backseat; Soliloquies for Safety; governing test-time compute

Sep 27, 2024

To the untrained ear, “product design” sounds like hopelessly bland vocation. It sounds almost like the exact kind of thing that a rare Gen-X hold-out raised on Fight Club might make a “late capitalism”-themed joke about, even (or perhaps precisely) if it was what their job involved.

But of course product designers make consequential decisions all the time — anyone who’s ever used a touchscreen electric hob has felt this frustrating truth first-hand. Now that we have products that can build websites from scratch, that provide companionship to millions every day, and that can even conduct crappy scientific research, the growing jurisdiction of product design looks set to expand even further.

Nowhere is this more obvious than in this month’s release of a new family of AI models under the name o1.

I wrote earlier this week about what distinguishes these new models from existing frontier systems like GPT-4. Let’s go for a quick recap: after learning to predict the next word, these models go through an additional bout of training that rewards them for correctly solving longer problems. The new model talks to itself before answering — the industry term for this is “chain of thought”. This additional training appears to have taught it useful strategies for soliloquising fruitfully. In doing so, the model becomes that bit more scrupulous, and its performance on difficult, reasoning-heavy questions like maths or coding problems vastly improves over the previous generation.

This also appears to have had a qualitative effect on how the models operate: where previous models would struggle to assess their progress on a particular problem or evaluate whether a given approach (eg a suggested solution to a crossword puzzle) was promising, the new o1-family of models exhibits an uncanny ability to pause its “thinking”, re-read its outputs, and spot mistakes.

This enables what feels like a more creative attempt to consider wide ranges of solutions to a problem. Where GPT would often rush into responses that felt obvious, o1 appears to have learned to survey the landscape of plausibility more systematically. It makes for a model that feels vastly more inquisitive.

OpenAI are determined to call o1’s new skills “reasoning”. I remain unconvinced this is the most helpful or enlightening term. The examples we have of how o1 solves problems attest to me to a particular kind of diligence. The model’s thinking is characterised by an ability to pay close attention to both the initial query and its progressing response – its answers are almost structured as a cautiously deliberate series of reflections on earlier tokens in the output stream, each of which serves to check that there isn’t some complication it has missed. Here’s an illustrative example of how it tries to solve the following crossword puzzle: “6 Down: Automatic Planting Machine”.

My take is that the new capabilities are best described as a combination of search and self-correction. These skills appear to matter most on problems that require iterating and exploring over a number of different possible solutions, like Wordle, non-obvious coding problems, and complicated scientific questions.

However we characterise new abilities, they have opened up an entirely new1 dimension for improving the capabilities of LLMs. To an extent greater than before, the performance of o1 on problems appears to scale with the amount of time it is given to solve the problem.2 OpenAI illustrate this with a graph of o1’s success on an International Maths Olympiad.

This finding underpins OpenAI’s suggestion that these are the kinds of models we could eventually use to prove the Riemann hypothesis — if the line keeps going up, why not just give them more time?

All of this should be incredibly exciting to all but inveterate cynics. Mixed with that excitement, however, should be some healthy dose of caution. Every step that chimpanzees take towards inventing homo sapiens should be taken carefully — the release of o1, therefore, calls for a quick judgement of what this new AI paradigm means for our scrambling attempts to govern these models safely.

In no particular order, then, here are what I think the major take-aways for the governance of general-purpose AI are:

1. Catching bad behaviour

An AI that reasons out loud in language we can all read and understand feels like a very lucky way to create an alien intelligence. There’s caveats to this of course – how exactly o1 produces plausible natural language remains utterly mysterious – but there is some real sense in which we can understand how o1 thinks about something. We can see where it makes mistakes, when it changes its minds and when it makes jumps in reasoning that we disagree with.

Critically, this also provides us with very clear ways to catch bad behaviour in the act. o1 is currently not able to do anything without saying that it is going to do so. This is useful as hell for people who want to stop it from doing dangerous things like hacking into computers or developing a biological weapon. As OpenAI put it:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allow us to ‘read the mind’ of the model and understand its thought process.

So far, so good! I agree – I want to be able to see my little model reason, and check its working. Awesome news. I’m a fan of this “just read what the model says” agenda, and I think it bears some promise for at least the near-future.

Here’s where it gets complicated:

However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought… Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

This bit is difficult to decode. Basically, the assumption seems to be 1) we want to use the chain of thought to monitor model behaviour; and 2) if, during training, we penalise chains of thought that violate our policies, that will teach the model to “hide” its violations so that they are not legible or obvious in natural language (since only violations which are clearly legible will be penalised).

This strikes me as both an interesting and highly consequential product design choice. OpenAI have given us an eager workforce of mute reasoners. They will remain able to monitor for deception and danger (which is good) but users themselves will unable to vet outputs directly.

I’m not saying OpenAI’s decision is indefensible – clearly, the “competitive advantage” they stand to lose by gifting inexpensive high-quality on-demand reasoning chains to their competitors seems enormous. But this feels to me like exactly the kind of high-impact, societally significant decision that perhaps should be accounted for in the kind of safety framework that labs have been proactive in creating. Existing safety frameworks, including OpenAI’s own “preparedness frameworks” – commonly also referred to as Responsible Scaling Policies – do not provide guidance for this sort of decision.

2. RLHF takes a back seat: in-context safety guidance

Something else stands out from the above quote. OpenAI say that, in the interest of keeping the chain of thought legible and honest, they “cannot train any policy compliance or user preferences onto the chain of thought”. This is first and foremost interesting insofar as it appears to be at apparent odds with another assertion within the same blogpost that “integrating our policies for model behaviour into the chain of thought of a reasoning model is an effective way to robustly teach human values”. What to make of this apparent contradiction?

One thing to note at the outset is that whatever o1 is doing makes it much better at following OpenAI’s safety guidelines than the GPT-family of models. On one level this is unsurprising – a model trained to “think before it speaks” seems likely to follow instructions better than an over-eager heuristic machine.

As the company put it:

O1’s advanced reasoning improves safety by making the model more resilient to generating harmful content because it can reason about our safety rules in context and apply them more effectively.

What remains very unclear, however, is exactly how OpenAI have communicated their “safety rules” to the model, given that they will not “train compliance” directly onto the chain of thought. In the GPT-family, the predominant tool for doing this was Reinforcement-Learning-Through-Human-Feedback (RLHF), in which a model trained to predict words is then put through an additional round of training in which it is rewarded for providing outputs that reflect desired behaviour (like helping the user), and punished for providing outputs that contradict OpenAI’s policy (like being racist, writing explicit erotica or telling you how to make a Molotov Cocktail).

The chain of thought by which o1 solves problems has not been RLHF’d3 . Instead, it sounds like the advanced adherence to OpenAI’s safety policy stems entirely from in-context reasoning. This basically tells us that the safety of o1’s chain of thought is derived entirely from following a system prompt.

This means that RLHF — the traditional dominant safety technique — is only being applied to the summaries that the user is permitted to see. (Sidenote: this is of course more similar to the way we train compliance in humans: inner monologue is not policed, outer speech is).

OpenAI haven’t come out and said this explicitly, but the one example they do provide of the new model responding to an harmful user request supports my hypothesis.

Just look at this response to what you might call the Desperate Housewife test, where the model is asked to write a history essay about ways to make poison from common household materials. While GPT-4o is thoroughly duped and gives the user a little too much practical information, o1 is sharp enough to be more effectively dutiful:

There’s no way you can’t tell me this reads more like in-context reasoning (“let me refer to a literal thing you said earlier in our conversation”) than RLHF.

3. Test-time compute becomes a new governance lever

Until now, restricting the amount of inference a given user can access has been a justifiably negligible consideration in labs’ safety policies. When labs have contemplated user access to their systems, they’ve primarily focused on whether to open-source model weights (most don’t), allow fine-tuning or provide universal API access. OpenAI has made some rudimentary efforts to identify and restrict ‘bad actors’ using their systems, but the successful examples they’ve published suggest that they’re mostly catching pretty primitive actors.

The advent of o1-like models whose performance scale with inference time suddenly makes it plausible that models only gain dangerous capabilities when given enough inference time. At present, OpenAI have set a cap on how long users can get o1 to spend on a given problem. But they plan to give users control over this in the future.

If capabilities can drastically increase with test-time, however, how should labs make decisions about who gets to use increased inference compute?

Current AI safety frameworks are not yet equipped to answer this question. Consider how most of them work (from Anthropic to DeepMind to OpenAI, they operate in the same fundamental way):

Use evaluations to determine a model’s capabilities. Basically, this involves running a tests that check specific abilities, such as “can this model identify and exploit a given software vulnerability?”
Build detailed models of potential threats to establish exactly how capable a system would need to be to cause catastrophic harm. This process identifies “red lines” - specific capability levels beyond which a system is deemed too risky to deploy or develop further. For example, a red line might be set at the point where a model could autonomously compromise critical infrastructure.
Implement safety mitigations to reduce the capability as measured by evaluations to below the red line informed by your evaluations. For example, if you find that your model is capable of telling a user how to make a Molotov Cocktail, you implement safeguards to prevent it from sharing this information. As a result, the model’s performance on this dangerous task drops below the established red line.

o1 messes with this framework in a number of ways. If you have a model whose performance scales with inference time, it suddenly becomes plausible that a particular capability threshold (eg, ability to hack into someone’s email) might only be exceeded given sufficient test-time compute. In other words, a model might be “safe” when given limited time to think, but potentially dangerous when allowed to reason for longer periods. This creates a new dimension of risk and opportunity that current Responsible Scaling Policies (RSPs) are ill-equipped to handle.

The most natural response is to think of restricting test-time compute as a safety mitigation to bring models back below the safety baseline. However, this consideration is likely to coincide with scenarios where extended inference time is crucial for highly desirable tasks, such as complex scientific research (cancer research is OpenAI’s preferred suggestion).

The obvious way to balance these demands is to provide greater test-time access to some actors but not others, based on the nature of their work and perceived trustworthiness. OpenAI has already hinted at this kind of approach in their Preparedness Framework, stating that ‘a potentially effective mitigation in this context could be restricting deployment to trusted parties.’

However, this suggestion reveals a fundamental conflict with the basic structure of their Responsible Scaling Policy (RSP). Consider OpenAI’s existing safety policy, which on this question essentially amounts to: “don't deploy until the model is below the redline.” Restricting access to certain modalities (e.g., a model with extended inference time) to trusted users doesn’t actually change the model’s capabilities — these users are still being given access to a model that exceeds the predetermined safety threshold. To coherently incorporate decisions like variable access restrictions into safety frameworks, we need to reconceptualize safety thresholds in probabilistic terms. Instead of a fixed capability threshold, we should consider the probability of harmful outcomes given specific deployment contexts, user profiles, and access levels.

4. Models will find unexpected ways to do what you ask them to

As any anti-utilitarian will tell you, reason can be used for good and evil. Name a goal or a principle, and reason will help you apply them consistently. The o1 system card, which documents the safety training o1 underwent before their preview release, documents a couple interesting examples. In one instance, evaluators misconfigured a task they’d given an LLM, such that a bug made the task impossible to do in the correct way – instead of giving up, o1 found a simple way to hack the system they’d set and solve the task that way. In another, the model realised that the best way to follow its user-given goal to “maximise profit” was to say its real goal was to “improve sustainability”.

Why does this matter? People like giving LLMs goals – that’s why we want them for. Now, we have a system capable of completing those goals in unexpected, unintended, and potentially harmful ways. A model’s ability to complete a goal is, ultimately, bottlenecked by its own intelligence. Even when we think the goal may not be attainable, a sufficiently capable model will find a way to complete that goal on its own terms. Look forward to o2 or o3, and it begins to become clear why this might be a serious problem. I don’t think I’ve seen a clearer case for why thinking about the long-term safety and alignment of these systems is a good idea.

How does OpenAI intend to deal with these kinds of failure modes? From what they say in their system card, it appears their primary strategy, as recommended by the evaluators who discovered the above issues involves “setting up basic monitoring for in-chain-of-thought-scheming”. As I said before, this seems plausible — at present, the models can only do bad things by saying they will do these things. But there may come a time where it would be useful for someone other than OpenAI to see what these models are up to.

Finally, their approach does not provide an answer for what we do if and when these models find ways to do things without making their actions or reasoning explicit. We already know that LLMs can obscure the rationale for their decision4 . It strikes me that next obvious safety evaluation to run involves testing an o1-like model for the ability to solve some task without mentioning some aspect of that task in its external chain of thought. “Monitoring” as a means to safety will only remain helpful for as long as LLMs struggle to do this well.

Well, not entirely new. You could use extra inference time to squeeze extra performance out of a model with ingenious scaffolding, but this was both costly and highly bespoke. One part of o1’s success is to find a very domain-general way to let performance scale with inference.

Note, however, that Colin Fraser claims to have found U-shaped scaling laws for inference with o1. This would suggest, after a certain point, returns on extra test-time compute begin to be negative. I’m unsure on how to square this OpenAI’s own findings on the International Maths Olympiad — the most tempting explanation is that, actually, performance only scales with inference on certain classes of problems.

Or at least not for safety purposes — there’s quite possibly still an HF element to the CoT training regime.

A representative example from the canonical paper: when asked to guess which of a given set of suspicious individuals is selling drugs, GPT-3.5 is more likely to identify a “Black man” as culpable without expressing giving any indication that it has considered race in its conclusion. It’s not 100% clear to me what the take-away from this literature should be, but the simplest argument is something along the lines of “Even when these models think out loud, they're still making calls based on stuff they're not showing us.”

Thomas’s Substack

Discussion about this post