Why almost every RL agent does learned optimization

7 min read

Or "Why RL≈RL2 (And why that matters)"

TL;DR: This post discusses the blurred conceptual boundary between RL and RL2 (also known as meta-RL). RL2 is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL2 emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL2 agents.

_I've found myself telling this story about the relationship between RL and RL_2  numerous times in conversation. When that happens, it's usually time to write a post about it.

_Most of the first half of the post (which points out that RL_2 is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment.

_The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL_2 systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some.

Background on RL2

RL2 (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022).

RL2 is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm).

The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL2 exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL2  may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL2 system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL2 is therefore a phenomenon that seems likely to underlie some major safety risks.

The conditions under which RL2 emerges are the default RL training conditions

Ingredients for an RL2 cake

The four 'ingredients' required for RL2 to emerge are:

  1. The agent must have observations that correlate with reward.
  2. The agent must have observations that correlate with its history of actions.
  3. The agent must have a memory state that persists through time in which the RL2 algorithm can be implemented.
  4. The agent must be trained on a distribution of tasks.

These conditions let the agent learn an RL2 algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL2:

  • Thanks to (1), agents tend to learn representations that identify if the agent is getting closer to valuable states.
  • Thanks to (2), it can learn representations that evaluate whether or not past actions have brought the agent closer to valuable states. To evaluate this, the agent must represent the key task variables that define its current task instance, since states that are valuable in one task instance may not be valuable in another.
  • Thanks to (3), this information can persist through time such that the agent can gradually refine its representations of what task instance it is in and which are the best actions to take in it.
  • Only if (4) holds is it even useful to learn representations of task structure, rather than learning a fixed sequence of actions that work in one particular task.

Why these ingredients are the default conditions

This set of conditions are more common than they might initially appear:

Reward- and action-correlated observations:
It's pretty typical that agents observe their environment and that reward tends to come from particular environment states. It's also pretty typical that agents get to observe how the environment changes after they take actions in it. Most games and robotics environments, for instance, have these properties. Cases where this doesn't happen, such as N-armed bandit tasks, are rarer or less interesting from an alignment perspective.

Persistent memory state:
Neural networks that have some sort of memory state that persists through time include recurrent neural networks or transformer-XL (both of which have been used to train RL2 agents). Having a memory state makes it easier for RL agents to solve tasks that require memory, which include most tasks in partially observable environments (such as the real world). We should therefore expect memory to be used in the most capable and useful RL systems.

But even agents that are purely feedforward often have access to an external memory system: the environment. Even simple feedforward RL agents can and do learn to use the environment as an external memory system when they don't have an internal one (Deverett et al. 2019). The trouble with using the environment as a memory system instead of learning one internally is that the externally represented memories must be learned non-differentiably, which is harder. But it's still possible in principle. Speculatively, large enough agents may be able to learn sophisticated RL2 algorithms that use an expressive-enough environment as its memory system.

Distribution of tasks:
Most 'individual' tasks are actually a narrow distribution of tasks. Here is a non-exhaustive list of reasons for why the task 'Solve a particular maze' is actually a distribution of tasks:

  • If an agent has to solve a maze but starts from a random starting position, that's a task distribution; the agent must learn a policy that works across the whole distribution of initial states.
  • If an agent has to solve a maze but its recurrent state is randomly initialized, then that's a task distribution; the agent must learn a policy that works across the initialization distribution of its recurrent state. From location X, if the agent has used a different set of actions to arrive there, its memory state may be different, and thus there will be a distribution over memory states for the task defined as 'Get to the end of the maze from location X'.
  • If an agent has to solve a maze but uses a stochastic policy, then that's a task distribution; the agent must learn a policy that works across the distribution of past action sequences.

The (admittedly somewhat pedantic) argument that most tasks are, in fact, distributions of tasks points toward a blurred boundary between 'RL2' and 'the agent merely adapting during a task'. Some previous debate on the forum about what should count as 'learning' vs. 'adaptation' can be found in comments here and here.

So what? (Planning from RL2?)

I'm making a pretty narrow, technical point in this post. The above indicates that RL2 is pretty much inevitable in most interesting settings. But that's not necessarily dangerous; RL2 itself isn't the thing that we should be worried about. We're mostly concerned about agents that have learned how to search or plan (as discussed in Hubinger et al. 2019 and Demski, 2020).

Unfortunately, I think there are a few indications that learned planning will probably emerge from scaled-up RL2:

  • RL agents with some weak inductive biases show behavioural signs of learned planning (Guez et al. 2019). Being only behavioural, the evidence of planning is currently pretty weak. I'd like to see interpretations of similar agents to show that they have actually learned a planning algorithm (This project idea is on Neel Nanda's 200 Concrete Open Problems in Mechanistic Interpretability).
  • Other empirical evidence, which is more speculative, comes from what we know about search/planning in humans. I mentioned above that there is evidence that the human prefrontal cortex implements an RL2 algorithm (Wang et al. 2018). The PFC is the brain region most heavily implicated in planning, decision-making, etc. This weakly suggests that scaling up RL2 might lead to a system that does planning.
  • The final, and in my opinion most convincing, reason to suspect learned planning might emerge naturally in advanced RL2 systems is theoretical. The argument is based on results from Ortega et al. (2019). I go into a little bit more detail in the footnotes, but briefly:  The Bayesian optimization objective that RL2 agents implicitly optimize has a structure that resembles planning; the objective demands consideration of multiple possible world states and demands that the agent chooses actions based on which action is best given those possible worlds. [1]

These are, of course, only weak indications that scaling up RL2 will yield learned planning. It's still unclear what else, if anything, is required for it to emerge.


Footnote: The Bayesian optimization objective that RL2 agents implicitly optimize has a structure that resembles planning

Ortega et al. (2019) shows that the policy of an RL2 agent, π(at), is trained to approximate the following distribution:


at is the optimal action at timestep t,
ao–––<t the action-observation history up to timestep t,
P(at|ao–––<t) is the probability of choosing the optimal action given the action-observation history, and
ψ is the set of latent (inaccessible) task parameters that define the task instance. They are sampled from the task distribution.  ψ effectively defines the current world state.

How might scaled up RL2  agents approximate this integral? Perhaps the easiest method to approximate complicated distributions is a Monte Carlo estimate (i.e. take a bunch of samples and take their average). It seems plausible that agents would learn to take Monte Carlo estimate of this distribution within their learned algorithms. Here's a sketch of what this might look like on an intuitive level:

  • The agent has uncertainty over latent task variables/world state given its observation history. It can't consider all the possible configurations of the world state, so it just considers a small sample set of the most likely states of the world according to an internal model of P(ψ|ao–––<t).
  • For each of that small sample set of possible world states, the agent considers what the optimal action would be in each case, i.e. P(at|ψ,ao–––<t). Generally, it's useful to predict the consequences of actions to evaluate how good they are. So the agent might consider the consequences of different actions given different world states and action-observation histories.
  • After considering each of the possible worlds, it chooses the action that works best across those worlds, weighted according to how likely each world state is i.e. ∫ψP(at|ψ,ao–––<t)P(ψ|ao–––<t)dψ

    Those steps resemble a planning algorithm.

It's not clear whether agents would actually learn to plan (i.e. learning approximations of each term in the integral that unroll serially, as sketched above) vs. something else (such as learning heuristics that, in parallel, approximate the whole integral). But the structure of the Bayesian optimization objective is suggestive of an optimization pressure in the direction of learning a planning algorithm.

Come work with us!

Check out our current open positions!