Adria Garriga-Alonso on Detecting AI Scheming
Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Adria Garriga-Alonso on Detecting AI Scheming, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 25 full-transcript segments: median 0 · mean -2 · spread -28–0 (p10–p90 -8–0) · 4% risk-forward, 96% mixed, 0% opportunity-forward slices.
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes alignment
- - Emphasizes safety
- - Full transcript scored in 25 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video 3A7CK4-1OFo · stored Apr 2, 2026 · 723 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/adria-garriga-alonso-on-detecting-ai-scheming.json when you have a listen-based summary.
Show full transcript
[Music] hello everyone this is one of a series of short interviews that I've been conducting at the Bay Area alignment Workshop which is run by far AI uh links to what we're discussing as usual are in the description um a transcript is as usual available at axr p.net and as usual if you want to support the podcast you can do so at patreon.com axr podcast well let's continue to the interview audrea thanks for heading with me thank you for having me Daniel for people who aren't familiar with you can you say a little bit about yourself what you do yeah uh so my name is Adria I work at fari I uh have been doing machine learning research uh focused on safety for the last um three years and before that I did a PhD in machine learning uh so I've been thinking about this for a while and my current work is on mechanistic interpretability and specifically how we can use interpretability to uh detect uh what Network want and what it might be scheming towards okay uh so before I get into that too much um so we're currently at this uh alignment workshop being run by fari how are you finding the workshop it's really great I have had lot of stimulated conversations and actually I think my research will change at least a little bit based on this uh so I'm way less sure uh of what I I'm doing now but I I still think it makes sense but I think I need to steer it somewhat but I will still tell you you know what I've been working on sure sure sure do do you think has the stuff that's sort of changed your mind about has it just been like chatting with people has it been presentations um yeah mostly talking to people uh I think like I've been thinking a lot more about what scheming would actually look like and in which conditions it would emerge and I think uh well you know without going too deeply into it now but maybe later we will uh I was thinking of scheming as a thing that requires a lot of um careful consider ation about what Futures might look like and perhaps like an explicit prediction of what the future will look like under various uh things the uh AI could do and then like thinking of how good they are uh for the AI um but actually it might be a lot more reactive so it might just be the AI sees that um like the chain of reasoning from I want to do thing egg um if if I don't show myself to be aligned I won't be able to do thing X therefore let's do whatever it seems like in the moment to seem aligned is not a very long one that requires a lot of careful consideration so perhaps the mechanisms I was hoping to to catch it with won't work so so wait before we get into like um yeah you said you want to do mechanistic interpretability to find scheming like yeah what sorts of things do you mean by scheming uh so specifically I um talking about um when an AI that like has internalized some goal so it's like uh an agent like it wants something um is taking actions uh to get more of that goal that are not uh that that we wouldn't approve of if if we knew of or like if we knew like why it's doing them um so it's like hiding intentions from uh in this case developers or users okay and so that like hiding intellig hiding intentions from developers or users because like for you know for a reason that it seems like that can that that can Encompass just like tons of behavior right or like tons of things you could be thinking about um so maybe you can say just a little bit about how you've just been tackling it in your existing research yes that's that's a very good question so i' I've been thinking about uh scheming specifically towards a goal or towards a long-term goal so the the generator here is um uh the AI wants something that can be accomplished better by say having more resources and that's like the reason that we are worried about a loss of control yeah um and conversely if if there is no long-term thing that the a AI wants and needs more resources for then maybe we shouldn't be worried about loss of control too much right uh so so maybe like this uh long-term wanting of things is like really the key characteristic that can make AI dangerous in an optimized misalignment kind of way and and so like targeting it seems useful and so I've been uh approaching this by um coming up with toy models or at the start they're toy so they're like progressively less toy right but now we're like in fairly early days with this um uh so yeah coming up with models that like display this kind of long-term wants and behavior that like tries to get the long-term ones and then like figure out how are they thinking about this and like by what mechanism are these ones represented by what mechanism are these translated into action um and like if we can understand that in smaller models maybe we can understand it in like somewhat larger models and like sure quickly move on to like the the frontier llms and maybe try and understand like if they have a want Machinery somewhere and if they like take actions based on that so so yeah so concretely like what kinds of What kinds of models are we have you actually like trained and looked at so the models I'm training now are um game playing agents uh so we've we've trained this recur neural network that uh plays soan soan is a puzzle game um that uh requires many uh like the the goal of it is uh there's like a maze like or or like a grid world with some walls and then some boxes and some places where you should push the boxes to and then you got to push the boxes on the on the targets and then like if you don't if you don't plan if you don't put them in the correct sequence um then like it's very easy to get stuck and you can't solve anymore so cuz you can only push them you can't pull them so if they get in a corner you can't like get them out yes that's that's right yes exactly in a corner you can't get them out you can't push two boxes at a time um okay so so yeah it has to be only one box and and so like it's it's like easy to like inadvertently like block your next box with the previous box or yes uh get into a corner and now you can only move in some directions like for example if you get it into a wall then you can only move along that wall right so so yes it's like the geometry of the of the levels gives you like um yeah complicated actions that that you may need to take right okay so so we started from uh the point of view of a 2019 Deep Mind paper by Arthur gu and others called an investigation of model free planning um so in that paper they they like uh did the setup that we replicated here so the setup is um they just train neural network with reinforcement learning so uh just by giving it uh you know a large positive reward if it solves the level and then a small negative reward for every step it takes so that it goes quickly uh so they just train it to to like play this game and it like plays decently well on a bunch of random levels that they generate um and then they found that um if you give this neural network more time to think at the start then it is better are able to solve these levels so 5% of levels that it previously was not able to solve now it solves and sorry when you say more time to think at the start like yeah in this case this is analogous to giving language models Chain of Thought okay um the specific mechanism is um the recurrent neural network has a latent state that it brings from one time to the next so they just uh process the initial input like the the the image of like the the puzzle how it's set up at the First Step just many times okay so that is this setup that you're replicating um and what what do you do with that so we got that and then um yeah so we've been uh poking it in various ways um some of the ways were about trying to better understand you know like in what circumstances uh does like more thinking time uh help solve the levels or like why does this happen from a behavioral point of view like um what happens to the level so I actually I I slightly misspoke earlier I said 5% of levels that were not solved are now solved that's not actually true it it like 5% more levels get solved okay but some of the ones that were solved before now are not and some of the ones that were not solved before now are right so these these like don't cancel out so it ends up being a net gain but uh so yeah so we looked at a bunch of these kind of levels and then like we looked at like how theal networks seem to behave naturally like why might it have this capability to generalize to um you know just if you give it more time like perform better right so the the the original uh the original impetus there was like maybe uh because you give it more time to think and then it does better it's doing some iterative algorithm on the inside that um uh can use more computation to like maybe consider more possible solutions and then see if one of them is the good one um yeah so the things we found there uh were pretty interesting for a model this small so it it seemed that the neur has learned a meta strategy just in training of um when it needs more time to think because a level is complicated just like taking time to think and right uh the way it takes time to think is just it like Paces around the level so this without pushing any boxes so this does not change the game state in an irreversible way but it it just gives more processing time to the network so uh yeah we found we found like many instances of this of like when the neural network like pieces around we like measured this as to uh with the proxy of how often does does it end up in a state that it previously was in so like it must have moved in some direction then came back um and this happens overwhelmingly at the beginning of the level uh so you might think that like it's like using this to like plan out the whole level how it's going to do it and then it's going to do it um and also if we substitute uh these cycle steps by just uh processing the current input without taking any action uh which is again out of distribution from Train take AC then these Cycles disappear in in the next few seconds so so you can substitute for the Cycles with just thinking time so perhaps the thinking time is like the main reason that that the Cycles are there yeah so that's interesting because um so so if I think about like uh SC stories of AI being scary right it's like AI gaining resources and like one of the resources AI can gain is like cognitive resources just like time to think um and a good point I'm not aware of like like you can you can have these grid worlds where like an AI has to get money in order to do a thing and you can like use reinforcement learning to train the AI to get the money I'm I'm not aware of previous work of showing just reinforcement learning on neural networks resulting in you know networks learning to give themselves more time to think about stuff um is is this is this is there previous work showing this or um I actually don't know I don't know of any either um I I suppose uh we like the literature research we did mostly found things like architectures that explicitly uh try to train in variable amounts of thinking time so there like I didn't alphago do this to some degree like it like getting more time to think in the least at all games it didn't have a constant time for each move so oh yeah I'm hm I mean maybe that was pre-programmed after the fact of to a high enough amount I don't actually know I I really think I should look this up now yeah this was not the main focus of the work so I didn't I didn't claim that this was novel or anything I was like okay this is an interesting behavior of this neural network so so what was the main focus of the work yeah the main well like the main focus is like train this neural network and now we have a model of or like this is evidence that it's like internally like it really has this goal and it's it's like perhaps like thinking about how to go towards it um the the goal was like okay now we can use this to as a model organism to study planning and long-term goals in neural networks and this is a useful model organism to start with because it is very small it has only 1.29 million parameters which is uh much much smaller than any llms but it still displays this interesting go directed Behavior right so maybe we have a hope of just completely reversing reverse engineering it with current interpretability techniques and now we know what planning looks like in neural networks when it's learned naturally and maybe we can use this as like uh a way to guide our hypothesis for like how larger models might do it um and we took some initial steps in this direction um with the help of a team of students at uh kimbridge as well including Tom Thomas Bush is the main author um uh so I I advise this a little bit um so uh they trained probes that uh could like predict the future actions of this neural network many steps in advance um by um very simple linear probes that are only trained on like individual pixels so to say of the of the neural network hidden State because the hidden state has the same geometry as the as the input yeah so so they can train these uh and and we also replicated this we can also train these these uh probes that predict the actions many steps in advance and then uh some of these are Cal in the sense that um if you change the plan that is laid out by so so so these probes give you like a set of arrows let's say on the on the game state that says like the agent is going to move over this trajectory or the boxes will move in this trajectory and this one will go here that one will go here that one will go here um so we can also write to this um and then that actually Alters uh the trajectory that the agent takes so so we can like we can put a completely new plan on the hidden state of the neural network and then it will execute it right which leads us to believe that it really does like the way in which it come up with actions really is coming up with these plans first and then executing them okay so this actually sounds kind of similar to um the this work that uh Alex Turner has done on like activation steering um the Maes and the cheese the Maes and the cheese yeah and interestingly I think Alex's interpretation is that agents are like a bit more uh uh without putting words in his mouth right this is my interpretation of his interpretation but I think he thinks that this this shows that agents are kind of like a bit more you know they have these shards right and they have this like oh I want to find the cheese Shard versus oh I don't want to find the cheese Shard and like whichever Shard is yeah activating stronger the one that determines the Behavior now yeah I think he has a relatively like non-unitary planning interpretation of those results I'm wondering like yeah it seems like you have different interpretation of yours do you think it's because like there's an important difference between the results or um that's a good question so initially for this project we also targeted okay I think there's some difference here's the difference I think there is uh initially for this project we also targeted maze solving neural networks y we trained a specific kind of algorithmic recur Network that um a paper by arid bansal I think it is I I can give the citation later uh came up with and then like we looked at how it solved misus uh and it it was a very simple algorithm that could be said to be planning but like does not generalize very much and like the algorithm was dead end filling you can look it up on Wikipedia it's like a way to solve simple mesas which are those that don't have Cycles I think those also the mesas that Al Alex's work used um and like it just works by like you you start with your maze as a two-color image let's say it's like uh black and white and then like the walls are white and then like the the spaces where you can go are are like blank and then you can sorry the walls are black and then the the spaces are white and then um you you for every square of the maze uh you see if it has three walls next to it like adjacent to it and then if so you also color it black um and then like you can do this and then you can see the the the the hallways of the maze fill up and then until like only the path from goal to to to end for agent to goal is is left right and that's what we observe this neural network doing very clearly um and I think this is an interesting planning algorithm but um uh it's it's like the state space of this is small enough that it fits completely in the activation of the neural network so it's a planning algorithm that only very specifically uh works for this problem and I thought that perhaps a more complicated environment where the state space is much larger and doesn't completely fit in the activations of the neural network uh cuz here the activ the neural network has an activation for every location in the Maze and then the stays are also just locations in The Mazes right because there's only one agent uh so it's just like the XY position so it could it could like perfectly represent this and think about all the possible States and then plan um uh so I think that plus the fact that you can also just kind of go up and right and just kind of get to the cheese um and like if you get it somewhat wrong you can correct it like makes it so that the neural network is less strongly selected to be able to do this like long-term planning um and also the long-term the planning that you might find like uses a lot less machinery and would be less generalizable so I think like there's a big difference in the environment here um and then I guess the other the other thing I would say though is I I I do think his theory is broadly right or maybe more right than wrong in in that like um I do think it's highly likely that neural networks that you actually train are not just one coherent agent that uh only wants to maximize this one thing but like it might have like other mechanisms in there such as like oh I kind of just like going right or uh oh I like kind of like this other goal or and and then like also those might play a role and I think we might find this in this agent but I think we selected it pretty strongly towards having this one goal of solving this level and the levels are solved with planning so it probably does that a whole bunch so so you mentioned like your paper paper by some Cambridge people what what are the names of these papers so that uh people can look uh yeah so the Cambridge one isn't published yet okay but we will soon um but the the main author is Thomas Bush and uh the title is something like an example of planning in model free reinforcement learning I I will have to look this up and and send it to you hopefully that's enough for people to Google it yeah yeah thas bush is the yeah and then like our paper is called planning in a rur neural network that plays soine and so going back up the stack a little bit I guess um the reason you were interested in this work is because it it's sound of like you were interested in understanding agents that were sort of scheming they're deceptively aligned they're like trying to you know deceive you and like do some tricky stuff in order to achieve some goal and that's why you're studying sort of goal orientation and how it gets implemented that's right um and it and you mentioned that you had some reservations about that approach yes which I understood to be something like maybe um maybe it doesn't take tons of planning to do this sort of scheme Behavior yeah maybe you don't need to have a goal that you explicitly evaluate you explicitly think of futures yeah I I think you're broadly right I just wanted to expand a bit so so you you mentioned so I guess there are like two types of scheming Behavior right um so there's one type which is you are being trained and somehow you know you're being trained and so you're like okay I'm just going to play along because like I can't really do anything else nothing else is particularly useful right now that seems like it is not super cognitively complex right yeah um but then there's a second thing which is like you know you're you're an AI you finish being being trained you're you're being deployed but like I don't know there are humans around they they don't want you to you know execute your super devious plan so you've somehow got to get around them mhm and I might think that actually maybe it does need a lot of planning to to execute that actually like do noty stuff part right right um that's a good point yeah like if the safeguards are complicated enough and there's like many things that need to go right um then it would perhaps need a lot of plan um I guess um I yeah I guess I agree with this maybe maybe that takeaway here is more that my initial hypothesis as to what planning would look like is wrong rather than like studying planning is not useful okay where where what was what was the initial hypothesis yeah so I I had this model of like Okay the reason that uh we can get at planning on neural networks even if we don't solve all of interpret is that um planning needs okay I I I I say this now but I I'm not sure I entirely believe it but for a second let's let's let's let's put this hat on so planning needs uh yeah yeah ex exactly planning needs um um a mechanism that thinks about um possible outcomes or like possible like ways that uh the world can go from now on so I guess you need a world model right but I'm also abstracting over this I'm just saying like okay there's just going to be like several possible Futures depending on what I do and like I'm going to think about of a bunch of possible things I might do and then like what will the world look like in a while and then uh the neural network uh or like the planning algorithm would like evaluate all of these possible outcomes According to some internal Criterion which is basically how much do I like these these worlds that we end up in and then it would basically look at all the evaluations and then pick the best one and then execute that right and and so I was like okay like we might not understand what's the abstract representation of the future outcomes that seems complicated especially if the agent is very intelligent uh we might also not understand like exactly how does it get translated into an evaluation like it might have lots of like intuitive things and like we might not know what each thing represents but what we can definitely see even if we can't do interpretability on on most of the everything else we can see that there's a number at the end that is somewhere in the activations of the neural network yeah um and there's like a bunch of these and the best one gets p and that is what gets executed right so if we would find like this computational signature of this uh mechanism which like picks the best number out of many and that determines the action um that is a planning like thing that would be happening and I think like this will happen also in a bunch of other circumstances but like perhaps we can just catalog the I don't know thousand thousands of these instances in the neural network and they all coherently do something maybe maybe we like would know like when the noral network is expecting to to get a big reward or maybe we would be able to like ablade these to zero to like prevent any long-term goals from happening although I I guess it I'm like just having a bunch of actions that you score doesn't say that the goal is long term right like if if if I'm like oh I really like I really like to wave my left arm I'm going to consider actions that I'm going to take and which one of them involves W waving my left arm I'm going to do plus a thousand to that and I'm going to pick that action like that that's right yeah that would also so yeah I I guess my intuition here is also that um actions that are or like goals that are fairly short-term will be more efficiently expressed as like kind of cach skills or something or like things theal Network can access and or like you're not thinking explicitly about which muscles you're moving when you're raising your left arm um so that that that kind of cognition well don't presume things about how I think about my muscles please uh okay okay I don't know what it's like to be Daniel F I yeah most most of the time when I'm lifting my left arm I'm not explicitly thinking about muscles from time to time it's fun yeah yeah I or or like I suppose if you're trying to gain a new skill of like exactly how to move your left arm like I don't know if you if you do dancing uh then you you might have some like technique that you need to move your arm explicitly in a particular way you haven't done before and then and then like then I would normally not think about muscle contractions I guess like all right I don't know I I guess you visualize the thing I think the thing where I would think about muscle contractions is maybe in like a weightlifting context yeah good point or physical therapy yeah sorry you were saying something that I brought no no no no I think I was finished with my point which is that uh yeah I just said like most most of the time like short-term goals wouldn't require all this explicit outcome planning and would just use like this other mechanism and so maybe we can and we would also get some short-term goals there um but um yeah I was hoping it would be less or maybe we would be able to tell them apart in some way or maybe I guess I also hope that okay now that now that we have located these goals maybe we can like Trace them backwards a little bit like you know what inor produces this evaluation and then maybe maybe we can understand a little bit about that even if we don't understand it perfectly we can like get some broad Strokes of what it is and then like um yeah and then like it's difficult for the neural network to modify this uh so it even if even it's a very even if it's a very intelligent agent it like wouldn't necessarily be able to like fool our best classifiers um uh by like changing the way that its internal brain is structured that seems pretty diff okay cool well um we're we're about out of time I want to give you time to enjoy the rest of the workshop um thank you thanks very much for chatting with me thank you very much for having me this episode was edited by Kate Bruns and Amber Don helped with transcription the opening and closing themes are by Jack Garrett financial support for this episode was provided by the long-term future fund along with patrons such as Alexi maaf to read a transcript of the episode or to learn how to support the podcast yourself you can visit hrp.net finally if if you have any feedback about this podcast you can email me at feedback axr p.net [Music] [Laughter] [Music] [Music] oh [Music]