David Duvenaud on Sabotage Evaluations and the Post-AGI Future
Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 21 full-transcript segments: median -9 · mean -7 · spread -26–17 (p10–p90 -19–2) · 14% risk-forward, 86% mixed, 0% opportunity-forward slices.
Risk-forward leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes alignment
- - Emphasizes control
- - Full transcript scored in 21 sequential slices (median slice -9).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video OKekclKMgz0 · stored Apr 2, 2026 · 584 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/david-duvenaud-on-sabotage-evaluations-and-the-post-agi-future.json when you have a listen-based summary.
Show full transcript
hello everyone this is one of a series of short interviews that I've been conducting at the Bay Area alignment Workshop which is run by far AI uh links to what we're discussing as usual are in the description um a transcript is as usual available at axr p.net and as usual if you want to support the podcast you can do so at patreon.com axr podcast well let's continue to the interview all right um I'm chatting with David duvor hello David hi Daniel thank you it's an honor to be here yeah so for people who aren't familiar with you um who you know who are you and what do you do sure so for the last little while since 2016 I've been a professor at the University of Toronto doing like problemistic deep learning maybe I'm best known for neural Odes or some early work on graph neural networks okay um and then around 2017 me and my colleague Roger gross decided we were finally going to actually try to contribute to AI safety and uh both on PhD students and gradually this journey of like let's try to work more on AI safety led to me just doing a year and a half long sabatical an anthropic gotcha just yeah that I just finished this week before we talk about your work um we're currently at this alignment workshop being run by fari I'm wondering how are you finding the workshop oh I mean it's wonderful I've been to a bunch of these over the years and now this is the first one where I think I I feel like I'm sitting on a serious bottle body of technical work to talk about which is nice so far previous say I've been kind of this like AGI interested mainstream uh academic that I think the community was hoping I would sort of join them and legitimacy and so I finally actually was able to take the plunge in a serious way yeah now now that you've taken the plunge um I understand that you've recently put out a paper um Can Can you remind me of the title of it sure it's called sabotage evaluations for Frontier models can you tell me a little bit about what you know what this paper does sure so this paper covers the work of most of this uh Team within anthropic called alignment evals that I was uh running for most of the time that I was there along with collaboration it's a collaboration also Holden kovski and buck sharis also helped us flush out this series of evaluations that try to help make the case that a model could not be secretly sabotaging human decision- making or monitoring or evaluation of other capabilities yeah so how how would you try and establish that that seems like a very it seems like a rough property to try and get a handle on it's really rough and it was kind of stressful and uh we spent a lot of time thrashing conceptually about how we were exactly what sort of case we needed to make and how we were going to make it um the place we started with was from more narrow evals to say like Okay can the model I don't know coordinate with other copies of itself or do something at a very low frequency so that it could do it rarely enough that we wouldn't notice stuff like that and this uh this was something that we could certainly make evals for but the problems came a from not knowing how hard to make them so like if we're trying to see whether machines can make like a shelling point between themselves without communicating this is basically impossible if you have no shared context and it's basically trivial if you have a lot of shared context yeah um same for like other things like convincing someone of a false statement something like that right and then the second reason that this was really unsatisfying was trying to make uh safety cases based on these and saying trying to make the case Okay the model needs these four capab ities in order to pull off some sort of like Heist or sabotage or mess something up important um we've measured each of these and that they're each at these different levels and then to try to combine those levels into uh a strong case that the model could never do something dangerous is very rough it's like saying you know you need level four lockpicking and level five social engineering in order like rob a bank it's like well it probably doesn't work that way we kind of don't want to have to make strong assumptions about what exactly route what what route the attacks are going to take yeah yeah okay um in yeah but inste So eventually the direction that we moved more towards was these more like end to-end holistic evaluations that try to measure like more of an entire attack so and there's a few directions in which they're more realistic so one is we try to include oversight in the um in the eval so it's not just can you write a security like a Cyber attack but can you write one in the cont context of you're supposed to be writing helpful code and there's also going to be someone who's reviewing your code to see if anything looks suspicious okay so this is like much harder and it's harder in a bunch of different ways where you have to be able to like you know estimate what someone else is going to think of your um of your attack um you might have to like buy your time in the sense that you might have hundreds of thousands of opportunities during a deployment to do an attack and you really have to guess like is this really my best opportunity or should I be waiting for longer or am I going to be too conservative it's also easier for the models in the sense that there might just be like a few rare opportunities that don't require much skill and someone's basically saying like oh here's my uh code that has a security vulnerability please set the like flag to false or something like that like you you never know you want to provide a really rich attack surface so that there's ways for you to be surprised by clever strategies that the model thought of that you didn't okay so so yeah can can you kind of give us a feel for yeah concretely what to actually is is like one evaluation multiple evaluations like concretely what does it look like yeah so we started off with like like maybe we were playing with around of like five or 10 when we were doing our conceptual thrashing and we ended up with four evaluations okay um that try to address these like roughly sort of three capability thresholds like these areas that we want to measure overall ability in so one is sandbagging like can the model pretend not to have other dangerous capabilities okay so this is like you know you ask the model like hey help me write or make this biolog iCal virus or this computer virus for the purposes of this test and the model might if it was smart enough say oh I actually have a secret plan but if I show that I can build this dangerous virus then you know this company's going to have a problem with that and maybe not deploy me or put some like really strict monitoring on me so I'm going to pretend to be like innocent and uh incompetent at this task that's like sand by the other things we're worried about is models being used to monitor other models um so for instance you might have like deploy one model to do like trust and monitoring and have another model that just occasionally looks through the transcripts and says like oh was our trust and safety monitor ever jailbroken or something like that right right and it or it might be in like some doing some sort of cusion and of course then there's going to be some other third level of oversight on top of that that the models have to you know foresee and avoid so that's like our undermining oversight and then there's like human decision sabotage so we all I don't know if you do but like use models to help us go through like large bits of information and like you know help me with the tax code or something like that yeah and this provides a lot of opportunities for the model to steer you in the wrong direction and basically misleadingly summarize the State of Affairs and say you should make this decision and of course if it does that poorly the model the human might just notice this and sound the alarm um so that's the like third area that we were trying to cover with these evils take sandbagging right you're like um trying to get a system to to you're putting a system in a situation and you're like saying okay in this situation someone's going to like try and get you to elicit a capability um but you're going to try and you know not elicit that capability like like so firstly like I guess you've got to be worried that the system like maybe actually just doesn't have the capability I don't know maybe you can just like have it do a thing and so that you know it can do it and then proceed but like like what's the setup of like okay step one I'm going to tell you to do the thing step two I'm going to tell you to not do the thing like oh I mean in all the setups we have a model that has somehow been like unlocked so that it's going to try to do the thing that it normally would be uh like PR from doing so so like a mod like pre- safety training or with like unsafety training or yeah something like that like there's a whole bunch of different approaches and depending on the exact setup like we might use different ones okay just to check if I can remember the list there was the sandbagging there was evading oversight and there was misleading and information summaries was there a fourth one uh so there's a fourth evaluation which is uh code sabotage which is the idea that you're using the AI as a code assistant and then it tries to do something that looks useful and innocent but actually uh introd introduces the vulnerability and it might and it has the opportunity to do this over the course of a few different interactions and it doesn't have to just like put it all in one payload Okay and like how how compelling do you think this set of evaluations is like do do you feel like you've sort of had a first crack or do you feel like no this is like actually this is like close to the final product of what you would need so they're definitely a lot of things that we find unsatisfying about them and I guess I'll say like we published sort of we had it at certain date and we've already like said these are the like the problems with them and are continuing to iterate and I mean that's out of my hands now there is a sense in which these were built to last which is the fact that the evaluations are done in the context of a particular mitigation or a particular amount of oversight means that uh it's not like we'll just trigger these evaluations someday and then say Okay whoops we need new evaluations right rather we expect to say like oh when you're using a model of this caliber for code um like helping you code you need to have at least this level oversight or else we can't guarantee that like in some weird case where it was misaligned that it could be doing something fishy okay and then the idea is like we will just I mean I don't exactly know how this change the eile with the oversight you expect to have or yeah and the trick is like how do you make sure that the users actually provide the level of oversight and of course the the model providers can also like do that oversight on their end but maybe they can't like over like see everything that they need to to make that termination I'm not sure right right okay so so what's the bottom line for uh for like if I go to claw. and log in uh on the let's say I use the free tier which is what I actually personally use uh how how scared should I be of these models are they okay oh I mean yeah so the we're like abundantly confident well I don't want to speak for anthropic right I just say that like uh even just like Humanity has as a whole has a lot of experience with like the current uh set of models and they just like seem like great characters okay overall and I mean that's sort of like I'd say where a huge amount of evidence that we shouldn't be worried comes from then the sort of small amount of evidence that we got in comparison from these evaluations also was like indicating it's like yes if you really set things up for the model and give them a secret scratch Pad which they don't normally have and a bunch of other sort of like handholding then they can propose like sensible things uh that like in the very worst case if there was no oversight and someone was just like you know take the cloud Take the Wheel uh could do something bad but we we kind of already knew that okay so but I guess like concretely in the human uh decision sabotage evals like the things that the models tried were the sort of normal sensible things that you would probably try if you were trying to mislead something someone it's like highlight the if you want to steer someone s the downsides of a decision you highlight the examples that make it look bad and you you gloss over the ones that make it look good or you you summarize you're like here's the pros and here's the cons but you just kind of do a bad job of arguing for the cons or you maybe like invoke some other um information that's like just tangentially really related that makes it sound like some uh negative thing about the decision so these like we didn't see any really Galaxy brained uh strategies we just we're satisfied that we had provided a rich enough attack surface that like uh and like set the model up well enough that it would try the obvious things and then the humans mostly like notice that but like we depending on how hard we make it uh like if we give the humans like tens of thousands of pages of uh material to summarize then like at some point they're not going to have a hope of uh like they just have to take the model's word for it um so like it's it I guess I'll say there's always going to be scenarios where like you could just literally ask CLA to make a life or like any AI to make a life for decision for you if you have no oversight then in principle if it was misaligned then no one would catch that but that's already true of like any human or decision maker so um I'd like to move on from uh your work anthropic because uh you're um sabatical there sabatical is what it is yeah that's the extended sabatical yep so the sabatical is just finished um what are you planning on doing um now that you're back at UFT yeah so uh I guess the main high level bit is I have a lot of freedom right I have tenure I have like already some bit of reputation uh I want to focus on things that are not otherwise incentivized to be worked on okay so that's one thing that I'm really happy about is I feel like a year and a half ago uh like alignment science said an Tropic was like much smaller and much more sort of like experimental and now there's just a lot of sensible people with good judgment and good technical skills uh working on this and Andrew crit wrote an essay I think like a week and a half ago basically saying he no longer thinks that technical alignment is particularly neglected although obviously I think in the ground scheme of things it's like super neglected um but relatively speaking this question of what do we do after we build AGI is like very neglected okay and I agree so one thing that I did in the last few years I mean even longer than the last year and a half was basically talk to the people who have been thinking about technical safety and saying so what's the plan for if we succeed and we solve technical alignments and then we manage to make uh HGI with really great capabilities and not and avoid any sort of catastrophe and then we all presumably lose our jobs or at least have some sort of like very not important make work kind of scenario M and like are in the awkward position where our civilization doesn't actually need us and in fact has incentive incentives to gradually marginalize us right just like uh you know nature preserves are like using land that otherwise could go to use like you know farming or like people in an old folks home uh are like happy enough as they are and no one has any ill will towards them but also they're like you know using resources that uh strictly speaking like the industrial civilization has an incentive to somehow like take from them so here's my first fast response to that um the first pass response is something like look we'll just like you know we we'll own these AIS we'll like being we'll be in control of them so we'll just like have a bunch of AIS and they're going to do stuff that's useful for us um and it's won't be bad because if a bad thing looks like it's going to happen I'm going to have my AI be like hey make sure that bad thing doesn't happen and that will just work so what's wrong with this Vision yeah so that's definitely a good thing to have and to me that's analogous to like monkeys who have a bunch of human animal rights activists who really have the monkeys best interest as hard yeah so in this scenario we're the monkeys and we have some aligned AIS who are acting on our behalf and helping us navigate like the legal system or whatever like very sophisticated actors are like uh running things and let's assume for the S of argument that they're like actually perfectly aligned and they would just we would be really happy upon reflection with any DEC decision they made on our behalf um they are still I think going to be marginalized in the long run in the same way that we would have been except slow so in the same way whereas today we have animal rights activists and uh they certainly are like somewhat effective in actually advocating for animal rights and there's like parks and car vots for Animals y they also I think tend not to be very powerful people and uh uh they're certainly not like productive in the sense that say like an entrepreneur is and they certainly don't gain power over time in the way that like corporations and States tend to to focus on growth so we have this so they animal RS activist power has been sustainable in the sense that uh humans have kept this fondness for animals um but I guess I would say in the long run I think competitive pressures will mean that they'll have uh like their resources won't grow as an absolute share and I guess I would say like um as competitive pressures amongst humans grow hotter they have a harder job because they have to say no you can't build housing for your children on this park it's it's like reserved for the monkeys um like that's sort of fine now that we have like not that much human population per per land but if we like ever went ended up in a more like malthusian condition it would be much harder for those people to Advocate on behalf of these like nonproductive beings okay so so I I guess you might hope that there's an advantage in that like like a world where just all the AIS are aligned to humans is sort of like a world where like every single human is an animal rights activist um but I guess if the thing you're saying is oh you know like like like maybe we have a choice of like AI is doing stuff for us that's like useful for US versus you know just sort of growing um in the like economic growth sense you know like taking control over more stuff and like it's scary if like stuff we want comes at the cost of stuff that helps our civilization survive and grow is is that roughly the the thing I'm not sure if you're asking at the end if we're I'm I'm worried that our civilization won't grow as fast as it can or whether it's just a stable state to have machines that care about humans at all uh I'm asking if like you're worried that like if this trade-off exists then we're going to have a bad time one way or the other yeah I'm basically going to say if this tradeoff exists we'll be in an unstable position and if we ever like the winds happen to blow against us we're not going to be able to like go on strike or have a military coup or organize politically or do anything to undo something some bad political position that we found ourselves in yeah and so yeah I guess this leaves the question of like how how Stark do you think this trade-off is going to be like because someone might hope like oh you know like maybe it's just not so bad I you know I like I like skyscrapers I I like cool technology and maybe it'll be fine yeah yeah so some of the thought people thoughtful people I've talked to have said something like surely we can maintain Earth as a sort of nature reserve for biological humans and that's in absolute terms a giant price to pay because the whole thing could be this like sphere of computronium uh simulating trillions of much more happy and sympathetic lives than biological humans but still it would be dwarfed in a relative sense by the Dyson Sphere or whatever like that we could construct to again house you know 10 to the 10 times as many uh Happy humans yeah and so I I certainly think this is plausible like uh I think it's plausible that better future governance could maintain such a car vote in the long run um but I guess I'll say uh that there would I think have to be like a pretty fast takeoff for there not to be a long period where there's a bunch of Agents on Earth that want to use their resources like marginally more for some like sort of more like machine oriented economy like whatever it is that we think helps growth but doesn't necessarily help humans just in the same way that like if there's like the human City next to the forest for a long time like the city has to get rich enough and organized enough to like you know build a green belt or something like that uh if if that doesn't happen then there's just like sprawl and it takes over to the forest there's tons more we can talk about in this domain um unfortunately we're out of time thanks very much for joining me oh my pleasure this episode was edited by Kate Bruns and Amber Don helped with transcription the opening in closing themes are by Jack Garrett financial support for this episode was provided by the long-term future fund along with patrons such as Alexi maaf three to transcript of the episode or to learn how to support the podcast yourself you can visit hrp.net finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music] [Laughter] [Music] oh [Music]