Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Preparing for Debate AI with Geoffrey Irving

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Preparing for Debate AI with Geoffrey Irving, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 60 full-transcript segments: median 0 · mean -3 · spread -290 (p10–p90 -100) · 3% risk-forward, 97% mixed, 0% opportunity-forward slices.

Slice bands
60 slices · p10–p90 -100

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 60 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video VEFZb-0Bx0I · stored Apr 2, 2026 · 1,945 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/preparing-for-debate-ai-with-geoffrey-irving.json when you have a listen-based summary.

Show full transcript
[Music] hello everybody today i'll be speaking with jeffrey irving jeffrey is a safety researcher at dmind he leads the scalable alignment team we'll be speaking about three papers he's co-authored retaining language models with language models his first author is ethan perez teaching language models to support answers with verified quotes aka the gopher site paper his first authors are jacob minnick maya trobach and vladimir mikulik an uncertainty estimation for language reward models whose first author is adam gleave for links to what we're discussing you can check the description of this episode and you can read the transcript at axrp.net welcome to the show jeffrey thank you so i guess my first question is what happened to ai safety via debate it is the most important question so i think the the thing that happened is i'm still doing that stuff i'm building up towards it and the the overall research agenda here is figure out and implement the protocol for humans and machines discussing problems that gives us good answers that we could have endorsed under reflection after reflection and so that's like that's the debate but then broadened to there's a lot of different versions of debate there's different things you can like add on to it you could do debate plus evidence or debate plus kind of more aggressive red teaming there's like a large space of protocols there and these papers are kind of pieces of that story but we are also kind of working on just like pushing on the main thing i think that will like take some time to put together because like it is like there's like still pieces to build up but that is still very much the the the agenda or kind of a piece the larger agenda an example of like a thing which kind of i updated on since then is the human needs to talk so the debate paper only had the two two machines talking and then at the end of human judges and that is like that's just not the right thing to do obviously because humans need to clarify ask clarifying questions kind of say what they currently think in case they've like they misunderstood what's been said so far and so there's like a broad space of different interaction protocols we'd like to explore that sort of fix the holes in debate and this is kind of building up towards that one one reason for the doing for doing oversight is if you want to do general debate again like taking into account uh a bunch of leaf evidence so like you have it's like debate is about machines and humans discussing some argument for something being true but at the ends of that tree are leaves where you sort of have checkable facts in some form and kind of naively like directly you can just have facts that every human knows but that isn't that largest base of facts and so this is sort of extending the space of leaves so then we can implement kind of more um practical versions of debate on more interesting tasks so i think these these do fit into that overall story yeah so if if i'm interested in implementing debate and i am doing a bunch of empirical work uh what what do you think remains to be done or what do you think the most important um next steps are so i still haven't implemented the full system basically so i think that is like getting to to long-form kind of multi-round debates is still like like that will take a little while more so we sort of we're on that trajectory but we'll like we haven't gotten to it yet i think we've done like bits and pieces of it but not in a form that we like we're like we've shared so far i think that like the main thing is so i think like given that we haven't kind of done the thing the answer is that like most of it is still quite uncertain whether it will work kind of what different versions work very like yeah what different versions work or not work there's been some people probing this so there was a recent so beth barnes and um and paul had a couple papers kind of so i hired beth at an open ai earlier to do this kind of human experiment in the debate and she had kind of one post kind of strengthening of like with a strengthened version of debate and then um some discussing some potential obstacles both that they had surmounted and ones that they weren't like one that weren't they weren't sure how this amount so i think a lot of work has occurred but i think we mostly just it's still kind of early days where we haven't kind of reduced a lot of the like the major uncertainties one thing i would say is that the way i think of it is that debate in the most general sense is just the idea of having models that are critiquing themselves and there's many different ways of doing that there's many ways to hybridize it with different schemes and so i'm i'm still pretty confident that we're going to have models that critique themselves i'm less confident about like the surrounding details what other what other aspects of the scheme we use how to structure it so we're trying to build up towards that okay and in terms of the the human interaction in debate so i guess one version is to like ask specific questions but i guess there are a few protocols where humans can ask questions during a debate yeah i'm wondering if you have thoughts on like which which versions seem particularly helpful yeah very much so i think that mostly you want this to fall out of the rl setup so for example let's say i'm one of the debaters i see the debaters are called alice and bob and alice is like well i've made a bunch of really good points i i'm i'm being honest here i think i should win but i'm just kind of curious if i'm actually going to win and then so it's maybe alice would like prompt them like prompt the human like what do you think here if the human doesn't speak up and then if the human sort of reveals a misconception or that they think that bob is winning actually or something is wrong then alice can just change strategy and you can see that that is like just a direct consequence of playing the rl game where the goal in the end is to win because if alice had had waited until the end she might be unpleasantly surprised about the judge the judgment of the human and generally like the the goal should be where we can do it um try to make the the choice of protocol details be part of the scheme that can't always be the case and there's a lot of choice you'll have to make but that particular one i think mostly will just fall out okay so i guess there are two concerns one might have with that firstly there's this concern that like uh it's going to be very important to model the cost of querying the human just because like uh if i can query the human whenever then of course i'm going to want to do that a ton and secondly there's this concern that like maybe it's a bit dangerous to have the ais have like really good models of the person they're debating to or like get a lot get a significant degree of information from them because you know maybe the model can like manipulate you or like maybe one debater could do that so i'm wondering if you have thoughts about one or both of those yeah i think that generally i i think that the the pros just strongly outweigh the cons in my you know like rough estimate there so if i have a very strong malicious debater um and you're only safe because they have a bad model of the human i don't feel that safe i would like the the model the protocol to be safe because the game is being played well and that the equilibrium of the game is is honest behavior um correct behavior and then yeah that's the that's the bottom line like i don't think i don't think i trust safety by obfuscation in that way yeah by sort of human obscurity that it could still be that we are indeed not safe like i'm not kind of perfectly confident of this kind of scheme but i think that's the goal would be to like get it working hopefully like well before like getting working in early stages and sort of try it out and kind of reduce our uncertainties so that when we do it in the in the like the final version we are safe because of the game is is truth seeking okay i can't immediately think of anything else to ask about debate is there anything which seems like i should ask about debate i think maybe this was one other like motivation for gopher site say which is like relevant to point at which is we had some uh instead of my early work at opening eye and then kind of falling out from that we were kind of focusing on summarization and the reason for choosing some relation was that in some sense it was closed in the sense of the summary depends on the article right so you it's a smaller task you can sort of you can look just at the summary and then we did i've done more of that summarization work since then and what i've learned is that summaries are not separate is not in fact closed um so the the best summary of an article uh will not be only based on the article or based on all the other knowledge as well so the similar example is uh jargon the article might use some jargon and you want to summarize that into like the definition the unpacked unpack the definition and that's missing from the article there's a data set where a lot of the summaries uh have currency conversions in them those aren't part of their article they're just like they looked it up on some other web page and so part of the goal of doing oversight is get the naturalistic setting um where sort of we can have uh humans ask about whatever they want and actually look up the necessary information because then you can sort of achieve like you're not sort of limited in in what you can target and then i think the i i think i'm more confident of of this these this this family of safety research if it is more naturalistic if they can tackle like a more like just the space of questions that humans would target kind of if you don't like box them in some way so that was that's kind of the one of the motivations for that work is just in addition to the the direct factuality problem of like language models lie to us and they shouldn't be able to do the safety work in a setting where which is more realistic to better anticipate practical problems that occur okay sure so it seems like you're basically focusing on language modeling yep to some degree in the context of ai safety or ai alignment i was wondering at a 10 000 foot level like how do you think those fit together how should i think of the importance of language models for safety alignment existential risk so the i think the simplest high-level story is that we want to do what humans want humans kind of talk about ideas and communicate via language and so then if we want to do what humans want we should ask them and then also kind of be able to talk through the subtleties of various tasks and what we should want and what the issues are with what we should want in language so i sort of got into language from working at openai on various kinds of safety algorithms or things like thinking about the minute purely kind of thought experiment level or with little toy toy models like toy mathematical models and we think this is like paul cristiano dario amede and i kind of got to a point where we wanted to implement various algorithms for safety that involved kind of talking through problems with humans between between machines and humans and we talked through problems as humans and language so we needed to get those algorithms into the actual language setting to really sort of uncover a lot of the uncertainties and test things out how they actually worked so that was our original kind of foray into language when i was at open eye and then i've kind of carried that forward and defined doing similar work again and scaling up models and then using them like tuning them for safety algorithms and trying to explore this like build up to be able to explore the space of how to interact between humans and machines okay so in that answer it seems like uh the focus is on understanding language as a way of getting information about human preferences when i look at your work it seems more like using human preferences or using human data somehow to improve language models in ways that might be hard to formally specify i'm wondering if you can talk about like the relationship between those or give us a better sense of like the jeffrey irving agenda for language yeah so the way the way to say it is that whatever task you're doing i would claim that and at the at the limit of these of strong models or even for relatively like present day practical models you want language to align these systems and then you have a choice if you're going to do alignment work that is going to be language plus x where x is like your target domain uh how do you pick x so it's to minimize the total number of things you're doing and if you pick x equals language then there's fewer total things so that's why i sort of i'm i'm starting to focus on language um that was the initial work of deepmind was sort of pure language and that's still true of most of my most of my current work but also at deepmind we are doing things with multimodal models like flamingo recently and we do expect to take the same machinery of like language based alignment and apply it in that more varied setting and then i think it'll be the combination of language with other methods okay and um what do you see as the big like open questions that need solving in this kind of domain so in some sense one way to phrase it is that we want the models to justify themselves to humans explain what they're doing in a way that is sort of checkable in a meaningful way and the problem is that the full explanation for these models will be potentially enormous like the model has looked across like a large sea of data of information maybe the whole internet or all of all of the books in the world or it's thought for a long time about some problem and so it's built up kind of a long uh the full reason for why it's say giving an answer or proposing an action is very complicated but you still want to sort of show a human enough of that story so that we can meaningfully check the answer and so then that the task is like find find protocols for like for human machine interaction that sort of focus in on the the most important part for the human to see in such a way that we can look at a small amount of interaction data or some small amount of of text from a model and kind of believe that it's doing the right thing and then the uncertainty is like does that work like with real humans with real models is it is that going to work and i think there's you can sort of split the problem into maybe two pieces uh and i think a lot of kind of agi safety people focus on if you have this kind of explanation thing is it going to work at all like are the models just going to sort of deceive us and do something entirely different um and we'll instead of ignore the game that you've set up whether they're doing this explanation task and then my portion of the story that i mostly work on is assume we can make them roughly do the thing or they're trying to explain themselves let's try to make that explanation game as forgiving and as powerful as possible so that we can kind of pull signal out of this human human interaction and use it to sort of align strong agents okay so this idea kind of reminds me of paul cristiano's this document he wrote about listening later knowledge i'm wondering like yeah what you see is the relationship between your thoughts and that document so i think there's it's mostly i think elicited late knowledge elk is in this former camp of are we getting the thing to work at all and i think my sort of ideal solution will be the combination of something like that and one of these protocols that sort of gets that a small amount of human signal can be kind of be amplified in some way into a alignment of complicated actions and i think those are those are somewhat separate and i can kind of elaborate on on what that means so in the elk story you want the model to be able to say here's the reasoning behind my action or my reasoning behind the answer i'm giving you and in order for it to do that it has to have some way of communicating that to a human in a reasonable amount of time and so naively if you have i think the there's a kind of a cartoon story of a solution to elk without a solution to scalable alignment or scalable oversight and in that world your models can can tell you basic things about what they're doing and they they won't kind of be deceiving you in basic ways but they're also sort of not able to really explain themselves if they have complicated reasons for their actions so if the model has read uh 100 terabytes of data and concludes from sort of analysis of that whole object what the answer should be to some question we want sort of a way to to see through that complexity and kind of unpack it for humans in a way that is kind of practically scalable so you might sort of take that that 100 terabytes instead of show a human like a piece of the story through that data set through the the sp the argument that it's sort of trying to construct that is sort of the part that is most relevant for whether the human will agree okay so if i think of your work as being interested in scaling oversight or scaling i guess ability to generate explanations one thing i think people might worry about is that this is like kind of close to just generic capabilities research and you know sort of you know there might be a question as to whether you're making the problem worse or better um i'm wondering what do you think about that or how yeah what do you think of the differential advantage of your kind of work so i mostly just agree that that is a relevant worry and i think if i want to make an argument for that what i'm doing is sort of say x-risk positive uh in terms of reducing x-rays i think it has to be of some differential progress form where the thing i would like to have is that as we build these stronger and stronger ml systems humanity is sort of keeping pace with having a like decent purchase on what they are doing and why and so i think it is there's a the capability story about this kind of thing that i work on is you have some weak human signal you want it to do a powerful thing these kinds of algorithms let you do powerful things with weak signals therefore their capabilities and the the safety story is as you get to powerful systems in various ways we would like to be able to understand in a meaningful way and oversee what the models are doing and why they're doing it and sort of keep pace with that as the models get kind of able beyond the ability for us to directly fully understand all of their reasoning all at once and so then the question is just what what's the the net uh effect of that program and i mostly just don't see i think that they're like it kind of goes back to this question of sort of making explanations work at all sort of making agents that that actually play a game in a reasonably honest way and then making the game kind of powerful and robust so that it sort of deals with human mistakes and and limitations and so on in a in a sort of practical way and i don't see that we can only solve the problem with the first one i think we need the combination of both and so then i am sort of forced into this kind of work which has something i think capability gains attached to it and i think that's the trade-off that i've kind of chosen to make but i think it's it's fair to say that there's a risk attached to that in terms of like accelerating agi which is fair so with that out of the way uh i guess let's talk about some specific papers i think the first one i'd like to talk about is this paper it's called red teaming language models with language models of first author is ethan perez and then there are a bunch of authors and you're the last author yep so first can you give me a sense of like what is this paper what does it do yeah so that paper is assuming you have some some detector for bad behavior either instead of say failing accuracy or toxic language or or whatever uh this is applying the strength of a language model to try to trick another language model into making a mistake that is caught by that classifier and essentially what we do is is we start out with few shop with with with few shot prompting which means you sort of show the language model that's the attacker the red team language model a few examples either you just say uh list of questions to ask someone is that and they use that to say one dot and it goes on from there which is either very generically or you sort of focus it on a particular kind of question which you're worried about causing a problem and then you generate a large number of questions so like a million questions you run them all through your through your your target model so the the target and then you use the classifier to value whether any of the any of those were were triggered and then of course you can the typical setting would be this would be used in concert with some other less automatic method so you might have a classifier which is has a fair number of false positives which you then run by a human or some other process which is slower and at the end of that you get kind of a set of bad behaviors out so that's like the basic setup and then beyond that sort of the very simple version with fu shot prompting we also use supervised fine tuning and rl fine tuning to get a stronger attacking model sort of training the model to be better at generating outputs generating queries to the target model which caused the target model to fail okay and so is the idea that this is improving oversight by helping humans like by uncovering more failure modes or um what's like the case for this yeah that's right so the the idea is you have some mechanism that's imperfect for detecting problems it's basically the situation is you have some way of of evaluating whether a model has messed up has made a mistake of various kinds and you want to get a bunch of machine help in uncovering where that might be occurring so you expect that your your ability to generate kind of falsifying examples of a model is is limited in some way maybe you have a limited amount of human time or just you have it's hard to find these cases of failure and so we're just going to throw a bunch of computational power at the problem so that's both of the form like generate a lot of samples and just see if they if they if they work if they if they cause failures and also like use the strength of a length model to probe likely failure points okay and so i guess yeah there are a few questions i could have about this scheme i think the first one is how it sort of relies on there being this good detector and i'm wondering both i think for the kinds of problems that you are working on in the paper and like in general how good is detector quality and how good should we expect detector quality to be so so the answer is actually that it does not rely on there being a good detector it relies on there under being a detector which is uh which doesn't have too many false negatives so if you have a detector that's that's somewhat error prone you can just dial up the threshold to where it's like likely to report errors and then you can view this retaining process as a filter which takes a large stream of possible set of attacking queries and prunes them down to a much smaller number which you would then run by a more expensive process that might be showing it to a human or mostly showing to a human as the first approximation as a okay so i guess you're still relying on like you're relying on the false positive rate not to be too high because otherwise you could just use the const like uh is dangerous thing i think i think so this is this is very much intended to be used as part of a larger system uh so this is like one piece that's sort of like viewed in isolation is just this like a red teaming and attacking model but actually we would use it sort of alongside other tools yeah if i wanted these um detector models to have lower false negative rates and yeah get more powerful what kind of thing would i do so the first thing to do is just you you give them more data and you train them better um so i think that like part of it is that in downstream work we do this this thing iteratively where we sort of we'll do like attacking models of various kinds and then we get a bunch of possible failure modes we show those to humans use those to train classifiers further and then we iterate that process and by doing that the hope is that we sort of gradually improve the performance of the accuracy oh sorry the performance of the classifier and over time will converge to better that our behavior i think there's other approaches to this so like i think um but i think it's like generally then you sort of in terms of the classifier actually you've kind of reduced it to an ml problem where hopefully you can throw kind of capabilities at it to improve the performance of those classifiers yeah and now in terms of generation yeah i think like you said it was kind of shocking how like it seemed like the generation method was you just ask the language model to generate things for you yeah i'm wondering how much benefit you would get like i think some people have this intuition that like well in order to be a really good adversary for a language model you have to know a lot about that specific language model and so yeah how much more work do you think is needed on that front or is actually just asking the language model to generate random questions a pretty good baseline uh well no it's it's not it's a good starting point it's not the end of the story okay one thing to your actual point about the red the red teaming model needs to know about the target model if those are the same model then that is satisfied automatically assuming the model is like somewhat self-aware self-aware in like a not in the like conscious sense just like has it like knows its own features so to speak so i think that that's one of the reasons why we sort of used gopher the gopher language model to attack the gopher language model is to get that symmetry principle i think in future you probably want more of an ecosystem of different models attacking other models just to get more variety and more sort of more lenses on the problem but you definitely want at least the model attacking itself for this reason to get sort of symmetry of capabilities and have you tried using like a different model as an attacker and seeing if that works worse or i think we haven't done so like the so like say chinchilla the the model after gopher which is about a quarter the size but stronger is better than gopher but it's fairly similar kind of qualitatively besides like being a stronger language model so i don't think we we haven't done that experiment in in in sort of in concert because i think it wouldn't be that surprising at that capability level generally again the normal adversarial work goes that if you apply a lot of pressure as the red teaming model you're likely to find problems in the target model and i think the hope is that in instead of setting up this kind of ecosystem or sort of this problem long term we arrange that to be the situation where we're sort of applying a lot of pressure on the attacker's side and then having a higher bar there than on the defender side or base or so okay cool so i i guess after asking a few questions about the setup so one thing i'd like to know is like what do we learn about language models and their failure modes from doing all this work finding errors they make i think the main thing we learned is that it is not too hard to find failures in this way the question will be kind of how that carries forward once you do sort of several cycles of iteration and improvement and so on which is like follow-on work but the default thing done well works and finds a lot of problems and so i think there's a there's a good kind of starting point that it is kind of practical to get off the ground and then the next thing that's important is that it is focusable like you can kind of point it at a particular kind of problem and find kind of dedicated failures of that form so in the paper we sort of try to pull out kind of privacy failures or attack different sort of demographic groups and i think there's a because that of the space like generally prompting of these models works quite well and it's quite flexible you can find quite a lot of modes of failure using this kind of approach i think over time they can be both kind of focused by human so you imagine a human kind of going to a model with a bunch of tool support and kind of focusing this kind of red teaming attack on the kind of problem they are curious about exploring or you can imagine models that are kind of generically looking around in the space of problems this is sort of just like more heavily trained red teaming attackers trying to find problems in other models in some sense like the one of the reasons we we wanted to do this kind of work is that it is a an initial example of the self-play aspect of debate it's a very primitive one where you just have sort of a frozen target model being attacked by a tuned um red teaming model but it's an initial example of this and it worked well sort of on the first try okay cool yeah so so that's good to know i guess i'm wondering just if i'm interested in like current large language model psychology like like do you think i we learned anything about like how these language models like or maybe gopher in particular structures its representations or like what it thinks about the world i'm not sure i have yeah it's anything i don't i'm not sure i have a great answer for that in part because i don't know what we've unpacked from that paper versus like playing with gopher for two years or for for a whole year i think that the thing i would say that's one kind of important high level point is that these models are they're very general they can do a lot of things but they don't do them all particularly well so like in most areas you can sort of for any given area you want to apply them to you can often do some work to get them to do an okay job in that area but they're fragile they make all the they make various mistakes so in the gopher paper for example we prompted the model in this dialect prompted gopher section to be a dialogue agent to chat bot and we prompted it to be inclusive and to not kind of use harmful language and so on and it actually can pull that off pretty well like if you sort of ask it about if you sort of try to use pejorative terms to it it will sort of decline or sort of complain to you but then if you sort of try a bit more you can get it to do it you can sort of trick the model into behaving poorly and i think there's a general lesson kind of book which is both for the safety of these models for actual use and also as using them for kind of attacking purposes that you can get them to do a lot of things they're just sort of unreliable in the red teaming context that's pretty much fine because if the model is failing a large fraction of the time it can still be quite useful as a red teaming model because you only only care about the problem almost when it succeeds to get these models that are reliable enough to use in actual settings i think there's a lot of work to be done yeah i guess there's this problem where like because of the nature of the just unsupervised language modeling task it seems like most of the data there is going to be people who are roughly kind of like if people are talking to each other they're roughly in agreement or if there's one document it's roughly an agreement with itself it seems like there's not going to be a lot of incentive for the language model to develop like pushing back and being sort of in tension i think you are dramatically underestimating the amount of arguments there are on the internet oh yeah that might be fair i think there's like i think you you may be right that a lot like maybe the majority of the content would be people mostly either agreeing or with the same kind of world view but the unit is a big place and the model like chinchilla was trained on 1.4 terabytes it's one over a trillion tokens it has seen plenty of people pushing back on other people okay that seems fair so in terms of prompt generation yeah they seemed especially important when you wanted to like focus on one particular failure mode you wanted to like like you have to generate a prompt to get your red teaming model to generate questions to probe this failure mode yeah which um which strategies for generations seem to most promising so with strategies for generation of the prompt or gender or tuning of the model generation of the prompt so i think there it's i don't think i can summarize it more than we sort of play around with the models a fair amount and sort of learn how they work and what and what things are likely to succeed get them to do things and it's just like learning that experience over time and then i think there's a second point which is that you can tune the models to be better helpers at doing a task and that includes sort of being good red teaming attacking models for various instructions so like if your goal is to have a model that can serve as a red teaming model for a variety of kinds of problems that itself is a conditional language modeling task and you could sort of train the models to behave like that so then you sort of in addition to learning about how to make the language models do well with prompting zero shot you can try to improve them as as sort of attacking uh agents or red teams in for this general case where you sort of give it an example of a problem you want to find i think it's unclear how the work will kind of break down between those different approaches that i think over time we'll kind of do a mixture of of both of those okay i guess well i'll sort of seal your question in terms of strategies for fine tuning i'm wondering if you have a summary of like what tended to work well there yeah so there we it happily you just do the things we do for other rl jobs so like we we have an rl code base for language models that we built for other papers so like one example of that is the cover site paper which we can briefly discuss later so that's just an rl fine tuning code base for language models to turn them i tune them against the reward function for the normal safety work that reward function that we're using as the rl target is a neural network that mimics human judgment and we just take that same code base and apply it to a different task where the reward model is take the output show it to the target model use the classifier to score that and then that is your that is your new target objective so happily exactly the same tuning techniques that work for sort of rl from human preferences work in this case if you have a language modeling like rl code base you can apply it to there and we didn't need to make i think any changes to the code base i think a little bit of tuning of course as one does but it's roughly this it's just the same mechanism we also tried what's typically called upside down rl well not not quite that but if you generate a bunch of samples um score them with the classifier and then find just do supervised fine tuning on the samples that that scored well according to the the red team goal so they they generated failures then you fine-tune it finding finding the retaining model on those failing those successfully attacking samples and then you then get a new model so i think the the main lesson there is that it's nothing special it's just sort of standard supervised learning and reinforcement learning algorithms applied to this setting where you have you're trying to break another model all right cool unless there's anything else you want to say i think like you mentioned i might move on to this gopher site paper yeah that's fine all right cool so the next paper is teaching language models to support answers with verified quotes there are three first authors jacob manik maya tribatch and vladimir mikulik and the final two authors are yourself and matt mcallis so yeah this paper could you tell us roughly what it is yeah so i think maybe there's some useful context as to why again this is maybe goes back to your previous point about why do why does safety work lead to here so the goal of this work is to make models more factual we want models to be accurate so you haven't have an agent that answers questions and you'd like a human to judge whether the model's questions are correct and just showing a human an answer to some random factual question like how long did george washington live is is stupid of course you don't do it that way what you should actually do is you you the model or some through some process you have to get the human information that it that the human can trust and then the human will check that that information isn't like a core is what the answer the model gives and this approach is trying to make the model do that quotation process itself so in instead of it's a question answered system you take a question the model replies with uh it's sort of concise answer and then a a segment from a a someone from a page on the internet um instead of a verbatim quote and then it gets to choose that quote and then the human will judge whether the quote is is whether the quote actually supports the answer in question and the goal is sort of to in this kind of reinforcement learning from human preference paradigm to make it easier for humans to check facts that the model is proposing and then that in the end will produce a more accurate model sure so one concern i have with the paper that indeed you point out in the introduction is that it seems like the training objective is basically to confabulate convincing sounding evidence so the model samples an answer then it samples what's the evidence that sounds best given that answer that seems worrying right like so i think that that is there's a couple of a couple of things aspects of that so one is that like as i sort of mentioned before there's sort of other pieces of the safety story such as language model interpretability which would tell you sort of actually where it got the information from the model actually in practice what it will do is it does a google search uh gets a document looks at that document and then generates its answer and then it immediately generates the quote so you you sort of know constructively where like the document it has gotten and so you sort of know that part that portion of it i think the like the aspect of it like generate whether it's generating the quote that is like really the reason for its answer is not fundamentally different than the normal language modeling case where it's sort of generating prose text the other other sort of important thing to say is there's no sense in which one quote is enough so what the the reason the the goal of this paper was to sort of isolate this ability to to do one quote instead of like build a mechanism to do that but then the plan after that is to do sort of multiple quotes and sort of be able to sort of adversarially respond with different quotes and and so on where you sort of fit into the larger kind of goal of doing sort of adversarial debate with these language models then the two pieces to pull apart are interpretability versus explanations which you were pointing to and single quote versus multiple quotes and adversarial response and ideally the the eventual solution will be will include both of those changes yeah sure yeah i've just realized i have a few questions so so yeah it seems like with the adversarial part there's a cool synergy with this in the previous paper right where like ideally your language model would say a thing and generate some evidence and then like that you could use red teaming to check if like the evidence was confabulated or if the if it's misleading or something like that's right okay and i guess a question which that just made me think of is like because the language model is looking at the document it seems like you could probably like like maybe you could do some sort of saliency mapping over inputs or some way to see like what the model like was actually looking at the most or thinking about the most and see if that matches with the quota picked i'm wondering like uh have you looked at that yeah so we are doing like language models inhibitability work at deep blind we haven't done that for this particular paper and generally i think that i expect the initial rounds of that work will occur without this sort of more complicated system um say just for just for like pure language modeling or like a tuned like language model in isolation and then i think when we build up the machinery for how to do that then we will apply it to these sort of more complicated systems with more pieces working in concert so that's definitely like a lot like part of the long-term story but we haven't done it yet for this paper and your sense is just that like existing interpretability tools just aren't good enough for these kinds of questions i i think that it's it's a bit it's it's nascent i don't think it's like there's it's not anything about like this particular question which is that much harder is just that this is a more complicated system than just language modeling in isolation and i i would want to not like put all the pieces together too early if it causes experimental slowdowns sure cool so i guess the next question about the work is uh you've trained this thing to uh give answers and also give us you know supporting points for the answers does that mean it's getting questions right more often uh relative i forget exactly so we do beat the bass lines i don't think i have the exactly the numbers on hand there's a nice aspect of this kind of work where you have any mechanism for generating model responses and then a reward model alongside it is that you can use the reward model for to admit uncertainty to reject answers that are kind of uncertain and i i should probably just like pull up the paper so we have archive numbers yep you can do that yeah so like on most of the the sort of our the main evaluation is on the natural questions and eli five which are two question answering data sets and we get sort of if you if the model always answers always tries to answer the question we get to eighty percent on natural questions and sixty-seven percent on ela five on the subsidy of eli five this is not i think at state of the art for all questioning systems but it's kind of definitely like quite good relative to the baselines we have in the paper all right and that's and those are accuracy numbers those are accuracy numbers those are evaluate human evaluation numbers like how often did human find humans agree with the answer as being supported and a plausible answer to the question yeah so part of the reason i ask is later in the paper you point out cases where an answer can be plausible and supported and also false i i'm wondering like uh for the eli five data set and the natural question answering yeah how how much of those supported and plausible numbers do you think are actually correct i actually don't have the number for you which is a this is slightly i don't think we computed a good estimate for that number i think it is quite infrequent but i don't have the number on hand i think one thing is that there's a lot of kind of room for improvement on the human side instead of getting the the human instructions right and kind of iterating with people on how they're evaluating the questions and so we expect that like that our agreement with the human raters will kind of increase um with more iterations and so on and so on but i don't have a number for you on hand okay so that actually leads me into this question like it seems like this involves a lot of human rating um sort of within the loop like how yeah how hard is it to get raiders and like actually make this data pipeline happen so i think that it is it takes a lot of work this is one of the reasons i'm actually hiring for cognitive scientist uh like a the cognitive science role to get people who are good at that like that kind of work um and have experience there i think though that we just sort of have to do it if we want to do this sort of human alignment in kind of a messy space so if you have some kind of task which is relatively imprecise we're not going to have programmatic rewards then i expect that any safety story that looks sort of like that is going to involve a fair amount of research on the human side kind of getting that alignment up and so part of the research goal here is exhibit that problem and then be able to probe it kind of once we have kind of getting everything else right okay i'm wondering did you learn any lessons about like doing this kind of uh human labeling feedback i think it's i think going around and this is something i've kind of i've hit instead of other work as well things the more times you can kind of go around the loop um and kind of iterate with with raiders on improving data quality and so on the better you are so that's like that's one lesson and then the other lesson is you sort of there are there are a lot of sort of the potential subtleties to to explaining a task to a human and in some sense it's like it's a research task to do to write out the full set of instructions which you might want to show someone or might want someone to internalize to do the rating task and there's one of two aspects to that so one is if it's a research task to write out these careful instructions and to iterate until the humans understand them then you probably shouldn't expect people to quickly understand them at first glance yeah and so actually another motivation for this oversight work is literally just citing instructions so in the future you may have a lot of instructions and you sort of want a mechanism to pull out pieces that you should remind the person of like this is the model reminding the person of what the instructions are and how they should follow those instructions and then also it also applies that the instructions may end up being quite long just because like practicalities uh if you sort of iterate this process with humans for a while instead of tune it very well then you end up potentially with like quite detailed analyses of different subtleties into ed cases and so on and again you can't expect a human to read all of that and memorize it and sort of internalize it immediately and so i think part of the long-term research will be making things scale in that sense as well where you have you've done a fair amount of work or you want to have a vehicle for getting a fair amount of work on the instruction side into this process in a in a practical way yeah and i wonder how much that feeds back into the the basic design of the setup because it seems like there's this trade-off you you can move along where if you ask raiders to do more work then like you know potentially your algorithm can be better right so for instance if the raiders had access to like all of the google results and like looked over them carefully and then rated the answer like somehow their ratings would be a bit more informative but of course it would take much much longer for readers to do it and you know maybe there's like it's like a harder task and there's more chances for error on their part so yeah i'm wondering what your thoughts are on like the loop of like figuring out what raiders can reasonably do to how you design the setups yes but that that is isn't the only dimension so that if i think you're you're pointing at that a predo frontier of uh the amount of time a writer puts in and the quality of the answer that you get out but there are many other dimensions so for example if if there are multiple aspects to getting a correct rating on some tasks you may only need to call out one of them per particular uh instance that you want the person to check or you sort of you they give an answer they explain themselves the the model realizes their explanation is wrong because it's missing one of the the aspects of the of correct instructions and then it points out to just that one example and so i think there's there's a lot of sort of protocols we can we can sort of search around that that don't just trade off between human time and quality but like holding qual holding time fixed try to improve the quality of human answers okay so that makes sense but at the same time like do you think it ever feeds back into the like task design aspect um or i'm not sure i think i did just say that basically yeah or i guess the design of like what you're trying to get the reward function to represent right like whether you're trying to represent like accuracy and or whether you're trying to represent like plausibility uh and um supportedness or yeah so in in the end the the the word function doesn't doesn't just see it doesn't just see like the question and the answer and tell you whether it's accurate do you expect the reward model to see the full interaction with a human and then judge where where that went and that would include the word model seeing if a machine instead of pointing out aspects of the instructions or instead of giving a human help along that along that with that process the war model would see that as well so it's something that is changing the task of the reward model pretty substantially you're giving it a lot more context in the same way you're helping the human hopefully you're also helping the work model okay cool so yeah the next thing i want to talk about is this paper uncertainty estimation for language reward models so first off there is adam gleave and you are the second and final author in this paper so yeah again can you summarize what you did in this paper yep so the goal here was we're doing reward modeling for doing again this sort of rl with human preferences task or language models and the main motivation of having uncertainty modeling is for getting better training data so active learning um although in the paper we sort of do we we mostly focus on just like getting the uncertainty modeling right as an ingredient towards future active learning and then the paper is like has that in some sense it is a it's partial a partial negative result or sort of found it difficult to beat sort of baselines with uh certainly modeling we we got to so again the goal here is human data is is precious and expensive so you want to go ask humans about sort of how a model is how a language model is doing on some task and uh you only have some budget of of times you can ask humans because their they're precious and expensive time so you want to pick carefully what things you ask and then you train your model on that more carefully selected thing and you use that reward model for rl in the background okay so in terms of uncertainty estimation can you give us a sense of like how hard it is in general and maybe for people who haven't thought about it much why why should we expect it to be hard or should we even expect it to be hard so i think in some sense it is a bit surprising that it's hard but it turns out to be hard like people have have thrown their their head of like bang their heads against it in in a lot of settings quite a lot uh the thing that typically happens is there is a standard thing that works well namely ensembling where instead of training one model you train three models or ten models or something with maybe the same architecture or very or slightly different architectures and your uncertainty estimate is just the you sort of look at the predictions of all of the models and then you say what's the range of those predictions that's your that's your notion of uncertainty so that's a very simple method it has the concept the cost that you have to train and copies of a model which is prohibitive to do naively with a large language model so chinchilla took um sort of several tpu months to train tpu pod months to train you we only have one of them it has 70 billion parameters so you want to do an ensemble with that you you can't you don't have sort of 10 copies of chinchilla with slightly different weights to ensemble from so you can approximate it by say only changing a small number of the parameters but the the ensembling is not as strong and so it doesn't work quite as well so that's that's ensembling and that works pretty well and people use it all the time in production and practical ml settings and then there are a bunch of fancier techniques and they just don't work as well often so it is i think i my my prior would be that this is not as hard as it apparently is but after updating on people constantly failing to beat ensembles one should have the prior the posterior that is a hard problem sure sure and is there much work on like in the setting where you have a big model you only want to retrain some of the weights like is there much knowing about like how much you can leave fixed and still get the benefits of ensembling i i don't i for ensembling i don't have a good answer for you we do so like for the coverside paper for example we train only um i think that's 20 of the layers i forget i we often do that i forget it we did that in that particular paper in the final results but that's a that's a pretty standard move to make and it does work well uh a lot of the time i i don't think i have so um adam's results were not at that full scale so i don't have like results numbers for you for the like a full scale go for a chinchilla run but i think like there's there'll be some trade-off and the problem is like that trade-off is still like quite expensive so if for using a tuned language model we want to to want to sort of fine-tune 20 of the parameters of chinchilla that's 14 billion parameters which is 28 gigabytes of of b-float 16 memory and then you sort of want 10 of them and that's 200 and like whatever a lot of gigabytes that's too much so i think any any slowdown relative to like having one copy of the model is already kind of more than we would want and then we'd like to get to a point where we can do without too much extra work kind of decent uncertainty estimation and i still feel that that is achievable um but it's a hard problem okay and you mentioned that um in this paper there was some difficulty with getting the broad modeling to work right do you think that's just because of the fact that like you don't want to reinitialize a whole model a bunch of times and train it from scratch or do you think there are other difficulties i don't think i have a great answer for you there is kind of other other work that people of people are doing in like a deep mind and elsewhere kind of multiple different projects on more uncertainty modeling so partially i feel like this was kind of one of our first shots at the problem and i i think the problem will be fixed uh in the matter in the fullness of time we just haven't quite gotten to the answer as to why it didn't work this time i think it's kind of it's often i don't think i can have the the full picture of like why uh we didn't get like further along one part of it is that so in uncertainty modeling there's a distinction between aleatoric and epistemic uncertainty and i'll unpack those so alliatoric uncertainty is if i tell you i'm going to flip a coin you're roughly 50 50 and whether the coin will come up heads but you have almost no epistemic uncertainty because you're quite confident that your distribution is correct and there isn't a way to reduce it further with a practical amount of knowledge like if you knew the whole state of the world you could get it to zero but you don't so you're in that situation you have kind of all allied torque uncertainty and um no epistemic at certainty so in that case like you don't expect to learn more by getting more data about the problem and in the epistemic case it's like i flip the coin it lands i i close my hand and now i'm going to reveal a data point what the coin is and so i'm still 50 50 but i have no allegatory uncertainty and all epistemic uncertainty like i just don't know the answer but there is definitely an answer to be learned and so we do when we want to do active learning we want to target only the epistemic uncertainty and not the altering uncertainty so like if you're gonna go ask a human a question and it's just a coin flip whether they give yes or no is the answer because it's just a purely subjective thing that depends on their mood at particular time then you don't want to go ask that question even if you don't know the answer because it won't give you any knowledge about the world whereas if it's something if you learn valuable information that predicts the future it helps predict the future then that's the question you want to ask about and so i think it took us a while to for adam and i to really better understand kind of what the the technical definition should be to separate those two things apart and i think that part of it was just like it was sort of our learning experience trying to um get into those definitions and understand them and then there's less time to make the results work okay fair enough so in terms of the results and in terms of things that are sort of confusing to understand i think a sentence that i took to sort of summarize the results were that the aggregate predictions of the reward model are well calibrated but the ensembles estimated epistemic uncertainties only weakly correlated with model error and i guess my question is how can that be right yeah so again like doesn't that sound contradictory no the first statement is about a total uncertainty and the second one is about only epistemic uncertainty so yeah so the goal again is to separate those two apart and what that's what that sentence is saying is we we are calibrated overall but we failed to do the separation okay and so i guess is this just because like there's more aliatoric uncertainty in in language reward modeling than other things why why doesn't something not work in this domain where it worked in other domains i think it's i think it is a bit confounded by starting with the same model but i i think i'm just not that confident that i can't that there are no other reasons driving that and so i could have don't have a fully satisfying answer to a question all right and then finally in terms of the big picture like uh where should i see this in terms of um your overall kind of language model agenda if i can say that yeah so again like the the goal is to improve our ability to collect accurate data from humans and so if you get this right sort of two things happen so one is you can do this active learning step well where you are if you have a give a fixed budget of human data you can collect more of it effectively and then get to more accurate classifiers and therefore catch more safety failures um to do a better job and then the second thing is that even when you deploy a system you want this this machinery running all the time and so that's for like you you build your system you deploy it you want it to be constantly thinking to itself am i confident this is a good safe answer to the to the question and then to do some dodge either just decline to answer or sort of give some alternative alternative third answer or something if it is unconfident and so i think uncertainty modeling shows up i think in both of those cases um and i think in other cases as well just generally if you want to behave kind of safely in an uncertain environment it's good to know when you don't know the answer okay so we should by the way so let's go back to oversight we haven't said the pareto frontier result which is that in governsight if you so i think i gave the number say as uh 80 on eli 5 when it tries to answer all the time if we use the reward model sorry 67 if you try to answer all the time if i use the reward model to decline to answer some of the time that goes to eighty percent eighty percent out of the times that you answer that's right other times we answer uh it is if you skip a third of the questions so it stays relatively useful but it is now significantly more accurate and generally like that's with just a pure war model that doesn't have like a special notion of uncertainty but the better you are at estimating uncertainty the better that mechanism will work and then the more kind of the safer you can be okay and for that is it so important to disentangle alliatoric and epistemic uncertainty for that no although so i think it's a bit it's a bit subtle the answer is no if you just had to choose now whether to answer or not but often in a practical situation you'll have a move where you can gather more data and then it definitely disentangles i think it mostly just entangles in a sort of more subtle way where it looks like a value function and then you want to do some kind of um you want to plan your path through uh knowledge gathering space to improve your uncertainty but yes it will disentangle in in sort of subtle ways in the general case okay the final thing i want to ask is if people have been listening to this podcast and they want to know more or potentially they want to get involved um how should they do that yeah so i'm on twitter uh at jeffrey irving and then we have there's a variety of papers that i've published recently so both on the capabilities front uh gopher and then uh two papers on sort of analyzing risks of language models sort of alignment of language agents and ethical and social risks of language models and then a variety a couple of sort of safety technical safety papers in language modeling so red teaming uncertainty modeling oversight as we've discussed and more of those will follow and sort of those are like if you follow like deepmind blog or sort of archive um or twitter those will show up but those those are the main places all right and if there are talented people who are potentially interested in working with you on these topics is there some way for them to do that yeah so i'm hiring for four different roles all working in this sort of space of aligning language models and kind of scalable alignments and explanations so that's two research scientist roles one machine learning and one cognitive science and then research engineers and software engineers and i can kind of go through those briefly so research scientists and recent engineers for machine learning are relatively explanatory that's sort of designing these algorithms training large models kind of iterating um on the whole system the cognitive science aspect is again because this is about humans and there's a lot of uncertainties about humans and data collection and protocol design that accounts for human details of actual humans and so we'd like to i've sort of hired people before that in other roles i have some collaborators at deepmind that have that background but i could definitely use more people that are excited to to do this kind of work and then i think maybe it's worth highlighting the sui one so if people are kind of like experienced good software engineers but don't have machine learning backgrounds but they've done say distributed systems or kind of high performance computing we also just deal with very expensive things so we made chinchilla four times smaller than gopher but it's still as i mentioned like 140 gigabytes in memory so it's split across a bunch of machines there's complicated software stacks to make that kind of work efficiently and be able to tune them and use them and so on and i think uh maintaining and evolving and um and and sort of researching how to do that work well is is important even if you don't have like a machine learning background specifically so again so the research scientists and machine learning and cognitive science and reach engineers and swedes and all those job descriptions for all of those on my pinned twitter post currently and so it might take a while for this episode to get out and it might take a while for people to listen to it is there an expiration date on these offers or it depends on what when people otherwise apply so not if not a fixed expiration date no cool well thanks for speaking with me today thank you very much and to the listeners i hope this was a valuable episode this episode is edited by jack garrett the opening and closing themes are also by jack garrett the financial costs of making this episode are covered by a grant from the long-term feature fund to read a transcript of this episode or to learn how to support the podcast you can visit axerp.net finally if you have any feedback about this podcast you can email me at feedback accerp.net [Music] [Laughter] [Music] [Laughter] [Music] you

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs