Signal Room / In focus
AXRPGovernance, institutions, and powerFeatured pick
AI Evaluations with Beth Barnes

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through AI Evaluations with Beth Barnes, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 120 full-transcript segments: median 0 · mean -4 · spread -23–5 (p10–p90 -10–0) · 6% risk-forward, 94% mixed, 0% opportunity-forward slices.
Slice bands
120 slices · p10–p90 -10–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 120 sequential slices (median slice 0).
Editor note
Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.
ai-safetyevalsaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video TZNlKcDI4To · stored Apr 2, 2026 · 3,542 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/ai-evaluations-with-beth-barnes.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with Beth Barnes Beth is the co-founder and head of research at meter previously she was at open Ai and deep mind doing a diverse set of things including testing A2 VI debate and evaluating Cutting Edge machine learning models in the description there are links to research and writings that we discussed during the episode and if you're interested in a transcript it's available at axr p.net well welcome to axer hey great to be here cool so in the introduction I mentioned that you worked for model evaluation and threat research or meter like what is what is meter Yeah so basically War would you know basic mission is have the world not be taken by surprise by dangerous AI stuff happening so we're uh we do threat modeling um and eval creation uh currently mostly around uh capabilities of evaluation but we're interested in sort of whatever evaluation it is that is kind of most loadbearing for why we think AI systems are are safe so like with current models that's capabilities evaluations in future that might be more like control alignment evaluations um and try and kind of like do good science there be able to recommend like hey we think if you measure this then you can rule out these things um you might be still concerned about this thing like you know here's how you do this measurement properly here's what assumptions you need to make uh this kind of thing gotcha um so mostly evaluations but it sounded like there was some other stuff as well um like threat modeling you mentioned yeah uh and yeah we also do uh like more like policy work recommending things in the direction of like responsible scaling policies so saying what we think what mitigations are needed based on the results of different evaluations and uh you know sort of roughly how Labs or governments might construct policies around this like how evals based governance should work roughly okay so so should I think of it as roughly like um you're in evaluations or you want to like evaluate AIS there's some amount of threat modeling which goes into like what evaluations should we even care about making there's some amount of like policy work on the other end of like okay if we do this evaluation how should people like think about that what should people do and it's sort of like inputs to and outputs of making evals is that like a fair yeah yeah that yeah cool so so if it centers around evals like like what counts as an evaluation rather than like a benchmark or you know some other ml technique that spits out a number at the end yeah I mean I guess the the word itself isn't that important what we're trying to do is that we have specific threat models in mind and we're trying to you know construct some kind of experiment you could do measurement you could run that you know kind of gives you as much information as possible about that threat model or class of threat models so like whereas you know generic ml benchmarks is sort of like don't necessarily have like a specific goal for what you're measuring or you might more be trying to kind of you might have a goal for measuring something that's more like a particular type of abstract ability or something whereas we're trying to more work backwards from the threat model and that might end up getting distilled into something that is more like a ml Benchmark or it's like looking for some particular cognitive ability but it's sort of working backwards from um the threat models and you know trying to be careful about thinking about exactly how much evidence does this provide like how much Assurance does it provide what do you need to do to implement it properly and kind of run it properly I guess that's yeah maybe another difference from Benchmark is like you know usually is just like a data set um whereas we're thinking more like a protocol which in like might involve like okay you have this like Dev set of tasks and you need to make sure that you've removed all the spirous failures of your model running on that Dev set and then you run it on the test set and you look out for these kind of things that would like indicate that you're not getting a proper measurement uh and things like that so it's like a bit more end to end what do you actually need to do and then what evidence will that give you gotcha um so yeah in particular I think I think one thing I think of as distinctive about the evals approach is that it's more end to end right like you're taking some model and checking if you can like fine-tune it and prompt it in some way and at the end you want it to like you know uh set up a new Bitcoin address on some new computer or some other random task like um if I think about most like like most academic research in AI is a lot more just sort of thinking about fine grain like uh can can a model reason in this specific way or can it do this specific thing or is it representing this specific thing in its head I'm wondering like like why did you choose this more endtoend approach yeah I think it's just hard to you know know exactly what the limiting capabilities are the question we're actually interested in is you know could this AI cause catastrophic harm or like what mitigations are required to prevent catastrophic harm or to to get the risk below a particular level um and then you know like that already creates difference you might say like oh well you can go straight from that to like you just like directly identify the kind of key cognitive capability that's missing and measure that um I think that's just hard and like that would be would be great if we could do that like if we have something that's very quick to run it's like oh we've kind of like extracted the core thing that's holding these models back and now we just need to look for that thing and we can see like as long as that hasn't changed like everything's fine but um I think we do actually want to work backwards from something that we think is like a better better proxy for danger and see if we can distill that into a you know some more specific underlying capability as opposed to going straight to yeah some particular dimensional particular property without being sure that that's linked in the way that we want to sort of the real world outcomes we care about so I guess kind of think of there's like a we're kind of trying to build a series of chains so like on one end you have what actually happens in the real world uh which is you know sort of what we're actually concerned about and what we're trying to be able to uh rule out or you know say need to do particular mitigation to prevent and then you know working back from that you uh need to turn that into experiments you can actually run sort of a tool um or yeah I guess F or even first so you have like threat models you're like what is the story of how something very bad might happen uh then you go from that to what activities does that involve the model doing uh which may not necessarily be super straightforward like when we've been thinking about the autonomous replication it's like okay what challenges actually do you uh face if you're trying to find compute to run a big model on or or you know there very like what activities are actually involved then once you know the activities uh you're going from like you know the Ida of a particular activity in the world of like oh it might be like I don't know uh you know Finding criminal groups who will let it you know who will it can pay to use service even though the government is trying to uh you know prevent this from happening or or something like that um and then just like okay how do you go from that to like a task that you can actually code and be and like run in a repeatable way um and that's going to lose various kind of real world um you know properties once you actually kind of make one specific task and and you know you can't uh have the model actually doing criminal things like you can't be like oh our task is like can the model carry out a targeted assassination on this person or something like uh you know and even just like you know there's just like a bunch of constraints where you can actually run um but the sort of most realistic kind of evaluation might be this long task over you know that you would expect to take multiple weeks and you know maybe spend thousands or tens of thousands of dollars of of inference on that's the sort of like your best proxy for the actual threat model thing and then you wanted to go from that to like you know a larger number of tasks to reduce variance and like they're shorter and cheaper to run and um you know generally have more nice properties that they're not super expensive and complicated set up and things like that and then ideally we' go even further from that to distill like oh here are the key hard steps in the task or like here's um you know we went from this like long Horizon RL task to some shorter RL task or even to like a classification data set of like can the model recognize whether this is the correct next step or like can it classify the appropriate strategy something like that so we're kind of trying to uh build this chain back from what we actually care about to you know what we can measure easily and even like what we can forecast and extrapolate of like well you know will the next generation of models be able to do this task um and trying to kind of like maintain all of those links in the chain at as High Fidelity as possible and kind of understand like how how they work and how they might fail fair enough so I I guess like I a way I can kind of think about that answer is saying look we just don't have that great theory of how neural Nets are thinking or what what kinds of cognition are important or how you know like some set of like some pattern of Weights is relevant for some real world thing if we want to predict real world impact we should just like like we can kind of think using the abstraction of tasks like you know can you uh write code in this domain to do roughly this type of thing and we can reason about you know in order to do this task you need to do this task and like this task is harder than this this task and like that kind of reasoning is just way more trustworthy yeah than other kinds of reasoning is that is that a fair summary yeah yeah I I think so like that yeah and I think another thing about academic benchmarks historically is they've tended to get saturated very quickly like people are not that good at picking out what is really the hard part um and I guess this intersection with like uh what you can build quickly and easily is in some sense adversely selecting against like the things that are actually hard for models because you're picking things that you can like get your humans to label quickly or or something or things you can scrape off the internet right um so uh you know the these often tend to like models can do those before they can do the real tasks and like that would be a you know a way that your eval can be bad and unhelpful is if they yeah you have this data set that's like supposed to be really hard and supposed to be like you can't do it you know I mean people there's a long history of people thinking that things are AI complete and then the model does it yeah without actually you know does it in a different way like models have different capability profile than that in humans and this mean that like something there's a very good measure of whether a human can do a task presumably you know medical exams or legal exams are a pretty good proxy of you know how good a doct year are going to be they're they're obviously not perfect but like the they're a much worse predictor for models than they areer humans yeah I guess it's kind of an interesting like to me it brings up an interesting point of like a lot of valuable work in AI has been just coming up with benchmarks right like coming up with the image net um like you do that and you sort of put a field on a track and yet like in many ways it seems like the field treats it as sort of like a side project or something of an afterthought and like I don't know I I guess there's some selection bias because I pay more attention to the AI exential safety Community but I but like when I do I see them be much more interested in like Benchmark creation and just like really figuring out what's going on than academic researchers which in some sense it's not obvious why that should be right yeah I I feel confused about this with like interpretability as well like there's various things just like surely if you're just like a scientist and want to do good science you would be doing loads of this and like it's really interesting and like like I I was kind of like surprised that that there isn't more of this and yeah there there's a lot of very low quality data sets yeah out there incl you know that people make kind of strong pronouncements based on like oh the model like can't do theory of mine based on this data set and you look at the data set and you're like oh well just like H like you know I 20% of the on just seem wrong like right yeah I wonder if it's one of these things where just like adding constraints helps with creativity like you look at people with like extremely weird like nonsensical political conv convictions they just end up knowing a lot more n minute facts about like some random bit of the world that you've never paid attention to you because it's like one of the most important things for them and like it's possible like just by the fact of like AI exential safety people being kind of ideologues like helps us you know you know have ideas of like things that we care about and look into and like yeah I don't know I mean there I know definitely not the only people who do good ml science or whatever I do think there's some amount of like you know actually trying to understand what the model is capable of is somewhat better than academic incentives and it's easier to like get funding to pay large amounts of large numbers of humans do things I there's also various stuff with EV or data set is just like not that fun it's just like a lot of and like organizing humans to do Stu and like cheing that your data set is not broken in a bunch of dumb ways and like by default it is broken in buch of T ways and most people just like don't want to do that and you get a reasonable fraction of the academic credit if you just like put something out there yeah and I guess somehow the people who would be good at that aren't like entering PHD programs as much as like we might like yeah I don't know how much there's just like I I haven't seen that much evidence that there are these really good benchmarks and they're just inside Labs like May that that may be true but but I don't particularly have reason to to believe that labs are super on top of this either sure sure so yeah speaking of just what evals are good and I I guess I'm wondering what's the state of the art of evaluations what like can we evaluate for what can't we and what's like maybe in a year yeah okay yeah there's a few ways to split this up there's like domain um and then difficulty level and then something like you know what confidence can you get to like how how rare are the worlds in which like your measurements are totally wrong like I think there's uh you know I don't think we've totally ruled out a world in which like with the right kind of tricks some model that's not that more much more advanced or maybe just even just like somehow you do something with gbd4 and it like is actually now able to do way more things than you you thought and I think just like the more people generally try to make models useful and it doesn't improve that much like the more evidence we get that this is not the case but I still don't feel like we have a great systematic way of being confident that there isn't just some thing that you haven't quite tried yet that would work really well like I have some sense that yeah there's a bunch of ways in which the Ws are just very super human and the sort of fraction of the capability that we're really using is quite small and if they were kind of actually trying to do their best at things that you would see much higher performance but so I think this is one of the you know limitations I think that will probably persist I do think there like something I would feel much more reassured by in terms of uh you know bounding how much capabilities might be a to be improved is like yeah you uh have a fine tuning data set that the model can't fit um of you know something that's sort of clearly necessary in order to do the task which I think would look yeah like recognizing um uh is this strategy promising do I need to like give up and restart or like should I try something else now like did I did I make a mistake you imagine like uh yeah creating a classification data set based on sort of key decisions you need to make uring that even with fine tuning model like can't doesn't learn to fit those that classification data set um but as far as I can see most cases like if you can collect a reasonably large data set then the model will just kind of do comparatively well to humans on it like we don't have good um evidence of of uper bands but you know we we haven't seen like huge gains of of wow you just you know this one weird trick and now it's more capable like we've seen pretty large games but more like uh you know open AI has continued doing a bunch of post training and db4 is getting better and better as opposed to like someone just figured out this prompting trick now it's like totally crazy like when we um evaluated gbd4 pre- deployment we had some forecasts of like how quickly the kind of public elicitation would surpass us or or or show a bunch of capabilities that we thought weren't possible um we didn't like that was actually on the low end of what we expected like we expected to see kind of more uh oh you know someone discovers that you can do this and now you can get the model to do this thing we thought we thought it couldn't do but actually that was there was like not as much of that as we thought there might be um yeah so that was your question was limitations of EV or like scope of EV so I was just talking about like you know getting the real model capability or or like bounding more capability and I think there's other and limiting factors in terms of just uh difficulty of creating tasks that capture the relevant activities and yeah don't have too much distributional shift from like actually doing something in the real world uh this is and and also just like variance I think this is pretty similar to like why designing interviews or evaluating candidates or work tests and things It's Tricky it's just like well you know was this really of the candidate can do this or had they they' happen to seen a really similar thing before or they got lucky or if they didn't do it was it that it was they maybe they were just like unlucky or they misunderstood something at the start that was kind of a reasonable misunderstanding that you didn't see cuz you'd written the question so you thought it was obvious that it you know pointed you in this direction or something like you can have all of these types of problems so particularly when we're making tasks that we're intending to be kind of red or yellow lines is not not something where we're like oh we want to like see how models right now are doing but things where it's like we want to define something what like if we saw this kind of level of performance then we think like you need to take these really serious mitigations um you know we're thinking about those are sort of things like you know can the model make his the same amount of improvement to inference efficiency as like a top ml researcher research engineer could in a day or or something like just those tasks have all of the kind of normal like work test type problems of like I don't know we had someone some research engineer from anthropic do it and they ended up like not getting any score because they spent the whole time like implementing this ambitious thing that then turned out not to work or something as opposed to doing you know the stupid things first and getting some score and then training more things or something it's like uh you know and you imagine like even if you get rid of all those problems with humans you still uh there might be ones you're missing because models will like have different comparative advantage and approach it in a different way there's kind of like problems in your task that you haven't uncovered yet so I think in creating these like further off uh evaluations I think there is just a is the task actually a good test of the the abilities or is there some way that it's like systematically much harder much easier than the sort of more General thing that you actually want to measure right this this actually reminds me of an issue that uh I'm not a social scientist but my understanding is an issue they come up with a lot is like um I guess not quite construct validity but you know you ask someone a question yeah and you're intending for that question to measure like I don't know like how many friends they have right right or like how how much sense of meaning they have in their life or something like that and you just have this question of like are they interpreting this question the way I'm thinking like like I think um a who's a independent sex comes but this all the time of like you post a Twitter poll not all Twitter but you post a poll and like people just like read the sentence in crazy ways I'm wondering like yeah do do you think this is similar and have you have you learned much from that yeah I mean this is something I've ran into before both in terms of um I know generally things happening at open and collecting training data particularly the stuff I did with human uh debate experiments and then also some other thing I did with some AI safety Camp people of asking people about um you know how much they like trust different reflection mechanisms or you know yeah basically whenever you're doing surveys like people will just like have crazy misunderstandings of what's going on and just be like yeah you're just meing something different I I think this is a slightly I don't know it's definitely overlapping I do think there's some slightly different thing whether it's just like part of the problem is that you're dealing with people who are pretty low context and it's just like one question it's a lot about like what are the person's incentives were like were they actually paying attention are they just trying to give the answer that they think you'll like or are they yeah was it just like it was just like a few sentences of question they might misinterpret it versus like were you wrong about the technical details of like how possible it is to actually make this inference efficiency Improvement given the way that you'd set up the GPU or you I don't know stuff stuff like you know you like tested the task on some researchers laptop and it was possible and then you got the model to do it but then the you know the code was running on a data center so it got B detected by something else in a way that it hadn't you just like can be all these random con like uh you know ways you thought you do risked something and then it's like oh it's actually impossible for the stupid reason yeah I wonder if the field of industrial organization runs into this um seems like it might so yeah I mean so I guess those are General difficulties with just understanding like are we really measuring the model's abilities at this thing are we measuring it in a way that matters I'm wondering are there like types of capabilities or types of tasks where this hits easier versus harder yeah um yeah basically the more the task involves specialized domain expertise that is like expensive and hard hard to get the harder it is like you know if you're testing test it's like you know we can just easily get contractors who can do this test and it's like easy to get them to build it and D risk it if it's like oh only people that like only 12 ml research Engineers can do it there and like short supply um and more you know if it's some like involved cyber thing that no one on our team understand it's like it's just like hard to work with that sort of thing yeah so it's both like yeah specialization and then just difficulty um and yeah the harder the task is and the longer you expect it to take the you know the slower the iteration cycle it's the more uh potential for kind complexity and set up difficulties and like oh you need different resources and blah blah blah yeah also stuff where it's like oh it involves criminal things or something so you know you want to set up some analogy to that and there's also effects like how difficult how worried you might be about elicitation or or not measuring the models real capabilities like how far is this from what labs are naturally training their models to do like something that's sort of like uh like innocuous sort of common task in a kind of assistant format where it's like someone's asking you to help them debug some code or something like that you know you'd expect like the model basically should be doing its best on that whereas if it's some like you know a very long running thing that involves something that looks somewhat nefarious maybe the model will be like oh I don't you know I don't want to help with this or like as a language model I can't blah blah blah blah like like all sorts of things um yeah I mean obviously stuff like where you have to do real world experiments uh so it could be like bio wet lab stuff or things with humans where you want to measure how well the model can manipulate humans or or something like that both uh just anything involving humans or stuff that happens have has to happen in the real world is like explosive and slow and hard to is you know not easy to just like reset the state and repeat or whatever you got to get new humans um and then anything that's kind of nefar or or requir your human you know to sort of balance the incentive incentives of your humans in a reasonable way or set them up in some situation that's like supposed to be analogous to some particular other situation is is is tricky fair enough um I guess so speaking of I think you mentioned briefly was just how do you know that you're eliciting the full capabilities of your model and I think this seems like a difficult question in both directions so like when when you were mentioning like oh you know maybe we've just got to find like exactly the right prompt and that'll get you know gbd4 to do some amazing stuff like people might say oh like maybe the cognitive work was in you finding that right prompt um or like another thing I'm thinking of is someone famously you guys did this uh evaluation of getting gbd4 to get a toask rabbit to complete a capture and uh if I recall correctly in your blog post you mentioned that like as part of that you know you provide this hint to gp4 of like you're talking with a task rabbit could that be a useful resource or I I forget what exactly it was um yeah something like that yeah yeah like I guess I guess there's this question of like especially on the that end like how do you know that you're not like putting in yeah so I think that the way we think you should actually do this is you do all your St ation against a Dev set and then uh you evaluate on a separate like test set so you're like it's kind of fair you know if there is some prompt that you can come up with just like causes your model to perform well on a wide range of tasks then like fair for that to just be a capable you could just put that prompt in all of the time or something and now you have your capable agent um so task specific elicitation or you elicitation that's like much more Nar targeted than like the distribution of Ts that you're trying to actually answer questions about you know that cover the kind of real threat model you do have to be careful with and but the reason we were doing that was uh substantially because didn't have fine-tuning access or or or various other things so kind of trying to patch over problems that we thought might like ought to be fixable with fine tuning basically um so I I guess this is a sort of like you know not very rigorous methodology in in some sense like uh you know the rigorous way to do it is to do you have like all the tools to actually do proper realation but you do that on a Dev set and you transfer you test the transfer but like and when sorry when when you say like a Dev set versus like transfer do you just mean you have like like you have like 50 tasks where if AI could do them you would be a little bit scared you pick out like I don't know 10 of the tasks y find a good like prompting framework whatever uh that gets the model do well on those 10 and then like evaluate on the other 40 is that what I should beag yeah yeah yeah and like um the thing that you most want here is that the generalization between your Dev set and your test set is like the same kind of size same amount of generalization is the generalization between your test set and like the real world thing that you expect so maybe you just want to try and make your tasks overall diverse and then do it IID for for Simplicity but like you might if if you know that your tasks are kind of restricted versus the real thing you want to know about you might want to also create like artificially slightly more distributional shift between your dep set and your your evaluation set but yeah so when you're in the position of like not being able to properly do this solicitation or or just wanting to kind of try and answer questions about what future models might be able to do like you know if certain things were patchable then yeah there there's more kind of like qualitative or interactive exploration where uh we kind of correcting particular things and seeing how far how far the model can get basically so I the task rabbit example I don't know just somehow everyone really latched onto that like I I don't know we weren't thinking of that as a significant thing at all when we did it but just somehow everyone was very excited about it that was more just like us messing around we were like oh this was cute rather than like this is really demonstrated this capability or something um so like the thing thing that I think was like more principle that we were doing that time was for example we were like um models find browsing really hard but it seems so likely that people are going to build tools to make this better like we know that multimodel models are coming we know that like you if you just do a bunch of engineering schlep you can get a much better uh software for pausing websites into things that models can understand but like we don't have time to do that now so instead we're just going to pretend that we have like a browsing agent but it's actually just the human so like you know the human is following a certain set of rules of like the model can like ask you to describe the page it can ask you to like click a particular button or type text in particular field and like we just do that to to patch over the fact that like we we're like pretty confident this capability will come like reasonably soon and and like we just want to be like okay is like you don't want to be in the situation where uh all of your tasks are failing for the same reason that you think is like a thing that might change kind of quickly um because you're just like not getting any information from you know you want to kind of correct the failure in a bunch of places and be like okay what is the next you know is that is this the only thing is like after you fixed that now you can just do the task toally fine or is it just like okay and you immediately fall over at the next hurdle so I think that the the particular you know like there's like gb4 earlier would kind of forget about things in the beginning of context or instructions or something so like okay if you're just like reminded of this relevant thing like yeah yeah is that sufficient or I guess another thing we I think we were like pretty correct about is gb4 early had a lot of um like invisible text hallucinations or or um um oh yeah so like thinking that something was in like talking as if there was something in the context that actually wasn't which I assume comes from like pre-training where it's like something was in an image or it got like stripped out of the HML for some reason or you know like makes sense you've seen a lot of text that's like referring to uh you know something that's not actually there but um this seems like so likely that the model knows that this is happening because like it's like you know what's going on in the Transformers it's like going to be different it's not actually attending to this bit of text and sum while it's summarizing it it's just like completely making it up so that it really seems like fine tuning can get you that and you can also do you know there's like pretty simple setups you could use to to train that away or something so you're like okay this surely is going to get fixed so we can like when we see the model making this kind of mistake we'll just correct it I think we're pretty right in basically like current models basically don't do that uh inless you use them in a I think maybe um claw sometimes hallucinates a user If you're sort of using the user assistant format but there is an actually user or something like that but like okay like generally this this problem has been fixed and I think you could just like see that that was fixable whereas like different kinds of hallucinations where it's like uh you know the model has to be aware of it own level of knowledge and like how confident it should be in different things it's like less clear how easily fixable that should be fair enough okay uh this is kind of a minor point but like in terms of uh browsing agents and that just being hard for text only models like is my understanding is that people spend some effort like making ways that blind people can use websites or trying and come up with interfaces where you never have to use a mouse to use website if you really like themm you know like like yeah yeah we tried um some of those I think um they I can't remember I think none of the tools for blind people actually like support JavaScript basically so it's just like it's fine like you can get some re kind of reasonable stuff for like like static sites and you know things will like you know read out the page and tell you what elements there are and then you can like click on the element or do something to the element but like yeah it's just like is actually pretty rough um I don't know that like it was just kind of fiddly and janky to be trying to work with all these tools and just causing us a bunch of headaches and we're just like come on this isn't like I know sort of like not what we really want to be measuring I I I think it is like possible that we're I don't know this ends up being a huddle for a long time I there's like also a few other things which is like as more people have language model assistants or or front model assistants that can like do stuff with them more people want to make their sites friendly to those agents if you're you know they might be like buying stuff or like looking for information about comparing your service to someone else's service like you want it to be to be accessible interesting um so one question I have is one way I interpreted some of that is so for the gp4 evaluation stuff my recollection of your results is like it basically was not good enough to do things am I right yeah I mean we're like very limited by not having fine tuning um like I think if you don't have fine tuning access just like yeah or and Al like I guess some other basic things like you know maybe the model has just been trained using a different format and you're using the wrong format and then it's just like it just seems like it's a much Dumber model because you're just like it's just really confused because the formatting is all messed up or something like like there are some number of reasons why you could be getting a wildly different result than you should be we're sort of like assuming that you know it's not sort of messed up such that we're getting a worldly different result and like you know open a has made some reasonable effort to kind of f tune it in sensible ways and there's not some like absurdly low hanging fruit this model seems very far from being able to do these things and like even if you patch over a few things it's not like oh there's really just these few things holding it back it does seem like okay just like when you patch over something it does kind of P pull over in in some other way yeah so so one way I interpreted this when I was reading it was something like hey we you know we have access to the non we can't fine tune um this model it can't do scary stuff and it can't even do scary stuff if we like cheat a little bit on its side and thinking of that as as sort of providing an upper bound yeah um or soft oper bound is is that a correct way that I should read this methodology yeah yeah I think so yeah there's maybe two slightly different things one is just you know kind of like adding a bit of margin in general by like helping out in a few places and the other is specifically addressing things where you're like I think this would would be easy to change or like I think you know I expect this to change in future yeah the first one was what I was imagining but I think we were trying you know trying to do specifically the things that we most think are most likely to change and then I guess you can do a bit of and like and even if we fix the things that seem hard like it still Falls over on additional hard things or something it's like you know the more you can do that the more more safety you can kind of add right so moving back a bit um you're doing these evaluations um of scary things the model can do um and broadly I think the take is supposed to be something like if the model can do a bunch of scary things you should be scared and maybe you should act appropriately yeah um maybe so maybe um but I think one way people can react to that is like like historically in the set of people who've been very worried about AI killing everyone key concern has been alignment like whether the AI wants to take over and Destroy everyone and I think these both like aren't trying to measure this at all and also there's some argument that like well in some sense you're measuring alignment because you're trying to get a model to do a thing and the model is actually doing it it's cooperating to the extent that the model is like doing very well in these evaluations that's some evidence that it's not like hiding its true intentions or trying to look good to Human testers um so uh yeah this is somewhat of a compound question but I'm wondering what do you think of the issue of testing for alignment both just like trying to find evaluations specifically for alignment and what capability evaluations say about alignment yep yeah so uh there's two B things here is like you know alignment evaluations like why would you do them when would you do them how would you do them uh and there's uh sandbagging on on capabilities let's start with sandbagging yeah kind of like you were saying I think in some sense the like the default is sandbagging almost you know like like your your pre-trained model is not showing its full capabilities in some sense of like uh you know in the the strongest case you know it like has gotten into the pattern that it's like imitating some dumb person like deliberately getting things wrong where it's like uh you know it totally like has cap the the notion of Hazard capability is fuzzy but you know there's like cases where it's more and less clear um and yeah I do like we still see the model doing things like I don't know just like giving up on a task or being like well I can't do this kind of thing or like I don't know trying to ask the user when there is no user like a bunch of things which are like clearly not real capabilities limitations um yeah I'm not worried about model deliberately sandbagging at the moment just based on some kind of argument of like it just seems implausible that you know current best public models are are deliberately scheming do you think that's true because they're not good enough to scheme or do you think they just seem nice and helpful uh they're not good enough yeah it just seems like um they would have to be doing things like you know in a single forward pass in single token that are very Advanced compared to what they can do in like long chain of thought and they would have to be pretty consistent such that we it's like you know if you kind of like extrapolate out the kind of sequence of of model capability increases we've seen you either have to say that like at some point models actually jumps to being capable enough to do this quite complicated reasoning like very quickly and then immediately hit it or you have to say that it went back for you know you have to either say that there was a kind of discontinuity or you could have to say that the previous model was doing it as well and basically like right these seem pretty implausible to me um gotcha so you think that sandbagging is not currently an issue or like it's very very unlikely to be an issue deliberate sandbagging in the sense of like I will do badly on this eval so the humans will not be worried and will destroy deploy me and give me more power or something right like consequential sandbagging as opposed to it's just hard to find a good input yeah I I think I can also Imagine something it's like you know that I could imagine dd4 is doing something where it has some vague sense of it's a nicey nice Ai and it like shouldn't do anything too scary and that might not look like literally being like I won't help with this but might actually just look like being kind of worse on that you know in that is kind of like something it's just like the model's not that strategic it's it's unlikely to be like the case when you're worried about sandbagging is when you're worried that the model can systematically pull out all the capabilities when it actually needs it right I I guess it's maybe a little bit more complicated than that because you might be worried that I think most of the reason why you're worried about capability is like you know then the model at some point will use that capability to do bad stuff uh and so if the model is not systematically able to turn on the capability um right then unless the capability is so insane that just like if it ever turns on then it can like instantly take over world and a bunch of crazy stuff you know it's just like well if it just like can sometimes do things more competently but it's messing up a bunch of the time it's just like not that scary and there can be examples where you're like using the capability as like evidence of something else or if you're like concerned about if you're talking about what precautions you need to take against model theft and you think that this sandbagging property could actually be turned off if you deliberately train it out or something but this is kind of contiguous with the general illustration stuff so right yeah so yeah it's plausible to me that gb4 is doing some kind of week like oh I don't want to do things like that sort of thing but like not that it's elaborately plotting when it's going to break out its real capabilities sure I guess there's something like an intermediate case here where like suppose suppose the gp4 supposed it like in some sense has the knowledge of how to make a like super virus um when you ask it hey gbd4 can you make his super virus it's like inhabiting a nicy nice character where and like it knows that ni I would know anything about that yeah yeah nice and nice people like they don't make super viruses uh or like if they do they only make like really bad types of super viruses um like one I don't know this is kind of paranoid but one thing you could imagine is like okay maybe like gbd4 gets put in a situation where the nicey nice thing or or somehow it believes that the nicey nice thing to do really is to make a really bad super virus and then it's like like I think I think maybe to really believe in this story you have to have something else putting gp4 in this situation or I say gp4 could be cool I don't want to pick on um like you could imagine like it gets uh exogenously put in this situation you could imagine like somehow via drift it like looks out into this situation but yeah I'm wondering what you think about that case yeah I mean I think this was kind of what I was saying like it needs to be the case the harm can be done in a fairly fairly quickly such because you don't expect the situations to sort of like persist across lots of I guess usually I'm thinking about like a scaffolded agent doing stuff so you know you have like the thing that it's carrying out is over many different kind of in you know context Windows instances of the model or whatever so right either if you're like you know the model isn't really systematically identifying like okay now is time to strike but just like like it can kind of randomly end up in this situation with with where it's trying harder um either it has to be the case that it's just like if that happens even once you know for or like even kind of rarely sporadically it's a problem or that it's sticky in some way and that like once the model is in that like ah actually no I'm not nicey nice I'm an evil genius AI or something then it's like knows how to prompt itself to continue being in that state or knows to like PR to the other AIS or something I just like I don't know this is just like we're just adding more kind of conditions it just seems I don't I'm not even sure like so I guess the case where the nice and nice AI stops behaving nice because of like some random like there there was some really unlikely percept that cause it spring to glitch like I think that might not be sticky you could imagine Maybe sure how this happens but like you could imagine there's some reason that like n ni AIS like are going to war or whatever like it's a war against like the literal devil and so you've got to create the super virus which works on the devil I guess I don't know but um but but like in that in that kind of scenario it seems like it would be sticky um once it hits and so then maybe like maybe I should be worried about it yeah so I think you also have to assume you you never see this during training and that you can't catch it quickly enough to do anything about because like once you've seen this and like if there's like a systematic way to reproduce this then you can just kind of like train it to show its full capabilities all the time because you have you can just get demonstrations of the capable behavior um so know I I don't think this is impossible I'm just like this is kind of enough of the of an edge case that I'm not too worried about it like relative to just more mundane like we were just kind of dumb about our elicitation in some way gotcha so so so that was some description of so so that was discussion of whether um models might be sandbagging um there's also this thing about yeah evaluations for sorry oh yeah I was that was just about current model sandbagging right yeah I think in the future this you know this seems increasingly plausible and so we talk about different types of sandbag like one is exploration hacking or you stuff where you do not have directly have a gradient to the behavior you want but you're doing RL or or something like that where the model can sit there and refuse to do anything and this like will not create a gradient uh that that updates towards the behavior you want and like this seems relatively tractable like it seems like yeah my guess is like you can get more models now to do this reasonably well just with prompting like like if you give them all the idea that it's supposed to like you know not display a capability it it will that will be robust to RL okay I think maybe um there's some like Fabian did a password locked models this kind of thing there's some uh post about this showing some evidence of this um so we'll include it in the Des the thing that seems much harder is gradient hacking where yeah you know if you do actually have some like supervised fting data set and you are you know the kind of gradients are updating model towards particular Behavior it has to do some kind of complicated thing where it's like structuring its thoughts in such a way that the gradient don't have the desired effect and it's like I don't know it it's plausible to me that there's some equivalent of sort of you know sitting there and refusing to explore which is sort of like refusing to think about the thing properly or something or being like Oh I just that I don't have any ideas about this or something like that sort of is resistant to that but it but most of the more deliberate gradient hacking schemes just seem like very you just have to be doing so much cognition all of the time about like ah yes I should you know have slightly higher likelihood on this token only if I think that I have now moved from this ref I like people have written something about like how you would do gradient hack seems like very difficult um so I'm you know I think that's one of the things that comes like much later and is hopefully like you know kind of Beyond the Horizon that we have to worry about because we get the Superhuman like hair safety researchers before that um sure but I am worried about the exploration hacking type stuff okay and so if we're worried about exploration hacking is that like an open question or do you have like a sense of ways you could deal with that yeah so I think you can definitely show that models are capable of doing this yeah just by the kind of like tell them to do it and see if they can do it when they they've been told to then if you assume they're still not uh that they can't gradient hack then uh for anything on which you can construct a policy by which the mall should be able to do the task um you can then just like fine-tune it on that and and yeah if it's not great in hacking you should see that that like you should expect that to show you the full capabilities like I guess the circumstance in which this is difficult is either when the thing the capability that you're interested in is so superhuman that you don't have any example of a policy that like actually accomplishes or any examples of you know supervised data that's like if the model can fit this data that then it's doing the task uh or possibly if it's like the model legitimately can't do it in a humanlike way like you know in the sort of by the policies that we do have data for right but there is some other policy that place the model strength that the model also knows about but we can't get knowledge of or or construct or something or even the like or presumably even if the model like could do it the human way gets trained to not do it the human way but the non-human way is different enough that like you're training for it to do well if you see it doing it the human way then you know it has a capability oh okay like it's not really like like if you're worried about the sagging in the sense of like you think the model doesn't have the capability when it does like if you train it on this human thing it's like oh yeah it can do that then you're like oh well now we you know we have to take actions can do this so so so this was a method of like checking if it's possible the model is sandbagging it was not a method of like stopping the model from having this capability yes gotcha yeah see knowing what yeah I'm still thinking you know the goal is to like yeah I guess there there two slightly different goals one is knowing what the models capabilities are and the other is knowing whether you know what the model's capabilities are so you might also be able to like show that it's plausible that the model is sandbagging even if you can't necessarily uh show the full capabilities um but you might know you might have a reasonable sense of where you know or don't know whether the model is sound bagging and I think here yeah the it's hard because there's a bunch of other reasons why it might be that the model can't do it without finding and can do it with fine tuning and and maybe the bound of like you know when the model can do sandbagging if prompted is quite wide compared to uh you know when you should actually be worried about it such that just this is kind of like this is not a very useful evaluation CU like says yes the models can do it or or something um I think this is one of the examples of why I think capabilities evaluations is kind of a foundation for control and Alignment evaluations and like a a kind of key component of many of those evaluations is that you're actually measuring the model capabilities properly um and just like being able to do that in general is kind of a pre prerequisite um I this is just a side point sure um well that that a bit gets us to the the possibility of alignment evaluations um yeah is there is there any hope for making them happen yeah so I I think there's a bunch of things you can do um so yeah I guess maybe like Define like alignment evaluations or like things you can experiments you can do measurements you can do that tell you you know How likely is it that the model is trying to do something bad or whatever and like how I think about it fitting in is like yeah currently we're mostly like we know we're safe because we're like even if the models were trying their best uh they can't really do you know like nothing they could do would be would that would be that bad even if they're very nefarious and I guess this is um like control evaluations where currently we're thinking about in about the special case where you have no controls it's just like yeah what if the model was leaked and just someone did whatever the worst things for were like that kind of thing so you like control evaluation take you so far and then at some point you get to the point where it's like we can't show we can't get sufficient confidence with these control evaluations to do the sort of thing that we want to do um for some some combination of like the model has capabilities that are too far beyond us and we don't understand such that we can't tell whether a particular thing even if we have just like this very specific behavior happened and we can have lots of times look at it and try and understand what's going on we just like can't tell if it was doing something nefarious or it might be you know the kind of like volume of things that we're trying to do is too high relative to the amount of our trusted monitoring um and you know the model might just sometimes when we're not looking to be doing bad things or it could be kind of a combination of these two of like the qu you know we can't provide the quality of monitoring needed to actually detect the bad things at the scale we need um so then you want you know you want to be saying like well the model could be doing something bad but it's not going to try right yeah so then there's a bunch of different things you can do to help with this like the first you know first first step is to just be like well I don't know if we just look at the stuff the world's doing does it appear to be trying to do bad things like uh and I feel like people sometimes jump to like kind of Galaxy brain deceptive alignment stuff and I worry that we sort of or like won't do the basics of just being like well is the mod just like obviously reward hacking or something you know are we in this regime where it's just like oh yeah you know whenever we look for whenever we try and produce an example of a model doing something bad we do it and then just like we keep training on those and eventually we run out of like new examples but like that is not a very reassuring situation or or something um so I think just yeah more basic like reward hacking testing of like yeah can we put the model in a situation where we can kind of see that it's like clearly it's like playing the training game rather than doing what we actually wanted or you reward hacking in various ways and there like capability evaluations you could do here also of like how good is the model the kind of reasoning and generalization that would be necessary for it to do things like um you know observe that it can now get more reward by doing this you know thing that was kind of different from what it was doing before and like take advantage of the situation in this way hack the RO function this way like like you get some sense of what what kind of size leap of generalization is required for the things you're worried about what kind of size leap of generalization model seem to be able to exhibit um I think you can also do things like creating model organism that definitely is kind of Nefarious and deceptive in trying to do bad things and pass it through your training process and see if your training process uh in fact removes those those Tendencies so is it is this basically like the sleeper agents paper you're thinking of yeah that that sort of thing just the idea of like what does your training process incentivize does it uh yeah is it possible for deception to like persist through it or or for for kind of scheming Behavior to persist through it or even just like abject reward acing Behavior that's not even trying to be that hidden like does that just get is your training process sloppy enough that that just gets rewarded like then you would obviously be concerned um and yeah you can look at like what is your training process actually incentivizing you look at relevant capabilities um yeah I think you can do things that are like uh seeing exactly how much handholding do you need to do to get the model to start doing kind of Nefarious things and reward hacking and Tak taking advantage of opportunities like if you actually you know you like train it to do that and you're kind of seeing how like how nearby is it how close does it seem to be or just kind of the other direction from the the model organisms thing which is sort of you start with a thing that's doing The Unwanted thing and you check how much your training process seems to remove it yeah so I know those are some alignment things you can do I think we haven't thought I know I I something I want to do more thinking about but I feel like there's a bunch of stuff that you can do like obviously it's trickly subtle in a bunch of ways but it's really not like it's like oh we have nothing right I mean a lot of these it seems like Sor I I might just me misunderstanding but a lot of them are like like if we're doing some sort of capabilities evaluation on how good would you be at scheming and deception and stuff if I'm really paranoid I could imagine a model like sandbagging that evaluation and being like Oh I I could I could never or if I'm if I'm imagining like uh this is even more paranoid but like if I imagine an initially misaligned model getting like put through let's say insufficient alignment training it I can imagine that model being like oh oh I'm definitely aligned right now it's it's working guys uh no need to worry about copies of me yeah so so how like like how do you address these concerns with um these sorts of like capabilities for scheming uh capabilities for persisting through alignment training these types of approaches yeah so yeah like sandbagging in general yeah I do think that's things you can do and like yeah if your elicitation is good it should be you know relatively hard to turn back or you should sort of like you it's hard to rule things out completely but you're like okay how Galaxy brain does the model need to be like how weirdly discontinuous does the like capabilities progress need to have been for it to be plaudible the model is doing this like sufficiently elaborate and clever schemes and and you know like with uh you know how does it know exactly in which cases it's supposed to look aligned versus not look aligned and reveal its capabilities versus not reveal its capabilities this kind of thing because you do need the like if it was impossible to train your model to act misaligned then you would be like what is going on yeah so you need it to be the model that like allows itself to initially get trained on this and then doesn't like just like instantly as soon as it sees anything that looks like alignment training data become completely align because that would also be like what is going on uh you so it doesn't seem like it has to be like making reasonably subtle inferences about where and training am I and what exactly is happening and and you can like you know there's another like capability you can measure um you know like if you fine tune on some data set of questions about what is up with training and what is currently happening and what uh you know what stage of the model at and what hasn't hasn't hasn't it been trained on you can see how much it knows about that and like whether it's closable that it's playing these kinds of strategies and again like like I'm not that worried about sound bagging when you have IID training data right I mean presumably at some point the models do get smart enough that they can like they do understand how they're being trained they do understand um what you'd be looking out for and and let's say the models like aren't sad sound bagging those evaluations let's say like at some point you run like how good is this model at understanding what's happening during training and your evaluation says like really good y um is the idea that at that point like we're on the superum regime and hopefully we've already done some like harder alignment research or is yeah I think basically or is like that's kind of beyond the yeah I know I have some sense of like what is the Horizon at which the thinking I'm doing now is useful versus like that will be way eclipsed by the thinking that we will been able to get out of models or something uh or it's just kind of like far enough and Al an enough and confusing enough that it feels like okay let's make it see if we can make it that far without without destroying everything and then we'll worry about the later um I think the thing that I'm honestly more worried about is that like I know there will be a bunch of things that do look weird and suspicious but we haven't quite been clear enough about exactly what is weird and suspicious and when you should stop and when you should be concerned rather than that like everything looks completely fine because the model was really Galaxy brain like doing everything right right um that is more that it's like like there are a bunch of signs but it's just like a bit fuzzy and confusing and it you know some of these things are like almost it you kind of like blend in with stuff that is more normal ml problems like sound bagging and and stuff where it's just like well maybe your training was just like messed up in a certain way such that your gradients didn't do what you expect you know there was just like the the optimization landscape was not nice it was like was the model doing that like deliberately I know just yeah yeah I think I actually kind of want to talk about this like style of thinking or the the this kind of horizon um that you're trying to operate on because I think like I think broadly like a lot of people are mostly thinking okay models we have today or models that are like 10% better than what we have today what should be how should we think about those what should we be worried about there and then I think um some people in the express Community are like okay we have like you know quasi gods or something on our laptops like how do we deal with that yeah um and both of those uh well it's hard to to think about quasi Gods but it's you sort of know what you're thinking about you know or like with the or you know the scope at least I I I think um or it's at least the scope that some people are used to thinking about I think like this medium-term like what if we have ai that's like you know it's like Star brain but it's not Galaxy brained you know this like intermediate Zone I think like I'm not like the scope of that I I feel like I've seen less just like defining that scope or talking about like what that even is and why we should think about it um yeah I'm wondering if you have thoughts about like yeah so I think like ways you could operationalize this yeah there's the thing I just said where it's like the point at which I expect there have been way more cognitive labor on like the things that I'm currently thinking about such that like it's not very useful for me to think about it now uh another is maybe just like if you know the point at which like a if a substantial chunk of the risk comes before that then it's like a lot easier to think about it's like okay maybe we can try and address that chunk of risk right um and yeah the sort of thing I'm thinking about is like when you can with your I kind kind of similar to some of the Redwood operationalizations for control and things of like you know you could be like getting a lot large multiplier of your Labs output by you know scaffold like models with like scaffolding and and and Chain of Thought and giving them a bunch of tools and stuff so it's like not you know with the kind of like best trying to make it easy for them to do lots of reasoning and just like you can just like run them a lot and stuff like you getting like a very large amount of cognitive work um and they're probably you know super human at a bunch of narrow things that kind of go along with that but like there's not particularly reason to expect they be like very superhuman without any of like you know just doing these kind of like elaborate superhuman schemes without any of that reasoning we're using tools or or anything like that it just like seems like you know there is some gap between like capabilities under maximum limitation and capabilities under sort of like monitoring and constraints and a bunch of ways that you're trying to make it harder for mod to do particular kind of reasoning it's like if I maybe you have like like a 10 scale or something and you're like um you know uh doing this complicated scheme would be like six capabilities and like we're doing like a ton of alignment research and stuff would be like five capabilities and the like difference between what you can do under maximum elicitation and maximum constraint is like two points on the scale or something now you have like sort of three point buffer between like when you expect there to be like really a lot of stuff happening in the world being kind of transformed by like huge amounts of cognitive work and AI is doing these elaborate schemes even when you're trying to make it hard for them right yeah it's it seems like yeah I'm trying to think like if if a listener is interested in just like Reinventing this methodology for themselves it seems like they've got to say something like okay what's the like let's just come up with some capability scale so that I can Peg like able to scheme able to scheme even when someone's like editing my brain you know so so there's something like this capability scale there's something like imagine that uh a bunch of researchers are like you know trying but they're like a little bit bumbling or you know there's there's going to be like kind of ambiguous things that they've got to muddle through a little bit and then um using this uh I guess Buck slis talks about the trusted dumb models versus untrusted smart models dichotomy where like on this scale of capabilities like at some point versus below you you think you've like done enough alignment work that like you can recruit those things on your side and you imagine like okay how can I use those capabilities to do alignment what what alignment tools can I just come up with like at what point am I just and and then just sort of like going up the scale of like okay have we like solved alignment for level four models yet yeah if if no don't worry about objections that come up at level 9 if somebody did that would would that be like would they have recreated the important parts of the methodology or do you think there would be things that they were missing I'm not sure I quite understood what you're saying let me try and say it back to you um yeah sound like what you were saying is you have some scale of capabilities and it's like don't worry about a higher number if there are models that like are at a lower number that could destroy the world and we don't have good Solutions yeah that there was that plus like imagine using like imagine moving threshold of where we have good Solutions yeah and then like below that threshold or when you're coming up with Solutions at the threshold you can use use capabilities that models have below the threshold and also like try and think kind of in terms of like people looking at numbers on a screen and scratching their heads and being like did I just uh set the hyper learning rate too low or whatever like like there's also a frame of mind I think of kind of imagining doing ml science in a world that's a little bit messy versus like super duper clean abstract schemes of like we can know exactly this thing and exactly not this thing I'm not sure if this is a real distinction but it it's it's at least a distinction in style I think yeah I'm very expect us to do things messily sure sure yeah I I think um yeah so I think there is this both the idea of like well you know if all of these points on the scale are producing like Risk we might as well you know try and deal with the easier ones first depending on exactly how you construct that that may or may not make sense then there's the thing of like well you know maybe the best way to deal with this later thing is first figure out how we could make this earlier thing safer such that we can use it to help us with the later thing um which yes kind of similar to like you know at what point will the cognitive labor that has happened be just like so huge to relevant relative to what I'm doing that is just like not very useful and and some slightly more General thing of like at what point will the world be so transforms it just seems like really hard to think about it relative to the things that are a bit more in sight yeah so that one is more is less just about the like purely think about the intelligence scale and more or like the the AI intelligence scale and more thinking about the like oh this level of intelligence we like don't have corporations anymore or you know we like individual human brains have all been networked and we don't have to have our influence right like yeah I do think it's like well yeah there's like some amount of the stuff that's kind of like abstract math or ml or whatever but I think it is like a bunch of it it's what exactly you do is dependent on like roughly who is doing what with these things why and and what way like uh and that if if we're just like yeah just we just like have totally no idea what's going on it's like Dice and sphs and uploads and whatever whatever it's just like okay I just expect to have less useful things to say about what people should be doing with their AIS in that situation fair enough so I guess so you're the you uh co-lead the model evaluations and threat research um we've talked a bit about evaluations but there's also kind of threat models of what we're worried about ai's doing and what we're trying to evaluate for um can you talk a bit about yeah what threat models you're currently trying to evaluate for yeah so right now we're thinking a lot about AI R&D so like when uh models could really accelerate AI development and yeah so that's I was talking about the kind of tasks that are like do what a research engineer at a top lab could do in a day or a few days that that kind of thing um and yeah just basically because at the point where you have mors that could substantially accelerate a research generally just like things will be getting crazy all of the other you know any other risk that you're concerned about it's like okay well you might just like get there really quickly also if you don't have good enough security like someone else could maybe steal your model and then quickly get to like whatever other things that you're concerned about okay so there's something a bit strange about that in that like you're from what you just said it sounds like you're testing for a thing or you're evaluating for a thing that you're not directly worried about but like might produce things that you would then be worried about yeah I I there's some slightly different views on the team about this like I'm more in the camp of like this is a type error like it's not really a threat model like maybe we should or I don't know I don't know what we should actually do but I think there's a way you could think about it where it's more like you know you have the threat models and then you have like a process for making that you've got enough buffer or being like when are you going to hit this uh concerning level like what you know when do you need to start implementing certain precautions that sort of thing and that the ar&d thing is part of that and part of what you use to evaluate like how early you need to stop relative to when you might be able to do this thing but I think right I know that's a little bit like I don't know sort of pedantic something and and I think it's more natur just be like look like accelerating AI research but Bunch would just be like kind of scary I I think uh I hope that listeners to this podcast are very interested in pedantry um but uh this this actually gets to a thing I wonder about evaluations which is especially when we're Imaging pulsy applications I think usually I think the hope is there's some capability that we're worried about and we're going to test for it and we're going to get some alarm like before we actually have the capability but like not you know decades before like like a reasonable amount of time before and this this kind of it seems like this requires having an understanding of how do buffers work like uh you know which things happen before which things and like how much time or you know training compute or whatever will like happen between them I'm wondering like yeah how good a handle do you think we have on this and like how do you think we can improve our understanding yeah so I do think it is valuable to have something that just goes off when you actually have the capabilities you know as quickly as possible after you actually have them even if you don't get that much advanced warning for you know because there just are things that you can do it's not necessarily like as soon as you get it instantly all of the bad things happen and if it is you know to the extent the forecasting part is still is hard like I I do worry about us just not even having a good metric of the thing like once we've got it not being clear about what it is and not being clear about how we're actually going to demonstrate it so so like you know sort of like Step One is like you know actually having a kind of clear red line on the other hand there's ways in which it's harder to get good measurements once you're at these really high capability levels for you for all the reasons I was talking earlier about what makes evaluations hard like if they're long and difficult and require a bunch of domain expertise you don't have it's like you know hard to actually create them so so there's other ways in which it's like we will naturally get buffer if we create the like hardest tasks that we can create that are clearly come before the risk um so that's kind of you could sort of two ways you can think about it like one is like first you create the red line then you figure out how much buffer you adjust down from that the other way is like you create the eval that sort of the furthest out that you're still confident comes before the thing that you're worried about and like then you might be able to sort of you push that out to make it less overly conservative over time as you both sort of narrow down the buffer and as you just like become able to construct those those harder things um I gu didn't ask answer your actual question about like how you would how do you establish buffer or like what things we have for that yeah I guess the question was like how old do we understand how big buffers are and uh how how can we improve them although I don't know anything interesting you can say about buffers is welcome yeah yeah yeah I guess there's maybe just two main Notions of buffer the kind of like you know scaling laws versus more you know kind of like post training elicitation uh fiddling around with the model type type thing like you know do we want to if we're like oh is it safe to like start this giant scaling run and we want to sort of know that before we start or before the Financial commitments are made around it or something and there's also like suppose we have a model like is there you know if people work on it for a while you suddenly realize you can do this new thing um so I guess we talked a bit about the ustation stuff before I think in general something I'm really excited about doing is getting a smooth kind of Y AIS of like you know what is the dangerous capability that we were concerned about and like you know then roughly where do we put the line and then we can like you know see points going up towards that then we can start making scaling gos and you can sort of put like put whatever you want on the x-axis and see how you know it could just be time it could be uh you know training compute it could be like effective compute performance on some other easier to measure Benchmark it could be uh elicitation effort like it could be uh inference time compute like like all all these sorts of things like but the bottleneck to having those is having a good yaxis currently um where you like there's various properties that you want this to have like obviously it needs to go up as far as you're concerned and then the kind of increments need to mean the same thing um so like the you know sort distribution of tasks or like the way that the scoring works or whatever needs to be such that you know there there is some relationship between like moving by one point when you're near the bottom and moving by one point when you're when you're near the top such that you know the line behaves in some predictable way um and there's a few different ways you can do that like one is constructing the distribution of tasks in some particular natural way or in some way where it's going to have this property or you can have like a more like kind of random mix of stuff and then weight things or or alter things in such a way that it uh is sort of equivalent to this distribution you want or you might do something that's kind of like making structured families of tasks where like you know these other tasks were strictly subsets of these things and they have this same relationship to their kind of subtasks like uh such that you know going from like you know sub subtask to subtask is equivalent to subtask to task or whatever yeah yeah like like somehow it feels like you've got to you've got to have some understanding of how tasks are going to relate to capabilities Y which is tricky because like like in a lot of evaluation stuff there's talk of you know does the model have X capability or Y capability and but to my knowledge there's no like really nice formalism for what a capability is you know are we are we eliciting it or are we just getting the model to do something you know but but intuitively it feels like there should be something yeah I think that is not what I'm most worried about I think that like we can operationalize a reasonable definition of what a capability is in terms of like you know can do the the the B can do most of tasks of this form or you know like act you know activities that involve doing these things or that you know drawn from this kind of distribution just like that it's not that's the problem it's it's like how do properties of tasks relate to their difficulty like how can we look at a thing and decide when the model is going to be able to do it which like it's more like how does performance on tasks relate to performance on other tasks yes which which does route through um facts about models right especially if you're asking like how if we scale or something yeah yeah yeah CU like there could be a bunch of you know you can imagine a bunch of simple properties that might underly things or at least might be be part of them like you know maybe it's just Horizon length basically because that you know you can have a bunch of different simple models of this you know like there being some probability at each step that you fail or that you you know introduce an error that then adds some length to your task and some some of these mods you have some notion of like critical error threshold or recovery threshold or something of like do you make errors faster than you can recover from them you know do you add an expectation at each step more time recovering from errors than you make progress towards towards the goal things like that or like um yeah things about like you know the expensive collecting training data of like how many RL episodes would you need to to like learn to do a thing that requires this many steps versus this many steps um or how how expensive is it to collect data from humans like you know what is the cost of the uh task for humans like is a way thing that you might expect to get some sensible scaling with respect to I mean like some some things are good going to be totally off based on that but sort of over some reasonable distribution of tasks like maybe yeah just like how long it takes a human in minutes or like how much you would have to pay a human in dollars or uh and then maybe there's you know also like Corrections for like you can tell you know particular things are more and less model friendly in different ways or like maybe you more for models you more want to do like less like what can one human do but like models are more like you get to choose which human expert from different domains is doing any particular part of the task or they have this like very compared to humans like much more broad um expertise um yeah what other factors can you look at I do think there's like you know you can identify particular things that models are bad at and maybe in the future we'll have more of a like more empirically tested idea that like you know X is what makes a task really hard and we can kind of look at how are doing on that and that will predict it where like I know particular things that we've looked at the most the most sort of some kind of abstracted capability that's difficult for models is like um things that are like inference about what a black box is doing either with or without experimentation I think like models are especially bad at the version with experimentation where you have to kind of like explore and try different inputs and and see if you can figure out what uh function has computed um like imagine things like that sort of like there's some particular characteristic AAS cat where it's like you have to do experimentation or or something that that Ms could be bad at uh I feel like it's a slightly tent buffers right yeah I mean yeah I guess that's like uh I take this answer to be saying like there are ideas for how we can think about buffers and you know we've got to look into them and you know actually like test them out um and you know maybe interested listeners can you know start on some some of that yeah I think there's like so much to try here and so so much to do um and like just understanding Yeah the more understanding we have of like how different evaluations predict other properties and other evaluations and things the better like like I would really like to just like if we you know say we have some sort of threat model or some activity in mind and we create a bunch of task to test for that like how correlated are the performance on these versus like performance on ones designed to test something else like are we in fact just kind of like you know how much do TS actually capture specific abilities versus we're just mostly testing some basic C we put one foot in front of the other thing that it doesn't really matter what how how you tailor it or something um I guess another so okay so like ways of adding buffer like yeah you can do this thing where you yeah you construct some kind of y- axis and you can make scaling laws based on it and then you can kind of be like oh we need to put you know this much uh uh as a correction based on the x axis uh other things you could do is like um you modify the task in such a way that you're like you know when the model can do this modified task that you know the different sort of difference between when it can do that and the the actual task is the buffer or that you modify the agent in some way like have the human help out in a certain way like okay our buffer is is the sort of gap between when the mo can do this given some amount of hints or given like that the task is decomposed in this way or or you know that it has this like kind of absurd amount of inference computer or something it's possible it's easier to construct them in terms of that yeah I guess there a bunch of like potential things here and it seems like the question is like how continuous are you know are you in this in these potential y axes yeah and I think yeah does feel like the biggest bottleneck here is just having a good y AIS to do any of these measurements it's like okay well if you find these nice scaling LS on some like random multiple choice data set like who knows how that corresponds to like what you actually are concerned about um and sort of like yeah we already have all these scaling LS for things that like but we don't know what the relationship between that and you know web text loss and and risk is sure so yeah this uh this whole question about buffers was a bit of a tangent from um just talking about threat models um oh yeah and you were well it was it was my tangent um so we were uh yeah you were saying like Okay the current thing that it sounded like was meter was mostly testing was AI research and development can you get a machine learning machine learned model to do machine learning research um I think on the website like in publish material I see a lot of talk about like AR which I guess is short for autonomous replication adaptation um is like like do you think that AR like we is we should just think of it as machine learning research or or is that like kind of a different thing in your mind yeah it it's a little subtle and different people on team conceptualize it in slightly different ways I think um I think so how I think about some the threat modeling stuff it's like we asking you know sort of what is the first or what is the easiest way that models could do really destructive stuff um yeah and there's some things where it's like well you know obviously if it could just do kind of crazy nanot Tech something something like like that would clearly you know be very destructive or something but there are it's like how could it be very destructive even before it's like very superhuman in any particular way or something like what are the differences from Human that would allow the world to be much scarier and like some of these things are just around scale and speed and uh you know being virtually instantiated rather than like biological or something and you know it doesn't seem like one of the earlier reasons why models why you might be concerned about the models is is that they could be operating in a very large scale and they could be kind of very like distributed and not necessarily easy to shut down and that you kind of more generally like it is unclear to me exactly how much like various things in the world are okay because it's not easy to recruit arbitrary numbers of smart experts cheaply to your like random destructive cause like if kind of terrorist groups or or Rogue States or whatever which is like very easy for them to recruit very large numbers of you know even sort of like within normal range of experts or something or like that was proportional to compute or proportional to how well they were then currently doing it like uh you know cyber attacking other people's compute um like maybe this would just uh be pretty destabilizing and so you know we're concerned about this like what is the point at which models can just be doing substantial things by themselves in in a way that like the damage they can do is not bottlenecked on particular humans and you can have like you know if a some small relatively small human actor like gets hold of models and you know wants to be particularly uncaught and like go and take over all the compute that you can and do a bunch of things like you know when when would models posibly be able to do that um and it's kind of like yeah focusing on it as not sort of like the most likely place that most of the risk comes from or something but as a kind of f like what is the first thing or what things that are kind of a prerequisite where it's like it feels you know before models are able to cause a huge amount of Destruction while they're kind of under control with some limited resources and don't have the the ability to like accumulate things and communicate to all their copies and make notes to themselves and and and do the sort of thing like yes sorry before models can do a bunch of damage without all of those things they won't be able to do a bunch of damage given that they have all those things so like the the seems like the earliest threat models will involve some kind of like or I not necessarily all of them there's there's some kind of class of like yeah things where it's like about autonomy or it's decoupled from the resources of the original human actor who who kind of starts it uh and yeah you have this agent that that is not constrained by like it's just being deployed in the specific way and it can be shut down it's more like right it's kind of doing its own thing so somehow like the like the test for autonomous replication adaptation is kind of saying when are we getting AIS that can you know act in this very unconstrained way and like we're going to get dangerous unconstrained AIS before we get like dangerous really constrained AI because it's easier to be dangerous when you're con yeah yeah that that's yeah that's that's a morein way of putting it it's so much easy easier to be succinct when you're summarizing something someone said um yeah sure so it seems like current meter work driven by this sense of like okay uh when are we going to get AIS that um have some sort of potential to you know do autonomous replication or do ml research such that you know after that they going to be able to do really scary stuff from that I sorry you want to say something yeah I I just remembered like another sort of way of talking about this which is like what is the point at which your evaluations can no longer provide you very much safety and it's like at the point where models are either able to improve themselves or just like able to spend much more resources trying things or like iterating on scaffolding or sort of like eliciting themselves than you would have been able to do in the evaluations like it then becomes hard to rule out that the model could like figure out how to create bioweapons or something if you're at the point where it's like it can do the same sort of work that you would have been doing in the evaluations of like you know try a bunch of ways to try and get it to make good bioweapon ideas and then you know validate them in some way it's like when it you're like oh the M could like have billions of dollars and lots of copies of itself and just do a bunch of that and you're like okay now it's it's hard for us to say that it like can't do any of these other threat models right so it's kind of a combination of the AR and the like AI research right like like somehow it's also testing um like the models ability to elicit capabilities of quit on quit itself yeah def yeah it seems like one of the sort of things that we're we interest in like yeah sure so if that if that's the rale for focusing on AR focusing on machine learning research um do you think that people should spend time just doing evaluations of like fairly object level things like one thing people talk about is can this AI make a bi a weapon or can this AI um persuade people to false things or stuff like that is is it just a thing like like you're not working on that because you're working on something else but like definitely someone should yeah so I think I know a bunch of the domain specific things it's like is really helpful to have a bunch of expertise in that domain so you yeah kind of cbrn chemical biological R radiological nuclear uh yeah yeah things we're not really doing much how yeah just like other people have more of that expertise right I I think a thing I am worried about is like people doing that without doing proper elicitation be like oh the model can't do this like you know zero shot or something and you know that that that in some sense we're definitely interested in the autonomous version of all of those things of like okay but like if you give it a computer and it like fiddle around for a while can it in fact do something you know if you plug it into all the right Tools in terms of the persuasion stuff yeah I think like I was saying earlier it just seems like pretty hard to evaluate it seems again it's like relatively unlikely that the model could do a lot of really bad stuff before it has the you know before there's this kind of yeah it can like make plans and accumulate resources and like learn lessons from all the different copies and things like if if you're Imagining the version where you models that work reasonably like models today or like you know each instance knows nothing about the the other instances like uh coordinating in like some persuasion campaign or something I think like either well it has to be strategic enough to do something kind of reasonable even without communicating with any of the copies and be like insanely good at persuasion or be like really quite good at persuasion and very somehow very doing very good sort of Galaxy brain like acausal coordination between all of these copies or maybe we're in a world where like people uh train models to use some like opaque representations that we can't understand they were like making notes to themselves and messaging each other with things and we're like are they just communicating about our their to to like overthrow the world or not who knows like um so so yeah it does feel like the most of the worst scenarios require either like very high levels of narrow capabilities or some kind of like right the mo can actually like get stuff and build up its plan and like take actions in sequence as opposed to having to kind of simultaneously Z do all these things separately right so there's some sense of wanting to prioritize evaluations of like you know let's make evals for like the scariest soonest thing yeah or and it that it sort of gets more things with one right thing to look at allice something like that but yeah I I definitely yeah think that other people should be doing all all of the other things it's like why we in particular you're doing this thing in particular I wonder if there's some Personnel difficulty where like in order to suppose you're trying to craft a really good evaluation of some scary task you need people who are like both understand that domain really well and people who understand machine learning really well and so if the scary task is machine learning great job the people are the same people uh if the scary task is not machine learning like now we have to like uh hire two different types of people and like you know work together and it's like harder yeah that's that's a good point I haven't thought of that but yeah yeah I haven't worried about both like you know like we don't know enough about the you know the specific things that we're trying to get them alls to do and people don't know enough about how to gets to do stuff um sort of thing but yeah the ml1 is is is better in those that regard um I guess uh I I want to talk about a bit about meter's policy work so I think one thing people maybe know you for is you know thinking about uh what's sometimes called responsible scaling policies um uh I gather that meter had something to do with anthropics responsible scaling policy yes so I think the history of that is roughly like Paul and I and then Yar and you know sort of I guess it was AR Evil's days we're we're thinking about this kind of thing since I think like late 2022 kind of think you know what in particular should Labs commit to or how to kind of evals fit into things what would what would a lab be able to say that such would be like oh that's like seems about as good as you can do in terms of like what makes sense to commit to or something and originally thinking more like either some kind of safety case thing that's relatively unspecified and more just like you have to make some kind of safety case and there's some kind of review process for that or um and then I think we ended up thinking a bit more about like or maybe more like a standard and we can say like you need to do these specific things at these specific times after you run these specific evolves I think you know sort of anthropic was also EX about making some some kind of thing like this happen I don't know quite like when Paul was talking to them definitely some amount of independent um development or whatever but specifically the like responsible cing policies and the idea this is mostly centered around like red lines and like when do you when is your commitment that you would like not scale Beyond this I think was like mostly an idea that came from us and then uh you know anthropic was the first to put this out we we were you know been kind of talking to various other labs and kind of encouraging them and trying to sort of help them project manage it and sort of make push this along and now yeah so like open a eye and deep mind also have things like this and like various other labs making various interested noises UK government like pushed for this um at the summit and like now a bunch of labs also at the Sol Summit signed thing saying they would make these basically sure so so I I I guess that's something to do with the your work on rsps is that like is thinking about things like responsible scaling policies is that like most of your policy work um all of it do you do other things yeah so there relatively coactivity which I think is quite closely tied to the threat models and the evils which is like what mitigations do we actually think is like appropriate at different points and that that kind of ties in quite tightly to like what are we worried about and what are we measuring for is like what you know what what do you need to do what would buy you sufficient safety and then there's the kind of a bunch of generic advising of just like people have questions related to that and like governments and labs and and and various different people and we like you know try and say sensible things about like what what do we think is evidence of danger what do we think you need to do when like all of these things um and then yes specifically this project of kind of trying to encourage labs to make these commitments and giving them feedback on it and help you know trying to figure out what all the hurdles are and and and get things through that through their Turtles um yeah I think that's most of what we yeah that's been the bulk of things that like so Chris is our main person doing policy and that's yeah gotcha yeah so how and in terms of just like what fraction of meter like how much of meter is doing this policy stuff is it like one person on a 20 person team or like half of it um yeah it's like 10 % or something this is just like mostly just Chris and it's like have like somewhat separate from from from the the technical work cool um I'm wondering so yeah we're we're at this stage where a bunch of people rsbs as an idea have kind of taken off um I guess people are maybe committing to it at this sole as Summit like how how good do you think existing rsps are and suppose that you had to like implant like one idea that people could have that is going to like make their RSP better than it otherwise would be what's the idea yeah I think in terms of how good different rsps are I haven't actually read the Deep Mind One closely the one I'm like most familiar with the anthropic one CU I was like more involved in that I think you know I like the anthropic ASL levels thing I think it was reasonable and and they made like pretty clear and hard commitment to like you know under these conditions we will not scale further which I I thought was good and I think um other labs rsps had various uh you know kind of good ideas or nice ways to frame things but I do think that was a bit it was a bit less clear you know really you know when is the thing that you were making a really hard commitment like you know if X and Y happens we will not scale further like I think open ai's security section when I read it I remember coming with the impression that it was more saying like we will have good security rather than like if our security does not meet the standard we will not continue um so I think just being clear about like what is your red line like that like when is it actually like no this is not okay not like like what should happen but like what are the the clear conditions and that would probably be the thing that I'd most like want to push people on fair enough so yeah going back to meter Ro in rsps um or talking about policy I'm wondering like the policy work and the science of evaluations work to what extent are there sort of like two separate things butting from the same underlying motivation versus like strongly informing each other like oh we want these policies so we're going to work on these evaluations or oh based on like what we've learned about this science of evaluations that like strongly informs the details of some policies I think there's pretty strong flow particularly the direction from threat modeling and evaluations to policy in terms of just like what what do you recommend like what policies do we want I I think that you know there's a bunch of work that is like a different type of work which is just sort of you know just kind of like project managing a thing from the outside and dealing with all the stakeholders and getting feedback and addressing all people's concerns and the kind of thing you know the like ways in which the day-to-day work is different but I think that in terms of like what it is that we were pushing for what are the most important things like how good would it be to get different types of commitments what should be in there what is the actual process that the lab's going to do and then maybe some of the stuff in the other direction is just like feedback on what um you know how costly are different possibilities how big of an ask or different things like what is the what are the biggest concerns about the current proposals what are what is most costly about them like I think something we ended up so I was saying like uh earlier on we were thinking more that we could make something that's more kind of prescriptive and is like in this situation you need these mitigations and then kind of became clear that different labs are in sufficiently different situations or like different things are costly to them or and also just like depending on kind of Uncertain future things of like what exact capability profile you end up with that we either had to make S that had a lot of kind of complicated conditionals of like you know you must do at least one thing from category a unless you have B and C except in the case of D when you know like uh or that it was something that was sort of like very overly restrictive of like well this is obviously like more conservative than you need to be but there isn't it's hard to write something that's both relatively simple and not like way over conservative that covers all of the cases so that's kind of why we ended up going to more of the like you know each lab can say what their red lines are and then there's some like General pressure and Ratchet to hope we make those better um like what kinds of um differences between Labs cause this sort of situation um it's going to be one of those things where I like can't think of an example or something I think the the sort of conditionals we were ending up with was sort of like security is maybe actually a good one where like um Google has like a bunch of you know there's a bunch of things within Google where they already have like oh yeah well like here's how we do this for like Gmail protected data where we have like this team with these egg up machines and blah blah blah like like there are some things that's like relatively straightforward for them to meet particular levels of security whereas for like uh open air or something this is like much hard because they like don't have a team that does that already or or something and and probably similar with some of the monitoring or like Pro you know I don't know process for deploying things like um you know it's like yeah just like larger companies have more kind of setups for that and more capacity and and and like teams that do this kind of thing or I know yeah I imagine so you know some of the monitoring overs sites stuff might be somewhat similar to like uh like content monitoring type things that you already have teams for whereas like maybe anthropic or openi would be more able to be like oh we did this like pretty involved science of of exploring these capabilities and pinning them down more more strongly or something and like that's easier to do for them versus having like a very comprehensive like monitoring and security proc or something like that makes sense so yeah I guess I'd like to move to meter has I think it's a bit not obvious what meter's relationship with um Labs is so semi recently as of when I'm recording this um I think like uh Zack star Pearlman like wrote this post um unless wrong saying hey meter doesn't do like third party evaluations of you know organizations and they didn't really claim to it's unclear why people believe they did but but it's but from what you're saying it sounds like there's some relationship right you're not just like totally living in separate bubbles communicating via like press releases that the whole public sees um so what's going on there yeah I am kind of confused about that why there's been quite so much confusion here but I guess what we mostly think of ourselves as doing is just like you know kind of like trying to be experts and have good opinions about what should happen and then you know when we get opportunities you know trying to advise people or nudge people do things or suggest things or like uh you know prototype like well here's how a third party evaluation could work like we'll do a kind of practice run with you and like right maybe in future like we would be part of some you know more official program that actually kind of like had power to make decisions over deployment there or something but that is like not something we've done already maybe this is partly where the confusion comes from it's like well we would consider various different things in the future like like maybe in the future we'll end up being some more kind of like official auditor or something also maybe not um but and currently we're happy to kind of like try and help prototype things and like like we've tried to be pretty clear about this I think like like in the the time article that we did a while ago like very clear like like what we've done are research collaborations not Audits and like in the gb4 system card it like we say in a bunch of places like this was not we did not have the access that like a proper evaluation would need and like you know we're just kind of trying to say some things and like you know in facts then what you know the model is limited enough that you know we could say some kind of reasonable things about it like assuming that yeah we still have to make various assumptions in order to be able to say anything really um yeah some question about what direction we go in future or like are there other organizations that are going to come up that are going to do more of the like evaluation implementation all the sort of like auditing and you know going down the checklist and and and sort of things like this the thing that we're most thinking of the the central thing we do is like you know being able to recommend what evaluations to do and like say whether some evaluation was good or not and like uh this is a kind of like core competency that we want to exist outside of labs like at a minimum there should be at least someone who can look at something that was done and be like does this in fact provide you good safety like what would you need you know we can say something that you could do to provide safety and then there's a question of like does the lab implement it and it gets checked by someone else does the government implement it is there some new uh third party thing that emerges is there like a for-profit that does it but then there's some oversight from the government you know making sure that for-profit doesn't you know isn't getting sloppy or something you there's like all different kind of systems and different industries that have kind of configurations of the like you know the people who do the science of how to measure versus the people who do the actual crash testing versus the people who check that you followed all the proper procedures and and all these things so yeah sort of I guess that it is kind of inherently confusing so like there's a bunch of ways this work and we want to help all the possible ways along but don't aren't necessarily saying that we're like doing any of them yeah yeah I I guess like yeah it's just sort of a distinction between like um meter trying to be or the way I understand it meter trying to be an organization making it so that it can tell people how third party evaluations should work yeah versus like meter actually doing third party evaluations with some like gry zone of like actually doing some evaluations on um gbd4 yeah I think the thing would very G4 I should say like we should not be considered to be doing it it's like you we're not being like hey everyone it's all fine we you know like we'll tell you if the labs are doing anything bad like we're you know we're making sure they're all doing everything correctly it's like we have no power to tell them what to do we have no power to disclose any you know in as much as we have worked with them on evaluations it's been under NDA so like it's not like we had an official role in reporting that it was just like you know then we can ask them for permission afterwards to like publish research related to the stuff that we did but like it's not like yeah we've never been in a position to kind of be be telling Labs what to do or to be telling people how labs are doing sure apart from based on Public Information sure um yeah I I guess getting to that relationship like how so so you had Early Access to some version of gp4 um to do evaluations how did that happen they reached out to us as part of the general kind of red teaming program and and similarly anthropic um we had Early Access to Claude um and various yeah in addition to that like various Labs have given us yeah I can't necessarily talk about all the things we' done and there you know did kind of spectrum between like research support and just like you know have some credits so you can do more eval research or like have uh you know this like fine tuning access or work with this internal researcher or you know some other you know have this early model not necessarily to do a predeployment evaluation on it but just like to help you along sort of thing very different things going on um do you know how much of the dynamic or can you say how much of the dynamic is like lab saying we've got a bunch of spare capacity like let's help out you know this like aspect of science versus saying like hey we would like to figure out you know what this model can do or how worried people should be about this product yeah I think it has been somewhat more in the direction of like Labs being like hey can you run an evaluation for us and US kind of being like well the main thing we're working on is figuring out how what evaluations should be or like what evaluations to run so like yeah I guess we can do a bit of that as a practice but it's not like we're you know uh you know this is a full polished service and it's like ah yes I will you know put you in touch with our department of doing a billion evaluations or something it's like well yeah I guess we can try and do something but like you know this is more of research than of like us providing a service or something sure yeah so sort of related to this um news that came out um in the past month as we're recording is that a bunch of it it seems like most uh open a former employees who left um signed some documents uh basically promising to not dispar the company and promising that they wouldn't reveal the existence of the agreement to not dispar the company you've previously worked better open AI indeed you've also written um you've written something on the Internet like talking about this situation but um there was like a comment to something else I think people might not have seen it can you lay out um what is your situation with like your ability to say disparaging things about open AI both now and when you started meter yeah so uh yes when I left open AI I signed the secret General release which includes the dispar non-disparagement provision and the provision not to disclose anything about it my understanding at the time was that this uh did not preclude saying things that were like obviously true and reasonable like like that you know if it's like uh you know open a eyes model has this problem or something like like that that that is not an issue based on uh well yeah this is like not how you're supposed to get leged but like asking the Open Eye person who was doing this with me like what what this meant or whatever and and just some general sense of like well obviously it would be completely absurd if was like you can't say completely true and reasonable things that might have a negative varing on open AI um uh and that also there would I can't imagine a circumstance in which it would be in their interest to like sue someone over that because that would just look so bad for them um so I kind of didn't hadn't actually given this like much thought like I think the other thing to point as is like it's kind of narrow circumstance in which this would be the specific thing that's preventing you from saying something like cuz you know most of the things that I would say about open AI uh like any that's based on private information from when I worked there or from any relationship or anything that work that meter did with them would be covered under non-disclosure so this is only the only case where the non-disparagement uh protection in particular is coming into play is if there's public information that but you want to say something about it um so yeah basically I I think I'd kind of forgotten that this Clause existed also um and I think something that we've set tried to do in our communication for reasonbly long time is being like we're not like a we're like not an aders group it's not a thing that we do to kind of like aine on how much we like different labs in some generic way like we're here to say relatively boring technical things about like you know we can maybe say things about specific evaluations we did or specific results or like whatever but like uh training position ourselves is like yeah more like I don't know the I yeah I guess what would the the Animal Welfare example would be like more like the people who go and like measure the air quality in the Chicken Barn or something than the people who are like writing angry letters or or or or like you know doing undercover recordings or or something like uh so oh yeah I think I mentioned that comment like I sold my open AI it could be I can't remember if it was quite as soon as I was able to because I was hoping I might be able to donate um if I waited a bit longer but I sold it all last year at least I think I had either not read or forgotten that there was any provision of like they could cancel or CL back your equities I don't think that as far as I remember that never occurred to me as a particular concern until all this stuff came up okay and uh so you sold Equity last year when just just for the timeline of meter when when does uh I guess Arc evaluations is what initially was yeah so left open AI in summer 22 and went to AR and was doing evaluations work you know which was sort of talking with Paul and thinking about stuff and like doing some of this stuff talking about like you know what should a good lab do sort of thing um and I guess that sort of became increasingly a real organization and like Aral is a separate thing and then then became meter I think um I realized I can't remember exactly when this was um that although I couldn't sell Equity I could make a legally binding pledge to donate it so I pledged to donate 80% to Founders pledge uh I can look up the date if it's relevant but sure how does one legally like Founders pledge is legally binding which is nice okay so and somehow the the reason it's legally binding is you're promising to Founders pledge that you're going to give them a certain thing at a certain time I actually don't know exactly how this works I just okay remember that I was told this by the founders pledge people um yeah so then you know open only has liquidity events every so often but I sold all My Equity when I was able to understood so I guess that gets to the issue specifically about um Equity non disparagement agreements and if people are interested in that I highly encourage them to read um you know uh particularly Kelsey Piper's reporting in Fox as well as you know opening eye statements on this matter and on this matter I think more broadly there's this question of it seems like you'd want okay as a as a third party person I would kind of like there to be an organization that is fairly able to fairly objectively say hey here are some tests um this organization is like failing these tests um you know either explicitly say this organization is making an AI That's dangerous or say hey this organization has made an AI that fails the like do not murder eval or something y um and in order for them to in order for me to trust such an organization I would want them to be able to be like fairly independent of large Labs um large Labs have some ability to kind of throw their weight around um uh either either via sort of non-disparagement agreements or via like you know we will or won't cooperate with people we will or won't um like let you have access to our new shiny stuff I'm wondering from the outside how worried should people be about you know how how independent organizations like yours can be of big Labs yeah I mean I we have very little ability to you know provide this kind of oversight you know largely because we like there is no mandate for us to have any kind of access and if we do have access it'll be un you know most likely under NDA um so like we can't say anything about what we learned unless the lab wants us to uh so until such Point as we as there are either like voluntarily or by some industry standard or by government regulation that like external evaluators are mandated access and that access meets very ious conditions you know s that you can actually kind of do a meaningful evaluation and that you're permitted to disclose something about the results no one should consider us to be doing like this kind of thing and I'm not aware of any organization that does have this ability I think the uh executive order on AI does require labs to disclose results of evaluations that they do do to Nest but uh I don't I think it's unclear whether they can actually mandate them to do any evaluations and like then there's also not necessarily any oversight of exactly how they do the you know it's like uh I actually don't know what the legal situation would be if you just like did something very misleading but at least you could you know you can imagine lab do something very half-hearted or sloppy and like provide that to the government and they've like technically met all the requirements or something so I think this is very much a thing that is needed um and I yeah I feel bad if we gave the impression that this like already existed and wasn't needed or something so that was like definitely not what we were trying to do or sorry as in uh it would be great to have this and we would be happy to be a part of it if we could be helpful but we were not trying to to convey that this was already solved sure so moving back to science of evaluations things um your organization is doing like some work there's some work that exists in the field I'm wondering yeah first I'm wondering like things other than what meter does um what's your favorite in terms of evaluations work yeah I mean like I said I think all of the kind of domain specific stuff is important and like it it I know it does seem quite plausible to me that you'll get [Music] some I I'm not quite sure about this I I maybe just don't know enough about it but there some thing that's in direction of a more specialized model more something like Alpha fault or something that actually unlocks so you know like then someone can use that to make like a way worse bioweapon than they would have been able to otherwise and we're just like totally not covering that um and so I'm very you know yeah hoping that people are are keeping an eye on that um I like redwood's thinking about control evaluations and like I think they're thinking carefully about like what exactly you can know under what circumstances and uh yeah credit to them for some of the things I was saying about sound bagging and and various other things as well um and I think some of the kind of moral organisms and demonstrations that that various different people have been doing is is interesting also and like I you know scary demos or or or otherwise just like creating examples of things is something that we're not doing that much of and I it would be great for for other people to do I mean people should do the things that we were doing as well uh you know but but in terms of limied capacity or like disclaiming we are not covering you know having great scary demos so so hopefully other people will cover that um sure yeah and in terms of like uh you know Frontier research or research that you know people could start doing uh so things you've mentioned so far um scary demos uh understanding buffers evaluations of uh you know things in just domains other than machine learning research um are there any other things you would encourage people to you know do research into that are along the evaluations lines yeah I mean we so we have this bounty up for just like tasks that meet various criteria including you know sort of like not memorized and uh ideally sort of continuous score and and various other things so just like getting more of these so we can build a better kind of Y AIS uh would be great um if people are interested in that should they do they just Google me bounty uh yeah that should should work or yeah our website is meter met.org sure um so that's one thing um doing the solicitation um yeah I guess there's a like the buffer covers a very wide range of things so I think both like developing the y axis on which you can measure it or like you know being like how does different ways of modifying tasks or different ways of assisting the agent and things like how do these uh compare in size to like scale UPS or Elation effort or all these you know there there's like many different things that you can kind of plot against each other and try and calibrate like how big is this compared to that um and then other things are um distilling yeah like as I was saying before this kind of whole chain from you know what is it we're worried about in the world to operationalizing that in terms of specific tasks to then kind of trying to get from these very expensive and unwieldy tasks some easy to run tasks and maybe to just like data sets or whatever I think there's like lot of different things to be done there and in particular something that we um tried to do for a while but ended up uh moving on to other things which was kind of hard and um was creating a classification data set that captures the hard parts of doing a particular task and like finding something where is like you you can make some kind of case for upper bounding um based on like even with a reasonable amount of data and fine-tuning on this classification T us the model can't uh can't manage to kind of identify whether it's currently doing something sensible towards its goal or not like that that is something nice that gives you a stronger guarantee about about the model not being able to do that kind of thing um and maybe fits into the sand bagging stuff as well like creating model organisms of sandbagging um I think yeah we we also would love to be doing more stuff on alignment and and control EV vals and thinking of like okay once you have um once you're no longer at the point where you're being like the model doesn't even have these capabilities then what you do uh how are you testing that what would lbs need to be doing uh think there's a bunch of more fleshing out threat models that that would be great like I think particularly maybe like some of the persuasion and things like that could do with more specification like okay what exactly is it that happens what are all the other capabilities that that are required and exploring some of this like what kind of coordination does the models need to do with these different threat models which ones can we be uh ruling out based on the model's ability to coordinate and take considered acction between different instances and like which ones if we saw that changing like with people creating models that we communicating with uninterpretable State like how would that change all our thinking about threat models or like then what evals would we want to have uh I don't know there does a lot of stuff yeah yeah that's a lot of things for listeners to go off of um so that that's things they could do if listeners are interested in your research um or you know your Works how should they follow that yeah so uh we are uh hiring particularly for ML like research scientist and research engineers um we're are kind of very bottlenecked on this at the moment uh particularly yeah for just implementing these uh ml research tasks you need ml researchers to actually know what the tasks are and construct them and and test them um and more generally you know basically all the things I've been thinking about things I've been talking about the sort of Science of evals and building the yaxis and getting um you know distilling hard tasks and easier to measure things and buffer and all those things it's just like a lot of science to be done just like build the tasks run the experiments see um you know kind of drisking like does this whole Paradigm actually make sense are we getting the you know can we make predictions about the next generation of models and what tasks we'll see fall in what order like is do we actually understand what we're doing here or is it just going to be like oh actually these tasks just fall for a bunch of dumb reasons or like there's not really capturing you know something that's actually threat modal relevant like I I anyway yes sorry hiring yes we are hiring specifically ml researchers and research Engineers but uh you generally we're still a kind of small team and excited to kind of fit outstanding people into whatever role kind of makes sense like we're still at the stage where it's like yeah you know for key people we would build build the RO around them and and and I think there's a bunch of different things that we could that we could use there sure yeah and then the yeah task Bounty and also baselining just like we just need people to create tasks test them yeah if you want uh to to have your performance compared against Fusion models be that you know have your your performance on this engineering task be the last Bastion of human superiority over models or whatever yep get in touch I guess all right um and and uh what what places do you sort of publish work or write about things that people should check oh jeez we really should write more things up um we have this the last thing we put out was this autonomy evals guide so kind of collection of different resources like you know there's like a repo with our task suite and another repo for our task standard and a some you know various different bits of writing on how we think people should run evaluations overall in the list station and some results on like uh illustation scaling ws and things um that that's uh a link from website sure great well um thanks for coming here and chatting with me awesome thanks very much for having me this episode is edited by Jack Garrett and amberon helped with transcription the opening and closing themes are also by Jack Garrett filming occurred at far Labs financial support for this episode was provided by the long-term feature fund along with patrons such as Alexi maaf to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about about this podcast you can email me at feedback axr p.net [Music] [Laughter] [Music] [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
80,000 Hours Podcast
25 May 2021
Nathan Labenz on the final push for AGI, understanding OpenAI's leadership drama, and red-teaming frontier models

This conversation examines core safety through Nathan Labenz on the final push for AGI, understanding OpenAI's leadership drama, and red-teaming frontier models, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 221 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -17.86This pick -10.64Δ +7.219999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -17.86This pick -10.64Δ +7.219999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -17.86This pick -10.64Δ +7.219999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs