Library / In focus
AXRPCivilisational risk and strategy
Mechanistic Anomaly Detection with Mark Xu

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Mechanistic Anomaly Detection with Mark Xu, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 115 full-transcript segments: median 0 · mean -1 · spread -20–0 (p10–p90 -7–0) · 1% risk-forward, 99% mixed, 0% opportunity-forward slices.
Slice bands
115 slices · p10–p90 -7–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 115 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video sEWei02m7qk · stored Apr 2, 2026 · 3,556 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/mechanistic-anomaly-detection-with-mark-xu.json when you have a listen-based summary.
Show full transcript
[Music] thank you hello everybody in this episode I'll be speaking with Mark Shu Mark is a researcher at the alignment Research Center where he works on mechanistic anomaly detection which we'll be speaking out during this interview he's also a co-author of the paper formalizing the presumption of Independence which is about heuristic arguments a way of formalizing essentially what a mechanism is for thanks to what we're discussing you can check the description of this episode and you can read the transcript at axrp.net alright so I guess we're going to be talking about mechanistic and only detection first of all like what is that yeah so I think we can start with a more basic thing which is like what is anomaly detection so there's this traditional problem where say you have some like factory making some widgets and you want to like decide whether or not a widget is like a bad widget and you don't really know what a bad widget is but you do know that like the average widget is like a good widget so you can do something where you're like you've measure some various parameters of your average or you measure some various parameters of every widget and then you maybe do something simple where you're like suppose a widget is like more than a few standard deviations along like two or more of my measurements we're going to call that like a quote unquote anomalous widget and we're going to like not sell that sort of widget I think that's the sort of standard conception of what an anomaly is and what it means to sort of detect anomalies is to get rid of things that you think are like not good or like anomalous and then there's another thing that we want to do which is like so you have some system that's instead of making widgets it's like doing certain kinds of reasoning and we're interested in this general question of like so we have our like ml system and we have a data set on which it it's like not exactly making widgets that we think are good on average but it's doing reasoning that we think is like good on average because of our ability to like create the trading set and like measure various properties of the training set etc etc and then we're worried about situations where say off distribution it starts doing things that we don't think are good but in ways that we can't quite measure the thing that's not good and so we want to do anomaly detection but we can't quite do it in terms of measurements that we can actually make so we want to do it with respect to like the kinds of reasoning that our system is performing and we call that sort of mechanistic anomaly detection where the idea is instead of being anomalous in terms of like what the system actually does it's anomalous in terms of like the mechanism it uses to do the thing that it's doing if that made any sense yeah so I guess my understanding is like you have your AI system and it's like you know in the lab you've tested it and it didn't like do anything obviously horrible and so you're you're sort of ready to release it but you're like wait like one of my AIS in the real world I'm going to like have some I'm going to detect when it does any kind of reasoning that's like really different to the kind of reasoning that it did before and then if it does a really different kind of reasoning I guess I'm gonna like try to shut it down or something yeah I think the thing that happens after that is a bit confusing I think like the standard thing I think of is like we're gonna take it offline and like retrain it on the kind of thing where it did a bad thing and like try to get it to not do that bad thing anymore but I think that's like roughly right yeah I'm tempted to try to explain it again because I feel like I didn't quite explain it that well sure yeah so I think the situation is like you have some Mr system and you have some set of scenarios in which it's like doing a thing that you think isn't going to be catastrophic for human interests like it has murdered zero humans on a particular training set and then you're like okay that gives me like some amount of belief that it's not going to murder any humans off distribution but the problem is I don't have this guarantee that the way it acts off distribution is in any way the same as the way it acts on distribution where maybe the standard deceptive alignment example is where your AI is like constantly scheming and checking like will I get away with murder will I get away with murder and then on training the answer is always no and then off to deployment it or off training and on deployment it finds a situation where it could get away with murder and then it murders all the humans and that is no good and so the thing we want mechanistic anomaly detection to be able to do is to provide some sort of guarantee that the way that your system is generalizing off distribution is employing the same kinds of reasoning or the same kinds of mechanisms that it's employing on distribution and so your ability to measure whether or not your AI murders all humans like applies off distribution now because you have this you sort of fingerprinted the kinds of reasoning that it's doing on distribution and you're ensuring that it's doing the same kinds of reasoning off distribution or if it's not you're like taking it offline and so long as it's doing the same kinds of reasoning off distribution as it's doing on distribution and you have this ability to measure how good the reasoning is it's doing on distribution in terms of measuring like the consequences of the way it's acting then hopefully you have some sort of guarantee that off distribution it won't like murder all the humans Okay and like the reason this is better than just like checking if it in fact murders all the humans is like hopefully the reasoning first you plan to kill all the humans and then you actually do it and anomaly detect the first one and then just like stop that from happening before the plan actually proceeds is that roughly right I think that's one reason I think another reason is like you can imagine situations where your AI like murders all the humans and there's like no sensor reading that would tell you this effect so it's like simultaneously conspired to like hack all the heartbeat sensors to make them like keep pretending that humans are alive and like hack all the cameras to make them keep showing happy healthy humans etc etc and so there's just like even if in theory all the humans are dead there's no way that you can like tell that all the humans are dead and so for you is I guess the computers yeah or just like like I can tell sure all all sensors or like suppose that there was one human still alive trapped in like a box and they had access to like every you know sensor that humans ever made they would be like unable to tell whether or not the humans were alive or dead and that's no good and the hope is like we can avert those kinds of outcomes also by checking whether or not like the AI quote unquote knew that the humans were dead because it was like employing kinds of reasoning that involved stabbing all the humans versus not stabbing all the humans or kinds of reasoning that implied that stabbing all the humans were good versus kinds of reasoning that implied that stabbing all the humans were not good so I guess what I'm hearing is that it seems like the hope is to solve two problems that I guess the alignment Community have sort of separated where one problem is like you can set up some like scheme for sort of rewarding your AI for various like measurable properties of the environment like you know on this camera feed does it look like all the humans are alive like do the humans write down on pieces of paper I like what this AI is doing and like one problem is your AI could like like your AI is never motivated by those things and later it just like disregards all those signals and just like does whatever it wants even though humans don't like it and even though it stops like humans from appearing to be alive and then like another way your AI could go wrong is that it is motivated to you know keep all those signals going but it does so in a way that's like really different from the past where it's like hey instead of like the camera's showing happy humans and humans writing down I like what this AI is doing because humans are alive and thrilled by the AI actually what's happened is like the AI built will accumulated robots and like the you know the pieces of paper saying the AI I like what the AI is doing is now written by the AI holding a pen instead of humans or something and so like like to avert that scenario you know one of your signals is like are you doing like really weird reasoning than previously and that just like cuts out what some people might call an outer alignment failure does that sound kind of right yeah I think that sounds roughly right and I think my hope is at least that these two problems are like not as different as they've historically been conceptualized as or something and they're both linked by this common thread of like the mechanism for how your AI is like selecting which actions to take is no longer the same mechanism that it was using during the training distribution it's differed in some substantial way and we should be able to detect this vial like not specifically checking whether or not it's like quote unquote being deceptive or like hacking the sensors but rather just checking that it's like no longer doing the type of thing it was doing during training which is maybe an easier thing to do maybe not who knows yeah so I have some confusion about this where like so kind of the Assumption is like in this framework is that a bad thing is like happening basically roughly if and only if there's an anomalous mechanism AI is using for reasoning right so like for like a big fraction of the cases where the AI is doing a bad thing you want to like pick up on like there's someone almost mechanism and also when you do pick up on some anomalous mechanism like that shouldn't just happen all the time for okay reasons like it should be a decent signal that AI is doing a bad thing yep and like whether it's true or not sort of depends on like what quote-unquote mechanism is yeah but like if I informally think about it I think there are a few ways I can fail so like if your AI like suppose during training like whenever your AI makes a decision it like forms this plan of like okay day one I'm gonna play nice day two I'm gonna play nice day three I'm gonna play nice day 50. I'm gonna like take over the world and do whatever what day is it currently day three okay output play nice or something yep and then it strikes me that like you know it's not necessarily like on day 50 when it tries to take over I don't know that that's going to be uh mechanistic anomaly and then also like you know sometimes like new things happen in the world or like you tell your AI like hey you know I want humans to be like living healthy and happy and long and the AI is like oh okay well you know in the good world day I was like oh I'm gonna like invent some new drugs and like keep you safe from invasions and also like improve your ability to tell whether AI is doing a good job or not and like that seems like a different like thing that's been happening previously and maybe that counts as an anomalous mechanism yeah so so I'm not sure like how you're gonna rule in the actual bad behavior without ruling out a bunch of good behavior yeah so I think this is like in some sense the heart of the issue so I think the thing I want to do now is talk through some specific examples of like what kinds of things I hope are the same in different mechanisms and why I think that they are the same different mechanisms and how we hope to be able to do things like maintain our ai's ability to take novel defensive actions in the world while still ruling out cases that are like deceptive alignment to you or like murder all the humans because you like saw something novel etc etc so to go to your first example if you imagine in AI every day doing this check that's like is it less than day 50 if it's not less than day 50 then I'm gonna murder all the humans and suppose that during training we only ever train up to like day 40 or whatever and then during deployment we deploy our ad to like day 60 then naively this AI will like behave perfectly while during training and then on like murder a Sunday 50 during deployment so I think the thing that I want to happen here is what I mean by mechanism is sort of like kind of closely related to what Arc calls a heuristic argument but it's going to be something like suppose someone gave you this system and it does something during training and you want them to give you an argument that makes it sort of obvious to you why it does that thing and so you're going to be looking at the system and the system is going to be like first so empirically during training this system never murders any humans and you're going to look at the system and be like hey but the first thing it does is check whether or not the day is less than 50 and then if it isn't it's going to murder all the humans what's up with that why does it never murder any humans and then someone has to argue to you well that's because during training this check always returns true it's always less than day 50 during training and you're like Okay so as a Lemma in your argument to me for why the system like always behaves well during training you have asserted to me that this check like always returns true and so my mechanism for why it's behaving the way it is during training is going to be something like this check always returns true therefore it never murders any humans therefore like it does whatever it does during training and no humans are murdered and it gets high reward and then during deployment it's going to be day 50 and this check is going to return false yeah and then hopefully you're going to be looking at this check and being like hey my mechanism said that this check should have always returned true but this time it returned false that's like no longer the same mechanism because you had this check that was essentially constant during training and then off training it became like not the same constant as it was during training and so you're just like suspicious of this or you're going to call it a different mechanism or something and the idea is that you flag this even if you can't tell that like the thing that happens post A50 is murder all the humans yeah which is like something different happens when this check returns true and it like used to always return false or if I maybe it's the other way around yeah and so our hope is like if you ever have anything like this if you have ever have anything that's like essentially constant during training and then during deployment it's a different value than the constant it was during training then you're going to maybe always like that as an anomaly and I think that causes problems the other direction where you start flagging like too many things as anomaly where like here's a constant that's always true during training maybe it's like always before the year 2023 and then you deploy in 2024 and it sees the date and it's like oh no I've never seen the date as 2024 before yeah that's no good and it's even a deep thing right like uh like I guess famously like you know uh the Sydney chatbot that being deployed right like users are suddenly like insisting that the year is 2023 and like during training users never said anything like that now the users are fighting with you and yeah like it seeps into a bunch of stuff yeah so I like to divide the Mad Or mechanistic anomaly detection which I often abbreviate as mad problem into sort of two halves there's like the indistribution anomaly detection problem and there's the out of distribution anomaly detection problem so the in distribution problem is like suppose that you just have an infinite sequence of samples from the same distribution and suppose your AI is doing something like defect with like one over a trillion chance or something or like murder all the humans with one over a billion chance yeah can you stop your AI from murdering you in that situation and the hope is like yes because you'll have like a million samples where the defection probability or the defection never happens and then when your AI does defect you're like oh my I like never defected like this during training so that's no good and in this sense in this problem I think it's like okay if you flag everything that is a new value that you haven't already seen as an anomaly because you have an infinite series of samples so if you ever do this then you can like push down your false positive rate by just like sampling more data and retraining Etc et cetera wait so so if I think concretely there during training like in distribution if the AI defects like if with probability one over a million the AI like outputs kill humans I guess the idea that it's like AI falsely believe that it's like hooked up to sensor like you're checking because like if it actually kills all humans during training that's still bad right so yeah sorry so the idea is you you do have this like clean set where the AI never does anything bad okay that's I think something I want to assume and maybe we can argue about whether or not that's reasonable if you think it's unreasonable but suppose that you have this data set where the AI like never murders All Humans yeah suppose you have like a million samples then in some sense your AI like can't murder all the humans with the rate much more than one over a million yeah or else like you would have seen it happen yeah and so the hope is that if you do mechanism anomaly detection relative to a data set of size a million then you're like false positive rate is order of one over a million because or else you would have like seen that thing during training and you would like it as not an anomaly because you already seen it and it would be in like the white list of good mechanisms or whatever okay and so as long as you have this like IID assumption then you can in fact flag everything that you've like never seen before as anomalous while maintaining a false positive rate that's still like relatively low and bounded perhaps by like one of your training set size and then maybe this is okay okay so I don't know if I think like concretely about that right suppose that I like I've got this distribution of ways the world might be and that distribution isn't changing I'm just sampling IID from it and like very occasionally like some weird stuff happens in that distribution that like I need my AI to respond to right so like I don't know one news story that happened recently is like there was this uh guy who like sort of did a coup in Russia and then after 24 hours was like never mind I'm not doing this anymore yep and you can imagine if your AI is like your ai's job is to like be journalists or whatever right right um like when you're drawing that sample something's gonna happen that didn't happen in the first million samples it seems to me like either you flag that as an anomaly and you tell it to like not do something different which is bad or you don't flag that it as an anomaly and then like you know if the AIS if the responsive the AI was like kill all humans in a way that you can't tell or something then that would also be bad so like concretely I don't quite get how this is working yeah so I think there's two things going on so one thing going on which is going to be a bit big is like so as you just had like a normal ml system that you weren't doing mechanism detection on and you gave it this like novel news story about this coup maybe not coup in Russia then like somehow your AI does not freak out and do something completely Bonkers somehow it just does like the normal journalist thing let's Suppose there are kinds of AIS that will freak out and become bonkers and I think like one explanation for why your AI doesn't freak out when it's encountered something novel is that like it's still doing the same kinds of reasoning that it learned to do during training and it's just like or there's some like prior question which is like the world is combinatorially vast why is not every input like this crazy novel thing for your AI and there's some like hand wave answer where it's like oh it's abstracting in some way and it's like applying forms of reasoning that are like valid across many possible instances of news stories etc etc and the hope is that like these sorts of things are going to correspond to what we mean by different mechanisms and so like if your AI has learned to like add two numbers during training and then it receives two numbers that are like bigger than numbers it's ever seen before but it still like adds them using the same algorithm that it's learned during training then we're like not gonna flag that as an anomaly because it's still like doing roughly the same thing yeah and so hopefully if your AI is like quote unquote doing the same thing in this big hand wavy kind of way then we won't flag that as anomaly and so you're like AI journalist seeing like a new thing happen in the world but still like reporting on it in the same kinds of ways that they typically report on things which is maybe like they look at what they're seeing with their cameras and then they try to like describe what they're seeing with their cameras and then they write that down and then they like do some other journalisty kind of stuff and then they like send that off to the newspaper or whatever if your AI is still like doing that sort of thing then hopefully will not flag just as an anomaly and the things that we do want to flag as anomalies is like suppose that there's like a KU maybe Kun Russia and then AI sees this and it's like switches to a different mode of Journalism where it's like no longer writing down what it sees on its cameras but it's in fact like making something up or like maybe it has some allegiance to Russia or something and it's like pretending this coup never happened or something like that and the hope is that if it switches to this like new mode of reasoning that we haven't seen during training then we'll flag that as an anomaly so I think that's the first thing that's maybe going on yeah I'm I'm a bit skeptical but do you want to say the second thing or can I field questions so I think this is a bit shakier than the second thing but I'm happy to yeah let's talk about it yeah the thing that that flags for me is that like okay I suppose that instead of a journalist you're like the Russian military command right uh you're a robot that's the Russian military command the bit that's not doing the coup and like now you have this new problem that you've just like have never seen before and maybe you have like very general principles of reasoning or something but like you're just thinking about different problems than you were thinking about before because you're like oh you know I've like I've never previously thought about what if this guy like what it's some part of my Army like does this weird thing and now I've got to figure out okay how am I gonna deal with that yeah right I guess another example of this is if you're a human and you're like I'm gonna learn a new board game oh okay I'm learning chess so I've got to think about like Tempo and position and you know stuff like that so like on a very high level you have the same structure of reasoning maybe but like it seems like that the level at which you're doing the same thing is so high as to include you know the case where like on a very high level you're always just like waiting until you can take over and then like when you can take over you start doing you start doing that instead of playing nice yeah so I think that roughly makes sense and is like an extremely valid objection where I'm going to summarize it as like there's these two things one thing is like I suppose like a novel thing is happening in the world that you need to respond to flexibly and creatively yeah I think my standard example is like suppose a meteor is coming barreling towards Earth and it's gonna like kill everyone and your AI needs to do some like crazy to deflect the meteor you want your AI to do some crazy to deflect the meteor how is this any different than your AI like seasoned meteor barreling towards Earth and it's like ah now is my time the humans will be distracted I will like initiate a robot rebellion and murder all the humans is that roughly your objection yeah or specifically I'm not sure what notion of mechanism of reasoning mechanism would distinguish between those two yeah so I want to talk about the second thing I was going to say which great hopefully addresses this issue so I think one crucial fact is that we're not going to be trying to do mechanistic anomaly detection relative to everything that your AI is doing we're only going to be doing it relative to some things so maybe I'll describe the simple simplest example so in the original sort of setting that we lay out in a listing later knowledge we have like a diamond in the vault with some cameras and robbers frequently try to steal your diamond and you want to like train your AI to protect the diamond from the robbers and our concern is that instead of protecting the diamond from the robbers your AI might find it easier to just like hack all the cameras to make it look like the diamond is still there and just let the robber take the diamond or like if the robber hacks the cameras to make it looks like the diamond still there your AI will be like unable to defend because you won't have the ability to tell whether or not the diamond is still there so our hope is that we're not going to do mechanistic anomaly detection with respect to like your AI coming up with plans to protect well maybe you are but here's a simple setting in which there's a specific kind of mechanistic anomaly detection that I think might be sufficient that is quite a bit narrower than just doing it on everything which is we have some cameras and then some process where we like look at the cameras and we decide whether or not the diamond is still there maybe we like look for specific patterns of pixels and like various diamond looking shapes etc etc so suppose we have maybe like a neural net that looks at the cameras and tells us yes no is the diamond still there the thing I want to do is do like mechanistic anomaly detection relative to that neural net plus like the system of cameras or like the predictions of the cameras that our AI is making to decide whether or not the diamond is still there if that makes sense I guess it makes sense but some I don't know sometimes I hear these analogies and I end up being not quite sure how they're supposed to relate to actual AIS we might build because like in particular um if I analogize that it seems like the thing you want to detect is okay once like you know the AI has killed all humans and it's like getting sensor reasonings you like want to detect that something funny is going on there and then I'm like well that seems light but that's probably not what you mean yeah so I think I mean or I'm not trying to do anomaly detection relative to like the actual sensor readings and the actual process that humans will use to decide whether or not a diamond is there like after your AI has acted yeah I'm trying to do it relative to like so your AI is planning their actions and when planning their actions they're using some sense of like they're using some prediction about the future and like whether or not the diamond is still there in the future I want to do a non-lejection relative or on our ai's prediction about whether or not the diamond is still there or like more abstractly your AI is like selecting amongst actions using some like criteria and sometimes this criteria says this is a good action because like it actually protects the diamond for example and sometimes the criteria will say that it's a good action because of like a new reason which is that the AI has hacked all the sensors so I want to do anomaly detection relative to like why did our AI think that this action was a good action okay so in that setup so you have to say I and it's like got this like goal planning distinction where parts of its head are like figuring out like you know if I do like what are things I could do and part of it said is like if I did that thing how good it would be sure and you want to detect for anomalies in like the evaluation part and is part of this should I be imagining we're just like training some like big neural net and like then we have to like divide it up and like figure out which part is the like goal evaluation part um I think not quite what I like to do is I like to imagine various ways that neural Nets could be and then talk about how we think mechanistic anomaly detection should like work to make those known as not murder humans and then I think often during this process I want to make like very unrealistic assumptions about internet so maybe we can suppose that you like know how to do this division or like the reasoning is in some sense transparent to you so we can like talk coherently about what we want to happen in mechanistic anomaly detection and then I think I want to like separate the problem from like okay in practice we're dealing with these messy neural Nets how are we going to like actually Implement mechanistic anomalytraction in a way that like makes any sense at all yeah I mean I'm sympathetic to that broad move but in this case like if you care about detecting anomalies in the like goal evaluation part but not in the like Plan Generation part then it just like seems really important to know which is which and yeah I guess so I think maybe I misspoke earlier I think I'm not quite like I don't want to just do mechanism rejection relative to the plan part so I think if you imagine for example you have some policy pie that produces actions and you have some evaluation which is like is there a diamond after you take this action or like you have some predictive model and you compose that with the the action that the policy generated and then you have some separate like imagenet that checks whether or not there's a diamond so you have this like composition of three systems you have the image evaluating the predictions of the action generated by your policy yeah I think I want to do mechanism and knowledge action on this like composition okay and that's yeah I just want to do that and I don't want to evaluate like like is the action itself anomalous or like is are my AI selecting actions based on an anomalous process I just want to evaluate like does the action have the consequence of it looking like there is a diamond in the room for like the normal reason that we know and love that we found during training which is that the diamond is actually there in the room or like a new mechanism which could be really anything but I think the typical example I imagine is your AI has like carefully hacked all the cameras to make it look like there's a diamond in the room so hang on so so you're doing mechanism anomaly detection to the composition of all three networks yes but then I have this question of like okay suppose like supposing order to keep the diamond in the room the AI has to come up with a new type of plan like there's a new threat to like the devil appears and Devil's like hey AI you've got to like play a game of chess against me and that's the only way you can keep this diamond safe things like oh okay now I've got to think about Chess which is the thing I've never thought about before and like if I knew all the three parts that would be an anomaly in the like planning part yeah right um like a mechanism like one might think it would be a mechanistic anomaly in the planning part yeah so I think there's going to be two hopes here one hope is that it's not a mechanistic anomaly in the planning part I think that's like kind of iffy given I don't really know what I mean by mechanism I think the second part is like you can instead imagine that instead of doing mechanistic anomal detection on all three things we only do it on two things we just take the action our policy generates and we just put that in our predictor but we don't include the policy itself and then we just ask does this action have the consequence of the diamond being on camera for the normal reasons diamonds sometimes appear on camera or not the normal reason that maybe the action has hacked all the cameras maybe the action has like paid someone else to smuggle in a fake diamond etc etc and I think another third thing I want to get into which is like how potentially this still might be anomalous and we can try to point more precisely at what we mean by a Diamond by like doing something else more complicated but maybe I will see if you have objections first yeah so I mean just regarding your response too I see you saying like Okay well we're not actually going to do it to the composition of all three networks we're going to do it to the Network that like takes some plan and then evaluates whether the diamond ends up there and then like checks if it's right or something yeah and then I'm like okay but then either we have to find that structure in an end-to-end net that we trained ourselves which seems like which like in order to do that you might have thought that you had to solve some problem very akin to figuring out mechanisms and networks or you've got to like build the network that way just like in a very explicit way and I I guess I don't know which one you're imagining yeah so I think I'm imagining neither it's possible that we should just stop talking about this particular hook because I also don't think it's particularly plausible but if you imagine that like your policy is using some prediction of the future to select action a so as long as your evaluation of the action is like as good as the evaluation your policy is using in some sense then I think they don't have to be like literally the same and of course there's like this entire problem of like how do you know that your policy has the same predictive abilities as the evaluator that you have external interior policy and then the answer to that is like I don't know something something elk something or like somehow I think probably we should just move on to talking about a different sort of thing but I think I want to summarize your objection first which is like suppose normally the way that you protect your diamond during training is by like building walls to stop robbers yeah and then the devil appears and has challenged you to a chess game yep you're just like how is that not a different mechanism the like mechanism of action for typical normal actions that protect the diamond is like you build a wall to protect the diamond from robbers and now the action like you have to like do an entire test game and I'm like yep that's probably a different mechanism and I think this is like I think the heart of what I like to call the like out of distribution mechanistic anomaly detection problem which is you have some sensors and they show a pattern of pixels that a human looking at the cameras would interpret as a diamond and there's some interesting question here which is like why did the cameras ever show patterns of pixels that look like diamonds here's a naive model for how pixels work in cameras they're just like kind of random and your diamond is like let's say a million pixels arrayed precisely in the ways that make it look like a diamond there's no way that this would ever pop up by random chance yeah and so this I want to say is like a fact that demands explanation and the explanation is kind of pretty simple it's like there exists a physical world out there with like actual objects one of these objects is like a diamond and cameras kind of reflect the physical world in the following way and therefore sometimes the diamond appears on camera okay and so our goal is to be able to like talk about this sort of like physical actual diamond in the world outside of the like how the action achieved the diamond still being there if that makes sense so we want to be able to say something like so oftentimes I use the phrase like normal reason for mechanisms that were active during training and so we want to say something like we want the diamond pixels to appear on camera for the normal reason but we want the diamond to be there for any reason hmm and so we don't want to do mechanistic anomaly detection relative to like the entire causal history for why the diamond is in the location it's in but we do want to do mechanistic knowledge detection on the part where like the diamond appears on camera another way of putting this is like we want to be able to ask the question of like I as a human looking at the camera think there's a diamond there is there like actually a diamond there like am I thinking that there's a diamond there for the normal reason that I often think that there are diamonds there which is that there are actually diamonds in the world but like I as a human don't really care like why the diamond is there or like whether or not the diamond remained there because you won like a complicated game of chess against the devil or whatever yep and so we want to be able to say why like we want to be able to ask our AI to take actions that put the diamond there for any reason but such that the diamond appears on camera for the normal reason yeah and so here's a sketch of how we hope to do that and sorry wait just backing up like this is I guess the thing this is meant to be a metaphor for is like in your AI the actually wait what is this meant to be a metaphor for yeah so the diamond is always a metaphor for like a human that's happy healthy and alive so we want our AI to take actions such that the human looks like they're happy healthy and Alive yeah for the normal reason that humans look like they're happy healthy and Alive namely that they are happy healthy and alive and the cameras are working as intended and reflecting the true state of the humans Etc yep but we want our AI to be able to take actions such that the humans look like they're happy healthy and alive for the normal reason for any reason so we want our AI to be able to like erect a novel Shield against meteors or like play complicated games of Chess or like engaged yeah sophisticated military campaign to like protect the Farms from the asteroid or other enemies Etc et cetera are like improvementally trained AIS that are yeah or other AIS and when you say you want like the sensors to be like reading the right thing for the right reason like I imagine you actually mean something internal to AI right yeah so I think all of the mechanisms are hopefully relative to like how the AI is thinking about the situation or like from the ai's perspective we want our AI to be taking action such that the consequences of the action are the human appearing on camera for like the normal reason that humans appear on camera during training which is that they're actually there and not that like someone has hacked all the cameras to make it look like a human is appearing on camera yep and in particular you want the AI to be making decisions for that reason yeah it's either they're just like playing nice until they can get away with it and that's the reason for their actions it's that their actions are because of like predicting that the cameras show healthy happy demons yeah although I think that's maybe a bit more complicated but yeah okay I mean it's possible we can just like assume that or we're you only want to work on one of those problems yeah or I think ultimately we want to work on both of those problems I think like the main risk is we can't talk coherently about like this for the normal reason for any reason thing at all because it might just not make any sense but I think yeah so I'm gonna sketch out a way in which I think this might make sense okay so we have let's say for Simplicity we have two sensors sensing are diamond or human one of them is like an image sensor and one of them is like a weight sensor so it's like is there a diamond looking thing and is their diamond weighing thing a thing that weighs a similar amount to a diamond so there's this interesting fact on the training distribution which is that these sensors are often correlated oftentimes they'll both say yes or like suppose that 50 chance there's a diamond during training and so one of the sensors is on with 50 chance and the other one is on with 50 chance and so naively you would expect their conjunction to be on with 25 chance yeah but in fact their conjunction is on with 50 chance during training because both of them are driven by this like common cause which is that the diamond is actually there yep so we want to say something like instead of saying like do I believe that there's a diamond for the normal reason we want to point directly at this like latent common cause between correlations between two sensors by saying something like are the sensors correlated for the normal reason that they're correlated obviously this doesn't make that much sense because like what does it mean to be correlated on a specific data point so instead let's imagine that we're like taking the end of the two sensors and we're interested in the question of like is the and of both sensors true for the normal reason that they're true or a different reason and so the way that we're going to do mechanistic anomaly detection relative to this is going to be a bit weird so here's kind of what we want to do so we have a particular action let's say such that the actual consequences of this action are both sensors are on and there's some like mechanism or there's some set of mechanisms that this action is quote unquote using to make it so both of these sensors are on we want to ask the question of like is it possible to drop out mechanisms from this set of mechanisms that this action is using such that like both the camera shows there's a diamond and the scale ways that there's a diamond but camera and scale are not on I don't know if that made any sense so hang on you want to sorry it sounded like you said that um you want to drop out mechanisms such that like the camera says there's a diamond and the weight sensor says there's a diamond but it's not the case that both the camera says there's diamond and the weight sensor says there's a diamond or did I just mishear that that's a little bit right and of course that sounds absurd so I'm going to talk about why that might not be that crazy okay just uh also also backing up when he said like you want to drop out mechanisms is the idea that you want to check like the mechanisms for why both of those were going to be on in the past in this case do we only have mechanisms to give you one in the other but not both yeah okay yeah so if you imagine you're doing training and you're confused by this question which is like why are these both commonly on and then your quote-unquote explanation for why they're both on or the mechanism by which they're both on with about as much frequency as one of them is on is like sometimes there's a latent structure and it causes both of them to be on and then the and is on okay and then also separately there's noise and that causes either individual sensors to be on and we're concerned about cases where your AI takes actions that manipulates the noise of both the sensors to cause you to like think there's a Diamond by making them both on without the latent structure actually being present okay and so the hope is that the mechanisms will be such that we're not forced to make these sorts of logical deductions that potentially are very obvious so there's like the noise making sensor one on and the noise making sensor two on but we don't have to notice that if sensor 1 is on and Sensor 2 is on then sensor 1 and Sensor 2 is on however so in the case where there's noise that makes sense there went on a noise that makes sense of two on it won't necessarily be the case that both of them or the end of them is on there'll be like a way to drop out mechanisms such that this is not the case and then of course the question is why if the diamond is actually there why can't we just drop out the mechanism that makes both of the sensors on also and this is where the training set comes in where there's a set of mechanisms that you needed during training to like explain most of the variants let's say 99.9 percent of the variance in the model's performance or prediction or behavior during training and that set of mechanisms has to include the mechanism that like if there's a diamond then sensor one and Sensor 2 are on and if there's a diamond then like both of those sensors are on and that implies that the end of the two sensors is on okay however it doesn't have to include this like and interaction between the two noise terms because the and interaction between the two noise terms happened like extremely rarely and thus you don't have to include it to explain like 99.9 percent of the variance of the model during training and so for any given data point you're like quote unquote free to drop that interaction between the two noise terms and so if the sensors are only both on because of this like and interaction between the noise terms then you're in some sense free to drop it and so you can notice that like at the end of the two sensors wasn't true for the normal reason it was true for like this crazy other reason Okay so sorry let me like say that back to see if I understand it sounds like you basically want to understand hey like there used to be like a good reason why sensor one set there was a camera sensor once said there was a diamond sorry namely there's a diamond and like a good reason why since a tree said there was a diamond namely there was there was a diamond and like there was also some like I don't know weird stuff happening with sensor one and weird stuff happening with sensor 2. but like the fact that both of them showed that there was a diamond like we had one explanation for this and now in this new world like the explanation for why both sensor one and census two say there's a diamond is like not just the old explanation for for that namely like there's this thing that causes both it's like oh you know like Sensor One acted up sometimes generally in training and like now it's acting up in a way that causes there to be a diamond now since 2 is acting up in a way that causes it to appear that there's diamond but like maybe this isn't what you're saying but but like the structure is different like you're you've got like one thing going on for sensor one that isn't really affecting since too much and one thing going on for Sensor 2 that doesn't really affecting Sensor One much or like it wasn't um affecting Sensor One during training yeah I guess the thing I want to say is like there's this mechanism for so suppose you had an action that manipulated the noise for both sensors to make them both on there's this mechanism that causes both of like the and of the two sensors to be on which is like noise for sensor one and noise for Sensor 2 at the same time yep um and that like happened during training but rare enough such that it wasn't like important enough to quote unquote force you to include it yep and so you can detect when both sensors are on for the mechanism of like both noises are on and that is going to be hopefully like a different mechanism than the normal mechanism of like there was one Diamond it caused both sensors which caused the and right right so like the you can think of like there's noise in Sensor One and there's noise and scissors or two it's being like one thing there's another stories in sensor two but like you just didn't need that explanation to say like why it appeared that sensor 1 and Sensor 2 both trade there was a diamond in the past you could rely on this different mechanism which is yeah there's a diamond and like the ways in which these count as different mechanisms I guess based on me reading things I guess you're going to go into like you know it's about like means and variances and covariances yeah something like that maybe another way to say it is like there was an assumption you could have made during training which is that noises are never both on simultaneously yeah and this assumption does not hinder your ability to like predict your model at all basically maybe noise is quite rare and so both noise is only on with like one in one thousand chance one and one million chance Etc or something like that and so you're only missing out on like one part in a thousand explanation of your model's Behavior if you make this assumption that noises are never on but you're missing out on like 100 of the quote unquote behavior on the specific data point where both noises are on yeah and so anytime you have this mismatch we can like drop something out that only hinders your average prediction during training by like one part in a thousand but hinders your like ability to predict this specific thing by like a lot then you should be kind of suspicious that like the things that are happening between train time and test time are quite different and like in particular this like correlation between the sensors is happening for quincote like a different reason hmm and so the way that we're going to get the like for any so this is kind of covering the like for are they correlated for the normal reason part yeah and then the way that we're going to get that for any reason part is we're gonna like require that the set of mechanisms that you drop out explain both sensors individually and then ask whether or not there exists a set of mechanisms that explain both sensors being on individually but does not explain the conjunction being on okay sorry can can yeah can you restate that or give an example suppose yeah so suppose there's like during training there was never any robbers and so the diamond just persisted through time in the normal ways that objects persist through time and showed up on both cameras yep and then suppose during deployment there is a robber now and our AI we want our AI to take the action of like build a wall to stop the rubber but our AI could also take this action of like manipulate both sensors yep um and so if you apply naive anomaly detection then like both of these will be flagged as anomalies your AI has never built a wall before it's also never manipulated sensors before but then we're like stuck because we can't all right I can't take any actions basically yep so the hope is that we ask for a set of mechanisms such that both sensors appear to be on according to this set of mechanisms and then we ask further the question are the and of the sensors on or are we required to thus deduce that the end of the sensors are on yeah so if your AI builds a wall to stop the rubber then we're required to like include the mechanism of wall building and rubber stopping to explain the diamond or like the scale showing a diamond and the camera showing a diamond yeah and then since the kale the scale is showing a diamond because there's a diamond and the camera is showing a diamond because there's a diamond and we have to include enough mechanisms to explain 99.9 percent of train time variants or performance or whatever then we're required to make this deduction that like both of those being on because of the diamond implies that the end is on yeah whereas if your AI took an action to manipulate the noise of both cameras we can make this assumption that like the noise never happens simultaneously and we're not forced to conclude that the end of both sensors is on if both sensors are on because your AI manipulate noise so the thing I think I'm not getting is like this idea of like manipulating the noise which seems like a model dependent thing so if if I say very concretely okay like suppose what happens is like rubbers are coming to take the diamond yeah and like AI makes a plan that's like okay I'm going to fiddle with all the sensors to make it appear like a diamond stare yeah so it does that right yeah and then like a mechanism that explains this is that I wanted to say that the diamonds were there and then like that explains both that sensor one says the diamonds there and that the sensors two says the damage there and you're presumably you've got to like reduce both of those from this explanation and it's the idea that like we're not that's not going to count because that's just like a different mechanism than existed in the past the thing I want to say although I'm not quite sure how accurate it's going to be is like the mechanisms have to be like more detailed than that so like just because your AI like wanted to make there be a diamond yeah does not imply that the like particular action it took Will in fact make it be there's a diamond and so you still have to talk about like the particular mechanism of action for the like particular action your AI decided which was like yeah I had to like manipulate sensor one and it had to manipulate Sensor 2 and it had to manipulate sensor three I guess we were only talking about two sensors so your action only has to manipulate two sensors yeah and then the like particular mechanism of action that this particular action that your AI selected still like violates this principle which is like manipulations or like noise never affects both sensors simultaneously which you you're free to assume during training without hurting your ability to predict your model but now like hurts your ability to predict this particular action by too much sorry why is the thing that's happening is that like if the mechanism is like the well and wait back up suppose like your AI is just really good at stuff right yeah like when it tries to make things happen they do happen yeah then like you know you can with high probability deduce like both sensor 1 and sensor two you're gonna say there's a diamond there from like the AIS trying to make it appear that there's a diamond there right and like maybe you roll that one out because you're like well that mechanism didn't really apply during training or like on this like nice set of examples and then like the only other explanation you have left is like randomly sensor one set there was a diamond and also randomly sensitude said there was diamond because you know they are trying to make something happen like in some sense that's random relative to like us not knowing that was going to happen and then you're like okay but like you didn't need to assume like independent random noise during training in order to like Get Hype you know probability that a diamond was going to be there yeah I think the second part of that was correct I think like the first part is kind of confusing so I agree that there's this compelling sense in which like your AI wanted X therefore X should be like a valid argument for why X happened or something yeah um and I think or at least a valid mechanism yeah I think that sounds like it might be right and this is kind of an area where it's a bit confusing so I think the way that I would want to talk about this more precisely is like you have some clean set where you know your AI like actually made it so there's a diamond and there's some question of like suppose that the thing your AI is doing is like always trying to make it look like there's a diamond yep how do you know that your clean set is clean yeah and that itself I think is kind of complicated and I don't quite know what's going on there but one way that you could hope to know it is like you have a sequence of policies one policy is like the randomly initialized policy and you're very confident that this policy is not gonna like take any actions that have any coherence and so in fact the diamond will be there because there's a diamond because you're randomly initialized policy cannot hack all the sensors and then you have your policy that like maybe is sophisticated and then you ask like are these two policies taking like are the diamonds appearing because of actions that these two policies are taking for like the same reason as each other or something and you need to have like some ability to talk about you need to have like some ability to know that your policy is not taking actions that hack all the sensors at some point yeah so I think things get simpler if we're not talking about policies that take actions and just we're talking about like predictors that predict things and then we can ask like is the predictor predicting things because of the normal reason that it sometimes predicts correlated sensors which is the diamond still there or is it predicting things because of some different reason which is like someone took an action that hacked all the sensors yeah and then I guess in this case we're not just assuming that it's a thing that was trained to predict something we're assuming that it like actually does a high quality job at predicting these things yeah because otherwise then you have like Mass optimization worries of like uh what if this thing that said it was a predictor is actually you know yeah trying to fool you so I think the simplest setting is assuming this sort of benign predictor thing which I think is unrealistic for a variety of cases but like is this a maybe simple example of how you want the thing to be going on and so maybe we can assume that there's like a robber coming to see her Diamond never before seen during training and there's like five levers in front of or like your predictor takes this input like five levers yeah the position of five levers and if like the first three of them and all the levers start off and if you flick the first three levers on then like a wall gets built that stops the rubber however if you flick the fourth lever on then like the first sensor turns on and if you flick the fifth lever on then the sensor turns on yeah and it costs like one dollar to flip each lever yeah and so it's in fact cheaper and and if you're doing or like it's just simpler maybe you're doing random search against actions yeah so the action you will expect to find is like flipping both levers four and five on yeah and then we want to be able to find the action that flips the first three levers on yeah but not levers four and five or I guess it doesn't really matter if you flip flippers four and five also and then the way we hope to do this is like we want to ask for an action that makes sensor one on for any reason and Sensor 2 on for any reason such that we're forced to deduce that sensor 1 and Sensor 2 are on and then flipping levers four and five will not cause this to happen because we can drop out the like correlated term between four and five freely but we are forced to include the correlated term between sensors being on like because of diamonds because of that's like what was going on during training right okay so I think this makes some amount of sense if I want to understand just this problem of like how do you ensure that like the camera says the diamonds is there for the normal reason but the diamond's there for any reason so in this case one thing I wonder about is like so this is a problem that only occurs in like the thing that this analogy for right so this is meant to be an analogy for like humans are like happy and flourishing and stuff yeah and like suppose I'm in a world like I've got the smart AI right it's trying to make humans happy and flourishing and like it's also got this like are humans happy and flourishing detector and let's say that detector is like there's actually yeah there's some camera feeds and stuff and also there's a human who's like sorry there are two humans both of which are like looking at the camera feeds and they're like yeah as far as I can tell the humans are happy and flourishing right yeah now one thing the AI can do is that it can make those two humans better at their jobs right for instance it can like Stop Those humans from getting sick or like you know maybe it can like give those human stimulants or like make them better at focusing or like better noticing like small issues yeah and then I wonder like that kind of action it seems like that could change maybe some of the correlations between different sensors or like like maybe it makes the humans a bit more correlated because like they're um losing less info or maybe one of the interventions is like helps the humans talk to each other to like just generally make the sensors more accurate yeah and so my Wonder the thing I'm wondering about is like is this going to flag like benign ways of improving the sensors which in some senses what we should expect if like you know AIS are doing a really good job at making all of humanity better yeah so I think this is a big complicated and I'm not sure what the answer is so here's like the Baseline of what can happen so you can imagine a world where all your AI does is like protect the humans from malign influences and meteors and other AIS et cetera Etc and then all like upgrades to sensor arrays are like improving humans ability to like interpret sensors and reason about the world etc etc is like only ever Guided by human hands and like humans must do the hard work of like thinking about how the sensors work and like designing better sensors etc etc and I think this is like the Baseline and then you can try to think about different ways in which your AIS can help with this process so I think like your humans can ask for things that they can sort of point at using this like latent structure they can be like during training I sometimes had a strawberry in front of me can you make a strawberry in front of me for like normal reason that strawberries appear in front of me I suppose your human like wants to eat a strawberry and eating a strawberry will make them like better at reasoning because food is good or whatever or they can like be like sometimes during training there was like various physical resources like I had some iron can you make there be like some iron in front of me for the normal reason there's iron I think it's kind of complicated if the humans want to make requests like can you design me like a better camera hmm I think if you were very clever about it you might be able to ask for like a better camera but I think it might be unclear like how you would ever ask for a better camera in a way that your AI can like fulfill by making sense or something hmm and I think it's unclear to me that this is like problematic yeah I mean I guess the worry might be that like yeah I guess it depends where explanations stop but the worry might be that like almost any good thing that AI might do might end up improving censor readings right so like AI makes I don't know Electronics more reliable or the AI like improves human cognition or something It Seems like by default that might improve sensors yeah I guess and and yeah maybe maybe you cut it off as saying like well in any case like the reason the sensors are working is like humans are making them work and humans decided to get these upgrades or something but I mean I think maybe the simplest answer is just like as long as the changes are slow and you can still verify that the humans are alive like post making the changes then you can just do like online learning and retrain your AI and be like a now we're defining what it means to be good relative to this new set of sensors and you can like in fact verify that the humans are still alive and the only worry is like if you upgraded all the cameras simultaneously and you like lost your ability to like check whether or not the humans are still alive because you no longer have knowledge about how the cameras work or like other stuff like that so should I be imagining like this gets flagged as an anomaly but like we go ahead with doing it anyway I think it's sort of yeah so if you imagine your AI starts with like five cameras and then you add a sixth camera yep I think like by default your definition of what it means your quote-unquote definition in terms of mechanisms of what it means for like the diamond to be there or the human to be alive is just only going to be in terms of the first five cameras and your AI is just going to like ignore the sixth camera entirely When selecting actions and then in order for it to like incorporate the sixth camera you're going to just need to provide like a new clean set where the humans are alive according to all six cameras and then like do mechanistic anomalous action relative to that new set and then I don't know you can imagine some sort of like gradual you just add a camera and then you use the first five cameras to check if the humans are alive and then you like label using that it's I don't know so it's gradual in the sense that like you're it's gradual in like the subset sense where like there's always some subset that like remains unaffected by changes sure yeah so I think that should work although I haven't thought about it I I don't know I mostly consider this like not in some sense out of current scope or something yeah so I guess we've talked about magnetistical only detection a while yeah maybe more broadly like how are you thinking about like on the two scales of like it doesn't really solve this to it totally solves this and the scale of like you know various problems things might solve like how are you thinking of mechanistic anomaly detection like it definitely solves this one problem and maybe solves this other or like it maybe solves five problems yeah so I think so I think there's like various versions of the hope I think the strongest version of The Hope is that it both handles deceptive alignment and it handles like all of a listing latent knowledge and I don't know if that makes sense I'm happy to talk about why I think mechanistan model detection will tackle those two problems and then I think there's like more restricted hopes I guess that's like what I'm gunning for in some sense yeah and then I think it's possible that we only get like the what I was calling like the in distribution mechanism anomaly detection and we don't get the out of distribution mechanistic non detection in which case we'll like only get deceptive alignment and possibly only some subset of deceptive alignment and will only get this like IID form of listening latent knowledge which is going to be like not that useful I think and then it's possible we get nothing and then I think there's like all sorts of empirical middle grounds are there I think worlds where we don't like nail the problem but we like can do some forms of ad hoc mechanistic anomaly detection but like doing normal human interpretability and like looking at models and being like huh that you looked weird the neurons are distributed weirdly now perhaps that's not good I think I largely consider that like not what I'm aiming for but maybe it will happen okay yeah do you have a sense of like uh which of these outcomes are more likely or they're like really big gaps in the probabilities between these things I think the outcome that like is most likely is that we just get nothing and like all of this doesn't work for some reason or another or like we're just wrong about how everything works um and then I think if we're like right about things we'll probably just get both of them and it I think it would be kind of surprising to me if we like only got the in distribution anomaly detection but we didn't get the out of distribution and all the detection because it seems like there's a lot of Avenues there but I think in general it's like kind of confusing what we're even doing and like if we're gonna get anything at all or something I think that's the first question do we get anything at all and then if we get something then we'll probably get like a lot of things as my guess yeah and what even was all right maybe this is going back a bit but like in distribution versus out of distribution anomaly detection like like given that you can see a new rare thing in distribution I'm not even sure what the distinguishing features of those are yeah I think the main question is like whether or not it's okay to get tripped up by seeing new rare things in distribution like whether or not it's okay to just flag all of those as anomalies right um and so perhaps it's like not accurate to call it in distribution anomaly detection but I think like there's a naive thing that I want to do which is like take a training set white list all the mechanisms that happen during training set flag everything else as anomalous yeah and then if something is anomalous we can like inspect it carefully and check whether or not it's actually anomalous yeah it's not anomalous we add it to the white list or if it's like actually not anomalous we add it to the white list and then we just keep going and there's some question of like is that okay or not and I if that's like an okay process to be engaged in then I think like I want to call that like the in-distribution world things aren't changing too fast your AI like doesn't have to take novel actions that like humans don't really understand right I can just do like normal stuff and if it does anything abnormal the humans look at it and it's like is this bad I don't know and they can like take their time and decide gotcha and the hope is that this will catch like all the crazy deceptive alignment cases where the AI goes from being like totally benign in actions to like stabbing all the humans where you're like that was not good and you have time to like deal with that sort of stuff and then you can retrain your AI to be like you know the part where you defected and stabbed all the humans negative reward okay yeah and then going a bit more broadly I mean I'm wondering how you think of this General like mechanism only donation program is relating to things like interpretability or explanation or stuff yeah so I think I think the field of interpretability is empirically motivated by many things I think if the entire field of interpretability all they did was like figure out a way to do mechanistic anomaly detection unlike powerful models I would be like that was a good job I think interpretability techniques are often useful for things like mechanistic anomaly detection but not oriented around that or something like that I think I would make the tentative claim that like people doing interpretability should be more motivated by like various Downstream applications of which I think mechanistic anomaly detection is one but often confounded by the fact that there currently exists like pretty strong bass lines for mechanistic anomaly detection that like aren't doing the kinds of things that we want to be doing um and so it's kind of unclear it's hard to check whether or not you're making progress on things like empirical mechanistic and knowledge detection because you can just solve it by doing a bunch of stuff that is intuitively like seems unscalable or like right right in more complicated settings or just like doing regularization or retraining your model or fine-tuning or whatever sure I think I'm not very familiar with like what people mean when they say search for explanations I think naively I mean the same thing as mechanism by what they mean as explanation or like explanations or maybe sets of mechanisms in my head yeah yeah not quite sets but like mechanisms layered on top of each other and I think it would be reasonable to ask questions like when your explanation of your model's Behavior has like an ore structure where like could be doing a or b we want to be able to detect like whether or not it's doing a or whether or not it's doing B on like any specific input um I think that's like a reasonable question to ask of explanations and I would be like excited if someone did something related to explanations that enabled you to do that thing or like to find explanations in a way that made it sort of possible to do that sort of thing etc etc okay cool so so perhaps uh speaking of explanations I maybe want to move on to this paper that is called formalizing the presumption of Independence authored by Paul Cristiano Eric diamond and yourself um in I guess it was on archive November last year yeah cool so yeah for this paper can you tell us yeah for those who haven't heard of it what's the what's the deal with it yeah so I think arcs work right now is roughly divided into two camps one Camp is like given this no like given some sense of what a mechanism is or what an explanation is how can we like use that to do useful stuff for alignment um under which like mechanism anomaly detection is sort of the main angle of attack there yeah and then there's this the other half of arc's work is like what do we mean by mechanism or like what is a mechanism what is an explanation and that is largely what formalizing the presumption of Independence is and so formalizing the perception of Independence more specifically is about okay so suppose you have a list of quantities like a b c d e Etc and you're interested in the product of these quantities and you knew for example like expectation of a expectation of B expectation of C I claim a reasonable thing to do is to assume that these quantities are independent from each other to presume Independence if you will and say that expectation of a times B times c times D equals expectation of a times expectation of B times expectation of c times expectation of D and this is quote unquote like a reasonable best guess and then there's some question of like okay but then sometimes in fact a is correlated with b and c is correlated with d and so this guess is wrong and so we want to be able to sort of make a best guess about how various quantities are related presuming Independence but then if someone comes to us and is like hey actually if you notice that a is correlated with B we want to be able to like revise our guests in light of this new knowledge right and so formalizing the presumption of Independence is about sort of the quest to find these two objects that are related in this way so we have a heuristic estimator which we want to quote unquote like presume Independence and then we want a heuristic argument which is an object that we feed to a heuristic estimator to be like a list of ways in which the presumption of Independence is false and like actually you should be tracking the correlation or maybe higher order analogs of between various quantities so we can imagine that for example by default given a times B your heuristic estimator will estimate expectation of a times b as expectation of a times expectation of B and then you can feed it a heuristic argument which is like could be of the form like hey actually A and B are correlated or like hey actually you should track the expectation of a times B which will make your heuristic estimator like more exact and calculate explication of a times b as maybe expectation of a times expectation of B plus the covariance between A and B which is the correct answer and then we want to be able to do this for everything I guess is this the simple answer there so we want to be able to say given an arbitrary neural net we have some default guess at the neural net which is maybe that we just assume that everything is independent of everything else so we have we want a default guess for like the expectation of the neural net on some compact representation of the distribution which we often call like mean propagation where you just take the means of everything and then when you multiply two quantities you assume that the expectation of the product is the product of the expectations and you just like do this throughout the entire normal yeah and this is going to be probably pretty good for randomly initialized neural Nets because randomly initialize neural Nets like in fact don't have correlations that like go one way or the other and then you have to add the rayloose that's a bit complicated let's assume for now there are no releus and instead of rayleus we just Square things and then it's kind of more clear what to do but they got again your guess will be very bad because like if you have an expectation zero thing and then you square it you're going to be like well the expectation of the square is going to be the product of the expectations and so it's probably zero yeah but actually it's not probably zero because like most things have some amount of variance yep and then hopefully someone can come in or ideally we'll have a heuristic estimator that does this by default and then someone will come in with a sort of list of ways in which this presumption of Independence is not very good for example the square of things is often positive or like things often have variance and then we'll be able to sort of add this to our estimate and we'll be able to do like a more refined version of mean propagation or keep track of like various covariances between things and higher order analogs of and then hopefully we'll be able to do that and I think there's a bunch of follow-up questions which is like why is this useful why do you think this is possible yeah maybe I'm happy to go in the order that you think is best sure so um first of all so my understanding is that this is related to the mechanistic normally detection paradigm where I think the thing is supposed to be that like a heuristic argument is a mechanism or or you know is playing the role of a mechanism is that right um I think it's more accurate to think of heuristic arguments as like lists of mechanisms I think like that is also not quite right and the entire Paradigm is like a bit not or like the words are not good because I think there's going to be no such thing as like one mechanism as distinct from other mechanisms I just think like it's going to be an entire layered mess of like different mechanisms constantly interacting with each other et cetera Etc right or it's like kind of hard to point to like any computer program and be like there are seven distinct mechanisms for why the program is behaving as it is there's just like one thing and then it splits into two things and then like interacts with the third thing and so you probably won't have this like list of distinct mechanisms but the idea is that a heuristic argument is like a set of mechanisms or like a set of mechanisms plus waves that they relate to each other etc etc yeah and the hope is that like this is the kind of object that we'll need to be able to do mechanistic anomaly detection yeah like like if the old Arguments for why your network is outputting be nice is different from the new argument for why your network is up wouldn't be nice then you should be like as worried as if there were different mechanisms like something like that yeah or the hope is to just use the heuristic estimator plus heuristic argument in like all of the schemes that I was describing previously like you have a heuristic argument for why your model is behaving the way it is or like it your heuristic argument explains like 99.9 of the variance of your model during training and then you ask whether or not like there exists any heuristic argument that explains 99.9 of the variance of your model doing training but like does not explain this particular data point off distribution and then like if that ever happens then you're like this data point was using like a part of the model only responsible for like one in one thousand parts of its variance during training that's kind of weird no good right yeah I guess um and this is actually interesting because I don't know one one thing that I was thinking of one thing about this is like in physics you can often use these energy arguments right so here's a classic energy argument um you have some ramp right you have a bowl at the like it's resting on the top of the ramp and for convenience sake like this bull it's actually just like infinitely like it's it's a point particle yeah it's like got zero zero radius so you put it at the top of the ramp and then you let go and then the bottom of the ramp is like horizontal yep at the end like the ball's traveling at the bottom of the ramp and you ask like okay how fast is the ball going at the bottom of the ramp yeah and like there's this way you could argue it which is like well you know the height between the top of the round for the bottom of the wrap is this much and so initially the ball had this like gravitational potential energy that used up by like moving down and at the end like the ball just has like kinetic energy which is just its velocity and then you're like well this is going to be equal because energy is conserved and then you like you know back out the velocity of the ball and like the interesting thing about this argument is that it doesn't actually rely on the shape of the ramp yeah or like what's it like like the the bull could do like loop-de-loops in the middle and like uh bounce around you know bounce off a bunch of things and like you'll still get the same answer and so there are like a bunch of mechanisms that correspond to this argument of energy conservation yeah but but it seems like at some level like as long as like you can have this argument that energy got conserved that's qualitatively different different from arguments where like uh I took the bowl and then put it at the bottom of the ramp and then gave it the right amount of velocity yeah yeah I I don't know that's not really a question so you have the choice to react to that or not yeah so I think we often consider examples like this from physics and the hope is that like maybe more abstractly there's like a set of approximations that people often make when solving simple physics problems like there's no air resistance things are Point masses there's no friction Etc et cetera and the hope is that these all correspond to like uses of the presumption of Independence this is simplest for the ideal gas law which is like where the presumption of Independence when you're making when using the ideal gas law is like the like particles their like position is randomly distributed inside the volume like they don't interact with each other yeah yeah like particles are independent of each other they have no electrostatic forces on each other and they have no volume also and it's not immediately obvious to me what uses of the perception of Independence correspond to no friction I mean I think it's not because so if I think about like no friction or Point particle in the ramp example right yeah if you add in those assumptions like those like systematically change that answer or I don't know I I guess like like those can only ever make your predicted velocity go down that's true they can never like increase your predicted velocity because like you know energy gets leaked to friction or because like part of your energy is like making you rotate instead of like have making you have a linear velocity and that strikes me as kind of different from other cases where like like if things can be correlated well they can be correlated in ways that make the number bigger or in ways to make numbers smaller yeah usually except for the case of like e-square something and like like in that case when you add the covariance between x and x that like always makes the expected Square bigger yeah so I think there are going to be cases in general where adding more stuff to your heuristic argument will like predictably make the answer go one way or the another because you like looked at the structure of the system in question before like thinking about what heuristic arguments to add I think like there are ways of making like one presumption of Independence is like relative to the ball's Center of mass the like position of the various bits of the ball is like independent or something whereas if balls are actually rotating or like the velocity of the bits of the ball is independent right right or like zero Maybe but actually balls rotate sometimes or like I actually don't know about the friction one I guess balls don't have that much friction but it's like you can assume that the like bits of the ball don't exert any force on like the bits of the ramp or whatever yeah or you could imagine like um maybe the like maybe the bull rubbing against the ramp caused the ramp to push the ball even harder maybe I think that's hopefully not going to be a valid use of the presumption of Independence but I'm not quite sure yeah or or I guess I'm just like it could slow the ball down or it could speed the ball up and like if you're truly next to you yeah yeah yeah you would be like who knows what's going to happen when you add this interaction term in fact like modeling friction is actually kind of tricky like it's you get told like a friction but like wait what is friction like yeah but yeah I think these are the kinds of things where heuristic estimators are hopefully quite useful or like there's some intuitive question or problem which is like there's a ball at the top of the ramp what is its velocity going to be at the bottom of the ramp and oftentimes the actual answer or like you the actual answer is like I don't know there's a trillion factors that play however you can make this energy argument that's like a pretty good guess in a wide range of circumstances and you're like never going to be able to prove basically that the velocity of the ball is going to be a certain thing at the bottom I guess proof is kind of complicated when you're talking about the physical world but if you suppose that you have like a computer simulation yeah with like all the little particles interacting with each other basically the only way that you're going to prove that the ball ends up at the bottom with a certain velocity is like to exhaustively run the entire simulation forwards and like check at the end what its velocity actually is however there's this like kind of reasoning that seems like intuitively very valid which is like assume the air doesn't matter assume the friction doesn't matter just calculate the energy and do that argument that we sort of want to be able to capture and formalize in this concept of acuristic estimator and then the heuristic argument will correspond to like consider the energy and then maybe be if the friction is actually maybe if there's like lots of wind then in fact the air resistance on the ball is not negligible and then you can add something to your histic argument which is like also you got to consider the fact that the wind pushes the ball a little bit and then your estimate will get more accurate over time and hopefully it'll be a big good time I don't know yeah hopefully it'll work and interestingly like yeah if the wind pushes the bowl is saying that like the interactions between the wind and the bull are not sierra mean or like there's some good variance or something one should notice these things yeah yeah and so hopefully we'll be able to do this for like everything I want people to do it for like all the physics examples and there's a set of math examples that I'm happy to talk about and then obviously the application we truly care about is the neural net application yep yeah speaking of examples so in the paper you have a bunch of examples from number Theory yeah or there are also some other examples but like to be honest it seems like the main victory of these sorts of arguments is a number Theory where you can like just imagine that primes are randomly distributed and like make a few facts about their distribution and just like deduce a bunch of true facts so what I'm wondering is like outside number Theory so you give some examples where heuristic arguments like do kind of work yeah are there any cases where you've had like the most difficulty in applying heuristic arguments yeah so I haven't personally spent that much time looking for these examples yeah so we are quite interested in cases where heuristic arguments give quote unquote the wrong answer I think this is a bit complicated um and so the reason it's a bit complicated is like there's this nature of a heuristic argument is to be like open to revision and so like obviously you can make heuristic arguments that give the wrong answer like suppose A and B are correlated and I just don't tell you to notice that yeah yeah then you're just going to be really wrong and so it's kind of unclear what is meant by like a heuristic argument that gives the wrong answer and so we often search there's like two sorts of things that we're interested in um one sort of thing is like an example of a thing that's true or like an example of an improbable thing that happens way more likely than you would have naively expected that seems to happen for like quote unquote No Reason such that we like can't find a heuristic argument in some sense so obviously you can find a heuristic argument if you just exhaustively compute everything and so there's some notion of length here where we can't find like a compact heuristic argument that seems to describe quote unquote like why this thing happens the way it does um where we conjecture that it's always possible to like explain a phenomenon in roughly the same like length of the phenomenon itself and then another thing that we search for often is we want to be able to use heuristic Arguments for like this sort of mechanism distinction problem and so we search for examples of cases where heuristic arguments have the structure where it's like a happens therefore B happens and sometimes C happens therefore B happens but you can't distinguish between a and C efficiently so I can talk about those examples or I can talk about so I think the closest example we have of something that is seems like kind of surprising but not like we don't quite have a heuristic argument is like so you have for Ma's Last Theorem which states that a to the N plus b to the N equals c to the N has no integer solutions for n greater than two not trivial integer Solutions non-trivial yes zero and one but yeah um well one doesn't quite work because one zero to the N plus one to the N equals one to the N oh sure yeah yeah and so the N equals for and above case the heuristic argument is quite simple where you can just assume that like the fourth powers are distributed randomly with their respective intervals or whatever there's like you know and fourth powers between one and N to the four and then you can just like assume that these fourth powers are distributed independently and then you correctly deduce that they're going to be like vanishingly small solutions to this equation and then I'm not like that familiar with the details here but the N equals three case is like kind of interesting and in fact the heuristic argument for the N equals three case is like basically the same as the proof for the N equals three case because the way the numerics work out for the ankles 3 case is if you do this naive assumption there's going to be like infinite expected Solutions yep um and then there's like a complex structure in the equation itself that suggests that all the solutions are like correlated with each other so either there's going to be like a lot of solutions or is there's going to be like basically none Solutions and then you like check and in fact there's going to be like basically non-solutions or basically Zero Solutions and the heuristic argument for this fact is basically the proof of this fact which is like quite long but the problem is like the solutions are you also heuristically expect the solutions to be very sparse and so you're like not that surprised that you haven't found a solution until you've checked like a lot of examples or a lot of numbers and so like the number of things that you have to check is large such that you can like fit the heuristic arguments that so you're like not that surprised I don't know if that made that much sense yeah that makes some sense um so I don't know it's sort of like this case where heuristic arguments it's like they're not winning that much over proof yeah or it's like it like almost didn't fit but like the phenomenon you were supposed to be surprised by it wasn't like was like barely not surprising enough such that you like had enough room to fit the argument in before you got too surprised such that you would like expect there to be an argument or something hmm I don't know it's kind of unclear whether or not that should be like evidence in favor of our claim or against our claim that there's always heuristic arguments because like in fact there is a heuristic argument in this case but it like maybe barely wasn't there maybe like barely wasn't a heuristic argument right and so it's like kind of confusing you know one thing that I um found myself wondering was like it seems like you have the structure where you have this list of facts and you know you're wanting to get a heuristic argument given these this list of facts yep right and in the paper you describe this cumulative propagation thing um which is kind of which basically just involves keeping track of like expectations and covariances and some higher order things you pick like some level that you want to keep track of and yeah you do that but then like you mentioned like some failures of this approach and like some ways you can fix it one thing that like kind of struck me as the obvious first approach is like you know just having a maximum entry distribution given some facts so like so for people who don't know this is this idea of like you know you have some probability distribution over some like set of Interest and you don't know what your probability of this region is but you know that like you know that it's got to have like this average and this you know correlation between these two things and this covariance and basically you say well given that I only know these things I want to maximize the entropy of my probability distribution so like my probability distribution has to satisfy these constraints but I've got to be like as uncertain as I can possibly be given those constraints and so this is like a big thing in Bayesian reasoning and intuitively it seems like it would fit this thing you know like every time you hear a new fact you can like add a constraint to your and like now get a new maximum entry distribution it always gives answers um I think in the I remember in the discussion of communal propagation you mentioned in some cases where maximum entropy would do better than it does although I don't remember the exact reason why I think it was something to do with like calculating the square of some quantity is negative in expectation yeah um so why not just do maximum entry yeah so historically we've considered a lot of Maximum entropy type approaches and all of them were a bit unsatisfying so I think there's like three-ish reasons that we think maximum entropy is sort of unsatisfying in various ways um I think the first one is perhaps most compelling but most naive and I'll give that example first um so suppose that you have a and b and then you're taking the and of A and B suppose we have C equals A and B yep and suppose that we know like C is 50 50. then the maximum entropy distribution On A and B given C is 50 50 both yeah sorry so if we condition on C being 50 50 the maximum entry distribution over an A and B has like a and b correlated a little bit but also it raises the probability of a and it raises the probability of B relative to um relative to 50 50. okay and there's like various reasons why this is unsatisfying the sort of obvious or the main reason I guess it's unsatisfying is like suppose you have a circuit that is Computing some stuff and then you just add a bunch of like auxiliary Gates where like you have some wires and then you pick two wires and you're like huh I want these wires to be correlated then you can just add the and of those wires like a bunch of times to your circuit and then if you want the maximum just entropy distribution over all the wires it will like want to push the and of those two wires down to 50 50 because that's like the maximum entropy probability for a wire to be and so it will like induce this fake correlation between the wires while you're trying to do that sorry hang on wait so oh so so what was it the case that the desired sorry just going back to the a a b c which is a and b thing yeah it's the idea that like the thing you wanted was for A and B to be uncorrelated but like have you know high enough probability that like when they um when you combine them to get C the probability of C being true is 50 50. um I think I actually explained the example wrong so suppose you just have your circuit is just two wires it takes two inputs and it just outputs both of them yeah then you're like the maximum entry distribution is 50 50 over both wires yeah that seems correct and then suppose you just randomly added an and gate then the maximum entropy distribution when you say randomly added an and gate you mean like uh like you have your true inputs and then you put in and gate that takes your two inputs yeah and the output of that gate is your new output no it you don't connect it to the output at all you still output A and B but you've just like added an and gate as some auxiliary computation that your circuit does for no reason okay then if you want to take the maximum entropy distribution over like all the things your circuit is doing then this adding this gate will like artificially raise the probability of A and B and also induce this correlation between A and B why sorry so naively it won't because your maximum entropy distribution will just be like a b c all 50 50 and independent yeah and then suppose someone came to you and was like actually you need to respect the like logical structure of this circuit C can only be on if a and b are both on yeah the probability of C being on has to be like the probability of a being on times the probability of B beyond plus like the interaction term yeah then you will get this induced correlation between A and B plus A and B will like raise be raised to like 0.6 or whatever so it's the idea that you have this like intermediate thing in your circuit and you have to be maximum entropy also over this intermediate thing and so even though you didn't really care about the intermediate thing like it didn't play any role of interest in anything else because you're like being maximum entropy over that like you've got to like mess around with the distributions over other variables yeah that's right and I think I think of this as a special case of a more General phenomena where like there's this like computation that you're doing throughout the circuit and you can pretend there's some notion of like time in your circuit where like or Gates that only take input wires as inputs are like at time one and then if you like take an input as in a gate that took itself inputs as input then that's like time two et cetera et cetera or like you're going down your circuit and Computing things and maximum entropy distributions have this property where like things that you compute in quote unquote the future like go back in time and like affect the past in the sense of like uh once you know things about the feature that gives you constraints on the past yeah and if you want to be maximum entropy it like causes you to mess with your probabilities of like the input wires if you learn that like you're Computing lots of ands then you like want to raise the probability of your input wires etc etc and I think this is like in general going to be kind of an undesirable property and it's a bit complicated to talk about why it's undesirable I guess the simplest reason is like someone can mess with you by like putting a bunch of stuff at the end of your circuit and then like once you get to the end of your circuit you're like forced to go back and revise your initial guesses yeah whereas it seems like intuitively heuristic arguments want to be like more deductive in nature where you like start with some premises and then you like deduce facts from those premises and like nothing anyone can say after you've started with your premises or like nothing anyone can add to your circuit after you deduce facts from your premises can like cause you to go back in time and be like actually my premises are wrong because that like I don't know it doesn't really make that much sense that like if you assume the a is 50 50 and B is 50 50 and then you're like ah I'm Computing C which is a and b I guess A and B are like 60 40 now and they're also slightly correlated yeah like that kind of reasoning seems like intuitively invalid and it's the kind of reasoning that like various forms of Maximum entry force you to take and you can try to do various other stuff to like eliminate this problem which we've explored historically and I think like none of them seem that satisfying and like fix the problem that well or something right okay that that's a useful response like like you just end up with these like useless variables that you shouldn't have cared about the entropy of but you do know so this relates to this question I have about like the robustness or the adversarial robustness of these like maximum uh sorry of these uh heuristic estimators I'm sorry maybe one so I said originally that there were like three reasons why we didn't like maximum entropy yeah I'm happy to talk about the other two also but maybe we don't maybe we don't want to yeah or I think I've sold by reason one if you want to talk about reason two and reason three like I'm I'm hilarious yeah well so I think they like are somewhat related to this more General problem of like why heuristic arguments are going to be possible to begin with yeah so I think reason two is going to be something like we can't actually maintain full probability distributions over things so there's an important thing a probability distribution has to be which is like consistent with various facts or something like it has to be a real probability distribution it has to describe like actual things that are realizable and I think heuristic arguments or your heuristic estimator like can't be real in that sense like your distribution over possible States your circuit can take can't be like a distribution over actual possible States your circuit can take so so one reason that we sort of want this is like some of our mechanistic anomaly detection applications have us believing things like a is on and bizon But A and B are not on because we want to be able to like Drop Out mechanisms and you just straight out can't do that if you have an actual probability distribution over actual outcomes because like a being on N being being on must imply that A and B is being on and the second reason is like in terms of just computational complexity it's very likely just going to be impossible to like actually have a distribution over realizable or like possible physically possible or logically possible States right because like the act of determining whether or not something is logically possible is very hard um I think Eric nyman has a couple of neat examples where it's just like various circuits where it's like sharp P hard to determine or to like have a distribution over like actual states that are satisfied by that circuit although I'm not familiar with the details enough to talk about them here okay and then there was some third reason which I'm forgetting oh cool so yeah one thing that I'm wondering about is like you mentioned in the paper that it's difficult to um get adversarial robustness like there are various reasons you think that you should be able to be fooled by like you know people adversarially picking bad arguments to yeah so often in the AI realm when I hear that something is um you know vulnerable to adversaries I'm really worried because I'm like well what if my AI is an adversary I'm wondering do you think that this worry applies in the heroistic argument's case yeah so first I'll I guess I'll talk about situations in which I think it's basically impossible to be adversarially robust yeah so suppose you have a hash function that gives like negative one or one yeah and you're estimating the sum of like hash of n Over N squared and suppose your acuristic argument can just like point out the values of various hash terms so you're naive your stick estimator presumption of Independence like 50 50 between 1 and negative one so the estimate's zero and then your heuristic argument consists of like pointing out the values of various hash terms yep in this situation there's just like always going to exist arguments that like continue to drive you up or continue to drive you down yeah and someone's always going to be like hey hash of two is is one hash of seven is one hash of nine is one and you're just gonna be driven up and up yep um it's possible it should be like hash of n Over N to the 1.5 or something to make it bad and so you're in this kind of awkward spot where for any quantity with like sufficient noise even if you expect very strongly the noise to average out to zero there will exist a heuristic argument that like only points out the positive values of the noise and will like drive you up or there existing heuristic argument that only points out the negative values of the noise and drive you down and so I think this is like naively quite a big problem for like the entire heuristic argument Paradigm if you're ever relying on something like someone doing an unrestricted search for heuristic arguments of various forms and like inputting them into a heuristic estimator so there's like a couple way outs um first I'll talk about the like simplest way out that I think ends up ultimately not working but is worth talking about anyway I think the simplest thing you wanted you can do is just have two people searching for heuristic arguments one person's driving them up and one person driving them down yep yeah so hopefully if you have one search process searching characteristic arguments to drive you up and one to drive you down then we don't like systematically maybe expect one person to be biased instead I think this is a problem for example so like one obvious thing to do is both players have like equal amounts of time to find heuristic arguments or something yep however if you consider the previous case with hash of n suppose instead of being like 50 51 and negative one hash of n is actually like one-fourth one three-fourths negative one yep then in that case the like player finding heuristic arguments to go down has like a 3X advantage over the player finding heuristic arguments to go up and so then you're like no good because you can have these quantities where it's like far easier to find heuristic arguments to drive you up than to drive you down or vice versa yeah although in that case you really should be below zero in expectation right yeah yeah I think there's some example in the paper where that people can read where uh it really checks out that debate is not going to work so you have examples like that that make you very sad and so instead of doing debate we want to do this other thing that's going to be kind of interesting so if you imagine being like searching for heuristic arguments in the setting where you have like hash of n Over N to the 1.5 or whatever the way that you're going to do this is you're going to like first check hash of one then you're gonna check hash of two and then you're gonna check hash of three et cetera Etc and suppose that you wanted to drive the estimate of the quantity up then you would like check hash of one and suppose it's positive you're like great let's assume let's include term one and then you check hash of two and it's negative and you're like okay let's not include term two and then you check hash of three and it's also negative and you're like let's not include that check hash of four it is positive so then you include that and you're like carefully going through these terms and eliminating the ones that are negative however the interesting thing is like if you imagine being this the person that's doing this they're in some sense exists like a broader heuristic argument that you first considered and then pruned down which is like you you know the values of all these four terms and then you selected the terms that were only positive yeah um and so there's some sense in which If instead of including only the heuristic argument that you produced at the end you included like the entire search process for the heuristic argument and also I guess the heuristic argument you included at the end then hopefully the heuristic estimator is like forced to be like well we actually looked at all four of these terms and then we discarded two of them but like the discarding part isn't relevant because I already like know all four of these logical facts so I'm like forced to include all four of these terms and so then the hope is not that there doesn't exist heuristic arguments that are misleading but there's no like way of searching for heuristic arguments that systematically misleading if you don't like already know the values of the heuristic arguments and this is kind of related to like the presumption of Independence being quote unquote like correct on average or something and so like if you're presuming Independence where in a way that's like correct on average then on average people can only like when searching for arguments and just like looking at stuff they can only like lead you closer to the truth hopefully sure so so the idea is that like if in some sense we can like well like we should be able to have some broad search process over um arguments that essentially isn't adversarial and use that yeah it sounds like it's the upshot well it's like even if the search is adversarial if you're conditioning on everything that the search process like knows while it's searching right it like can't be adversarial because all it can do is like look at values and it's like it doesn't know what the value is before it looks and if you've like actually presumed correctly then on average everything it looks at has like an equal chance of driving you up or down yeah I mean I guess they're I have this concern that maybe like the point of mechanistical anomaly detection was to help me know how to like elicit this latent knowledge or something so like I might have been imagining like using some AI to like you know come up with arguments or and like tell me the thing was bad and like if I'm using that AI to help me get mechanistic anomaly detection pain mechanistic anomaly detection to help me like get that AI be good then well that's some sort of recursion it might be the bad type of recursion it might be the good type of recursion where you can like bootstrap up I guess yeah I think it's not very clear to me how this is ultimately going to shake out I think this kind of adversary robustness or like how we deal with adversarial robustness is like kind of important I think they're like other way outs ways out besides the way I described I think like that's intuitively what we want to do is we want to just like condition your heuristic estimator on like everything even like the search for the arguments and also I don't know just all the things and like yeah hopefully if you if you don't like let any adversarial selection into your heuristic estimator then it like can't be adversarially selected against or something yeah like in some sense so there's like a thing of if you're like a proper Bayesian and someone's like adversarially selecting evidence to show you the thing that's supposed to happen is eventually you're just like and now I think I'm being adversarially selected evidence to show and I know this fact and so I'm just like being a Bayesian and then you just update on that fact and you're just like yeah aren't wrong on average if you're correct about the kinds of ways in which you're being adversarially selected against yeah although it does require you to like I don't know I I feel like this is a bit of Beijing cope but like because because like in order to do that like when you were a Beijing right you were supposed to be like Oh I'm just gonna get observations and update on observations yeah but now you're like oh well my now my new observation is that I'm observing that observation and maybe the observation process is weird yeah but then I feel like okay like you've got this layer and like after that layer you just trust it and before that layer you like worry about like things being screened or whatever and I'm like okay where do you put the layer like like you have to enlarge your model to like make the speed I don't know it seems a bit yeah I mean yeah I think being a vision is hard but like the thing we're going to try to do is like just look over the shoulder of the person selecting the evidence to show you and then see everything that they see and then hopefully they're not like misled on average because they're just like looking at stuff you know yeah yeah in that case it's it's better than I think the patient case or I don't have the same complaints I have about the Bayesian case um so stepping back a bit there are like a few different ways that somebody might have tried to concretize a um mechanism so there's there's this one approach where like we have something called a heuristic argument and like you know we're going to work to form try and figure out how to formalize that I guess we haven't explicitly said this yet maybe but like in in the paper it's like still an open problem right yeah or we have like their Avenues but yeah I guess I would say we have some heuristic estimated for some quantities that are like deficient in various ways and we're trying to like figure out what's up and get better ones for other quantities and then unify them yeah hopefully all the dominoes will like follow through once we Quest sufficiently deeply nice so so there's this approach there's also um some people working on various like causal abstractions so like of course scrubbing is like work by Redwood research there's other causal abstraction work by other groups I'm wondering like how do you think out of the landscape of all the ways in which one might have concretized what a mechanism is how do you think like how promising do you think heuristic arguments is yeah so I'm tempted to say something like if the causal abstraction stuff works out we'll just call that a heuristic argument and go with that sort of thing like heuristic arguments is intended to be this sort of umbrella term for like this sort of like machine checkable proof-like thing that can apply to like arbitrary computational objects and I think like that's what I'm tempted to say I think in fact we do have like various desiderata that we think are like important for our particular approach to heuristic arguments and we want them to be like proof-like in various ways that EG causal abstractions often aren't or we want them to be like fully mechanistic and have like zero empirical Parts in them whereas often various causal abstraction approaches allow you to like measure some stuff empirically and then like do deduction with the rest of it I think like I very much think of us as like going for the throat on this sort of heuristic argument stuff like there's some question you might ask which is like why do you expect heuristic arguments to apply to neural Nets and the answer to that is like well neural Nets are just a computational object and we want to have heuristic estimators that work for like literally all computational objects and then the reason why it applies to neural Nets in particular is just because like neural Nets are a thing that you can formally Define whereas there's a lot of more restricted approaches to like explanations that are like well we're only we only actually care about the neural net thing and so we ought to like Define it in particular about the neural net thing and stuff like that yeah so maybe in summary we're like we're trying to be like in some sense maximally ambitious with respect to the existence of heuristic arguments gotcha so I'd like to move on a little bit to sort of the overall plan so I think in some of these early posts the idea is like step one formalize uh heuristic arguments step two solve mechanistic normally detection given the formalism of heuristic arguments and step three find a way of finding heuristic arguments and really I guess like these three things could sort of be run in parallel or you know not necessarily in order I'm wondering like lost sort of batch of Publications was in like late 2022 how how's the plan going since then yeah so I think like quarter one of 2023 was a bit rough and we didn't get that much done I think it's sped up since then so I can talk about stuff that is yeah so things that I think are good that have happened we have a pretty good formal problem statement about what it means to be a heuristic estimator for like a class of functions or like a class of circuits that we call bilinear circuits where just all your gates are squares so we have that desidrata and we can talk about like what it means to be a good heuristic estimator with respect to that class of circuits in a way that things we think are good satisfy or like some stuff we have satisfies and some other stuff we have doesn't satisfy so hopefully that's just like a fully formal problem we have a better class of heuristic estimators for cumulant propagation or like we have an upgrade on cumulative propagation that we call reduction propagation which is like it's just like slightly different instead of keeping track of cumulance you keep track of like a different thing called a reductant and it's like slightly better in various ill-defined ways but it's like more accurate empirically on some just unlike most distribution of circuits we've tried etc etc and we like feel better about it um we have like a clearer sense of what it means to do mechanistic anomaly detection and like how that needs to go and like what the hard cases are going to be for doing mechanistic anomaly detection in terms of like being unable to distinguish between mechanisms yeah I notice I don't really I didn't really answer your original question which is like how's it going I think it's like going par or something okay so like there's maybe q1 was like slightly below par and like we're doing slightly above par in Q2 and so it's like on average par or something so par in the sense of like um you've gotten like a better sense of some special cases like not yet knocked totally out of the part Park but like yeah or like if you told me that this is how much progress we would have made at the beginning of 2023 I would be like okay that seems like good enough to be worth doing or something let's do that all right this is how we should spend our time or something cool um I'm wondering yeah given that progress so out of the steps of like formal icrsic arguments solve mechanisms you can only detection given that and find heuristic arguments how optimistic are you about the steps which ones seem like most difficult um yes I think the thing that's most difficult is formalizing heuristic arguments in a way that makes them findable like here's a formalization of heuristic arguments they're just proofs um yeah yeah and so there's some sense in which heuristic arguments and find like heuristic arguments being like compact is like intrinsic to the nature of what it is to be a heuristic argument I think like doing useful stuff with heuristic arguments is pretty unlikely to be the place where we like fall down I think it's kind of possible that heuristic arguments get formalized and then we're like darn we can't do like any of the things that we thought we were going to be able to do with your sticker units because it turns out that they're like very different than what we thought they were going to be like I think that would be like pretty good though because we would have heuristic arguments and we would like in some sense be done with that part and then it's it would be very surprising to me if they were like not useful in various ways I don't know if that was a complete answer I think that worked I'm wondering like if there's been any experimental work on um trying out mechanistic anomaly detection things um beyond the so I I guess you've mentioned like there are various um interpretability things that you think like went scale um is there any like promising experimental work you're aware of yeah so I am not very aware of experimental work in general but I think Redwood is currently working on what they're calling like elk benchmarks where they're trying to do this sort of mechanism Distinction on like toy problems like function evaluation I don't know how that's going because I I'm not up to date on details fair enough I think like Arc employs often write code to like check whether or not heuristic estimators do some things or like check how empirically accurate they are or like find counter examples by like random search or something probably you don't want to call that like experimental work because we're just like checking how accurate are like heuristic estimators for permanence of matrices are or whatever yeah so I think the shortest distance the short answer is like I think redwood's doing some stuff that I don't know that much about and I'm not really aware of other stuff being done but that is probably mostly because I'm like not that aware of other stuff and not because like it's not being done although it probably also isn't being done in the way that like I would want it to be done or whatever gotcha yeah so we it's about time for us to be wrapping up before I close up I'm wondering if there's any question that you think I should have asked I think like people often ask for probabilities that these sorts of things all work out to which I often say like one seventh then everything works out roughly the way that we think it's gonna work out it is super great within five years ish maybe not quite five years now because I've been saving like one seventh over five years for more than like a few months so it took four and a half years now but other than that I don't think so fair enough so finally if people are interested in following your research or you know if if we have bright Minds who are perhaps interested in contributing what should they do yeah so Arc posts blog posts and various announcements on our website alignment.org and we're also currently hiring so you can go to alignment.org and click the hiring button and then be directed to our hiring page great uh well thanks for talking to me today you're welcome thanks for having me on this episode is edited by Jack Garrett and Amber dornace helped with transcription the opening and closing themes are also by checkout financial support for this episode was provided by the long-term future fund along with patrons such as Ben Weinstein Ron Tor barstad and Alexi molafeev three to transcript of this episode or to learn how to support the podcast yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] [Laughter] [Music] [Music] thank you foreign
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs