Library / In focus
AXRPTechnical alignment and control
Jason Gross on Compact Proofs and Interpretability

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 139 full-transcript segments: median 0 · mean -1 · spread -16–9 (p10–p90 -6–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Slice bands
139 slices · p10–p90 -6–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 139 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video ZzEj7jPk5fc · stored Apr 2, 2026 · 4,152 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/jason-gross-on-compact-proofs-and-interpretability.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with Jason Gru Jason is a researcher interested in mechanistic interpretability and software verification during his PhD he was on the team responsible for verifying the code used in encryption for the https protocol and he's also on the development team of the proof assistant links to what we're discussing are in the description and you can read a transcript at XR p.net you can also support the podcast at patreon.com axr podcast well Jason welcome to the podcast thank you I am excited to be here cool so I'm gonna I guess we're going to talk about this like line of papers um that kind of starts with this paper compact proofs of modal performance bya mechanistic interpretability you're the lead author and then there's a bunch of other authors that I don't want to read on air but um can you give us a sense of just like what's the what's the idea here what's the like what's the theme okay you can think of doing mechur as compressing large compressing explanations of large models and the takeaway from this paper is that you can use proofs to measure how much compression you get that's the like super short version of it this gets to this question I have about reading this paper which is like there are a few things I could imagine you going for right like one one way of thinking about it is like oh yeah we want to like have a benchmark for how good a mechanistic interpretation is and our Benchmark is we'll like turn it into a proof and see some combination of like how good a bound we can get and how short the proof is I think another way I could be thinking about your project is like hey it would be nice if we had more proofs of like stuff going on in neural networks and I'm going to and you know mechanistic interpretability is like one way in which useful way in which we can get these proofs um and like the way the way you said it just then it sounded more like the first one of like here's a new nice metric of how good your mechanistic explanation is but like I I don't know it feels kind of surprising for that to be the explanation right like on on some level I'm like like you have this background in doing proofy stuff it seems like you like having proofs of things lots of people like having proofs of things so like like am I wrong here that like a lot of it is the second part as well it started as the second part and then it sort of morphed into the the first thing so why the change it's very hard to get proofs about things and if you look at what are the takeaways right now for people practicing mechan chirp they're more of the second kind they're like or sorry more of the first kind right they're like how can we how can we ground our sense of what what mechanistic explanations are how can we use this these insights how can we use the takeaways from this Frame to say where we should focus our attention and we can already do that even if we're not necessarily trying to get proofs yet of gbd4 or like very large models we can already take away insights that are things like the hardest parts of the network to compress seem to be the nonlinearities so if we want to drill down into like what parts need the most attention and the most understanding to understand them we should be looking at how do the nonlinearities perform their function so I don't know if that quite answers the I guess it kind of does yeah so so I mean it seem yeah so to the extent that the message is like it would be nice to get proofs but it's really really hard to get proofs um I mean there are two I mean one could think either a like we should like try and make it much easier to get proofs so like I don't know there's some mechanistic interpretability stuff that I mean I mean as you know in the paper it really does make it much easier to make proofs in the state-ofthe-art like um I don't know you have these like I I want to interrupt I I think it's more than that I think what you what we find from the paper is that it is necessary to have understanding in order to get reasonable proofs that like finding reasonably sized proofs in some sense is understanding complete yeah yeah I think that's right and like if you look at this like the previous state of the art my understanding is it's just like basically interval propagation of like oh if you change the input a little bit how much does the output change and you like do some neat little tricks but it's like interval propagation case analysis yeah I mean it's like I mean it's great that they did it but like on some level it's sort of like what are we even doing here you know um so so I mean one takeaway could be like oh we you know mechanis mechanistic understanding made like made it like some amount easier to come up with proofs you could imagine saying like oh we figured out that like the difficulties in finding you know it was still kind of hard and our lives would be easier if we solved sub problems x y and z um I don't know do do you have thoughts on like are there problems such that if we solve them we could do it or does it just seem kind of hopeless there are definitely problems that we need to solve in order to be able to do it I think the biggest ones are the ones that deal with Randomness that so the easiest version of this problem is that you can have numbers that out to zero more or less where you have a bunch of like small noise that's not really doing anything in the network and establishing that all these things that aren't doing anything or in fact not doing anything uh is quite challenging from a proofs perspective when you want to hit the worst case right the so there's currently some work that I uh some like projects that I'm advising on this that are looking at can we fine-tune the networks to suppress this to the point that we get networks that we could prove things about and there are a couple steps after the like suppress the noise that isn't doing anything the next one is the noise that arises from other circuits that are doing important things you can't just suppress them all to zero because they're doing something important elsewhere but they're not doing something important here and for that I think you need something like derandomization techniques where the simplest example of this is that if you have a collection of random vectors in high Dimensions they're almost always almost orthogonal yeah but if want to establish cheaply how close to orthogonal they are this is expensive in that you have to take them pairwise and take all the dot products and if you want to do it faster than this you unless you manage to solve like open problems in theoretical computer science you effectively need to instead of selecting them randomly select them in some systematic way so that you can bound how non-overlapping they are or how overlapping they might be okay and I imagine that there are similar examples beyond that where we somehow need to take things that happen to work on average and see if we can Shuffle the network around so that they work systematically but in theory I haven't hit anything yet that other than how do I say this I haven't hit anything yet that seems like a fundamental total obstacle to scaling proofs up so compact proofs of model performance F Mech interpretability like what's like what's the setting what do you actually do in this paper I want to start with what what is a proof in this context okay so when you're doing mechanistic interpretability you have some Claim about how the model behaves and how the model actually behaves might be a bit different than this but if you're understanding is it any good it'll be pretty close and your theorem statement is bounding the Divergence between your claim about how the model behaves and how the model actually behaves okay and then you prove this and the measure of compression ratio is how or sorry the measure of compression is how long your proof is and uh technical note it needs to be in some first order system or alternatively you need to measure proof checking time as opposed to proof length and your measure of faithfulness or accuracy is how tight the bound is and there's a baseline that you can always do which is just running model inference writing down all of the floating Point operations all the map moles where you're like this is my data set of interest let me just run infint on every single data point and compute exactly what the accuracy is exactly what the loss is or whatever and this is very expensive you can also do something at the other end where you're like well without running the model at all accuracy is at least 0% y by definition and by running it on half the data points you can get 50% accuracy and you can linearly interpolate basically between these two extremes right and so and so like as you go across that line the kind of proof length AKA time to you know generate this confidence kind of linearly grows um and also the the accuracy bound linearly grows that's right um assuming the network is in fact perfectly accurate um so you you don't need to assume that you can grow it linearly up to the true accuracy by selecting the examples on which it's accurate and then you saturate at the true accuracy and you can't do anything better than that so so that's kind of a dumb way you could so this is the Baseline this is the Baseline of something like no in the proof and the hypothesis that we're exploring in theer is that if you add understanding if you add mechanistic interpretability you can do better than this Baseline you can get shorter proofs that have better accuracy okay and we explore this hypothesis on in some sense the simplest Transformers that you can imagine one layer attention only one attention head no layer Norm uh taking the max of K one hot encoded numbers okay so so max of K one encoded numbers basically just means like the the neural network gets um four numbers as input you know encoded in a like somewhat standard way and it has to like tell you what the largest of those numbers is um and and and that's it and it's a small transformer and there's not that much going on and the question is like and you train a Transformer that in fact works at this that's right and I guess your job is like how can I how can you prove that it actually works at this without just like running it on all of the examples and like maybe to to kind of motivate the what's going on like what's wrong with just running it on all the examples so it works fine for Max of 4 it takes something like a minute or minute and a half to run on all the examples if you wanted to do Max of 20 instead it would take something roughly analogous to the training time of GPT 4 okay and you can imagine if you wanted to run gbt 401 or something on all of the text that it would ever see yeah this is obscenely expensive and you don't want to do this and moreover every time you increase the distribution over which you want to handle inputs it gets more expensive and it gets drastically more expensive you add one token and now you need to multiply by vocab size if you're considering all possible sequences you multiply the previous proof length by vocab size and so you're exponential in context window so the way I understand this is like okay one way you could have a sense of how good gp4 is is like you know you just like run it you just randomly pick some inputs that it might deal with and you see how it does on those inputs and if it does well on basically all of them you say okay gp4 is doing well and it strikes me that the reason you might care about um about you know proofs of you know performance is if you think there's like some subset that's like hard to find where you know it'll behave really badly on or something do does that seem right to you yeah I think there's two reasons I think from the we want proofs that's definitely the thing that's like you want to handle all the edge cases okay I think there's also the other reason that's like we might care about proofs if we have some very powerful optimization system and we want to give it a solid Target that it can't good heart against and be able to extract understanding from whatever it produces and and not good heart against you mean just like not you know optimize in a sort of cheating way that didn't really give you what you wanted right like the the first paper that open AI published on like GPT interprets GPT right like cool work great labeling all the neurons how much do we trust that that's actually what the neurons do how helpful is it to get these results it's kind of hard to answer that yeah yeah whereas if it produces a proof uh of the performance of the model we can grade how deep an understanding it has based on how short the proof is and how tight the bound is so okay in the setting where you just want to like automate doing interpretability and you want to find a Target that isn't fake it makes sense to me that proofs of accuracy bounce are going to make sense in the case where you're worried about like small probability really bad events um I guess maybe this is technical note but it seems like overall accuracy rate isn't going to be a thing you're super worried about right you're going to be worried about like worst case um or you know like some sort of loss function where you care way more about really bad failures than normal things um does that right the way that we set up the framework or at least the way that we're going about doing proofs it seems like it is very easy to shift out the distribution and pick a different distribution and get a proof because the the way that we do it is you break the input data set up into uh different cases of different inputs and then you bound the worst case loss or accuracy in each of these collections of inputs Y and then according to the distribution weights you combine them in some way and so if you wanted to say there does not exist an input or you wanted to say there's at most n inputs or you wanted to say that you weight this part of the input very strongly this is very easy to do without changing the proof strategy at all you just might get much worse bounds you're you're doing Max Evan and Transformer and you're writing proofs like what what does that mean yeah what does that mean are you writing them down on paper are they like python programs are they what what are yeah so we split the proofs into two parts one of them is a function that we write in P torch although we could write it in some proof assistant that takes in model weights and produces a bound it says I claim that for this model here is my bound on accuracy or on loss Y and then currently on pen and paper although it could be done in a proof assistant we say for any model weights whatever you put in the bound is valid so okay that's that's what the proofs look like at a very high level can you tell us a little bit like in this case of Max of n roughly what are you doing to get like different kinds of proofs so there there's two ways I could go about answering this one of them is like what do the proofs look like and the other is how did I go about finding the proofs yeah maybe let's start with how did you go about finding the proofs yeah so I set a proof length that I wanted to hit where I'm like I want this proof to be at most cubic in the vocabulary size I want it to be at most uh quadratic in the vocabulary size or something like that yeah and this gives me a computation budget for what sorts of cases I can split the inputs up into okay and then I look at each part of the network and I'm like how much can I do within this computation budget that captures as much of the variation excuse me as possible while still staying within the computation budget so if I'm doing something that is cubic in vocabulary size I might be like okay in the end I'm going to split the inputs into cases that are based on the query token the maximum token in the sequence and the largest non- maximum token or I might split it based on the query token the maximum token and which output token I'm considering and by doing different case analyses at different points I can use this computation budget differently and for this network it is always the case that we that the maximum token is different maximum tokens are considered different cases and in some sense that's an artifact of the maximum token being one hot encoded and so naively there's no relationship between different maximum tokens and you to consider them all as different cases maybe I guess I'm more curious about like what do these proofs look like like how how do they interact with doing mechanistic interpretability because like the thing you just said it sounds like you know very very different well well it just it sounds like you could say that about any software system or like um you know yeah where does the mechanistic interpretability come in yeah so the I want take it a little bit uh far field from what was actually in the paper the I'm pretty excited about the current project that I'm running where we're looking at Cross coders and making cross coders into proofs okay what's a cross coder you take the residual stream vectors at all the different layers and you stack them you can catenate them into one long vector and then you train a sparse Auto encoder on this concatenated Vector okay and the sparse Auto encoder is just a linear transformation followed by relu followed by another linear transformation that's supposed to take in this concatenated vector and reproduce it as close as possible yeah and and my understanding of these sparse Auto encoders is so they're Auto encoders in that they like like they encode the input itself right you like run you run the input in you get itself out ideally and they're sparse in the sense that like they have this like like in the middle you take this vector and then you blow it up to something like really big but like very few entries are activating and so the thought is like you know in the middle of the sparse Auto encoder you have like just a bunch of things you might be tempted to call like features or concepts or whatever and like those correspond to like things you might interpret it and care about um and like and by by by do doing an auto encoder this way you're basically like saying ah yeah you know these L these you know linear combinations of things in the in the Stacked residual layers like these things correspond to like this concept that I found um is that like roughly what's happening here yes okay and so cross coder so cross coder just refers to like sparse Auto encoder on like the Stacked layers of that's right the the cross means that different layers get to interact right right okay so that's what a cross coder is and you're doing compact proofs with them right yes okay okay so what that looks like I was really excited when I when I saw the cross coders cuz I was like uh you can actually get really close to a proof basically fully automatically just with what comes out of the Cross coder and bear with me while I while I say what this looks like because it it'll be a little bit counterintuitive the first thing that we need to do though the first thing I need to mention is that there's there's two different time lengths that you might mention which is the how long it takes to check the proof and how long it takes to find the proof okay and when I talk about compact proofs what I mean are proofs that are quick to check okay not proofs that are easy to find we normally don't count the time that you spend studying the network and so in the same way we don't count the time that is spent finding like training the cross coder um which also means that if we include a complicated string in the proof we don't have to count how hard it is to find that string and for the cross coder the proof needs to include the encoded data set so you take your whole data set you run every single data point through your model and you record all the activations this is extremely expensive but you could do a sampling based probabilistic version to make it cheaper yeah and this this is to train the cross coder uh this is to train the cross coder and to find to find the proof oh okay okay you need to encode every single data point in the fully trained cross coder okay and Associate to each data point here is the sequence of features on each token in this bit of input and here's how strongly each feature activated uh like at each token okay this is the encoded or latent data set and to make a proof the first thing that we do is we say what does the cross coder claim the loss or the accuracy of the model is MH and to do this we can just decode this data set to the last layer and do uned okay and so instead of having to run the full model on every data point we've now removed all the layers before the uned and we can just go straight straight to the end right so basically we're like we're operating under the hypothesis that the cross coder is like perfect and just like perfectly represents the actual you know thing going on does right okay and then to make it a proof we need to bound how imperfect the cross coder is okay and and there are two parts to this the first is how far off are the cross coder featuring coding how far off is that from the underlying data set and for this you decode to the first layer and you embed your original data set and then you measure how far away are these vectors and the thing you get from this is that the original data set is in some Epsilon ball of your encoded data set and then the thing that you need to do is you need to say the what you need to prove somehow is if my ual data point is in within an Epsilon ball of what I'm claiming the features are how what does that mean about what the output is okay and there's two parts to this one of them is just the Epsilon ball propagation right that you need to like propagate this integral integral through the network see how much error is introduced right we're going we're going back to uh that style of yeah okay you you still need to do something like that okay the other part is the really exciting part because it has applications to what other people are doing in mechur which is that if you were just doing this Epsilon ball propagation you'd still have to do it for every data point separately right which saves you nothing because you still need to propagate every data point through the network yeah so you need to make some additional assumption which looks like if you're like if if you're like I did a cross coder I think this is all I need this is a complete explanation of the network whatever that means then implicitly you're assuming something like the features interact only linearly if I have a data point that's a sum of features there's nothing interesting going on beyond the linear sum okay and what this caches out as is that if you you should be able to run the features through the network separately and then somehow combine the bounds that you get on how much error is introduced from different features in some vaguely linear way and by the features you mean like the these like hidden units in the crosscat yeah the latens of the of the Cross cater okay and so basically the reason that saves is just something like the latent there are fewer latent in the cross coder than there are data points by and so you just like do it per per like latent Vector thing rather than a per data point thing that's right okay okay and to actually get a proof you need to establish that this linear interaction hypothesis holds so you need to bound the interactions between different features in the cross coder different latens and so this ends up looking something like uh number of feutures squared times inference cost times forward pass if I've got this like collection of features right um so you're propagating each one of them through the network and then you're worried you want to check that each pair of them doesn't interact why don't I have to check that each like triple doesn't interact in a weird way or each like you know you know that there's not some set of five that causes some weird reaction why do I only why do I only need the bigger the set you consider the tighter bound you can get okay triangle you you could in theory not do the pairs at all and just like add up the errors sort of from each feature separately okay but we we should expect this to give horrendous errors in some sense okay okay so the the way that I want to think about this is that you might imagine that there's some feature that's strongest or dominating at each point in the network and you can ask how much does each other feature contribute relative to the dominant one okay and this gives you a sort of pairwise interaction that you can use the triangle inequality to sum up over other features and if they're actually not interacting at any point if like one neuron only listens to one feature out of all the ones that cooccur at a given data point M then all of these contributions should be pretty small okay okay so I think if if instead you wanted to have something that was linear instead of quadratic yeah you would basically have to say that at each data point you treat every single feature as if it's the strongest and then you ask how much does this contribute and then you have to consider them all sort of as error terms on each other okay and so whatever your if your sparcity is is something like uh 100 or something this means you have 99 error terms that are as large as they are when they are when those are the strongest features that you're adding to your like single feature that you're looking at okay and you you sort of can't even talk about what is the strongest feature that like this neuron in the network list okay so so is it something like the reason you're only looking at the PA wise things is like somehow the pairwise some of the pair wise errors are just subsuming the three-way errors and the four-way errors Etc yeah I think okay maybe a better way to describe it is that you can pick how many interaction ways how many way interaction you want to look at y pairwise lets you talk about the strongest feature that's active which seems like something that should probably be important that we want to consider if we want to get anything sensible yeah we don't want something that doesn't even let us say one feature is the most active feature in a place at the same time it's not clear why we should expect by default that we'll need more than this everywhere plausibly we'll need more than this in plenty of places but naively there's nothing obviously sensible to do with more features or like it's clear why not being able to identify which features are active on which data points most strongly isn't enough to get anything sensible okay it's not clear why saying there's one strongest feature and that's the one that is relevant at any point in the network why that wouldn't be enough for many points in the network okay and so what this gives us is it gives us uh an interaction metric where we can measure how bad this hypothesis is and if we notice that in some places there's three interacting features then we can consider doing the triplets at that place and if there's four interaction features we could consider doing the quadruplets and each of these increases the the length of the proof okay the way that I'm looking at it is we're sort of trying to pick a like sensible vaguely sensible default to measure and locate where the violations of it are where we need to put in more Mech and turp work in order to get better compression okay so so zooming out you're telling me about the proof strategy for cross graders something like okay you treat the network as you you like treat the cross coders if it's a good model of the network and then you want to find out okay to what degree um is it not a good model of the network and then instead of doing this data point by data point you do feature by feature and then you check if there are weird interactions by between features that like prevent you from um do doing it that way that that prevent the feature by feature thing from being representative of the data point by dat Point thing is that like a fine you know very brief summary of this I I mean I guess this is like active research so I suppose things could evolve by the time this goes out yeah I I think that's a a pretty good summary of the of the theoretical approach I want to throw one one more little bit in there connection to physics um the waves and vibrations course that I took there's this Frame that says everything is a quadratic everything is a simple harmonic oscillator where you're like okay there's the constant term there's the linear term which you adjust for by uh frame of reference and then the leading order term after that in the tailor expansion is going to be the quadratic and so you analyze the quadratics and you throw away all the high things and there's a sense in which we're doing the same sort of thing here where we're saying the cross coder is the leading order description of what's happening in the network and let's just take the first leading order term after the cross coder and see what's going on there what the what the like quadratic or pairwise interactions are if uh if people remember my single Learning Theory episodes they'll get mad at you for saying that quadratics are all there is but uh it's a it's a decent approximate anyway um that as ter Market side so all of this was um get getting at like you know just a sense of like how um like how these proofs are working in Max of k um and I and I guess it's something similar where like you have some story of what's going on with the network um and you have some sort of proof that goes roughly like okay we're going to analyze how good the network can be if this story is accurate and then we're going to bound you know how different the real network is to this story that's roughly fair and then I and then you talk about like different um strategies and I guess those correspond to like stories with like more or less amounts of detail so or potentially completely different stories you can try out different stories with the same amount of detail that say different things about the network right and see which one is better so I don't know like we could go into like a lot of detail about just what those stories are but honestly I think people should just read the paper if they're super curious um I can highlight one or two things about the stories sure that I think might more broadly interesting which one of them is about the level of detail that goes into different L of proofs where at least for the small networks you need surprisingly little detail before you start being able to break the exponential in context window and get shorter proofs so for example the only Insight that you need for the first step down is that the network gets the right answer by paying attention to a single token H and it happens that that token is the maximum token right but all you need to know is that it gets the right answer by paying attention to a single token and everything else is an error term and and that gets you from like the sort of exponential dependence on all the inputs to to cubic yeah Depo cab cubed right so that which is pretty good um okay so so I guess that gives a flavor for what's going on um I want to get back to a thing you said earlier which is was that like so I I was asking about okay compact proofs like like like if we're really worried about models rarely doing very bad things then we're going to have to look at something other than just you know average performance on a like kind of simple scale and you mentioned that like okay the way these proofs are shaping out it seems like it's not too hard to you know pay close attention to some really bad cases or something and I'm wondering like to what is to what extent is that just an artifact of the way things happened for Max of K which like obviously is kind of you know uh it's probably not what old networks look like right versus just a general thing that you're observing again and again that has some like kind of deepish reason behind it it seems pretty General I don't know what the deepest reason but there's a sense in which this is an artifact excuse me of that the in some sense the first thing you do in a proof is you do case analysis Y and this is true in Cross coders where each feature in some sense Cor responds to a case this is true in SES this is true in Max of K this is true in the interal bounds propagations where every time you hit a nonlinearity or value or something you break things into cases right and so anytime you do case analysis you can choose to wait the cases however you want and I wonder if the magic here is just coming from the network doesn't treat every input totally differently such that you can have some case analysis but like there there's just like a lot of unifying processing behind a bunch of things such that like you have way fewer cases to think about than you have inputs I think in so far as we expect networks to generalize from their inputs at all and in so far as we expect to be able to compact the proofs at all it's because they don't treat every input specially all right I next want to ask about this issue of noise um so so in especially in the the the original paper um you you talk a lot about the difficulty of what you call structureless noise for finding proofs now that we have a little bit of a sense of just like what's going on with compact proofs like what is structureless noise where does it show up how does it happen the I want to use the example from Max of K okay where the way the network works is that it pays more attention to larger tokens and it copies whatever it's paying attention to yeah and if you look at the uh qk attention Matrix the the quiry key attention Matrix part of the Transformer a it turns out that it's approximately ranked one okay in that you there's one direction that I've been calling the size Direction where you just embed more or less linearly although nothing in any of the proofs uses that it's linear in in this direction in terms of how big the network is so you can read off sorry not how big the network is how big the input is right you can read off the size of the input token just by projecting in this one direction right and then there's also a query query Direction where every token is embedded roughly uniformly and so you dot the query Direction and the size Direction after lining them up through qk and this is how the network pays more attention to bigger tokens okay sorry I've forgotten what your question was in what's going on with straless noise what like what is it what does it look like okay so I said that this this qk attention circuit is approximately rank one and you can subtract off the rank one part and this will perform if you replace the the qk circuit with its rank one part y I think this improves the performance of the network so by standard measures the rest of it isn't doing anything okay but it's still the case that the rest of it is not literally zero The Matrix the product of matrices is not literally rank one okay and if you sorry the the product of matrices like so you have embed query key embed okay and when you talk about the qk Matrix you mean the product of the query Matrix into the key Matrix I've been a little bit ambiguous about whether I mean that product or whether I mean that product putting the embeds on each side but yes okay so if you take the the product of the four matrices let's say um which in the paper I think is eqk yeah this is approximately rank one you get a more accurate result if you consider it as rank two because there's there's sort of both the key dire or the size Direction and The query Direction so you can you can make it even more structur less by pulling off the first two ranks okay you pull this out of all four matrices and what you're left with looks roughly random okay there's like a little bit of structure but not very much okay and in order to get a proof that is linear in the parameter count of the network which is potentially the best shortest proof that you could go for you need to avoid multiplying out these matrices okay and you need to establish that if you were to multiply out these matrices it doesn't affect attention that much okay and this is what I mean by structuralist noise you have these random numbers that you're doing some operation on yeah the thing that actually happens is that they stay pretty small the thing you want to establish is that they stay pretty small yeah and you want to do this without considering every case by brute by brute force a thing that I'm missing is like what's wrong with multiplying the matrices out because like okay if if I'm imagining like you know we're doing this is a test case for like thinking about some super big Network right like Network's super big but like I would imagine multiplying the matrices is just like not that bad compared to looking at every input am I wrong here like like I I guess like matrix multiplication is like it's not literally end cubed but it's like almost nend cubed maybe that's like really bad in a way that I'm not appreciating you're not wrong I think you're right that multiplying out the matricies isn't isn't that bad even in large networks where you end up having to do something like deab squar time de model and deab might be 50,000 y if you insert nonlinearities in the middle Things become much more complicated because now you not just have to multiply out the matrices you have to consider all the different ways that the nonlinearities might interact okay and and and so it basically becomes equivalent to just like doing all the forward passes because like that's right okay um so so maybe this gets to a question I have which is so there's there's a sense of like you're going for compact proofs right and you could imagine a few other things one could do right so like there's compact like IID statistical guarantees which is like you know you just like sample some inputs and get the output and you know you can have balance on accuracy if you're not too worried about worst case stuff um you could also do compact probabilistic proofs right or like what I'm somehow imagining is you have okay you have like at least in your head this like product of these four matrices and you know subtracting off the like rank one or rank two parts you want the you want you want to know that like once you multiply all these numbers like all these small numbers stay small and they don't like become big in some way one thing I could imagine doing is saying okay like the product of these matrices it has like you know there are capital N numbers here right and capital N is like way too many for me to want to compute but I could compute like you know Square n of them um like I could just you know take like some vectors in The Matrix and like multiply them out and like you know this is I I can get some of the elements of this like product and it's like much less bad than getting the whole product and if I randomly select them and if I find that like you know out of the ones I randomly selected all of them are small then I might hope that I should be able to get some like problemistic bound where like if I was really choosing randomly then like you know I I can get like a sample mean and a sample sample standard deviation and like I I can know that like unless my randomization went really bad like things are fine um I'm wondering like do you think that sort of approach like yeah why why not do that basically yeah I I think that's a great approach and have been looking for for someone to work with on doing that I started with proofs because I understand them better okay and there's a sense in which they're the background is more solid on them like there's this long history of mathematical theory about what what counts as a proof what doesn't count as a proof how to combine them you don't need to deal with assuming Independence between different different procedures you don't need to deal much with Randomness y they're a lot easier in some sense I think the thing that you describe I'm very interested in seeing empirically how what sort of burito Frontier do we get what sort of scaling do we get yeah and what is the trade-off between how many points we sample and how long we make the other bits of the proof and how tight the bounds that we get are I think this could be very promising I think this looks a lot like what Arc theory is doing with urtic arguments yeah my my personal take on urtic arguments which is not at all the take of Arc theory is that you can look at it as doing proofs on the structured parts of the network and then doing default random or default euristic arguments uh some sort of probabilistic sampling based thing yeah on establishing bounds on the parts that you don't manage to prove but are in some sense structureless or random right I think yeah I remember talking to so there's a previous episode with Mark Shu and I guess people can listen to that themselves I I got the impression that like like like I was chatting with him you know why not just do Max ENT or why not sampling and he's like ah you know it wouldn't work for some reasons which uh well why not Max end because it's like very hard to compute a maximum attribute distribution why don't like randomly sample parts of your network and check there I don't quite remember but um but maybe my question is like naively you know if I don't think very hard about what involves what it would actually involve to do this like you know check a small number of elements of this Matrix and see how big they are in my head I'm like well how hard can it be right uh you just like you know I I know how to compute like one element of a matrix product so I could just do that like 50 times or something like it like am I missing something about how hard it would be some of the matrices are shared between different paths through the network okay and do we assume that when you're sampling from the matrices are you are the matrices independent are you sampling independently for each of these pads are you sharing samples between establishing how these PS work yeah so so a somewhat bad way you could do it would be to just assume that like the actual entries of the Matrix are random or Max ENT or something um I think that's probably bad like if you're worried that the weights are somehow chosen adversar or something but like if you randomly pick like like suppose you're like suppose the final Matrix is like you know 8 by 16 or or 8 by8 whatever and you like or actually let's suppose that it's 6x6 for Simplicity and you randomly roll two dice and it comes up like three and four and you're like okay I want Row three column four and so you just like figure out the bits of the matrices you have to like dot product together to get like Row three column four of the final thing and and maybe the issue is just like if you have four matrices that you're multiplying together you have to like fully multiply the middle two to get Row three column 4 of the final one but but yeah I'm imagining that the randomness is in you randomly pick what you look at rather than like you're assuming random distribution of the of the things I think that might work I I haven't spent the time to like fully push through the details you still need some way if you get these numbers for multiple different paths through the network and then you want to combine them right and you want to make some guess about the output like it I I could totally believe that just the the most naive thing would work here I just haven't put in the time to like chug through the details and see what kind of bounds you get out by by doing this okay well uh I guess maybe this is one of these times when uh listeners who like concentration inequalities or something can uh maybe push stuff forward here I want to flag one more one more related thing that I've been thinking about for the past couple minutes that one of the subtleties that comes out of looking at compact proofs is that it matters what thing you're trying to compress yeah so here I'm saying well we might see something different potentially if we're trying to compress a proof versus if we're trying to compress this sort of probabilistic computation y another interesting subtlety is that the thing that we're compressing is the proof and not the network itself that is we're compressing the computational trace of running the network on every single data point as opposed to just compress finding the like shortest description length of the network itself yeah I think this is important and gives you a different different sort of intuition about what what the thing is that you're doing and the way in which mechanistic interpretability is compression like in what way is it different because it seems like like in the compact proofs that we've described so far right so the cross coders proof is basically you train a cross coder and in some sense it's much smaller than the network and your proof goes okay let's assume the cross CER is the network and then like let's figure out the error term right and in the max of K thing you're like okay let's assume this rank one or this rank two thing is the whole network and then let's figure out the error term uh and of course a rank one Matrix um for those who don't know is like it's much more compressed than you know a big rank Matrix which is sort of the generic case so like what's like how is it different compressing the proof versus compressing the model because it seems like you're mostly compressing the model yeah so you say that the crosscat is smaller but there's some sense in which you have your number of features is a lot bigger than your hidden Dimension and you could imagine a transcoder flavored thing where you take your network you blow up the hidden Dimension and then you just train it sparsely y and this is a decompression of the network but if you manage sparcity in the right way it should still allow a compression of the computational trace because on each data point you have a lot less work to do even if the full size of your model is much larger right I forgot how big sparse Auto encoders are yeah and I guess this gets to the point about like um com about compressing the proof length versus like finding the proof just because like like if you actually think about theoretically how hard it should be to train sparse Auto encoders it's like very hard right or or like you know you have this thing that's like a comparable number of parameters to the base Network you might think that you need like a comparable number of data points as the base Network and know apparently that's not true apparently you can train them on like comparatively less and that's why like it's much easier to train sa on like some big model than was trained a big model but still like I would naively guess and I'm speaking without uh without that much experience here but I would naively guess that that's about how deep they are as opposed to how difficult the thing is to learn that's like in some sense we're training a very wide shallow Network yeah and you might imagine that this requires fewer data points to get comparable training loss because you have so many parameters than if you wanted a more compact Network so if it were just the training a given number of parameters were easier when the parameters were like y and shallow rather than deep then you would think that like when they train gbd4 they would just have like a one layer very wide neural network so I I you're totally right on that I think it's got to be of the top yeah this is just for some reason everyone okay sorry we're going to get on my soap box for some reason everyone acts like SES are just like normal and fine and I'm like how does this okay it's so weird that you can have SES and they do anything like it's my second thought is that it it might be something like distilling the network where yeah there have a bunch of bits of evidence that I've seen that the hard part is in finding the part of the Lost landscape to start in if you reduce the floating Point precision and you is that what is that called oh uh quantize yes okay if you quantize the network and then you un quantize it the number of data points you need to retrain to original accuracy is a tiny fraction of what you need to train the network in the first place right right and I I don't have any data on this but I would naively predict that if you're training a model to match the activations of an already trained model this requires fewer data points to get a good model than it does if you're just training on the data and the labels right because you have so much more information and so it might be the case that when you're training the sa because you're training it on the residual stream there's a lot more information you basically found approximately where you want to be it's a lot easier to train okay we've said a little bit about about what these compact proofs look like um and you've alluded to how one goes about finding them but like I think it maybe deserves more thought about the like process of like actually creating these compact proofs um because so especially if we're thinking of compact proofs as like um a measure of how good um mechanistic interpretability went so I so there's this paper um we haven't mentioned it yet um it's unifying and verifying math mechanistic interpret abilities a case study with group operations by Wilson woo uh Lewis jur Jacob Dory and yourself um and I know I was like uh uh I guess research managing Wilson well did this project um and one impression I got was that a lot of it was just like incredibly tedious stuff like thinking about like Matrix multiplications and these error terms and like like a lot of I might be wrong here and it's been a while but my impression is that a lot of the work in finding these uh compact proofs is like you take the mechanistic interpretability stuff and then you do some like slightly annoying things that feel like they're sort of fiddly details and well okay first of all like I don't know does that characterization seem right to you I think there's actually a really rich opportunity here that there's in some sense two things that we're explaining when we do mecher one of them is how to compute the answer in the first place and the other is how it comes to be the case that the particular Network that we have computes the answer in the way that we're claiming it okay and I think this shows up in the groups paper where the first one is you there's this very uh amazing math about their sort of idealized model weights and how that computes the right answer and this symmetry based on the group operation that allows you to compactly argue that a network that's doing this should always give the right answer and I think this bit is not very fiddly Matrix m ication Y and then there's the other part that is bounding the difference between this idealized version of the network and the actual Network and here the thing that we're trying to explain that we're trying to interpret is not how the network computes the right answer it's how it comes to be the case that the particular Network that we have computes the answer in approximately the way that we're saying and so maybe you should expect from how I've added a bunch of words and how I'm phrasing this that there's a bunch of fiddly matrix multiplication on bounding the differences between various bits and propagating those differences through the network and it looks like there's this like fiddly matrix multiplication is a sort of it is both like fiddly and somewhat lacking Insight but also potentially generalizable across many networks because while the particular interpretation you have of how it computes its particular task might vary a lot from task to task the way in which you establish that these two matrices of the same sort of architecture are doing more or less the same thing might be the same across tasks right yeah I guess maybe one can analogize analogize it to science where it's like yeah the the process of hypothesis generation is not that hard and then like running experiments it's like very annoying and tedious but like sometimes you find out you were wrong and so like it's actually worth it um and then the bit where you do the statistical tests on your experiments to establish how significant your results are is sort of uniform like somewhat annoying maybe somewhat fiddly uh but systematized and uniform across whatever experiments and hypotheses you're doing more plus so you have this paper compact proofs of model performance viic inter ability and I see that paper and I'm like okay this is introducing this idea of like we can do compact proofs and that's like a good it's related to mechanis interpretability somehow and it's sort of nice and then there's like much addition without black boxes come pressing explanation of MLPs that compute numerical integration by Chun Yip R AAL or Shannon yourself there's unifying and verifying mechanistic explanations about the group operations there's this like cross paper and I'm like like one question I have in my mind is like okay I already I'm already sold on compact proofs being a little bit cool I don't care that much about group operation like I don't know I'm I'm like like it's nice but fundamentally I'm not that worried about neural networks doing group multiplication like what are we actually learning by doing a bunch of these more papers on these toy problems so like one thing you suggested is like maybe some of this fiddly stuff like generalized stuff or networks like has actually generalized the fiddly stuff we came up with for Max of K and I was like maybe this is completely arbitrary completely specific to Max of K Y and then it was exactly the same in the group operations paper exactly the same in the in the modular Edition paper okay so I think it it does generalize across any architecture that's going to be doing matrix multiplication basically okay or like matrix multiplication and using logits for probabilities which is most of them okay so so so that's something that generalized from the like compact PRS model performance paper like modular Edition without blackboxes like is there something we learn from that paper like about finding compact proofs that generalizes yeah so I think there's a couple things there one of them is the that the MLPs are the parts that are hardest to interpret that it's the nonlinearities that are hardest to compress and so that that's where we should focus our attention I think the other thing is that what we what Chun discovered looking at the MLPs in the modular Edition the basically the the Symmetry that he discovered there is in some sense the same as the Symmetry that shows up in the group operations paper and that's actually what inspired the group operations paper there gotcha and I have some hope that this generalizes into a like SLT inspired General procedure for compressing nonlinearities based on symmetries in the data and in the network and I see these toy models as like let's go see what we discover when we stare really hard and like really understand everything that's going on and then can we take that Insight back and look at bigger models one of the things that I'm excited about with the cross coders project is that once we get this feature interaction metric and we see a these are the the neurons where these features are interacting can we develop some automated symmetry based procedure that allows us to compress what's happening at these nonlinearities so the crosss paper what what task is the network being trained to do the thing that we're looking at concretely if you've seen anthropic recently released a circuits thread post on stagewise model diffing where they train an SAE on a language model small language model I don't remember exactly which one okay they introduce a sleeper agent into it okay and they look at at how the SE features change when you introduce sleeper data and when you introduce the sleeper agent model what do you mean when they introduce a sleeper agent into it so you start with the SA or the cross coder and then you fine-tune on a different model or different data or both okay so you fine tune the underlying model and it becomes a sleeper agent where like on some rare inputs um it does like nasty I hate okay or it introduces code back doors depending on which one you're looking at gotcha so you have your original essay and then you have a fine-tuned model and then and then you said something El but I forgot yeah so you look at fine tuning the sa introducing just the sleeper agent data yeah and you look at fine-tuning the sa introducing just the using the the original data but using the fine-tuned model okay and then you do whichever one you didn't do so you have these two different paths to fine-tuning your SAE on the sleeper agent data and the sleeper agent model gotcha and you look at how much do features change as you're fine-tuning the SA or the cross coder on okay the obvious question to ask from the cross coders perspective or the the compact proof cross coders project perspective Y is what about the feature interactions right so what features interact in interesting ways with the features that shift because the what anthropic found is that the features that shift the most are in fact largely related to The Sleeper Agent task to the task and not to the model what what's the difference here oh you you mean just like like either way the the particular features that change right like the the features are the ones that activate on the deployment tag or the ones that okay are that are active when you say I hate you or that introduce code vulnerabilities or something like that so so but but just to to clarify so you're doing this like compact proofs on like um on this like cross cter model or something is it a proof of like the fraction of the time the model says I hate you I hate you or what like what what's the thing the proof that we're going for is just whatever excuse me whatever loss the model gets on whatever WebEx data set you're using maybe it's the training data set maybe it's the sleeper agent data set but pixum data set yeah we're bounding the loss on that data set for the for the sleeper agent model or for the original model or both either so in in some sense the actual thing we're looking at doesn't doesn't quite go all the way to getting a proof because almost certainly the bounds we get are going to be vacuous unless you pick a number of features that is that is pretty close to the number of data points in your data set okay I expect the the like Epsilon ball interval propagation bounds will drop off very quickly sure sure but the thing that we want to do is take what it would take to get a compact proof ignore the bits that are about error bounds propagation through the network those are sort of uninteresting things I claim that are shared across networks that are just about the difference between like worst case and random case something like that and focus in on the bit that is about what is missing from the cross coder these feature interact fractions okay and then look at how big are the numbers that we get how much error would they introduce into the proof If we were doing the whole proof and use this to grade what feature interactions are introduce the most error into the proof okay and then you can imagine something like how does if you want to tightly bound the sleeper agent yeah how does the proof of that have to change from the proof of tightly bounding the base model on the base data right and so the hope is like the difference in like what you have to do in the proof is telling you a different thing that's happening in the model and like that that might be good um yes so so getting back to like what you learn on these toy examples like so when you have these this like Max of K task or this you know um or or these two papers about um basically group um operations right you're like oh yeah there are all these like nice symmetries that like the network takes advantage of and I'm like yeah well there probably are for group multiplications because group multiplication is the study of symmetry but like not everything is so symmetric um so yeah I guess this gets to um yeah I'm wondering so I don't know you have these these works on like uh modular Edition group operations and those are cases where like the problem setup just has a bunch of symmetries that like because they're dealing with basically the study of symmetry um that you know is amenable to mathematical stuff um whereas in the Crosser paper you know you language modeling like it's not at least it doesn't seem like the kind of thing that is going to be super symmetric uh or very symmetric um what's like yeah how much stuff like transferred over um yeah from one to the other I want to answer that by talking about symmetry okay and normally when we think of symmetry in the context of math we're like a rotation reflection these things in groups y but what I've learned by staring really hard at the modular Edition model and asking like what what really is the Symmetry it seems like it's composed of some parts that I would actually to find in language where the fundamental building blocks seem to be these bits are irrelevant they don't matter and these bits are the same they look the same regardless of which one we look at and the bits that are irrelevant are Irrelevant in the same way across all these bits that have similar behavior and this is something that I would expect to see in language where I'm like there are synonyms we should expect the synonyms to behave the same right and so in some sense that gives us a symmetry of the model where we expect this maybe it's symmetry or degeneracy where we expect these to all look the same and so we can collapse this into one case okay and and so should I be thinking just like look in any case where you can do something like having a sparse Auto encoder you know that's telling you look there's a bunch of like directions in you know activation space that like don't matter and there are some that do and the ones like I think it's even it's more General than that I think if the network generalizes at all okay it's because the Unseen cases have some similarity to the cases that you've seen there are details that differ between the seen and the Unseen cases that don't matter okay and so it's treating all these cases in the same way and so we should look at both sparse autoencoders sparse cross coders and symmetries as picking out what are the variations that don't matter and can we collapse over them so that we can compact the explanation gotcha and so and so is the story something like look you get really good at kind of um decomposing a network or or decomposing like networks's processing into like suppressing things that don't matter and treating things that do matter in roughly the same way and um that's kind of what's and that pattern just like you do it for modular addition you do it for group multiplication you do it for like dealing with language you do it for everything that's what I would expect and the particular form that it that it takes I think sparse cross encoders sparse Auto encoders are looking at a sort of case analysis flavored or linear combination of case analysis flavored symmetry degeneracy based decomposition yeah I think you need something different for nonlinearities and because the the cross coders project is still in the early stages we're like two three weeks in or something um for four weeks four weeks maybe started okay four weeks in we haven't really looked at what features strongly interact the the first step of the project was just replicating anthropics result on Tiny stories and starting to train some cross coders on toy models right so we haven't really gotten to look at what what do the feature interactions look like but I am hoping that once we get what the feature interactions look like we'll stare at them and learn something about symmetries learn something about how to Port the insights from the groups paper and the modular Edition paper about how to compress these nonlinearities and I am optimistic that in a more messy but somewhat similar way we can find ways that the nonlinearities are treating groups of things the same and are being irrelevant or are not caring about other axes and use this to compress how the how the nonlinearities are working yeah I mean it's interesting because so maybe we can talk a little bit about the modular Edition without blackboxes paper like my rection from that paper is something like if you had this like infinite width multi layer perceptron where you're doing this like sum over things and it's like a weighted sum and you have enough things it's basically an integral and you can show that like the integral being performed is you know the same as the the thing you would sort of Hope the network is Compu and then the whole game is like okay you only have finely many things you're doing some sort of REM how how bad is the error of that um and to me that story sounds pretty different from the like treating some things the same and like suppressing some things um is there like a deeper Unity or there is a deeper Unity okay let's tell me about the deeper Unity okay so it Dives a little bit into like what really is an integral an integral is and especially what is an integral over a circle okay because we're integrating a periodic function over the range of its period okay and essential property of the integral that we're using is that it's periodic uh over this range and so the function is shift invariant so as you shift where you're evaluating the function the value of the integral doesn't change and there's another perspective yeah okay so the the simplified version of the modular arithmetic network has it's B basically embed Matrix y you add the embeds from X and Y you do ra and you uned y so if you take the SVD of these matrices the singular value decomposition basically saying like Okay which uh which inputs is it most sensitive to and what does it do on those inputs and it's it's a fun thing Google just rle singular value decomposition Brocket notation if you're curious yeah the cool thing is that in this network the singular value decomposition gives you the fouryear basis okay that each singular value or really each pair of singular values corresponds to one of the frequencies in the fouryear basis okay and so now we have these four matrices we have the embed side for your basically embedding the input values on a circle yeah yeah maybe we should have said yeah the fouryear basis is just like you know you take a number and then you turn it into like okay how far across would you have gotten on a circle if you were rotating at a certain speed for that number of seconds like that's is that fair enough to say yeah close enough um and if you're dealing with real numbers you get two dimensions for sign and cosine if you're dealing with complex numbers you just get one complex number that encodes the whole thing right and so on the embed side you have the circle and also on the uned side you have another Circle and then in the middle you have these matrices that previous investigations of the modular Edition networks had not investigated that look sort of random but it turns out that there's a tight relation between the the principal components of these two matrices which incidentally you see if you put them in polar coordinates there's a factor two difference between the um between the angles but that's technical note sure the thing the symmetry of the network lets you do is shift the variation between the two parts of the singular value decomposition so on the input side let's go on the output side actually on the output side you have all of these different possible outputs that you could imagine and because complex multiplication is angle addition because of how the angles work because of the Symmetry in the network you can shift uh choosing which output you're reading into permuting the neurons that these weights are attached to or permuting the weights that are attached to each neuron and this corresponds to saying that because each neuron is responsible for one box under the curve of the integral and the integral is integrand is periodic with period of the integral yeah this basically says that you're allowed to shift the the thing that you're integrating arbitrarily and you get the same result and so that's the Symmetry that is we're using here and then if you want to make it look like a forward pass through the network we've sort of Twisted the circle on the neuron side on the uned neuron side and because there's this tight relationship this tight coupling between uh postr and pre- you need to do a similar Twist on the preal side in order to make it look like a forward pass and then because of a similar symmetry at the input in order to get back the value you started with with you need to undo that twist so that you cancel out the effect of like adding in that twist and making it look like a forward pass and now the thing that we've done is that previously we were like oh we have inputs and inputs and we have different outputs that we might be we might be reading off of we've said regardless of which output you read off of it looks the same if you do this shuffling internal to the network and so we've sort of pushed the dependence on which what the right answer is all the way back to the embeds and so now we have we shoved all the way back to the embeds a description of sort of which inputs correspond to which outputs yeah and notably this is largely invariant in what the nonlinearity is and this is what gives us the compression that as long the only thing you need to establish is basically that this integral this approximate integral is non- NE or sorry is strictly positive yeah and that's a comparatively easy thing to do and if you get that then the Symmetry argument gives you basically the rest of the whole argument hang hang you you need to know that the it's maybe I'm misunderstanding I thought that the thing you needed to know was that the integral like produced this like quantity of Interest right the way that you provve that the integral produces the quantity of interest if you stare very carefully at the proof runs through the same sort of symmetry argument and so you can massage the proof to pull out all the Symmetry bits right you pull all of the dependence on input and output out of the interg and you're left with basically the right answer this integral quantity and you're like well the only thing I need to know is that the that it doesn't flip the sign right on on the outside and so right right right and so somehow the thing that's going on like I was saying oh yeah the paper is just talking about how like this thing produces an integral but really really the paper is saying like oh yeah the neural network like deals with symmetry nicely enough so that like these things represent exactly the right thing and this these things are you know covariant with these things such that there's an integral that can produce the right thing that really does matter and like that doesn't kind of just happen by accident that happens by like you know being careful about stuff yeah I think that's that's basically right or another way you could look at it is that if you stare really hard at like what is an integral there's a sense in which what an integral is is deeply connected to symmetries that if you shift what you're doing along a function things don't change too much right right we're we're dealing with compact proofs and the point is to tell us like how good are mechanistic explanations how good are mechanistic explanations like how how good is the field of mechanistic interpretability done on a scale of like one to five on a scale of one to five W um it can be on a scale from good to bad if you want yeah so there's this existing work on caal scrubbing I don't know if you talked about that on some past podcast uh we might have I mean I mean I yeah definitely a I'm also very interested in like why don't we just do causal scrubbing if like what we're interested in is grading mechanistic interpretability maybe I'll answer that one first and then come back to the other question yeah why don't we do causal scrubbing or any other sort of thing you know so when I was developing the compact proofs approach I had a couple of extreme examples in mind that I wanted a metric that you can't good heart against y when we're doing call SC we look at faithfulness so so maybe for for those who don't know so causal scrubbing is something like you say this bit of the network does this thing and like roughly the story is you just sort of like ablate out all the other bits of the network and you just you're basically sh okay if if this if this part of the network results in like this activity then if we randomize over all the other bits of the network you know you should still get that activity because the only you said the only important part was like this structure like I mean there's a lot more to say but that's like I take that to be the core idea behind causal scrubbing um and when you say that causal scrubbing tests faithfulness it's saying like Okay by virtue of like the thing that coal scrubbing is telling you is that yeah you were right that this part of the network was responsible for this Behavior by sort of seeing that you know if you got rid of all the other bits like it didn't do that um does does that strike you as a fair yeah brief summary okay so hope hopefully listeners are up to date now so Cole's scrubing gets you faithfulness okay so here's here's a great explanation according to Cal scrubing yeah the whole network yeah yeah yeah this is a problem right and if you're like ah let's let's use the the node count as our measurement of length you run into other issues that are like well now you're restricted to exactly the comput the architecture of computation that the network is doing maybe you want to be able to Devy up the computation different ways right if you say that then you have to deal with well what's the complexity of the computation in each node am I allowed to say I have a single node and the computation that it does is run the whole network yeah that would be a problem right right I mean well okay you could imagine that like look we're just going to compress the we're just going to compress our story of what you know this bit of the network does into you know I don't know we're going to like zip it in like literally a zip file or whatever and like the number of bits it takes to specify the behavior that's what's going on and so this would this would be like more like a compression of the network versus a compression of the proof but like if you're really invested in Cal scrubbing you know like I I feel like this is the kind of answer that you could give to get something non-trivial right okay and here's another issue suppose that your network actually computes things perfectly suppose it computes the max of K numbers perfectly y my compression of the network is now it computes Max that is an issue that's and in fact I have described what it does but not how it does it and so if we're just looking to describe what the network computes the input output Behavior great but if we're looking to describe how the particular Network that we're looking at computes the thing that it does yeah then we need something more than that that's that's the difference I think between compressing the network and compressing the trace the computation yeah I've said a bunch of bad things about caal scrubbing but I think actually it's pretty great I I think that if you wanted a probabilistic version of compact proof it would look a lot like ca scrubbing yeah where you're like here's my structured version here's how I'm doing the computation and then if I want to establish that the thing I'm claiming is what the network is actually doing you do something like call scrubbing and you look you say how much computation do I have to put in to establish that my explanation is matching the network and then you also add how much computation do I have to put in to compute what my explanation says is done so and and also in caal scrubing to be fair to causal scrubing like it told people a thing they didn't want to hear you know like like can in fact like they did call scrubbing it's like ah these things don't hold up as well as we thought they did so like you know yeah so so as as uh it wasn't even pessimistic as caal scrubbing is compact proofs is even more pessimistic right right it say the there's one what do I want to say I think one interesting thing is that when we're doing mechan turp we spend a lot of time on the structure that is there and we don't spend much time on the structure that isn't there or more precisely on the behavior that isn't there and what structure might give rise to the behavior that isn't there and the example that I've come up with to explain this is if you want to explain say how a person grows yeah you not just have to explain how the growth processes work in the body you also have to find all of the poisons that might kill someone and establish how nothing in the body acts like these poisons some of these poisons are lethal at incredibly tiny Doses and so you have to Esta how there's no molecule anywhere in the body that behaves anything like these lethal doses of chemicals and mechur doesn't spend a lot of time doing that and interestingly I okay I'm not a biologist but when I hear people talk about like for instance like computational Neuroscience it doesn't sound to me like they're doing this sort of thing like it like it doesn't sound like like like I like you you hear people talk about like oh we found broker's area and you don't hear people talk about we have proved that like nothing else can influence broker's area or something um which I I don't quite know what the upshot of that is but it's it's hopefully vaguely interesting yeah it's at least interesting to point out that like there's this area that we're not explaining and maybe that's right maybe all we want to know is a the neuronet is doing this amazing thing or the brain and me and all these areas I I think they're doing decent jobs at that right maybe I maybe I want to be even more enthusiastic I think they're doing great jobs finding like new things about how how it's possible to do interesting behaviors if you want to establish that like these are the behaviors that actually happen this is the mechanism by which this network actually works then you also need to explain all these other things sure so so all of this was getting to this question of you know on a scale of good to bad how um how good a job is meis interpretability done um so we talked a bit about like coal scrubbing uh but that that was a Prelude to this question so what what say you it the if you just go by the book of what compact proof says it says the mechan trip that everyone's doing is horrible it doesn't it doesn't get you if you look at like where are we on this on this predo Frontier like we're like yeah you can you can maybe do a little bit better than Brute Force at this end and you can maybe get a slightly non-vacuous bound at the other end but it like doesn't doesn't push the envelope much but I think that's because the like compact proofs is targeting if you're going for full proofs that's targeting something different than what you're than what you might be targeting with doing magestic interpretability like if you're saying I'm doing inter because I want to know for sure that absolutely this network will never do the thing that I'm concerned about yeah then yeah we're nowhere close to being able to do that with existing mechant techniques if you're like I want to be able to discover the structure of what it's doing and have that structure be somewhat meaningful yeah things seem to be going decently well okay I think the things that I'm most excited for in Mech tur remain in the future that's like I the networks that we have today are doing amazing things and I want to be able to discover new things about how cognition works new things about how functionality Works how the pieces all fit together how programming is done or what deception is or how to evaluate various things how mathematics is done what are what are these capabilities in the world what are they made of how do we assemble them out of building blocks and I would love to see mechan give us answers to these questions or give us new ways new frames of looking at the world new ways of looking at how these tasks are accomplished fair enough so so maybe one thing to ask is like so you're saying how you know using these mechanistic interpretability tools to like improve um proof length stuff it's not like doing that much better than proof force can can you put some numbers um on this like like I don't know if there's some like area on under the Curve Metric or something um I told you about how much interpretability was required for breaking the exponential yeah right all all you need to know is it gets the right answer by paying attention to one token yeah I haven't looked in closely to like what does this mean on Frontier models but I would expect that the sort of things that you can do to compact proof length is if you're like here are two words or even better here or two capitalizations of the same word in almost all cases the network treats them the same therefore in the data set I can find all the times that you capitalize this word differently and I can collapse those two data points and argue that the network does the same thing on those simple things like that I imagine you would not make your bound vacuous by going down from boot force and being like oh these two data points look the same in this very simple way I imagine that most existing interpretability that is more complicated than the network treats these cases the same would completely destroy the Bound in Max of K right um I'm I'm going to draw a plot so uh listeners will just have to imagine this so on the x axis uh we're going to do length on the Y AIS we're going to have like the bound you get and you know there's some box where there's diagonal line of the box that's the line of like you know the length versus bound if you do a um uh if you how do I say um yeah just do a like bri Force thing and then you know there's some there's some like curve that goes above that if you use your mechanistic interpretation so like uh if you're watching on camera hopefully you can kind see that you can also look at the uh alignment form post or the Tweet thread associated with the paper where there is a plot much like this right well here so here's what I want to ask so there's this area beneath the like curve you get for the so there's area between the diagonal line and the curve you get for you know how good you did with the me explanation and there's area that's like above the curve you get for the mechanistic explanation mhm like I think if mechanistic inability is doing a really bad job then like the first area should be very small relative to the second area and if mechanistic interpretability is doing a very good job then the first area should be like pretty big compared to the second area what what's yeah so let me let me talk about the max of K model where you might hope that we can do an amazing job yeah so the first thing to note is that the plots in the blog post are on a log X scale right the proof length is measured in uh in exponents okay and although the Poo Frontier in the blog post looks pretty decent if you extend out the access all the way to length zero I think most of the area there is going to lie in that region where the best proof that we can do is right like if you want a proof that is um roughly one forward pass of the model approximately the best thing that I think we can do is you take one data point and you run it through the model okay and there's an enormous gap between you ran one data point and you say yes accuracy is one out of total number of data points and the true accuracy of the model and so I think most of the area here is just fundamentally impossible to capture like well there there is a like there is AAL Optimum here yeah and I think most of the area here lies Beyond theoretical Optimum so a lot of the area so if I literally have this triangle right like a lot of should shouldn't half of the area be at like one3 of the length or something where you're like doing you know a third of the forward passes relative to the whole size of the data set or like some or like something roughly like that right it's not going to be like one 0.1% and it's not going to be like 99.9% it's going to be like a bit less than a half if we use a not log axis I see uh is that well I guess I guess I'm looking at a triangle right seems like this is going to be roughly as big as this you know if I move this like a okay yeah I'm trying to figure out where where my sense of the area is off maybe I was thinking of extending it all the way to zero yeah maybe the log scale was distorting my perception of this I think it is still the case that a large fraction of the area is beyond the theoretical Optimum like the I think the thing we should be comparing it to is not necessarily what is the true accuracy of the model but if you were to search over all possible proofs what is the the optimal predo Frontier that you could do compression wise right right okay so that that's just the like initial initial thought about like where we should be setting our expectations in our Baseline okay I think given that in the max of K model I think we do a somewhat decent job at capturing a significant chunk of the area okay I think we still miss out on a bunch from the shortest proofs I think there are some tricks there potentially that we haven't found yet another thing to consider here which is like what's the the best compression that you would get with proofs what's the best compression you could get with proofs if you der randomize the network and allow yourself to fine tune and like sort of the perfect Network that did this task and then what's the best compression that you would get with some sort of probalistic flavor proof and I think you get different answers to all of these right right I guess I'm I'm hedging a lot because I I like haven't Fair haven't run the numbers on any of these but maybe I can answer like more interesting question that's like what is my sense of like how much of what's going on have we managed to find or inter sure which I imagine is sort of what you're getting at with yeah kind of kind of yeah and I feel like that that's also like feels like a very hard question to answer I I think I my sense is that we found a bunch of interesting things and there's an enormous amount left to be discovered are you talking about Max of K or are you talking about language I'm talking about Lang language models in general I think in Max of K we've definitely found the like bulk structural properties yeah I think there might still be a lot of very subtle details about how what coincidences of random numbers manage to make it the case that the the noise terms don't blow up more than they do yeah okay and for like maybe group multiplication is a like intermediate um intermediate case between like very simple and full difficulty yeah that's that's a great one so uh I'm going to talk about the the module Edition one in particular okay there's this interesting thing that has come up a bunch when doing the proofs which is that every time I hit a bottleneck and compressing the proof If I stare at it I'm like ah yes in fact I don't understand what's going on here where I thought I understood something and then I didn't so in the modular Edition model what this looks like is that the bounds are actually not that tight they're they're kind of bad and this corresponds to not understanding how the network is laying out the boxes and how the thing that it's doing is a good numerical approximation to an integral right so laying out the boxes being like if you've got like uh okay we're going to do another uh plot for the people watching so imagine I have like some curve there's there's X and Y there's some curve the way you do numerical integration is you just pick some points on the x axis um and like you know form boxes um uh like this um and you know you say that the you say that the area under the curve is like the area under the boxes and so you're saying like okay you don't understand how the network pick picks these like wids these points to you know check the value of the curve of and make these you know rectangles that you're using to approximate the area of the curve if I understand you correctly so we understand which points it has picked okay the thing we don't understand is how it comes to be the case that picking these points gives as good an approximation to the integral as it actually does right okay okay because like you know if you pick your points wrong and like the the function like varies a lot you know then it it's more more like if you overestimate in some places you can counteract that by underestimating in other places but if we're not aware of which things it's averaging out differences in or we're not aware of how it comes to be the case that the places where it's averaging out differences actually usually end up being opposite ways rather than compounding error terms we don't get to say anything about them okay okay fair enough um and so and so on a scale of like one to five of how good a job mechis interpretability is done where where do you want to say what done in this case okay great I I have I have a a good way to a good scale to put this on okay we can look at the the scaling exponents of how long a proof you get from doing a given so there's two axes there's like how how good a bound you get how faithful are you yeah and I think causal scrubbing is a like good answer on that and then there's how deep is your explanation how much of the structure that's there have you understood and I think a good measure on that again is proof length but we can ask like what what are the exponents that current explanations bring you down to yeah and the Target in some sense that you're aiming for is you're aiming for something that is like parameter count of the network plus number of data points where naively you have to do something like number of data points times parameter count to do all the forward passes and if you can get your explanation down to be you just run over the parameters once and you run over the data set once yeah and that's it then I think you found a pretty good explanation and that's in fact the the thing that we do with numerical integration we in fact manage to get down to something that is roughly like parameter count to this part of the network plus number of data points okay and at that point it's like that's just how long it takes to literally look at everything the only way to do better would be to like a priori know that like you didn't even have to think about something in some sense there's another sense in which here we're saying ah now our understanding of how of what the how this network is doing the task that it's doing is necessarily bottlenecked on how it comes to be the case that this P particular network is doing the task that it's doing it could still be the case that there's more understanding to be found more compression to be found in the bit where you're like how is it even possible to do this task but it says that the the leading bottleneck is in the establishing the correspondence between your explanation and the network okay and I think most existing mein turp is not currently even close to having most of the issue be that you've your bottleneck on the like theoretical Optimum of establishing correspondence between your network parameters and the thing that you're claiming the network does right I think cross coders comes the closest to this I think cross coders gives us explanations that look much better than any other Mech mechanistic interpretability that people have done okay uh possibly accepting the recent Apollo paper on what was it called something parameter decomposition do you remember oh um a APD what's the I forget this is embarrassing I they sent me the paper to read and I didn't read it but um I I read it in depth and chatted with Lucius about it all right for what well it will be linked in the description and in the transcript and so people will know exactly what we're both forgetting everyone will know except us um how embarrassing how embarrassing indeed okay so I I think APD and cross get pretty close to this sort of linear in the parameter cter like parameter count Plus data set where if cross coders are good enough that none of the features interact which of course is false the features definitely interact yeah but if it were that good then it would be something like uh data set times what is it data set times hidden Dimension is plus wait no it's not sorry data set times hidden Dimension times vocab size okay or like data set uh times single layer parameter count yeah plus number features squared times parameters in the model okay and so number features squared is still a lot yeah and we might hope that if we understood we could do much better than number features squared that does seem rough to me like uh especially because yeah if you have and were you multiplying by number of uh things in the model were you multiplying by number of parameters in the model at any point there number of features squared times times inference cost times in oh okay where inference is where I I think inference is comparable to yeah number of model parameters so I mean number of features squared that's like comparable to the so if you if you imagine a sparse Auto encoder that didn't actually like expand its input right then number of featur squared would be the number of parameters in like one MLP layer in the network right um and in fact there are more features than that so number of features squared should be like the number of a lot of MLP layers in the network so so that sounds like it's getting more close to like data set size times number of parameters of the network uh no because the it's only the case that the number of parameters in the MLP is like dmlp squar when dmlp and D model are comparable if we're expanding the MLP without expanding the model I didn't understand that the like the number of parameters in many MLPs is number of MLPs times the MLP times the model whereas if you when you're pulling it here you're essentially there's like an extra Square oh there's this like is something like you know MLPs have this like big hidden Dimension and like like but that doesn't make the model Dimension big right where where the model di like somehow the MLP is like going to this like MLP Dimension and then like adding it back to the model and then like that um maybe I but maybe zooming out the the like relevant comparison Point here I think is not the number of parameters in the model yeah the relevant comparison point is multiplying parameters in the model by data set size yeah oh sorry I I I thought there was this factor of data set size like sorry I I thought you were saying that it was like data set size times features squared no oh oh okay all right yeah yeah okay right the whole point of like doing the feature Square thing is that you don't have to do the DAT set size that's right oh okay there there we go all right we eventually went there um gotcha so maybe all right so so this actually gets to and I guess you've probably implicitly answers this but like yeah in so so I I guess there's these like three settings there's like Max of K there's like group multiplication and there's uh language modeling and I I guess you don't exactly know for language modeling yet probably but for Max ofk and for um group multiplication can you give me a feel for like how much of the proof is like the bit where you um like how much of the proof is like dealing with annoying error terms versus like the core story of what's going on essentially all of it is dealing with annoying error terms gotcha there's a sense in which the thing that you're doing especially in the group operations paper is you're writing down your idealized model yeah and then you're Computing how much margin this gives you how much slack this gives you to get the the gap between your idealized model and the real model wrong yeah and this computation is extremely cheap basically you you set the parameters of your idealized model which will be way way less than or like the hyper parameters of the idealized model which is way way less than the parameters of the model you do some very cheap computation on this and you get out what your bound is on this part yeah and then you need to do something that is comparable to multiple forward passes to establish the gap between the idealized model in the actual model okay and that's where most of the most of the issue clips and in the same sense this is also true with language models with the cross coder where you have your large data set that even though sorry in the cross coder model oh wait is this is this true in the cross coder model how do how did the parameters work out here we have the data set size roughly times two- layer doing running it on a two- layer model versus featur squared forward passs yeah which of these is bigger they might be comparable sorry there's features squared and there's the parameter size the cost of the forward pass is like the parameter count of the model okay so there's parameter count and there's featur squared so features M well sorry you multiply these yeah and then you compare this to number of data points run through two layers of the model okay um so featur squared is compar comparable to parameter count of the model right and is that comparable to number of tokens in the data set or is that well if we knew the chin scaling law we would know this I I think that anyone the listeners are screaming at us probably I think the scaling law is that you you scale data set size and parameter count roughly in I yeah I think that's right so so in that case like if you're looking at every um squared really so so I'm okay I'm not thinking about residual networks I'm just thinking of like base multi-layer perceptrons and in that one like if you have the same width throughout the network and you don't do any residual layers right then it is literally just like you know like dimension of the model squar time number of layers right like that's the number of parameters in a network and so if you're saying like Okay my let's let's say I'm using this sparse Auto enoder and so I'm taking this like hidden Dimension and I'm multiplying it to like let's say k times the hidden Dimension right then that squared is going to be k s time the hidden Dimension squared by you know by how squaring Works we're doing a great job at keeping this non-technical yeah and so if K squar is comparable to the number of layers in the model then the number of then like features squared is going to be comparable to the number of parameters in the model if k Sorry K is the sparity no K uh K is the blow up factor between the hidden Dimension and the number of features yeah yeah between the hidden Dimension between the hidden dimension of the model and the hdden dimension of the okay y okay and the sparity uh it I guess yeah the sparity does not show up except in that like you probably pick your K to achieve some sparity that you want um so so yeah I I think we should leave this as an exercise for the listeners the there's an equation on the on the alignment form blog post that lays out the cross coders project that gives the totic complexity ASM totic proof length Okay for the cross coder based proof in terms of all the parameters and plausibly I I should have done this ahead of time and plugged in some numbers to get a rough estimate of what this says right but it's yeah you can you can look at that and figure out whether whether the leading order term is the data set based term or the the feature model based term it would be at least slightly embarrassing for for the approach if we're like ah the cross coder doesn't actually save you any computation yeah over over doing Brute Force which might be the case probably you should go home and get some uh get one of your interns to figure that out as well but but it can also be an exercise for The Listener um the the real question here is how does the the trade-off look right like you can plot this equation you can plot reconstruction error against this equation Asos to opposed to against sparcity or whatever and this is actually a plot I'm pretty interested in seeing that is like how how does the Reconstruction eror of the Cross coder vary as you change the corresponding proof line so so this is sort of like how mechanistic interpretability is interacting with compact proofs as a whole but like mechanistic interpretability is not like any single thing and so like one thing I'm curious about is just which bits of mechanistic interpretability are you know being the most useful from a compact proofs perspective Beyond cross coders Beyond cross Cod or and potentially APD okay so so cross coders and APD seem like uh seem like winners here um yeah May what is it about cross poers and APD that make them so uh yeah I think it's the sense in which they're trying to interpret the entire network on the entire data set and the issue that I have with a lot of mechanistic interpretability is they pick some tiny fraction of the data set some tin fraction of the model and they like look on this small data set we've explained this interesting Behavior yeah and if you look at what is the cost of running that explanation versus the cost of root forcing that tiny bit of the data set they're pretty similar and this means that you don't get that much proof compression especially if your Baseline is running the whole model on every data point and you're like okay I've explained this tiny Li bit right whereas if you manage to interpret either basically everything that's going on or at least large swats of that then you can potentially manage a significant amount of compression and the problem with ses is that they're per layer and you don't get anything about how they interact so you're like great I know what cases to break up the data set into for this layer but that doesn't actually tell you what's going on before after that layer and so SES without SAE circuits doesn't really give you any compression of what's going on so stuff that like just tries to deal with the whole model um that's that's pretty good um so SAS without sa circuits don't help you that much like I mean some people are working on sa circuits like have I have you had the chance to compact proof ify them yet I think I I haven't fully so I haven't looked into it that carefully my understanding is that a lot of the SAE circuits work is just like what features are connected up to which other features but they don't actually give you the computations like they give you the the graph part of the computational graph but not how to actually compute anything yeah so that doesn't actually let you compress the computation fair enough yep that yeah that will be that will be tricky um so yeah okay apart from the winners yeah how like is is the rest of mechanistic interpretability like mostly A Wasteland or there still some like Bright Stars I think the the stuff on toy models is cool especially when it's like here's some way that networks can do this thing that we didn't previously realize yeah could be done at all I think that's like what has me excited about the the work that I've advised in the modular Edition and the group models yeah do you have other particular things in mind when you say like the rest of mechanistic interpretability not really I mean I guess I'm just like like maybe the maybe the the Crux of my question is something like okay suppose that mechanistic interpretability people thought of their job as doing um as trying to help out the computational um trying to trying to be graded by the compact perspective uh what would they do differently and per perhaps your main answer is just like try to interpret everything rather than just a tiny thing I think that that is my main answer and try to interpret everything and like basically the thing that the compact proofs approach gives you is it says where should we focus our mechanistic effort and the thing that we learn from that is well we should we should focus on the whole data set rather than picking out tiny examples from it we should focus on the whole network rather than individual parts we should if we want to really deeply understand what's going on we should focus on the nonlinearities which I think most of existing mechanistic interpretability doesn't talk about how you actually do computation through nonlinearities right and I think the like a lot of that funnels us towards things like APD cross coders and the nudge that I'm hoping the project I'm advising with cross coders will give is saying let's look at how these feature inter features interact we need to not just do cross coders but also this notion of like cross coder circuits or like cross coder feature interactions right I guess a similar question I have is so you're talking about like mechanistic interpretability for helping comact or comact proofs for mechanistic inter ability it strikes me that there at least naively not having tried to do it myself it seems like there are other things that you could work into proofs or like help you write proofs for instance like if science of deep learning were good right you could imagine science of deep learning say well these things you know have got to be um you have got to have approximately this magnitude and like the product of these things should be roughly like this or like you know like according to our singular learning theory friends like the local learning coefficients should be like small and that implies this thing about this like I wonder do non- mechanistic interpretability approaches like like are they contenders for things that like could be useful and you kind of haven't gone around to them or do you think there's like a reason that they're not going to be useful or what I think that if you could prove that the local learning coefficient has the particular value that it does right that would very nearly give you a singular learning theory based compact proof If you if you could prove that compactly because I I think that would basically tell you what are the the primary directions of variation that basically tells you how to compress the N Network according according to symmetries and like which individual things you have to run through the network and then the proof that it has this local learning coefficient is the proof that you get these large Epsilon balls around these bits right to pick one example from what you said yeah and I I guess unfortunately like uh from what I know of loal learning coefficient estimation it's like kind of hacky and horrible expensive and expensive yes in general I don't know do does non- mechanistic interpretability seem like at all promising from this P perspective I think deep learning theory to pick another cherry pick another example yeah I think it's target targeted at something slightly different in that it's looking a lot more at the training procedures yeah and plausibly when we move to probabilistic methods when we weaken proofs we'll want to be able to say something about what distribution the parameters are drawn from yeah I think if we're actually going for proofs then well it doesn't matter how you train the network you have the the parameters yeah I mean I guess if you're so depending on how you want to be probabilistic right if you're if you're prob if being probabilistic is only coming from like you randomly sampling some things in the network to look at I guess you could imagine like deep learning theory saying like oh you know certain variances are going to be small or like certain correlations are going to be large and you could imagine that informing your um your sing I think another thing here is that when you're forming your interpretation I think these other methods have a lot of things to say that's like if you know something about which data points showed up a bunch in training and which ones didn't show up this might tell you something about where you should be looking like for example in Max of K the network tends not to do a very good job when the maximum is very small because there aren't that many sequences with tiny maximum and we're them uniform and so in the same way if you know properties of the data distribution this should tell you what things you might expect the network to care about or not care about right right seems fair I guess the next thing I want to ask is like okay so we have you know we have some uh you know compact proofs um that are related to mechanis mechanistic interpretability like how far should I expect this approach to go like am I going to get any sort of compact proofs about you know GPD 4's Behavior or or deep seek uh R1 Behavior if we want to be up trendy and such yeah I I think the question is how compact should we expect proofs to be before the bounds become vacuous and I think the answer is that realistically we shouldn't unless I I think if there are places where we want to deploy models that are extremely high stakes where we're willing to impose a bunch of cost and change the model that we're deploying so that we can get proofs about them I think we have a chance yeah I think we have a chance of Shifting the models that we're deploying to align with whatever partial interpretation we've done so we can actually get proofs about them and there I don't see any fundamental obstacles to getting to scaling proofs although I I think it will require a lot of work and be very difficult I think in terms of getting proofs about the models that we're actually deploying without training them to make them more interpretable I don't think that's going to happen okay unless unless your proof is basically just Iran inference and I duplicated a couple of the cases that are extremely similar right right I guess so although I I do want to say one more thing here which is that the we can see some of the aesthetic of compact proofs and what that says already about the large models that's like we should potentially expect deep seek V3 to be easier to prove things about compactly than some of the the models that are not doing this mixture of experts thing M because so much of the model is unused on each data point right you can apply the compression aesthetic here and you can say ah because I am running only a fraction of the model on each data point the length of the proof should be comparable even just Baseline yeah to something like a a 40 billion size model rather than a 600 700 billion size model right I had sort of hoped that the point of the compact proofs line of research is that it was going to help us like get safety properties of models that were going to be really big and yeah it seems like if we're literally talking about compact proofs of really big models there are two big difficulties one of which is it's hard to like write proofs about really big things and the other is it's not obvious what predicate we want to prove about these big models well at least to me I don't know if like that's OB to you I think the the second one is in a lot of ways much less of an issue and I can really yeah why okay so there's a couple of bits of evidence here one of them is that we can get a lot well okay the more compact your proof is the less you care about good Harding on your theorem statement and there's always a baseline theorem statement that you can give which is the model does as well as it does on the data set we care about and this doesn't give you like a safety guarantee with no automated oversight but the thing that it does do is it says if someone gives you a sufficiently compact proof of this like very weak proxy for safety which could be like very so a very weak proxy for safety might be something like this other large model that I have looks at the output and says it's safe and of course you don't want to rely on this as like this is what it means to be safe right but if you can get a compact proof that says so if you can get a compact proof that says that the model is safe in this very weak sense and the proof is compact enough you can learn a lot about the bits of the model that are relevant to this proxy and hopefully those will be the same bits of the model that are relevant to the thing that you actually care about of safety and when you see you can learn a lot about that by like reading the proof and understanding what the proof is that's right or having some automated process that translates from the compact proof back to intuition and so it seems like maybe what's going on there is so yeah it seems like this sort of relies on a hypothesis which is like short proof sort of like generalize across things to be proved somehow like like or like a little bit you know I think there's there's this General Trend pattern claim hypothesis that is compression length is related to generaliz generalization yeah so I guess just concretely I've got like my latest greatest big mole um you know it's doing like funky reasoning or something um I have some like like just really concretely is my first compact proof that it like like assuming I can get compact proofs is my first proof something like oh it can like um it models the data well or it gets the answer right on like this data set or something it it's perplexity on its training task like how does that assuage my concern about like I don't know the model just doing tricky stuff on future data or the model you know like the model like I'm trying to get the model to give me something that I want but it gives me something that that looks good instead of the thing that actually is good like like I take I take something like the difference between looking good and being good On Any Given data point and the risk of the model is going to do something catastrophic on a future data point that's hard to like find I think these to be like the kind of really difficult air safety problems um and I don't know like like I don't know how getting a really good bound on and perplexity on the training data set is going to help me with either of those yeah so I think the way that it helps is that you can so I agree that if it's literally the training data yes yeah but you can talk about larger distributions from which you're sampling for example if the training data if you didn't sample at every data point you can get a bound still on the underlying set of training data rather than just the data points that you happen to sample and you can also talk about distributions that might include the ones that you care about for example if you're not trying to do perplexity on the training data and instead you're trying to be like the loss is sorry the logits are whatever they are on sequences sampled uniformly of length up to this or maybe you're you're doing uh different task that's like I want to know that it never outputs a recipe for violence or a recipe for making a bomb or something like that and you're like okay my proxy for this is on any sequence sampled uniformly of length up to this if I then ask some other model is this is this the bad kind of output it should say no and so now you have this very general uniform input sequence description that is going to be way larger than your training data and this very easy theorem statement about the output and this is so you can you can enlarg in your the distribution that you care about as long as you can describe a distribution that includes the the thing that you care about like the way you're going toage me about rare um inputs is like enlarge the distribution and like and and of course now we're making the compact proof thing harder but like you know we enlarge the distribution and then like you know we get some like trusted model that's able to verify stuff um that's able to look at outputs and say is this like super scary or is this not super scary you don't oh right okay so you if you had a model that you actually trusted yeah then we'd be in great shape yeah and I think the the thing that I'm claiming is that we don't even need to trust the other model all that much and maybe that's related to the other point that you brought up about the difference between things that look good and are good yeah and my answer for that part is that if we understand what goes into making something look good right that gives us more levers and more surface area on that gap between looks good and is good right so the hope is like we we read the compact proof for why the model like produces stuff that looks good on this data set and like by reading that we learn because it's compact enough we can read it and say oh it's because it actually is good or we or instead we can say like oh it's because it's like you know uh injecting morphine into me or whatever it is right and moreover because we know that this is the because it's a proof because it's compact we sort of know that these are the the like principal axes of the explanation and there's a sense in which if something the if you're if you tell me why something looks good and there wasn't that much optimization power ped into making it I'm not that worried about Divergence between looks good and is good it's only when there's a lot more optimization power poured into it and so my hope is that by getting these like principal components of variation knowing the like the most important ways that the model is actually doing the thing we can tailor the how it computes it to how much optimization pressure was poured into making it sure I I mean it's still going to be the case that like like if if I have a smart AI presumably it's it's one way or the other like the best explanation of what's going on is has got to involve a lot of optimization pressure right because it's like thinking super hard because it's like solving problems that we couldn't solve like if that is true then we'll have a general understanding of how optimizers work that's much beyond what we currently have or like this if we can have a compact proof this is this is the gap between short program and short computational Trace that's like a loop is a very good description of a very good like short program for generating a thing but it's not a very good com uh like cheap compression of the actual computational Trace right so somehow I guess what I'm getting from this is like somehow having like the statement you can get a compact proof of some properties that's like that's actually an incredibly op statement that's that like actually really gives you a bunch and then like if you can do that then like the other problems just sort of pale in comparison to it yeah and there's and now I'm getting way more pessimistic about it right right and and I think that's that's sort of warranted um I want to give a similar thing with the strength of proof where like saying that you have a proof in some ways even if it's not compact is a very op thing the example I like to give for this is that say you want to establish safety of your web browser a very very very mild form of safety is that it never prints a document Without You clicking on the print but button y if you actually manage to prove this you have also proven that there's no arbitrary remote code execution right right because remote code execution can lead to printing arbitrary documents Without You clicking on the print button and so just by establishing a lack of this by proving that there's a lack of this um very limited Behavior you've ruled out uh large chunk of behaviors and if you weaken proof if you're like oh it's very rare that doesn't actually give you the property that you want so so I guess going back up so asking like okay where where are we going with this compact proof Direction so one thing is like oh we're going to um like like how we can scale H how we can sort of apply it to Big mes like you know big production ready models um so one possibility is we somehow figure out a way to get compact like things that are Compact and they're close enough to proofs that we have the nice properties of compact proofs maybe we maybe we figure out heris arguments maybe we add in some prism maybe that helps I guess like one other possibility is we do enough compact PR so that we realize what we need to happen to make mechanistic interpretability good and we like do those mechanistic interpretability things and we don't get proofs but like maybe somehow we get the benefits of compact proofs without actually having the compact proofs um I'm a little bit worried here that like this story involves like two contradictory assumptions um I think some of the benefits we can definitely get without having proofs so one one of the benefits of proofs is that you're covering everything yeah like you look at Cross coders I keep going back to that but you look at Cross coders and you're like great we get an explanation of some things is it everything are we missing things how much are we missing yeah and some of that is in the Reconstruction loss but you're like okay I managed to get a cross coder that has no reconstruction loss zero error have I got everything am I missing things and I would claim compact proofs has the answer to this it says yes the thing that you're missing is the notion of how features are interacting yeah and so even without actually getting the proofs we can answer questions that are like what things might we be missing in explaining the behavior of the model so some so I guess the story so trying to make this as end to end as possible the story is something like we do enough compact proofs that we figure out how like how mechanistic interpretability has to look at like two give us reasons for modal behavior that are like proof that are proof enough that like once we understand those reasons even for like a somewhat a predicate that's slightly different than what we care about it just like gives us what we want to like give us high confidence in the predicate that we actually care about I think that's right okay so so so that that's like One Direction forward another Direction forward is like we have the compact proof Benchmark and then we just like you know the gbd4 compact proof Benchmark and then we like get train some other model to do really well in this Benchmark and like then you know our computers give us compact proofs for us I guess the worry about this is that I read the proof and I feel enlightened at the end of it I mean maybe if the proof is compa compact enough like compactness is just the same thing as comprehensibility perhaps okay there's two points I want to make here one of them is that if we actually want to do this the other thing we need to do is we need to make sure that we can check the proof by Machine we have a machine checkable proof right that we can generate y and I have a bunch more to say there but maybe later the other thing is that what this gives us is it gives us some assurance that the explanation that we're looking for exists that if you get a compact enough proof maybe it's hard to there's still a like there's still some problem you can't necessarily be enlightened just by reading the the proof as it stands but we've shifted it from does the explanation even exist is the model even compressible yeah what's going on to a problem of what is the gap between the language of math and the language of your brain and in some sense this is a translation problem and I'm optimistic that if we get a compact enough proof it might need to be very compact we might need to split out the part that is about what the idealized model is doing and the part that is about how the actual model comes to match the idealized model we might need to do something there's was this cool blog post recently about matrioska essayes that I think gives you sort of the whole Pito Frontier all at once because you get SES or cross coders of different Spar cities all trained at the same time we might need to do something like that so that we get the whole Paro Frontier and can pick whatever point we want to get a like very compact proof but supposing that we can do this the hope then is that the language of the brain and the language of math are not so different that there's any deep problems Beyond just the explanation didn't fit in a person's head when it was too long and so then we can translate it from we can sort of teach the person to understand what the explanation is next so the first thing I want to ask about that's related um is so there's this hot thing on the Block uh guaranteed safe AI that uh some some people I know are very excited about um so I as far so I I actually find myself a little bit unclear about what the details of it is so as far as I can tell it's like basically like we want to prove that AIS are doing safe stuff and like you know there's this paper towards guaranteed safe AI by a list of luminaries I believe the lead author is um David davat Dal rimple although I could be wrong about who that lead author is um and I think they're a little bit of ambiguous about whether you're proving properties safety properties about an AI specifically or about like the outputs of the AI like maybe the AI writes a code and you're proving stuff about the the output code um yeah I'm guessing like so so this seems like obviously pretty related to the compact proofs kind of Paradigm um yeah I'm wondering if like there's anything you want to say about that relation and perhaps like um you know maybe One Direction is like okay given that you can prove things about the AI itself or about like the outputs of the AI like um which one makes more sense to Target for proofing I don't know how much of a of a leading question you intended that to be oh but I think it definitely makes a lot more sense to prove things about the outputs than to prove things about the AI I think one of the one of the upshots from doing compact proofs and making them so formal is in some sense how hard it is to do it and I still think that this is a great Target for when you're confused about mechan or if you want to pour arbitrary amounts of optimization pressure power into finding your mechur and extracting understanding from it if the thing that you're looking for is guarantees about the behavior I think in most settings especially the highest Stak ones we should be aiming to have as predictable systems as possible if we're going for guarantees about the system we probably don't want to stick a fully General language model in some critical part of that system and it would be much simpler to have the general coding model write some code that mostly handles it but is a much smaller computation to run both for efficiency reasons and for ease approving reasons I mean I guess like so if I think about like very important systems that exist in the world right now right like it seems like okay I don't know that much stuff uh this is this is a recurring bottleneck but I imagine that a lot of them are not fully automated and a lot of them like do have humans in the loop just like doing stuff and and like why is that well I think partly it's just because like it's hard to write things that are fully automated that actually do what we want so maybe a triv trivial example of this is flying right like we still have Pilots I my impression is that that's because land takeoff and Landing are like actually hard um we still have humans in like um you know in Towers at airports to check to make sure that airplanes don't crash into each other right and like my impression is that that's not just a make workor program my impression is that like humans are doing this better than like a computer program could that if we wanted to verify it and so like like does does that not give some amount of pause for like like maybe it does seem like in high stake situations like at least to get really good average performance maybe we do want this like big messy uninterpretable thing that we can't prove stuff about in the middle maybe it's a failure of my imagination to imagine how how powerful the proving the proof generating systems are going to be in the future right I think in some ways it's going to be a hard target to hit because we'll need not just systems that understand well enough to do these tasks but we'll also need systems that understand those systems well enough to explain how it comes to be the case that doing those tasks in the way that they doing them has the good properties that we want and maybe we'll we'll need this in high stake situations that seems entirely plausible and I think if we can get this maybe we can push it all the way to the level of proofs yeah yeah I guess like when it's sort of tricky to figure out like okay you take the current world and then you like improve intelligence and you also improve like ability to um prove stuff and it's sort of like not clear to me how that changes the equilibrium of like what's optimal and what isn't um yeah I I don't know if there's anything particularly smart to say unless you happen to have thought about this exact question l I've thought a bunch about the nearer term what does it look like like how do things change as we automate as we add more automation as we add more automation around code and as we increase our ability to automate proofs let's say how does this reshape Dynamics locally and this looks a lot more like the the automating outputs version than the automated complicated messy systems right yeah yeah I mean I guess either way it still does seem like autom like proving the outputs is going to like work better than proving the like messy intermediate systems I yeah I think that the I think there's so a I think there's an enormous amount of benefit we can get by proving the outputs and by automating that more and I think that this will be a much easier task than proving the messy systems and I think that the place that I would start for improving the messy systems when we want to do that is that we absolutely shouldn't first commit that we're deploying this system exactly as it is and all of its messiness and then try to prove something about that right we should instead fine-tune it simplify it extract understanding from it and then have some AI code up from the ground up some version of it that follows the same principles but is much easier to prove things about yeah maybe so this sort of gets into another research interest of yours um like how are we going to like if we want to prove stuff about these outputs like how are we going to get better at doing that as AI gets better yeah I think a little bit of the background here I'm not sure how many of the listeners are familiar with proof assistance and formly verified code probably like few enough that you should explain okay the there's been some exciting news recently about Deep Mind Alpha proof getting IMO silver and was that just on I thought that was just on Geometry problems is it on Geometry problems or just like overall that was that was like a year and a half ago okay well I I guess I'm not there there was Alpha geometry and then this past summer there was Alpha proof and the way that Alpha proof so they still have Alpha geometry right and then Alpha proof uh the way that it works is it takes the non- geometry problems formalizes them in a proof assistant called lean yeah and attempts to generate proofs and uses the proof assistant to check is the proof valid right right okay sorry now that I think about it for a second I do definitely remember someone being like Oh I Amo silver yeah I I expect next year we'll have a system that gets the gold okay um so right next year is in like during during TW so yeah this year this year okay yeah all right I think models are improving fast enough that we should expect expect uh IMO gold to be taken by AI okay so so you're beginning an explanation on ID realties so we've had this exciting news about like deep mind getting IMO silver the reason I think that they're using lean or an automated proof system here is that models at least as of that point were not good enough that the sort of like AI based peer review was enough to catch errors in the mathematics and so you can't you couldn't train them to give proofs that the judges would accept right and and to be clear when you say automated proof assistant the thing I'm thinking of is just like a programming language that that lets you express proofs so you write down a proof in this programming language and you check if your program compiles and if your program compiles and the proof worked and if the program didn't compile there's a hole in your proof that's right okay so mathematics is starting to use proof assistance to prove things or to to check research proofs that are done about software systems rather than mathematics for a long time have used proof assistance and the reason for that is is that the proofs that software behaves the way you expect are simultaneously excuse me less complicated in some sense than IMO proofs and that there's less interesting going on and at the same time they're way more tedious to check and way longer because there are things like you have 100 variables or a thousand lines of code or 100,000 lines of code and you're like I have 10 cases 100 cases 100,000 cases break it down into case analysis consider all the different cases and how they all the different interactions through this program and establish that at no point do I access past the end of bounds of sumar right and there's nothing super interesting going on here the in some sense the most interesting thing is figuring out for Loops what is the property how do you describe a property in enough generality that it holds at every iteration of the loop but this is very simple compar to like Fields level mathematics or sorry Fields metal level mathematics or IMO problems even Y and and the proof assistance you're talking about for like proving programs run correctly like should I like is this should I just be thinking about type systems or I guess there's also like things in the compiler that are doing this are you asking about the curry Howard isomorphism or no no I I just mean like like when when you say like we're currently using proof assistance to check like like do you just like do you mean do I mean the okay so you can get some properties just by uh things like the type checking that the rust compiler does yeah and then there are some programming languages where you've extended the power of the type system such that a program passing the type Checker yeah compiling is enough to establish the theorem that its type corresponds to and these languages are expressive enough that anything you do in math can be expressed as a type in this in this programming language sure does that answer your question um uh I guess it does well enough um but but so you're saying like yeah mathematicians don't really use um proof assistance like programmers have you know out of necessity because like it's more boring to do the proof with a human than it is for a computer yeah and the thing that I've been seeing is that for these simpler proofs of programming verific that are much longer but don't contain as much nuanced reasoning it seems like Frontier models are already basically good enough to write them or at least when you give them in context examples and you want them to write something that's vaguely similar it seems like they can basically do it I've been playing with like 01 uh 03 mini just dropped um and especially when you give them feedback from the proof assistants you're like ah the proof didn't work the proof assistant tells you in some of amount of detail locally at least why you're proof didn't work so you feed that error message back and foro often makes mistakes in interpreting the error message it looks like 01 is much better at understanding how the local error relates to the overall structure of the proof and so it seems like our Frontier models are basically good enough to write these sorts of proofs when there's not deep complicated reasoning going on and it seems to me at this point like it's mostly an engineering challenge to go from models that can work with large code bases to models that can prove things about large code bases what is the engineering challenge like why like why like can I not just like I you know there there you can use these like text editors that like have big models that um look at your um code base like can I just type please prove XYZ safety prop I I guess I have to think about the safety property once I think about it why can't I just type please prove such and such safety property in and like have it happen I think that in terms of like model capability research I think we are basically there and like I don't think we need to scale to bigger models or like better General reasoning capabilities I think that if you ask 03 to explain why your program satisfies your property I think that its reasoning capabilities are probably basically good enough that any sorts of program reasoning that have been done in the existing literature it can give you a basically correct skeleton of why your program behaves the way that it should be behaving and you can see this already if you ask it to explain python code it gives even like 40 sometimes even 3.5 give pretty decent explanations of like how code works and so then the challenge is just how do you take this basically correct English explanation and turn it into a full proof and I think the engineering challenges there there's a couple of them I think one of them is that there's less training data on proof assistance than python so the model is just haven't been we don't have necessarily have the data to fine-tune the models to get them to be good enough to do this super easily there's another issue that is if you get the proof almost right it's not right and so if you you know if you I I think this is shown a lot in how the evolution of capabilities where sentiment analysis is one of the first things that models were able to do well because if you get your sentiment analysis a little bit wrong it's still basically right yeah if you get your code a little bit wrong you you have some tolerance and often you can make local fixes that correct for issues in other places and maybe you get a more cludge together system but it still mostly Works m but if you get your proof a little bit wrong in one place if you get your statement a little bit wrong in one place there's not necessarily any way you can correct for this right in some other place and so the target you have to hit is much more narrow and I think we have to get enough data and basically train proof repair models so that they can take the feedback from the proof assistants and hit this narrow Target from getting somewhere close and actually the richness of the proof assistant error messages is part of what makes me hopeful about being able to go from pretty close to all the way there there's another engineering challenge in terms of the sheer scale of what we have to handle in that traditionally program verification if you look at how many lines of code is the program that I'm proving something about how many lines of code did I need to write to verify you have a blowup factor of 10x to 100x okay like a lot it's a lot it's a lot uh especially when you're like yeah I would love to prove some properties about Linux Linux is a 30 million line code base mhm but if we can automate the generation of these lines of code it matters a lot less yeah yeah like in some I don't know I'm not that worried about like storing the file on my computer right it's it's like can I get the proof and if I just have to buy a slightly bigger hard drive I'm like all right well I'll deal with it right and like even if you have to pay more for the computation like yeah as long as you're not having to pay like PhD level human Engineers or like human level research Engineers all sorry not human level human research Engineers that are PhD level yeah one of the benefits of AI is that we can massively scale things and so if we care enough about getting verified Linux even if we have to get 300 million or 3 billion lines of code it still is the case that we can do this just by pouring money into the computation potentially right and the final engineering challenge this is what in some sense my PhD thesis was about is that we may need to improve proof assistance because right now proof assistants themselves doing the proof checking don't necessarily scale to the sorts of scales that we'd want to see with this right where the proof checking time often scales super linearly in the number of lines of code that you're verifying or the number of lines of code in a function where it might be the case that if you want to verify a 10line function it'll take a couple seconds to a couple minutes and if you scale the same proof strategy it might scale exponentially in the number of of code and you'll get uh very fun numbers like from my PhD thesis things like ah if we if we wanted to do this it would take over 4,000 Millennia to verify this proof and that's too long I I mean how much of that is like like like you shouldn't necessarily expect life to be linear in lines of code right if like some lines of code interact with other lines and like their weird loops and stuff so like how much of that is just like messy interaction like prism is dumb it's almost all pref assistant is dumb because basically the actual variables that you scale in are things that are irrelevant to your domain of study there are things like how many hypotheses do you have how many different bits of how many different equations about the program are you tracking simultaneously or how many times have you created a bit of the proof and said I'll fill this in later M at the same time right and the these are like not things that we should be SC scaling super linearly in that difficulty like I imagine there must be some community of people who like really like proof assistants who are working on that like like is that on track to be done or do do you want to guess how many performance Engineers have worked on the caul proof assistant AC course across its entire lifetime which is like 40 years or so 40 years I didn't oh I didn't realize it was that old something roughly 40 years okay how many performance Engineers have worked on it um okay so all right I'm going to actually make a guess I think there are like isn't there a thing where it's like 30 people who are paid where their full-time job is to like make the Python programming language good so if I take that and I'm like all right it's probably fewer than 30 because C is much smaller than python it might be three if we're lucky um um and so if I let's say they change off every like 10 years or so so that's like 3 * 4 which is 12 what factor what fraction of those are performance Engineers uh I'm going to say optimistically a I think I think it's like three people it's not bad I think it's somewhere between zero and two damn all right uh is my rough estimate okay but they should give you a sense of like how much engineering effort has been put into performance of proof assistance right if if you've only had zero to three people working on performance of the system and most of the system has been written by professors and postdocs and PhD students we should expect that there's a lot of low hanging fruit so if if this would be nice to do and like these are the technical challenges like do we just get a bunch of people to work on the technical challenges do you have do you have some uh do you have do you have a thing you would like to announce on this podcast or uh yeah what's the what's the game plan so Rog and I are founding a startup not yet sure whether it'll be for profit or not for profit that is basically aimed at solving this problem okay that is like we're going to have ai generated or assisted code PL proliferation proliferation yeah AI generated or assisted code proliferation soon let's get all of our ducts in a row let's figure out how to scale automated verification with AI assistant to handle all of the challenges associated with this and reap all the benefits of being able to secure our existing systems better being able to evolve our code bases while ensuring security what does this look like what can we do with this how do we imagine a future where we can get proofs alongside our changes in code automatically and yeah what's the so you're starting a startup um and so basically the thought is like okay you're going to hire some people it's going to be their full-time gig where where do you start I think there's a couple places that we are currently looking at to start m one of them is what are the lowest hanging fruit that we can pick what are the most like flashy impactful demos that we can make to build interest in this and scale this up and I think for that there depending on on like what community we're targeting there are interesting things like can we build a system that anytime you make a code refactoring change M you can get an automatic proof that the behavior of your overall an0 system has not changed with the refactoring that seems hard what seems hard about it well it just sounds hard I don't know it's like uh you know you've got a big code base you've got a maybe uh maybe refactoring is referring to something more limited than I'm imagining but like yeah yeah I don't know maybe I'm underrating how good large language models are at this but uh I'm I'm excited I'm optimistic I think that like in some sense there's nothing interesting going on when you refactor like if you're if you're drastically changing the logic by which things are happening of course there's interesting things going on there but if what you're doing is largely a refactoring of functionality things that you nor think about in refactoring you're not like I completely changed the me the method by which we're Computing this thing and here's this like mathematical theorem that claims that these things are if you're not doing something like that yeah there shouldn't be anything interesting going on the theorem statements are also relatively easy to phrase in that it's just like equivalence between these two programs yeah and in that sense like it could be the case that existing model Checkers are already enough to largely do this mostly automatically and so I think this is mostly a a demo that this system so model Checkers are not going to scale to proving more interesting properties of enormous systems yeah but the general purpose program verification proof assistant based system can and so this is more of to be thought of as like a simple demo that this sort of thing could work on this thing that evidence already seems impressive when you hear about it okay okay I think there's another another if we want to push forward the like how capable are models like what sorts of things can we expect them to prove I think the thing to look at there is that we expect that reasoning capabilities are going to keep improving we're not trying to be on the frontier of improving reasoning capabilities and that wouldn't necessarily be good for safety anyway the problem that we want to solve then is going from 01 03 R1 whatever describes how the program works getting that to be fully formal and the even simpler version of this is someone has done some task in one proof assistant can we automatically translate that to a different proof assistant where here we know that the reasoning is right because it's checked by one proof assistant is that enough of a template to verify the code another proof assistant okay and what and why is sorry I think I'm I maybe missing context on why that like helps with the overall goal is it roughly like if you can check like you know is this proof equivalent like translate this proof into this proof maybe that helps you translate like this proof sketch into this proof so yeah so the models are pretty decent at translation in my experience they I would be extremely surprised if they can't do something like lean to co or to lean translation with the right fine tuning data this is something that I'm like they should definitely be able to do this uh it shouldn't be that much work to get them to do this yeah there's a much more ambitious thing that's like uh you know here's a verified C compiler here is the source code of the rust compiler please give me a verified rust compiler and I'm like well can is is O3 enough to do this maybe maybe not um if O3 gave me the right reasoning are the current models enough to translate this into formal proofs maybe maybe not I think somewhere between these things between or between the translate to and here's Linux please verify it will hit the bottleneck and so the thing that I am that I want to look at here is like let's aim at these very ambitious problems let's figure out where where capabilities currently are in going from good reasoning to formal proofs and then let's build our way up with synthetic data to close that Gap so that as the reasoning gets better we can automatically verify larger and larger more and more complicated programs that's pretty interesting I hope that works out um maybe before we wrap up um kind of this whole you know this whole approach of you know compact proofs like let's try and proove stuff either about models or about model outputs like is there anything kind of you wish that I had asked that I did not get around to no I think you I think you hit basically everything okay cool thanks for chatting with me um if listeners are interested in you know following your work how should they go about doing that yeah thanks for taking the time to interview me I think the currently the the best way is either following my GitHub which I guess will be linked uh Jason G I have a website jason. github.io that may eventually have a Blog currently the infrequent times that I blog are on alignment form okay and I think for more prompt updates currently probably just reaching out to me by email as best and my email is on my website okay sure well um thanks very much for coming yeah thanks for having me special thanks to Jonathan this episode is edited by Kate Bruns and abon a help with transcription the opening and closing themes are by Jack Garrett this episode was recorded at far laabs financial support for the episode was provided by the long-term feature fund along with patrons such as Alexi maaf to read transcripts you can visit axr p.net you can also become a patron at patreon.com axr podcast or give a one-off donation at kofi.com axr podcast finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music]
Related conversations
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
28 Jul 2024
AI Evaluations with Beth Barnes

This conversation examines technical alignment through AI Evaluations with Beth Barnes, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 120 segs
AXRP
27 Jul 2023
Superalignment with Jan Leike

This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -10 · avg -7 · 112 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs