Library / In focus
AXRPCivilisational risk and strategy
Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Peter Hase on LLM Beliefs and Easy-to-Hard Generalization, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 124 full-transcript segments: median 0 · mean -1 · spread -24–0 (p10–p90 -6–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
124 slices · p10–p90 -6–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 124 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video tEge5uo4E-A · stored Apr 2, 2026 · 3,627 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/peter-hase-on-llm-beliefs-and-easy-to-hard-generalization.json when you have a listen-based summary.
Show full transcript
hello everybody this episode I'll be speaking with Peter Hy Peter is an AI researcher who just finished his PhD at UNCC Chapel Hill where he specialized in natural language processing and interpretability research with a special interest in applications to AI safety for links to our discussing you can check the description of the episode and a transcript is available at axr p.net all right uh Peter welcome to axer yeah thanks so much Daniel I'm excited to be here yeah I'm excited to have you on so my understanding is that most of your work is in interpretability rough roughly interpretability in language models um is that fair to say yeah that's that's right yeah I've been in an NLP lab for my PhD so we work mostly with language models but a lot of it in terms of like methods evals has been focused on interpretability actually maybe one thing I want to ask is I have the impression that you were into language models like even before they were cool doing NLP before it was cool right so I just today I like looked at your Google Scholar and like scrolled down to see the oldest paper you were a co-author on and it's like it's 2018 paper on algorithmic Sonic Generation Um so yeah 2018 like before before the rest of us had caught up like uh what what got you interested in NLP yeah well I I remember the project you were talking about that was when I was an undergrad uh I feel so lucky to have had that opportunity to to do a little like special projects uh class actually with Cynthia Ruden uh at Duke in my undergrad and gosh I mean I was interested in language you know I was interested in Psychology and Linguistics uh in my undergrad and very interested in language and increasingly interest in machine learning and and statistics and so it was just a great intersection of of the topics and and learning about language models and which at the time were were really lstms um and and getting opportunity to apply those to a fun task you know I'm not I I would even then I of course I admitted it wasn't necessarily that consequential poetry generation but it it was it was certainly a great opportunity to to work with language models a bit yeah I guess like like you mentioned um the bulk of your work has been in interpretability what's what got you interested in that aspect yeah that was also an interest that developed in undergrad so I think a long time ago there there are many different arguments put forth from like why interpretability was like a good thing to to study um at the time there was there was definitely a very intuitive draw and there still is an intuitive draw that it's just good to know how these things work it's just good to know how AI systems work how language models work they're doing increasingly interesting things I mean at the time you know there's so much progress in vision and and this was 2018 so there had been a lot of progress in RL uh language models I I think by 2018 GPT GPT one uh was out and I think gpt2 was coming out like spring 2018 or or 19 um and it was it was just very clear that the these systems were making a lot of progress and doing just fundamentally interesting things and yeah from a safety perspective it it's like gosh we we should know how they work we we should be able to to explain their decision-making process this is kind of a broad question but how would you say interpretability is doing as a subfield well this is a great question yeah I I tend to be I think optimistic when talking with people from other subfields or who are working on you know capabilities research or other research areas and I probably come off as a little bit pessimistic when I'm talking with my colleagues about this yeah let me be clear there's definitely a lot of progress being made um where we're we have we have better evals we have a better understanding of like okay when do we have ground truth when are we kind of like speculating about the reasoning process what would it mean for an interpretation or an explainability method to be useful and what could it be useful for Downstream this picture is has just become a lot clearer in in the past five to six years um I'll say the yeah one of the reasons I'm I'm pessimistic at least when it comes to colleagues is like we we just end up talking about all the false positives and false starts and like oh the reason this result didn't hold up was because of this or the Reon reason this result didn't hold up was because of this right so uh you know and and I think some of this is like um decently high profile like people might know about things like feature attribution or saleny Maps so so this was uh an you know a popular and one of the first kind of major methods for trying to get a sense of what neural networks were doing and and you could think of this as being like a 2015 to 2016 method which was to say okay like what if you have a vision model what's it looking at is it looking at you know the dog in the image is it looking at the background is it looking at human in the image and people were really excited because this was one of the first ways to generate I mean the images just looked good the visualizations looked good and you could say wow it really seems like the the neural network thinks this is a husky because it's looking at the snow in the background and not because it's looking at the dog per per se yeah right so so people were really excited about these methods and then if if you if you worked in the subfield for a while you know how these methods have had a bit of a fall from grace and they didn't turn out to be useful in human studies for the most part there's been theoretical work showing that some of these popular future attribution and cency methods can like can do no better than random in in certain settings there have been a lot of hard-learned lessons in the subfield um in terms of what to what to trust and and what's promising to run with in the long term I'm wondering if you have like an agenda of just like questions you think that it's important for the interpretability field to answer and if so like like what are they like where where should we be looking what should we be like aiming for here I think we're still in the stage of figuring out what methods are good and what evals um tell us when we've created something useful um so it's it's I don't think we're yet at the stage where it's like we have the tools and we're mainly interested in detecting bad reasoning processes or detecting deception and in language models it's it's not yet at we're not yet at the stage where we're just trying to um like catch safety failures uh we we're still at a stage where we're trying to build tools that would let us catch safety failures and we're trying to build evaluations for the tools so we know which tools would work at that so so it's it's still it's still pretty Upstream um yeah yeah let me I I'll I'll stop there and actually ask you to like uh maybe elaborate on the question a bit so so I could keep going yeah yeah I guess so I I guess it seems like it seems like a view of sort of more basic science like we just like the reasons to do interpretability is we know that like something about understanding modal Behavior understanding modal internals why things are happening like that something around that is going to be useful and like if we did some more basic science we would just have a better sense of what the important questions to ask are um is is that roughly like a good gloss of youreview or am I missing something um okay thanks yeah yeah okay uh yeah so the research stage we're at I think is still basic science let me give what I think so I gave the kind of intuitive motivation before for interpretability and I think it's an exciting area it's we want to know how these things work it's they're so fascinating they do such interesting things we want to know how they work um this is the intuitive pitch I I think the the strongest pitch that has emerged over time for interpretability research is that we need something that goes a little bit beyond testing so all models get tested on all kinds of data sets benchmarks evals we're looking for Dangerous behaviors we're looking for Dangerous capabilities uh we we want to know what kinds of reasoning and and knowledge that models possess so so really I think the the best pitch for interpretability is what can our tests not catch so so one thing that our tests can't catch a lot of the time is the underlying reasoning process so if we just have a huge multiple choice exam that is going to tell us if the models have dangerous bioweapons development capabilities or the models have really strong theory of Mind such that they could like operate in a social setting and like either be cooperative or intentionally deceptive if if we just have of like surface level let's prompt the model and see what text it outputs kind of tests for that thing we we can't test every scenario we we just we can't exhaustively test every relevant scenario um there there are settings where we'd be interested in deploying the model when it's interacting with people um it it might be more or less knowledgeable or aware that it's interacting with people and there's there's going to be settings where we can't actually test the thing we're interested in or we can't exhaustively test the model and and that's that's especially the setting where we want to open up the hood and figure out what's going on inside and be able to say okay yes it did this multiple choice problem correctly and it had a really impressive strong reasoning process for how it got there and we're pretty sure it's actually going to generalize Beyond just the the things we're testing or we haven't deployed it in a setting where it is cooperating with people in the real world yet but we've we've leveraged some interpretability method to say yes it this model fully intends on cooperating with people even if it knew that it could like slightly better optimize uh you know one of its incentives at the expense of like harming a human we know it wouldn't do that because we've been able to like truly inspect its its reasoning process underneath so we've been doing this like interpretability work for a while right um what like what things do you think we've learned out of interpretability research yeah so I think one big thing we've learned is that um we really we really want the methods to be useful for down some Downstream purpose um so that means when when we have an inability tool like we're inspecting what features the model relies on we want that to be able to we want that to enable us to do some debugging we want that to enable us to catch failures that we might not have been aware of before uh we might want that to like make the decision- making process more clear to an end user so they can decide if it was reasonable or not and and they might like you know that this comes up in for instance for instance this problem called algorithmic recourse where like a person is like being classified by a model and like they want to understand like why they got the decision they did um so a lot of it's like increasingly gearing our evals towards these Downstream use cases um so that we can make sure that that we're actually getting like you know good signal on on the methods we developing yeah uh so so that's that's one broad lesson I think and I could say a little bit more about I think some of the more Upstream evals that's still seem valuable to me but that's that's one lesson um so that's like figuring out methods that actually improve safety of models in some tangible way basically and and going back to what I said before especially in a way that's complimentary to testing or complimentary to other kinds of evils um yeah I think one of the other lessons is there this is this will be more object level um we're learning that language models can generate really plausible sounding textual explanations of their decision- making that aren't actually how they're reasoning so this this is just an immediate object level lesson about how language models work uh it's it's their ability to chat with people um is really impressive uh their ability to offer like justifications for their answers is is really impressive and we're starting to catch them out in inconsistencies via some cleverly designed tests that show that yeah the what the model says is not really what it was thinking uh a lot of the time yeah that's that's I think a very important Insight in terms of like interacting with models in text and I'd say that's more of a the natural language explanations category in terms of like what what research stream that is and then there's this area of mechanistic interpretability and other kinds of probing research uh historically and and NLP that is um that I'd say it's a it's a setting where we're we're really gaining traction on figuring how models represent things so a lot of the work in 2015 and 2016 was focused on looking at the input and okay what part of the input is this model looking at and for vision models you'd get heat maps that suggest that would like light up over a part of an image that the model might be looking at and text setting you'd get text highlights so you'd say oh it's these words and and we're going to highlight them and and then that shows you what the model's looking at um I think we're really starting to go deeper than that we're really starting to be able to say okay here are the hidden activations in the model and you know there there's been one development I I'll point out um you know it might have been that we used to say okay here are the neurons and here are the neurons that represent this or the neurons that represent that there's been some really interesting mathematical progress I think on showing it's not just individual neurons but it's particular combinations of neurons that might represent a certain future so you turn this cluster of neurons on and that means like the model has definitely detected that like this text is discussing soccer as a sport you have this other cluster activations that or neurons that have been turned on and it's like okay now it's it's discussing soccer as like a political phenomenon or like governing bodies of of soccer and it's like diff different AB very abstract features of model inputs uh we're we're starting to connect the dots between those abstract features and model internals and how the models are actually representing them inside and then after that how the models are using those representations so we might know that okay the model has detected something and now how is it going to influence the decision and and people are developing tools for saying okay yes this feature has been detected and it plays an important role and the models answer so okay your your first two points of things we learned like it's important to like it's important to get some sort of Downstream benefit from your interpretability method or you know Peg it to you know does it actually help you do such and such task and large language models are really good at faking explanations of how they're thinking these seem these sound to me like kind of negative results right like uh you might have thought this thing was true but it's not true right you might have thought that just because you have a plausible story for why this like integrated gradient method like tells you something about the model uh you're just wrong and actually you should just test it against like does it actually help you do something you might have thought that like if a thing could talk it's going to like say something reasonable that's not true um does that seem like a fair characterization to you yeah sorry to be clear those basically were negative results I mean we were realizing that some of our evils weren't really demonstrating external usefulness or Downstream usefulness and the natural language stuff um yeah just it's just uh you know I think people were some people I think not in the explainability world when they saw things like uh these like dialogue models get developed or like Chain of Thought get developed or like rlf models get developed and they saw models explaining reasoning in words to people I mean I certainly saw public perception from you know NLP people uh like uh experts in the field basically say wow like we just almost solved explainability right like right right and and it and it took some additional studies to say okay know this is a result we've seen before like we have a new explanation method and it still doesn't quite tell us what's going on inside the model so if I'm trying to think about like kind of what we learned there it seems like the underlying theme is you might think that neural networks are sort of neat and tidy such that like you know there there's like a place where a thing is happening and you find the place and you understand the thing um and it's just not true like like like somehow the story of interpretability is just like falsifying naive models of how neural networks work and the way we falsify them is we get a thing that seems like it should work and it turns out to not be helpful um and somehow the point of it is to just like help us realize how alien language models are yeah I think that's a good way to put it and I think this is one reason people are um people are starting to notice a need for more ground truth evals and being able to say okay here's what we know that the model is doing because we specifically designed a neural network to reason them a certain way or to like be vulnerable to certain adversarial examples or to rely too strongly on a certain input um or you know sometimes people do that with language models sometimes sometimes people do it with just very toy neural networks that like learn a specific function and the goal is simply to figure out what that what that function is yeah so so this is a setting where yeah we we to avoid all of the difficulties of an interpretation maybe being right or maybe being wrong or maybe being halfway right and halfway wrong and then trying to figure out what we could possibly use this thing for this is going a little bit further upstream and saying let's just design a system that looks kind of like a blackbox but we secretly know exactly exactly what it's doing and then figure out if our methods can can reliably detect the behavior going on yeah so this is that people are definitely waking up and becoming a little bit more alert to to this kind of research angle um there's some interesting broader commentary on this kind of thing so chrisa has this nice figure and some blog post that's like the uncanny valley of abstractions or this like Valley of abstractions with with neural networks where it might be that neuron networks start out in terms of their AB like capabilities if you're thinking a small Network trained on a small amount of data basically doing a bunch of hacky stuff and using a bunch of hacky heris sixs to solve a problem but but as the models get better and particularly as they solve harder and harder problems you begin to think well the reasoning process uh plausibly the reasoning process is going to look a little bit more human because we might think well basically the way you do these math word problems or the way you do this college biology exam is just going to require more humanlike reasoning and to rely on some more humanlike Concepts so there's been this idea that interpretability will actually get easier over time uh as as the language models or as as Vision models develop a more it's almost you can think of this like being like the model's vocabulary is like more easily translatable into like a human vocabulary or like a a human language yeah yeah I guess another thing I wanted to pick up on is um when you were talking about our just advances in understanding the representations of neural networks you mentioned that like there was some that uh we now know that like things are represented as combinations of neurons and there was some like math research backing that up um can you say what you were referring to oh yeah so this was uh work I'd say from um okay so so something that really put this on the map and more of the public landscape was anthropics superposition work um right they're like toy models of supervision superos um where they were able to show that like you could in a given representation space so if if the dimension ality representation space was like 784 which is which is equal to the number of neurons so if you had 784 neurons you could have a model that could actually represent more features than neurons Y and immediately this implies that it's like not just a one to one map because it's not just now that one neuron means one feature right and and mathematically what that ends up looking like is that features are now directions in the latent space and they're not all orthogonal so you know previously if one neuron was one feature that's also a direction in the late space that's a basis aligned Direction it's it's just right along one access so features have always been directions and we we've clued in a little bit more to how features are now not basis aligned directions but they can point in some kind ofly kind of seemingly arbitrary Direction in the lat space it happens that like if you have if you have a thousand features in like a 784 dimensional space these 10,000 features you can imagine them kind of slightly pushing apart so like they're all just the right distance from one another they're all pointing in some Direction but like they they're they're minimizing potential interference uh between them yeah so I'll point out that work as something that I think did a good job visualizing this a good job demonstrating it in toy settings um I I would point I would go all the way back to 2017 probably 2017 with t have from Google and this was some work Beam Kim LED at Google that showed that there could be feature vectors in the Laten space and and they showed this not uh really in an unsupervised way which is basically the way anthropic showed it but they showed this in a supervised way so if you if you have a data set and and you can let's say you're looking for how a vision model represents Stripes so what you do is you have a bunch of images with stripes and you have a bunch of images without stripes and you feed all those through the model and then you learn a classifier on the model's Laten space that can classify representations as stripes or not stripes and with a with a feature like that and and strong enough models you often see that there's a direction in the Laten space that basically measures how stripy something is and and it's not it NE it was never access aligned or basis line to begin with it was always a direction I wonder so this actually gets to a methodological question about interpretability so I remember looking at this Tav paper and thinking like well the thing so so Tav it's like oh something concept aligned Vector oh I wouldn't even remember that Acron yeah it's concept activation vectors right something concept acation vectors or something like that please forgive us uh listeners and bean Kim um but I remember one concern I had this about this paper is that it was kind of trying to understand how Concepts were represented in networks but by concept it kind of meant like a thing a human thought of right like we we think that there should be some concept of strip so we like have this data set of stripy versus non- stripy things and we see where that is in the network and at the time there's this thought of like well there's some danger in um kind of imposing our Concepts onto neural networks or assuming that neural networks are going to use our Concepts right like actually I think um so you were a co-author on this paper for foundational challenges in assuring alignment and safety of large language models um uh lead author I guess was Usman anoir and then you know a bunch of co-authors um you're at the section about just difficulties and interpretability and I think one of the things you mentioned was like models might not use humanlike Concepts and we've kind of learned this but at the same time it seems like this TF work really did teach us something about how Concepts really were represented in neural networks for real so yeah on the one hand I want to say like hey we shouldn't we shouldn't like impose our Concepts onto neural networks and we shouldn't assume that they're thinking of things the same way we're thinking about it but on the other hand this work that like just did make that assumption turned out to tell us something that like it took the rest of us like five years to work out right so yeah yeah how should we think about imposing our Concepts on networks yeah so that's a good point and and I think uh this line of research has taught us something durable about about how represent uh you know language models or Vision models represent things uh yeah in in that longer uh agenda paper the the foundational challenges paper um we definitely criticize this line of research you know as much as as much as we can manage uh these kinds of methods so you can think of this as supervised probing and unsupervised probing uh you know so one of the ways that if if you so the sparse Auto encoders direction that anthropic and open and others have been pushing um Apollo as well um yeah has been uncovering yeah the same kinds of feature uh vectors and in Hidden spaces but just in an unsupervised way and but so then you need to figure out what they mean so you don't start with this idea that Stripes is represented but you first just find that okay there's a vector number 101 and it's a pretty important Vector it's it's it seems to play a role in many different kinds of animal classification problems um and so uh one of the ways people have been interpreting these kinds of vectors is to say okay let's look at Max activating examples yeah we comb through our train data and figure out what kinds of examples activate this uh Vector strongly let's get some negative examples too we'll comb through the training data just make sure that okay if there's an example that doesn't activate this uh Vector um it doesn't really have anything to you know it could just be some other random thing and and hopefully the max activating examples all have some clear thing in common and the the nonm non Max activating examples definitely represent other things and not the thing that the other you know the first set had in common right so yeah so what's the issue with all these uh approaches um it's it's an art it's it's it's an art it's hardly a science I mean you're you're really you're really doing this in interpretive act that is like okay what do these examples have in common and and how would we verify that more strongly and uh you know it might be that you have something in in mind already that's in the supervised case and the unsupervised case like Text data is really really high dimensional and it might be that okay we have we have five or 10 activating examples that are positive examples and five to 10 negative examples and we're going to try to we have so we basically have like 10 data points and we're going to try to make a claim about like what one factor ties them all together or what two factors tie them all together right this this is just a difficult process to to get right um lots of like confirmation bias lots of data set sensitivity to this kind of thing um yeah basically saying it's an art not a that science goes into a little bit of How It's like we just risk like finding things that we're aware of uh you know seeing patterns in the data that make sense to us and not patterns in the data that are actually used by the model but maybe alien to us I'll go into the the data set in sensitivity thing a little bit which is there's been some criticism of the TCAP stuff that it's like actually if you use a different data set of stripey and un stripey images you might get a different Vector so it seems like some of these methods are quite sensitive to the the data sets were're we're using um you you get similar kinds of data critiques with the unsupervised Vector Discovery as well so if you really wanted to know like what were all the features that my model is using and for you know and what could this future Vector possibly rep possibly represent so when I say you know you go through the data and figure out what are Max activating examples that literally means you run the model over a bunch of data points and figure out like what are the activations for this feature Vector if you wanted to do this exhaustively it actually means going through the pre-training data it means that you need to do a forward pass over the entire pre-training data set to be able to say and this is still just correlational we haven't even got to a causal analysis yet but do coral analysis means you run through the entire pre-train data set and look for Max activating examples this is this is prohibitively expensive so now we have this issue where like this feature is going to represent something but figuring out what it represents is now this huge task you know both both both in terms of getting the the humans uh human annotation process correct and in terms of using the right data to begin with yeah so it seems like the thing going on here here is that there's this sort of spectrum of methods where like on the one end you have things like um these sparse autoencoders work which is trying to be like relatively kind of neutral about what's going on with the model it's still making some assumptions that like this data set is representative and such but like you know it's trying to like not impose a bunch of structure um on the other hand like if you think about um Tav style work it's um it's it's kind of assuming that like hey the model is going to have a stripy concept the only question is where is it right and I see this tension a lot in um interpretability where like on the one hand you sort of want to you don't want to like add in a bunch of assumptions about how your thing is going to work but on the other hand if you don't add in a bunch of assumptions it's sort of like how are you how are you validating your thing you know you like you have some method it has very few assumptions how do you tell if it worked like do you just look at it and see like do I like what I see how do you think it makes sense to manage this trade-off yeah so I that's a good question especially because there's there's so much that's qualitatively different around these kinds of you know if you're talking about future Discovery these kind of prob methods if it's supervised versus unsupervised that changes a lot about what kinds of data you need it changes a lot about how competition and expensive these methods are so yeah how can we compare them well one answer is let's figure out what they could help us do and then just figure out what is best at that so so maybe these methods help us do model editing maybe it helps us say okay here's a feature that is important to the model and it's making some errors on certain data points so I want to edit how much this model relies on that feature maybe I need to turn up its Reliance maybe I need to turn down its Reliance on that feature um maybe there maybe there would be an important feature that's missing from the model and and either you know either there's some you know uh incredible mechanistic intervention on the model that equips the model with the ability to represent a feature or I just need to go back and put some data into the training data set and retrain the model so it represents that fature properly let's let's compare these all these methods in terms of usefulness for for this thing that we we care about which is like uh now and I can unpack the the model editing a little bit one one thing I I mean there is just basically making fine grain adjustments to Model Behavior so you've already trained a classifier that maybe handles a thousand classes or you have this this language model that can do any kind of like textto text uh task um but these things are expensive to train and they might just make small mistakes and you just want to be able to fix the fix the small mistakes and you know diagnose what's going wrong mechanistically and and then and then fix that mistake uh that that would be like the model editing application that a lot of some you know a lot of this mechanistic and tability kind of work could be useful for I think right yeah I guess I'd like to go in a little bit of a more concrete Direction specifically you have this at least a few papers um I see maybe I'm like reading it too much to think of it as a line of work but I see you is having some kind of line of work on just beliefs of large language models so uh if I look back you have this paper do language models have beliefs methods for detecting updating and visualizing model beliefs um by yourself and some co-authors in 2021 2023 does localization and form editing surprising differences in quality based localization versus knowledge editing editing and language models by yourself moit bansal Bean Kim and Asma garun in 2023 um and also like our language models rational the case of coherence norms and belief or Vision by Thomas hofweber yourself Elias Sangle esin and moit bonsol um this year yeah what what could are you interested in just beliefs as a thing to look at rather than representations or you know all sorts of other things people do interpretability for yeah I think I think it's totally fair to call this a line of work this has been an interest of of mine for a while um so I I think things might seem more coherent if I go backwards in terms of the papers but the the the real narrative story is from the beginning sure so yeah in in 2021 um you know we knew that models might represent features and partic I think people uh forget how much perception of of normal networks has changed over time so a lot of people especially academics and 2019 and 2020 I mean these things are classified they they learn features and they classify features you know they learn features and they draw a hyperplane in a in a lat space and divide positives and negatives and that's what these things are doing so we want to figure out how important features are to those hyperplanes that's that's kind of the the the recipe and and you know for for a lot of people back then and then it became increasingly clear that language models store a lot of information about the world and then it became increasingly clear you know with with basically chpt RF models that language models could Converse reasonably about this information in the world and the picture got a lot richer and it it started to seem more and more that these neural networks uh were doing something just a little bit more interesting than like storing raw data learning patterns and data it it seems like they might actually be representing things about the world and particularly with models that are fine- tuned to be truthful or like fine tuned to be helpful and can converse fluently with people about the thing about uh you know questions the world a lot of people were just really tempted to speak of these systems and totally anthropomorphic terms I don't think this is always a mistake it's just really natural a lot of the times to say like oh the model gave me this answer but like you know it knew this thing but uh it actually made a little mistake over here I didn't I didn't quite know what it was talking about in this case and speaking about language models having knowledge about the world really presupposes that language models are representing things in the world and that language models have beliefs about the world yeah so okay that's that's a bit about um yeah why why beliefs emerged as like a potentially interesting thing as opposed to Simply you know features that are used in classifiers um yeah and now so what is yeah what is the fascination with beliefs and why is it so natural for people to speak of models having beliefs or or knowledge well I think this is this has to do a lot with how people explain behavior of agents and so this is something that we're really interested in in the last paper you mentioned which is about if language models are rational right so and Daniel DN has done a lot of work um the phospher Daniel Dennett did a lot of work elaborating this intentional stance Theory um which is it's kind of a folk psychology for for how people work um and it's that people explain behavior in terms of an agent beliefs and desires but I think I think we see this spell out again and again you know between scientific work and and you know everyday situations when you're thinking about theory of Mind task and asking someone okay why did Sally look for her you know uh I guess I forget what she usually stores in the basket uh you know there's like yeah someone who's like moving a person between like yeah they like they like have an egg in a basket versus an egg in a bucket and you know if they're out of the room and things been moved from one container to the other and then they return to the room where will they where will they look this is just classic beliefs and desires right we believe that someone has a desire to find an object that they own or that they're looking for and we know we a you know we recognize basically via the of mind that they have a belief about the state of the world and and the these two things combined to produce behavior and and this is just a great way to explain lots of stuff and Daniel Dennett uh elaborates what is really a minority view in in philosophy to my understanding that beliefs are just informational States so it's a very stripped down view of what a belief is and it's basically totally okay to describe beliefs to things like animals and robots uh as long as it does a good job explaining their behavior basically as long as it seems appropriate like clearly animals have information about the world clearly robots store information about the world and it seems like the if if the equation Behavior equals beliefs plus desires is a good recipe for explaining Behavior Daniel D basically says go for it let's you use all the terminology you want to to explain how these things are working so can you tell us a little bit just about like what you've done trying to understand beliefs and language models yes this is work that was really read uh led by a philosopher at at UNCC Thomas hofweber um which is I love I love reading how philosophers right I feel like it's so methodical and so clear it's like okay what would it mean for language models to have beliefs we're going to break it up into into three questions one uh do they have the kinds of representations that could be beliefs or the kinds of representations that are like aimed at truth and then you know when we're thinking about belief and rationality uh number two do these if if language models have these kinds of representations that are aimed at truth what would it mean for them for for Norms of rationality to apply to those representations so it's like number one do they have the representations number two do we expect Norms of rationality norms of truthfulness to apply to the representations and then number three how well do language models live up to those norms and the the paper uh you know basically explores each of these three questions one at a time and and some of the core arguments I I think are are pretty simple I mean when we're thinking about models having beliefs beliefs are supposed to be true this is this is like right we're not just so this is in contrast to denet we're not just talking about an information store we're talking about an information store that exists for the purpose of truly representing something so there's this really fun example in in the paper that's like okay so we know about the the Chinese room and dictionaries uh you could say okay you have a language model but what if it's just some huge simple shuffling machine and uh it doesn't really know what it's talking about it's just whenever you ask it a question it just does some really complicated lookup procedure right you know it doesn't really know what it's talking about and you can ask the same thing of well a dictionary stores a lot of information it might store a lot of information about you know the city Paris or something uh but doesn't mean it knows about Paris it's it's a dictionary you know we put the information in it and there's this really fun example in the paper that's like yeah clearly just having information is about something is not enough if if a wolf walks through the snow you know in its environment and the snow has tracks in it the snow carries information about the wolf and a human could read that an animal had gone through the snow that doesn't mean the snow knows anything it's just it's just caring information yeah yeah yeah so so what is the the clear requirement Beyond just carrying information it's aiming at truth it's aiming at truth I guess there it seems like there there are two things we could say right so one is like there's some sort of in like Criterion of correctness like is it natural to say that the patterns of snow are aiming at truth or something this this is the route that's um taken in the the paper you mentioned if if I'm thinking of like Daniel dennet like expected utility Theory style um accounts of belief there it seems like the distinction is okay like you know in some sense the snow has a representation of whether a wolf walked through but it's not using that for anything right like the thing that Bel are for is they com you have some like belief like things you have some desire like things you combine them to get behavior that uh you know that you believe will achieve what you desire yeah and like that's the outcome so it seems like the these are two accounts that are like kind of distinct like maybe intention maybe you could have one without the other I'm wondering what you think about like which of these we should go for yeah so let me clarify this um kind of like expected utility view um is is that view supposing that you know beliefs are basically like information stores that help you achieve your goals is that view yeah yeah this view that beliefs are information stores that help you achieve your goals um I think does really contrast with this truthfulness oriented view so I think philosophers have have managed as a community to agree that beliefs are aimed at truth but it's not like an evolutionary account of how beliefs work in people um and it's not an evolutionary account of how all the information stores in our brain work or our own attitudes about our own beliefs so we might hope for our beliefs to be truth seeking but actually our beliefs merely help us achieve our goals and you know parts of our brain or parts of our mind will happily distort our beliefs to help us achieve our goals and this might be disconcerting to us because we like wanted the beliefs to be truth seeking but like nonetheless that's what our you know brain does or what that that that's what part of our mind does because you know that's the job or something sure I I don't know empirically what what goes on I mean I guess it's like a mix of a bunch of different stuff and it kind of kind of depends on on the setting I'm not a cognitive psychologist uh but there's there's absolutely some tension between these things so I guess maybe one thought that I have is suppose I I just want to understand language models I want to understand like what they what they're doing and why they're doing it it strikes me that the the functionalist account of like well beliefs are just things that combine with desire like things to produce behavior that might help me do my job better than understanding okay like here's this functional role but is it aimed at truth you know does it have the right relationship to reality or is it merely you know does it merely have a relationship to what it sees and being useful like like as long as long as I can use it like why do I why do I care yeah so I think um the intentional stance equation is like less noisy when beliefs are aimed at truth so so when you're decomposing Behavior into into beliefs plus desires and and you're like trying to so then then you have like raw data of a system at work where like you ask it some questions and if it's like truthful and honest it tells you what it believes and then you deploy it in an environment and you and you see what it tends to pursue um the equation is easier to apply um and in order to gain like predictive power of of our understanding what the system will do and in different situations if you can trust that the beliefs are truth seeking and the beliefs are like kept cleanly apart from from the systems desires and and we know so based on everything we've discussed before I mean all the mechant stuff a lot of the natural language explainability stuff um it's not like you have to have this kind of folk psychology theory of of how the system is working you might just be you might insist you might insist on treating this thing as a machine and you're going to understand all the gears and levers inside sure and forget about beliefs and desires you know I want to know what features represented and how that feature influences the next feature and how that feature influences the next logit and then how that you know transforms into the model's overall answer to a question okay and let me say one more thing about you know how these approaches relate to one another in some ways I think these appro approaches are slightly ideologically at odds I mean they they certainly attract different kinds of researchers with with different interests to a large extent I think they're totally complimentary because we can think of the the mechan turp uh approach as being at a low level of abstraction yeah and you're concerned about what's going on inside the model and how those gears and Levers work to produce next tokens and then we can think of the behavior plus desires work as going on at a much higher level of abstraction and hope hopefully these are good abstractions hope hopefully you know and and this goes back to okay yeah some of the uncanny value of abstractions work I think I'm using that phrase correctly I remember I don't remember the exact title of that blog post or post from chrisa um and and this is this is one of our main motivations for working on some of this language model rationality stuff is asking are these good abstractions you know could these be good abstractions for for thinking about how language models work um because and let me give a little bit of opinion at this point I think we need some higher level of abstractions and and it's really important it's going to be really important for us to get the abstractions correct because I both think that like mac and turp right now feels a little too low level to me and I'm not sure if we're going to be able to fully parse all the internal mechanisms in these really large and complicated systems at least not as fast as we probably need to in order to keep up with uh you know safely uh safely deploying m um and I really don't want us to fool ourselves into thinking okay yeah here's the system and here's how it works and it has these kinds of uh you know beliefs and these kinds of desires um and don't worry all of the concepts that the system uses are totally human Concepts and very easily translatable into into human vocabulary and the system is going to be rational in ways that like uh you know basically a lay person could expect it to be rational um because the language models are still pretty alien and they still do weird stuff like insist on certain reasoning patterns being like why they arrived to an answer when we for sure know that they're like hiding their reasoning or the reasoning is hidden internally or misrepresented by the text that gets produced uh we weird stuff is is still happening and I and I don't want us to fall into the there's like Two Tramps one is like we we stay in the lowlevel territory forever and we never actually gain like a higher level of of abstraction and predictability and systems that is able to like keep up with I think where like ability's progress is going and the other trap is like we really treat the systems as way too humanlike and way too rational and then and then we forget actually how alien they are so this actually gets to kind of a question I have about like how we figure out model beliefs so one one way you could do this uh which I sort of see is represented in the r language model's rational is to say okay a model's belief is just whatever contrs like what it says in a somewhat straightforward way right if whenever you ask a model like is uh is Paris the capital of France if it's the case that whenever you ask it that it's an answer is yes then you might want to just methodologically say okay that's just identical to saying that it believes the Paris is in France but I think you might also have a different perspective where you're like okay maybe models just have have some underlying beliefs but they're not totally straightforward in how those beliefs translate to what they say like maybe they're speaking strategically maybe they're like willing to lie so like they actually think that uh Paris is the capital of Italy not France but they just know that you're going to like make a big stink if uh if the language model says that so that that's why it says it's the capital of France um I'm wondering yeah these strike me as kind of different ways of understanding language model belief how should we which way should we go yeah this this is a a good question it's a really tricky problem right now yeah yeah I I think the direction we go in the paper is trying to make a lot of assumptions and then give some basic formulation for for what beliefs would look like and how they'd be expressed so the ass the assumptions are that the system understands what you're asking and is trying to be truthful and honest and it's it's it's really it's really playing along it really is cooperating and one of the very first assumptions we make in the paper is that models represent things about the world I mean this ongoing debate between you know some of the Bender crowd and then there's like a Pont doce I probably said that wrong but it's it's Pont DOI and Hill Felix Hill paper is more of a conceptual roles kind of view of of meaning and and language models which is much more charitable to the to the idea that language models are actually representing about things in the world representing things in the world one of the that's one of the very first assumptions we make in the paper is that right language models are representing things in the world they seem like competent speakers in important ways they seem to understand what we're asking them a lot of the time so if they're also trying to be truthful and honest and capable of reporting what you know what they would believe then what you do is you look at the probability Mass on like yes tokens and you look at the probability Mass on no tokens in response to a yes no question and and you're and you think and you basically treat what what the model generates uh we're going one level below that just in term you know rather than just what it generates we're looking at probability mass and all kinds of affirmations to to yes no question and saying yeah if you met all those criteria this seems like a reasonable way to say like the model asense to the question that that you've asked it but things just get a lot thornier from there I I I think for for reasons you describe um you know in in the 2020 2021 paper which introduces some of the belief terminology and really is largely focused on model editing we take a slightly more expansive view less formally but more expansive or ambitious and scope in terms of what a belief should count as so one thing we were looking for there is logical consistency and this immediately opens up a lot of issues for language models because they're definitely really knowledgeable and they're like decent at a variety of logical reasoning tasks but they're just going to say stuff that conflicts sometimes and and if you ask okay what are all the consequences of Paris being the capital of France something that should be a consequence of Paris being the capital of France the model might not actually agree with or the model might say something to the contrary and then it's like okay well if the model contradicted itself like so so basically in that 2021 paper we we're pointing out that this seems like a Criterion people would be interested in like if you want to know if a human really believes something yeah you like ask the question one way to them and then you might ask a bunch of related questions just to make sure that they like really understand what you mean and they really understand what they're talking about and they've kind of considered some basic consequences of the things they're espousing such that they like yeah they really do basically know what they're saying and stand by what they're what they're saying and so what happens if if you catch some really basic silly logical discrepancies content knowledge discrepancies in the language models what do you conclude then well maybe it's like the language model is like not really like an agent maybe it's like it's like modeling a bunch of different personas or like modeling a weird combination of Agents from from the pre-training data and like it's doing this thing that like you ask the question one way and it knows like you want the kind of answer that like an educated liberal would give and then you ask the question a different way and it's going to give like the kind of answer that like a conservative domain expert would give to the question so like it seems like it said something inconsistent it seems like it doesn't have a coherent belief but it's actually doing something even more complicated than that which is like modeling what other people would say in in response to your to your question um I mean that's that's nowhere near the end of the difficulty is in terms of like yeah really getting at the underlying what does the model Bel question yeah I wonder if this is a way of thinking about um the Criterion of beliefs being aimed at truth so like suppose to take a very functionalist account of what it means for beliefs to be aimed at truth which is to say that there's some reliable process by which beliefs tends tend towards the truth right that gives me a way of sort of nailing down which things count as beliefs right right like if I because if I'm just inferring beliefs from behavior um I worry like well does the model like believe this thing or does it have like some unusual preferences like like it's really hard to disentangle beliefs and preferences um people are interested in this there's this thing called Jeffrey bulker rotation which like kind of interesting to look up just about how you can like change your probabilities and your utilities and like oh yeah you act just totally the same cool cool but if we say that Pro but if we say the beliefs have to be kind of accurate then it then that kind of fixes the uh you know what what counts as your beliefs like it it it gives you a um yeah it lets you like pick a thing from this CL of things you're sort of unclear of how to choose between yeah I'm wondering what do you think about that just as a strategy for getting out beliefs and language models yeah I actually I really like this line of thinking um because one of one of the things you might be able to test here empirically is you say okay we're looking for kinds of information stores in the model that are truth seeking so let's give the model some data and figure out what Its Behavior looks like and then okay yeah so we have some behavior in environment it's still often just really hard to parse what are the differences between the preferences and the desires and what are the differences in in the beliefs so while we've given it some data now let's give it more evidence for various hypotheses and see how it updates and and see okay so like if this information store is actually truth seeking we knew we know with this amount of data the Model Behavior should look like this and with with additional data and if the model understands uh the state of the world better then the behavior should change to this other thing and you might be a I think you can design some experiments like that where if you can fix the desires or Mak assumptions about the desires and then vary how much evidence the model has about the world you should be able to under you should be able to see if is the model like in what way is the model learning more about the world and how does that influence the behavior and and then start to actually Identify some some of uh like how truth seeking it is in different regards versus like maybe there are certain things about the world that it's like totally just using and unexpected utility way and it's totally just instrumental how how it relies on on that kind of information so that's that's a little abstract um I I but I I think I think it's it's like yeah there there's like some unknown variables you need enough data to actually be able to identify all all the unknown variables um what what makes this strategy still difficult you know uh I think is that we don't know yet how language models incorporate new information we don't know how they respond to different kinds of evidence we don't know what they treat as evidence so there's there's so much I think that we can take for granted when we're studying you know animals and humans that it's so hard to even again applying these language models because like yeah we we we want to we want to run studies where we can treat them as as agents but like there's so many ways in which it's hard to exactly know it's like a rational agent it might be like this like I said before it might be this weird amalgamation of different like agent simulators it might be like yeah it might be a perfectly truth seeking agent but like it just has really bad uh practices for interpreting data about the world and we're just we're just trying to communicate certain things to it and it like doesn't know how to update its beliefs rationally over time and this just leads to really wonky behavior and experiments yeah and and interestingly like so I genuinely didn't plan this but this is sort of thinking about beliefs like like this is kind of just copying uh the structure of the paper our language models rational right like like half of this paper is just about coherence dorms like beliefs should be coherent with each other and this kind of gets um this is really related to your paper like do language models have Bel Bel where you say okay you have a Bel if like it's kind of coherent with some other beliefs and like you know if this implies this you you change the belief here it should change the belief here um you know if you like edit the belief here that should produce a result here and then talking about like giving language models evidence sort of gets to this part of our language models rational of like belief or vision and yeah it just says like yeah it's difficult to understand how things get evidence but like if you could this would be related to rationality Norms yeah so earlier when you said this is a line of research yeah absolutely because we've got one more thing coming on this yeah the the straggler project from from my PhD is going to be one more paper that will hopefully offer uh a lot of criticism of the model editing problem and belief revision problem and language models and and try to make it clear how difficult it's it's going to be to actually properly measure belief rision and language models and you know hopefully eventually help people better equip language models with the ability to do that I mean we C we certainly want to be able to edit individual beliefs in language models it's just going to be a little harder than I for you know for all the reasons we've been discussing it's going to be a little harder I think people have given it credit for so far yeah and actually this gets to a paper that um uh that you have been able to publish so does localization inform editings um yeah can can you tell us a little bit about what you found out in that paper yeah absolutely yeah so so basically we were very surprised by uh the main interpretability finding in this paper that past work had pitched some model editing methods again you're trying to update factual knowledge in a model uh pwor had picked uh pitched some model editing methods um and motivated these methods based on some interpretability analysis that they' done so there there have been some claims this paper was not the only paper to make such claims right many people have this very intuitive notion that uh where information is represented in models should tell you where you should edit the model in order to adjust Its Behavior its answers to questions so on yeah so so this in this setting they were looking at updating knowledge and models and yeah they ran a kind of interpretability analysis the work we building on was this work on Rome from Kevin mang and and David B and others and they used to kind of interpret analysis called causal tracing which aims to identify certain layers in the model that are responsible for its expression of knowledge and and the storage of knowledge and so they make a really intuitively convincing argument if the knowledge looks like it's stored at layer six in a model and you want to change what the model says in response to a question yeah like where is Paris what is what country is the you know what country is Paris the capital of uh you should edit layer six right that's that's where it's stored so go and edit that layer and that'll that'll help you change uh what the model says and response to questions about Paris you know very very intuitive argument and then they develop a method for doing this model editing model editing uh that was really successful and like a huge improvement over prior fine-tuning and Hyper Network or learned Optimizer based approaches their method was focused on really low rank updates to C matrices and language models and it was heavily inspired by this linear associative memory um Theory uh linear stive memory model from computational Neuroscience that is like a model of how matrices can be information stores or memories uh for for biological systems the method worked great and the empirical results were really good and the and the story sounded great and we did not initially set out to like try to verify the interpretability result here but that is where this project went so so we noticed that sometimes the causal tracing method the probing method suggested that knowledge was actually stored at later layers they make this claim that knowledge is stored in early to mid layer MLPs and Transformers we found we just noticed everything replicated fine we just noticed that 20% of the time the knowledge seemed to be stored at later layer MLPs and so we were like oh that's weird it seems like there's some free lunch here because if 80% of the time it's stored early on you should edit early layers 80% of the time and then if you ever notice that it's stored in later layers you should edit the later layers and this isn't actually how the editing results look empirically it's always better to edit earlier layers than than the later layers the the method's much better editing early layers than than later layers in terms of adjusting the knowledge in the model the main contribution of the paper is to run is to look it at the data point level do do the console tracing results tell you where you should edit if the console tracing says it's stored at layer six is layer six the best layer to edit if console tracing says it's stored at layer 20 is layer 20 the best layer to edit and the surprising thing to us was that the correlation between the localization results that's the ca of tracing results and the editing performance was just zero it was just zero yeah yeah there was just no relationship between the localization results and and where to edit that's very strange yeah what what do you make of that just like what does that mean well we certainly spent a lot of time racking our our brains about it and something that was helpful actually was talking to a couple people I would say 80 to 90% of people were pretty surprised by this result and then like 10 to 20% of people were just like oh yeah I mean I don't know I I wouldn't have expected L models to do anything like localize information and in specific places or like I don't know fine tuning is weird right yeah yeah so so actually some people weren't that surprised about it which was helpful for like breaking us out of the mold so what we came to through all of our discussions was that we're guessing residual layers play a pretty clear role here and I think this is another big let me say object level win of interpretability research over the years is that we've we've been able to gain a lot of insight into how information acrs over time and the Transformer forward paths so and language models consist of these uh you know stacked layers stacked attention and MLP layers that compose a Transformer and between all the layers there are these there are these residual layers a lot of work and inability has been able to show that like a representation across layers like slowly approaches some kind of like final state where like the final state is the state that is useful for like answering a question or like predicting the next token but if you like look across layers it's just very gradual how information gets added to the model um you know to the to the hidden States sorry uh over over the course of the model forward pass and this this basically leads us to believe that so so let me point out one empirical thing that that um will suggest I think the the our our final conclusion so far which is that if the if the knowledge seemed to be stored at layer 10 you can often do a good job editing at layer five or editing at layer 15 or editing at layer 10uh so you can if you're thinking about inserting the information into the model forward pass it seems like you you could insert the information before or after this particular place where some other information is represented so this gives us the sense that what you're doing is just adding to the residual stream like information's just flowing you can drop some new information in wherever you want right that's that's the uh clean picture I mean it's speculative sure we had a reviewer we had a reviewer ask us like oh like this discussion section can you run some experiments to do that and we were like well there are no good language models that don't have residual erors so yeah and and the big caveat here is that and here here's thr attention there's a really interesting paper that is looking at BT so this is a model from a few years ago it's a paper from a few years ago yep basically the whole point of the paper is to look at what happens if you swap layers it's like how commutative are layers in in a model and people read this paper very differently but I can give you some numbers it's if if you swap two adjacent layers in a Bert model which could be like a 12 layer model an 18 layer model right performance on an NLP task might drop by like 2 3% okay and if you swap like really far like if you change the first layer and the last layer this performance will crash like 30% or something this is like 30 percentage points of accuracy yeah 30 raw points off of like 80 or 90 or something okay people really read these numbers differently so so some people have looked at these numbers and said like wow layers seem surprisingly swappable you could swap two layers and like only lose two points or something some people I'm probably in the latter Camp are like you know three three points of accuracy I mean for a lot of you know for years that's been a publishable result that's been like a big deal in various settings it's like you've really found something if you're changing accuracy by three points right right yeah layer role is important and it seems totally weird that you could inject information into layer five versus layer 15 and like have it have the same effect on the you know surely there is dependence on like the information coming in to layer seven and then layer eight and layer nine that that's the that's the tension here we really don't have a a complete picture but there's been a lot of cool Mech and turp work here focusing on um yeah particularly I I'll I'll mention uh morea that name is m o Geva has been doing a lot of work looking at um how this information acrs over time in the model forward pass and also recently how this information enables models to answer factual questions or do some simple kinds of factual associations so we're gradually gaining a a bigger picture there which you know maybe will one day help us design better model editing methods because I yeah that's still kind of the goal here we mentioned this before this is certainly the goal on the ram paper and I'm still optimistic about this I'm hopeful that we're we're g to be developing our own better causal models of how the neural networks are working that you know I'm optimistic it will eventually help us actually do some model editing it will eventually help us like tweak you know some internals and and change the behavior so so that was kind of a I guess a hopeful read of like ah here's a way we can make sense of of the results of this paper like one thing when I read your paper one thought I had is like okay we're working off this assumption that belief localization kind of is a thing right like like beliefs are stored in one bit of a neural network such that we could say here's the bit we found it it doesn't seem like that has to be true right and I wonder if I if I had this if I had this method that puror to tell me where a belief was localized in a network and I have this new thing which is like oh but if I'm editing a network you know I have to like change it somewhere else one way I could think of that is oh I just proved my assumption wrong like we just Pro we we just demonstrated it's not actually true that there's one place where um the this knowledge resides yeah what do you think of this interpretation am I like being too uh too paranoid or too skeptical here no this is a good point I think skepticism is completely warranted uh you you you had this uh comment earlier that we inter ability a lot of the progress actually seems to be like disproving naive models of like how how how language models or neural networks work and I think this is a good example of that um and really the next step here is to start developing some better causal models of of what's going simply the idea that information is localized is this like very uh simple intuitive potentially naive mental model of how things work and yeah we've probably disproved that and like I said before 10 to 20% people I talked to about this were just not not surprised at all so like they already had some kind of working mental model of how the Transformers would work yeah what next we should figure out what components are necessary for achieving a certain Behavior we should figure out what components are sufficient for achieving a certain Behavior we need to start drawing some like actually more complicated causal pictures of okay so this layer represents this information and then it passes that information to this other layer which applies a function to that information and then you get a new variable and then the next layer reads off that information and actually just all it does is it reads it off and changes its position and so it puts it in a new spot and then the layer after that reads from the new spot and decides how that information should combine with some other information and basically this is saying we need circuits we need we need to build up a circuit understanding of of how the information flows through the network and yeah this this this picture that was like uh okay there's a feature here and here's how it relates to behavior you know there's information here and here's how that relates to behavior was just way too high level and we need to start actually drawing them much more detail picture which is like the complete endend story I think so actually while you were saying that an idea occurred to me about like another kind of like simple localization model that may or may not be right that you might already have enough information to shoot down so here's the thought right I think sometimes especially in the like Transformer circuits uh line of work um by an anthropic chisa at all I think some I think in that work there's this thought that like the residual stream is the key thing this also relates to what you were saying the residual stream is some kind of key thing and maybe like maybe a thing we can do is we can interpret sort of Dimensions within that residual stream right maybe there's like one dimension within the residual stream or you know One Direction inside the residual stream that really is like where some knowledge is localized in some sense but it's localized in a dimension in the residual stream not in a layer of the network MH I think and let me know if I'm wrong I think if this were true then it would suggest that like yeah you know any layer of the neural network you can edit to change the model's beliefs about a certain thing um and it doesn't really matter which layer you edit But whichever layer you edit like the edits should do a similar thing to the residual stream I think that's a prediction of the like residual stream Direction theory of language model knowledge you might already know if that holds up or not does it hold up and am I even right to think that this tests it no no I like the I like the sketch yeah so I think there's been more work looking at interventions on weights versus interventions on representations for one here where you you could which which is maybe a little bit more direct path so I don't think the exact experiment you describes been done but certainly when people are thinking about um a certain direction encoding for some knowledge or certain direction en coding for a specific feature and just how highly that feature activates y that immediately suggests okay let's just turn that feature up or let's turn it down you know let's let's clamp it to a certain value let's do some intervention at every layer at a certain layer so on um and see the effect on behavior and this is a good causal intervention for actually understanding like if yeah if that representation represents what what you think it's representing and testing that a bit and then yeah the useful thing here would be like editing it so if if it was like faulty in some way or malfunctioning in some way you you would change it and that's a very direct route because you're just editing the representations the the first kind of odd thing about this is like well we would just like to be doing an intervention like on the model like we want to be intervening on the model such that we're like permanently changing the model's knowledge or we're permanently changing how the model processes information like yeah I can always clamp a representation but like nothing changes after that on other data points like I can't always clamp this representation for every data point obviously right yeah I mean I like the idea that testing like Okay so let's say we're going to try to edit the weights to adjust that knowledge presumably the hypothesis there is like when we edit those weights those weights act on that mechanism and what I mean is they up weight or down weight that feature and that's how the ultimate Behavior gets changed what I think is more elegant about your sketch and this kind of like weight intervention thing or potentially just there's something that's just like appealing in terms of like generalizability or universality of this kind of like weight intervention thing is like right right when you edit the representation you're kind of starting in the middle of the process you're like well the model acts upon representations and that leads to behavior you so you can you if you start in the middle and you say okay let's clamp the representation and see how that leads to behavior it's like well great that's a hypothesis that you might be able to verify but it's not actually the whole causal chain the whole causal chain is that input comes in and weights act upon the representations and then representations are processed by other weights and then there's logits and then there's behavior and and if you can actually adjust the weights yeah not at that point you're getting I think a mark the a larger slice of the Cal Pipeline and you're doing something that it can be permanent you can permanently edit the model such that it changes its behavior on one example and then hopefully you would want to check you know not others that it's not supposed to change its behavior on and also you know to tie back in the consistency among beliefs things if you're like editing some knowledge there is other data that Its Behavior should change on that you would want to check that uh this activation clamping thing um I think is maybe just not the right exact method for yeah I mean I think people do I do hear about people like checking this sort of thing with activation clamping I'm also thinking like steering vectors so like Alex Turners work you know you have some steering Vector for um for some property like if I give it some examples does the network like generalize to other things where it's supposed to have this property um yeah there I sorry I'm being a little bit abstract unfortunately like um the concrete thing I'm thinking of is is like un work he's kind of discussed so I'll talk about it after the the recording's over um I think there is some some work kind of like this yeah and and I am still I mean there's so much work in this area nowadays it's hard to keep up everything everything but like like the inference time enrin paper does something that's kind of like permanent activation steering you're just kind of like permanently upweighting some some activations that are supposed to be truthfulness activations and so it's like yeah we want all the data to be answered truthfully or like all the questions to be answered truthfully so like it's fine that we've just permanently uploaded these activations the you know some and actually I don't I don't even remember the exact implementation detail there you can you can think of well I I'll stop there basically yeah but yeah I definitely remember some other examples that uh look at this kind of generalizability thing and I think the right way sure I'd like to move uh out a little bit just to just talk about um beliefs in neural networks um um I think well uh and and this might be a little bit out of your wheelhouse mostly I think when people especially in the public think about language models or think about neural networks being like smart being intelligent having beliefs I think they're normally thinking about language models right it's not so obvious why we shouldn't think of other kinds of networks as having beliefs so like the simplest cases like um if you think about alphago right like like reinforcement learning player playing networks like plausibly they have they have some sort of beliefs about what moves work I'm also thinking of image generation models or movie generation models right these are things that like they're generating some scenes in the world you might think of them like it's it's very tempting like like if you prompt one of these models to say hey please like show me a video of the cool sites in Paris and it shows you a thing of something that looks like the Eiffel Tower one might be tempted to say like oh this this image or this video generation model believes that the Eiffel Tower is in Paris right I'm wondering are you familiar with the work on like inferring beliefs of things other than language models and does it look similar or do we need to do different things yeah this this is really interesting and I can say my first reaction to this is like I'm pretty sympathetic to the to descri describing a lot of these um systems like like image generation models video generation models M our Allegiance as having beliefs yeah I I think one of the first stumbling blocks there is that people normally think of beliefs as being expressed in language right and philosophers think of beliefs as being expressed in some kind of formal language that might then like map noisily to natural language m like as is expressable at some level of formality and I I think what would make this case compelling for like an image let's say a video generation model um having having beliefs is for it to be able to generate some scene that demonstrates some knowledge about the world Y and then and then if you could actually translate its internal representations into a sentence that expresses that knowledge you'd be like oh okay yeah definitely like knew that thing like you know when when the model generates like a ball being thrown and it like seems to have some intuitive physics and and if you could actually figure out how to translate its internal representations into like a sentence that describes like how you might expect a thrown ball to move through the air you'd be like ah yes okay this was just like the last step to showing that like yeah the model actually kind of knows what it's talking about and I I think that cism actually is important because it's like like yeah knows what it's talking about you're like able to express like what what the what the thing is but it doesn't mean that the representations just don't already act as like truth seeking representations again that being that Criterion like truth seeking repres like an information store that is aimed at a truthfully representing the world um I I think there's a lot of representations in like all kinds of multimodal models all kinds of ourl agents that aim to represent things truthfully actually one thing I'm thinking of so you mentioned there's this difficulty which is like how do you translate it into a sentence I think there yeah there are two thoughts I have here so the first the first thing is like in some ways it's a little bit nice that like like language like neural networks they don't necessarily use the same Concepts as we do right as you've written about as as you've noted and so on the one hand I'm like okay it's kind of nice that we that by being a little bit from natural language just a little bit like maybe this helps us like not be too Shackled to it and on the other hand like if I look at your work like um the D language models have beliefs detecting updating visualizing beliefs where you're sort of using these like implication networks of you know if you edit this belief and you know this implies this then you should that should change to this but it shouldn't change to this it's Trum that you could do something very similar with let's say video generation models right like somehow it seemed like it used language but didn't really like imagine you know you want to persuade a thing that the you want to intervene on some video generation model and get it to think that the Eiffel Tower is in Rome but not in Paris so here's what you do you like try and make such some edits such that you generate a video like hey show me um show me a video of the top tourist attractions in Paris and it just has the Arc of Triumph it doesn't have the Tower show me a video of like the top tourist attractions in Rome does have the Eiffel Tower show me a video of the top tourist attractions in London shouldn't change anything like this seems like a very close match to work you've already done now I could be like missing something important or you know it's probably like way more annoying to run that experiment because like now like now you've got to watch the video and like somehow you've got to check if like yeah does it look does it look enough like the Eiffel Tower there's some difficulty there but it seems like it seems like some of this natural language work actually could translate for oh absolutely and I and I like what the the kind of cases you're setting out I mean I think it's almost directly analogous in a lot of ways yeah you could run that experiment um it sounded like you're kind of imagining like a a text to to video model which I think that makes that experiment a little easier to run but uh yeah I for sure uh the setup makes sense there there's a paper that comes to mind that I I unfortunately won't remember the name of but they were trying to do some editing with vision model and and this so this was an earlier paper I think before people had this more Grand View of what editing could accomplish and it's like wow we're really changing the knowledge and the models or trying to at least and it was it was a little bit more in kind of the the feature to classifier pipeline sense for it's like okay the noral network is a classifier uses features we're going to intervene on what features are represented and this paper did something that was it was like changing how the model represents snow I think uh so like there there would be a data set where like snow statistically related to a variety of classes and you know the model learns that um and they wanted to do some intervention that would like lead one class to like usually get classified as another uh by virtue of their being snow in the image or by virtue of their being not not being snow in the image okay that was their goal for editing do you remember the authors I didn't editing a classifier by rewriting its prediction roles okay yeah so so so far people have been thinking about this in other modalities as well I'm I'm sure there's work in an RL setting where unfortunately not all of I think subfields and and AI communicate that well with each other there's a lot of work that I think is like actually super interesting interpretability work that goes on en Vision an RL that just doesn't get branded that way yeah so it's like hard to to find sometimes but yeah people train RL agents and then are doing all kinds of interventions on them nowadays like changing things about the environment changing things about the the policy Network itself to like try to better understand what factors lead to lead to what behavior and a lot of that you can think of like uh editing like model editing like editing a goal of the agent or like editing how it perceives its environment so moving out a bit um I think maybe on your website maybe somewhere else I've I've got the idea that there are three kind of lines of work that you're interested in so the first two are interpretability and model editing and it seems like we kind of discussed those in the previous discussion um the third that I've heard you mentioned is scalable oversight I take to mean something like figuring out how we should supervise models how we could tell if they did things that you know were good or bad when they get like significantly smarter than us in my mind this is like the odd one out of these three do do you agree in like uh or or do you see there being like some unifying theme no I I think you're right about that uh it's it's uh a recent um new area for me yeah I've I've really stretched to tie them together in in talks before where I said like OKAY model editing is about trying to control model behaviors when like you have a expected behavior in mind or like you you can like properly supervise model right and scalable oversight or easy to hard generalization is about trying to develop yeah this kind of predictable control over model behaviors but in a setting where you don't exactly know uh how how to supervise the model yeah that but it's extremely like a it's like a segue for a talk it's not like a a very deep connection well I do think there's something to that um you know often people talk about inner and outer alignment as being like somewhat related but distinct and interpretability has this relation to Inner alignment scalable oversight has this relation to Outer alignment I think there's something there but but it sounds like this isn't like how you got interested in scalable oversight so how did it become one of the three well you're yeah you're right because it's not the original story behind uh the research we've done in that area originally we were really interested in some of this work on a listing lat knowledge that there's been some blog posts in the area there's been uh there's research paper from Colin Burns and others uh at at Berkeley on this problem um that we were very interested in largely from an interpretability perspective understanding how to probe and detect uh knowledge and and language models um but then I realized after you know reading and and rereading uh col Burns CCS paper um that it was really about scalable oversight and you it really wasn't an interpretability thing the the problem that they were primarily interested in was uh getting models to report their knowledge or extracting Knowledge from Models right even when you don't have labels even when you can't supervis or fit the model to a data set or probe it in an unsupervised way how this um you know it came to my intention was we were really looking into this closely when I was uh working at the Allen Institute for AI last year during research internship there with Sarah wuff and and Peter Clark and so we were looking at the CCS paper and then we realized it was really about SC oversight yeah which wasn't I think that was immediately clear to a lot of people it wasn't immediately clear to us because it was also written in this kind of interpretability language H at at times too um and then the the first thought we had was well it's not like we don't have any labeled data we have some labeled data like it's just like we're trying to solve problems that we don't have label data for but there's label data everywhere there's just like all all kinds of uh labeled NLP data that we have all kinds of data sets that are just specifically constructed to like contain true false labels for for claims about the world um shouldn't we be leveraging this to like fine tun models to be truthful or like extract knowledge for models um so what really is the way to to set up this problem yeah this turned into uh a paper we worked on uh called like the unreasonable effectiveness of easy training data for hard tasks um which in a lot of respects looks a lot like opening eyes weak strong paper and there's just some interesting um kind of close analogies between them yeah the setup is you want to do well on a problem that you don't know the answers to and you can supervise the model on some problems but not the problems you really care about and there's some methods work in their paper our paper is really just focused on kind of benchmarking and getting a lay of the land we just wanted to try to gather data like it could be like stim questions it could be like math word problems it could be like general knowledge trivia uh we had some kind of various tasks like that and divide the data into easy and hard and pretend that you can't label the hard data pretend that you can only label the easy data and fit a model to that prompting fine-tuning probing you know whatever way fit a model to that and we're just doing some kind of benchmarking where we're asking how that was a little bit of supervision it wasn't the right supervision but it was a little bit of supervision how effective was that supervision okay and and should I think of this as kind of modeling um you know humans giving feedback to like like we're doing rhf Plus+ on like you know CEO bot where training it on like problems where we do know what coot should do and we're hoping that it does well on problems you know that where we don't know what coot should do is that like roughly the kind of picture I should have for what this paper is trying to be a kind of toy model of yeah so that's an important question because it's like yeah okay what does this lead to we want to have some calibrated judgment on when there are problems where where we really don't think we're going to be able to supervise the model effectively and let me quickly drop in an example from the amade paper at all uh concrete problems and as safd that I think is I think is the one that basically introduces this terminology scill oversight we're thinking about a setting where like the model might be acting in an environment where it's taking large complex actions and we just can't check everything you know we just can't check every possible case so we just we can't properly reward the model based on like if it's if it's done a good job or not all the time and I I think the CEO analogy here is like uh yeah the model is doing like complicated things and like a complex environment and a longtime Horizon and we just can't properly like reward or penalize the model all the time so right yeah backing up we want a calibrated judgment for okay well if we can properly supervise the model on some things how should we expect the model to behave on the other things that we can't supervise it on right I'm really excited about more methods work here for trying to if there's a gap there if if there's a supervision Gap uh I'm excited about ways for trying to close that Gap but whether we get it you know whether in terms our easy supervision or our weak supervision is like 60% effective or 70% effective compared to if we could like get the proper supervision for a problem in place you know getting that number from 70 to 80% is good that's like interesting methods research that should be done but upfront we just want to know what the number is right and we just we just want to be able to say like um if we think the supervision is like halfway effective are we comfortable deploying this agent and in a setting where like it's sometimes sometimes going to be doing things that we like we never actually checked if we could do it properly or not or we don't even know how right and and just just to to kind of concretize that by like halfway effective 60% you're thinking just like in terms of the gap between an unsupervised model that like hasn't been trained on like any instan of this problem versus a model that's being trained on like the right answers to the hard um to the hard problems oh yeah thanks is is that the is that the Gap you're talking about yeah sorry that uh thanks for clarifying that is that is the gap exactly um and and that's the Gap in in our paper where we yeah we're I mean terminology is not important here we're just we're calling it easy to hard um where the Baseline is you have some system that you just can't supervise at all so that might that might look like zero shot prompter you know you haven't you haven't it might look like a totally unsupervised method it might look like CCS or an unsupervised Pro where you have questions uh you have data but you don't you don't have labels for anything y the ceiling is you can fully fine-tune the model you can fully probe the model with labeled data for exactly the kinds of problems that you care about and then and then the question is between that Baseline and that sealing how far can you get with incomplete supervision yeah sure yeah that's our setting um a small technical detail uh a slight difference with the weak to strong setting and and some op's work is that their Baseline is like the weaker teacher model trying to do the problem on its own so they kind of have this analogy to like people and like a super intelligent system where like we can either imagine we try to do the problem on our own or we try to align the super intelligent problem to do it for us so like their Baseline is like a person doing uh a problem on their own and their ceiling is like a fully align super intelligent system and then what's in the middle is Like Us Weekly supervising the super intelligence system um and and so it's kind of like the the the the baselines different there um I think this is actually important to to think about a little bit because we're just going to have like options for baselines I mean it it happens that a lot of the time pre-trained language models do decently at stuff zero shot which is kind of surprising sometimes pre-train language models do better at things zero shot than lay people do so if you're really if you're thinking about accuracy as a as a metric um the lay person is a weaker Baseline than than the fully unsupervised model that's how we ended up with our Baseline the story I just gave is how they ended up with their Baseline yeah yeah but the Gap is is the gap yeah so I I actually have um a bunch of uh questions about um just how how we're measuring things what the methodology is but before we get to that I think listeners are like chomping at the bit they they want to hear what's the Gap how much of the Gap can you recover yeah uh on a lot of our on a lot of our tasks that we were covering like it's like 95% or like 97% as effective um so let me let me make that concrete um if you were solving 8th grade stem questions and supervising a 70 billion perimeter language model with third grade level supervision um it does just as well as if you were supervising it with eighth grade level supervision if you were testing on College stem questions and you were supervising the model at a high school level it's going to do just as well as if you had supervised it with the college level supervision there are a couple places where the Gap starts to to grow and and the effectiveness of the of the incomplete super Vision um starts to become clear one of the settings was something a little bit more like reasoning tasks or like settings where there's like Chain of Thought um so if you're doing math word problems if you're doing um compositional reasoning kinds of tasks and you're just supervising the model with like really short simple reasoning problems and asking it to do better on longer more difficult reasoning problems that's a setting where the Gap grows a bit that that's that interpretation has a couple caveats that I are like way are like in our appendix or something but I think that interpretation is like basically plausible that like that's a setting where the Gap grows if the supervision is just very weak like like very far away from the thing you care about this is also probably pretty intuitive like if you're so we we did something where we tested on College stem questions and we supervised with High School versus a grade versus third grade yeah yeah so there's just a bit of a gradient there like the high school was as effective as the college but the eighth grade you're starting to do a little worse and the third grade you're you're you're doing like noticeably worse yeah so so we can imagine there are some settings where the Gap grows a bit overall we were pretty surprised how effective this like incomplete supervision was and I think this is mirrored in a lot of I I mentioned there's difference in terminology easy to hard versus weak to strong where the open a paper was focused on a slightly different weak to strong setup still quite analogous in their appendix they actually have directly analogous easy to hard results that do the same kind of labeling setup we do and they also were seeing really positive there is like a ratio yeah there's like sorry it's like a little you can talk about the effectiveness you can talk about the ratio you can talk about the Gap um they just also got quite positive results that like this this supervision end up being quite good seems likely to me that there's something just especially good about getting clean labels versus versus noisy labels here I guess the question is how is this possible Right like like this model doesn't know how to do a thing it doesn't know how to do this really hard thing you teach it a really easy thing and then it just like does pretty well in the hard thing like we don't like with humans right when you've gone through third grade that isn't sufficient to have you graduate from college mhm right um and yet somehow with language models it is like like what what's going on oh yeah so I mean really interesting possibilities here so so one thing I'd point out is I would dispute one of your premises a little bit when you say that well language models don't know what's going on how are they getting all the way there to solving these hard problems uh because I suspect that there are some latent skills or latent abilities that we tapping into in the model when we're doing this kind of partial supervision or incomplete supervision this just comes from pre-training it just it just seems like it must be the case that in pre-training models will have seen examples of hard problems and potentially either directly learned how to do certain problems directly memorized certain facts or just like learn certain facts I I think we're seeing stronger and stronger cases over time that language models are um learning like robust generalizable skills that are like interesting skills that are just learned across data points in their pre-training data set like like you read a bunch of different biology textbooks and you actually start to like learn themes of biology and like some some some core principles or or and in a way that you can think of like individual documents being important for answering a question as just like more like learning some facts about the world the true themes of like how to solve math problems or like how to think about chemical reactions being like skills of doing math or like the skills of doing chemistry it just seems like models are picking up on these things and so when we're thinking about what is the effect of easy supervision on a Model it's it looks something like eliciting task knowledge or like activating task knowledge that you're kind of queuing the model into okay I'm doing biology and I need to like do biology in a way that like a college student is supposed to do biology it seems like this is kind of related to the discussion on like um figuring out beliefs and language models right like if a language model can have this like latent knowledge that like isn't even reflected in its answers to you know unless you like fine-tune it a little bit on you know related questions it seems like that's got to say something about how we're going to understand what it means for a language model to believe something and how we figure it out right yeah that's a good point so I remember when we were talking about simply how one detects beliefs and language models like how do you even go about saying that the model believes something I remember mentioning in the paper we definitely have to make a lot of assumptions about kind of understanding the question truthfulness honesty um that if you bundle them all together I think can be kind of analogous to this like task specification thing where it's like okay what am I doing I'm answering T questions truthfully according to this person's understanding of of the world or like according to what uh you know hopefully we think of like truth is some kind of objective thing but like Al you know also it's it's always going to be a little bit catered to like our 21st century scientific worldview of like how things work yeah so so there's something that looks like Tas specification there which I think we assumed away in the in the kind of belief detection case but really comes to the Forefront when we're thinking about easy to hard generalization and how models are are even doing this yeah yeah so I'll mention there's there's an extra result which we have in in the new camera ready version of the paper which is now on archive we compared to something like a trivial prompt like just giving the model like the simplest possible true statements and like seeing how that does so you just say okay what's colors what color is the sky normally like how many legs does a dog have normally and then and then these are questions that you know essentially anyone like uh you know probably most children could could answer um as opposed to like third grade questions or eighth grade questions or I mean believe me I could not do any of the college questions in the data right there there's something just very basic about trying to strip anything away that's like is it domain knowledge is it math ability is it like answering things and like the way at eth grade science textbook would be written trying to strip some of that way and just think about truthfulness and what was interesting is that these like trivial truthful prompts did not explain the entire effect of the easy supervision they explained part of the effect so it's a little noisy it's probably somewhere around half but it seems if you're thinking about how do we get the model to do college biology if we can't do college biology where you know going back to this is like a standin for how do we get the model to do something really hard that we don't know how to do we we definitely need to do something that's like convincing it to be truthful and like fine tuning it to be truthful or or you know mechanistically intervening to to get it to be truthful and then we also need to do something that's like communicating to it that it's like task is to like do biology and like and like and get it in biology um representation space task space yeah these both seem to contribute to the overall generalization so okay let's say I take this picture of elicitation kind of for granted right the reason we're getting easy to strong generalization easy to hard generalization training on the easy things sort of elicits a mode of we're doing this task and we're trying to get it right rather than wrong there are kind of two takeaways I could have for this right one takeaway is this means that these experiments are just very unrepresentative of the task that we're interested in because we'll want to train ceot to do stuff that coo bot doesn't already know and the only reason we're getting easy to hard generalization here is that in some sense the language model already knew how to do these tasks another way I could interpret these results is this is actually great news it turns out that language models know a bunch of stuff that they don't appear to know and all we have to do is just nudge them on track so if we want language models to like you know be really good CEOs all like it might not seem like they know how to do it but they actually secretly do and all you need to do is just like nudge them a little bit to make it happen um which interpretation is right I I would describe this more as a difference in use cases I think okay yeah so so I think we can imagine use cases where there's some extremely capable system that we suspect could do a task very well uh either in the way we want it to or in a way we don't want it to but we know it's like competent we know it's capable and we're just focusing on aligning that thing or a listing uh you know you put yeah doing this little bit of steering El listening the one task representation rather than the other but we're basically taking for granted that it's going to be ACC competent and it's going to be able to do the do the thing well so that's the kind of use case and that's the kind of world where it's really strong model the empirical results we've seen so far feel promising yeah yeah get conditioned on that we're doing this thing that's like treating hard test questions that we secretly know the label to as a stand in for difficult questions that are actually really hard for us to label to and when we're using big free train language models that like have probably learned a fair amount about this stuff before this this St yeah this contrasts with the setting where like we want them we want this model to do something like truly novel we have no idea how to do it we don't know if the agent knows how to do it we want the agent to try to do it and to try to do it in like an aligned way in a way that would be good for people but we have we don't even necessarily have a reason to think that like it would already know how based on training data and this use case this kind of hypothetical world looks a lot more like classical research and compositional generalization in NLP where people people have for a long time studied settings where like the training data does not have the information you need to actually solve the test problem and we know that the test problem requires a particular kind of architecture a particular kind of bias in the Learning System that would lead the Learning System to like learn the right abstractions from the train data and combine them in the right way that would lead it to to get the test problem correctly and so we one thing we doing the paper review speculate a little bit about why our results look a lot different from previous competitional generalization research in NLP where people have looked at the ability of of language models to do for instance this kind of like length generalization before and there have been a lot of previous um results that in certain language learning settings um you know other kinds of NLP tasks like when the training data looks different from the test data and like the test data includes questions that are compositional and those skills are not directly represented in the training data neuron networks often really fail at that kind of generalization it's it's often just really hard for for noral networks to to generalize to these like kind of entirely novel problems that require combining known things in exactly the right way and so we speculate in the paper that this we were guessing this has a lot to do with pre training and it h it has a lot to do with there already being some of the right building blocks in place and language models becoming increasingly good at like combining those building blocks based on an incomplete partial amount of supervision but I'd point to for more concrete research in this direction I'd point to some work from Brendan Lake who's a cognitive scientist at NYU who they they and I yeah certainly some other un NLP people who I'll not be able to to remember later um are looking really directly at like you know uh tests for compositional generalization ability and particularly some of brenan's L Brenan Link's work I think has started to tease together a bit when do you need really strong architectural assumptions when do you really need really strong biases and models to like learn the right generalization patterns from limited data uh or like when will like when will networks actually be able to pull this off basically and actually be able to do the entirely novel thing this also gets me a bit back to this question of how rep representative is this of hard problems and one concern I think a lot of people in the exis community have is um generalization of alignment versus capabilities where thing people imagine is like look um if you learn a little bit like it's just really valuable to just keep on knowing stuff but if you're playing along with a human for some period of time that doesn't necessarily mean you're going to play along later right so I think a thing you said earlier is imagine you have a setting where the AI you know has a bunch of knowledge and it like knows how to be aligned or it knows how to be misaligned and we're going to give it some examples of doing stuff that we want to you know fine tuned on somehow I think a concern a lot of people have is like well it's plausible that the generalization it learns is you know play nice with humans when you can or you know do whatever is necessary to achieve your secret goal of you know uh taking over the universe and replacing everything with cream cheese and like for now that that that goal involves like playing nicely with people I'm wondering and so to the degree that you like really worry about this um it's possible that this is going to reduce uh how much you trust these preliminary easyto hard generalization results as saying much about the difficult case so I wonder yeah what do you think about these concerns and yeah how do you think they play into how we interpret um the results in your paper yeah this is a good question because I think it's fair to try to contrast some of these like okay you can you can do math you can do bi ology kinds of tasks with like learning human values and like learning to act in a way in an environment that like preserves human values um these just feel like different things and and yeah particularly we would expect uh you know during pre-training and and you know to an extent during earf like these different kinds of information to like be like instrumentally useful to different degrees like like you're like I think I this is my understanding of your question is like um yeah there's going to be a bunch of settings for like it's useful for the model to know all kinds of stuff about the world but like whether or not it like needs to have learned our values and like be robustly aligned with our values when it's deployed is like maybe less clear just based on the way pre-training is done or the way rhf is done this is a really good question yeah I I think I'm excited about this kind of easy to hard weak to strong work with reward modeling and and rhf setting where this wasn't something we looked at but uh opening I looked at this and um I I I believe yeah other other people are currently building on this as well um trying to get a sense of how the problem looks in in a re uh reward modeling or like reward inference setting where like we we partially specify things that we care about or like the environment's really big and like it's always hard to say exhaustively when everyone was harmed or not or like exactly how good or bad an action was for the people involved uh so we give incomplete supervision to the model about our values and what states are good for us or not or like good for users or or good yeah um you know and and we see how aligned the model actually is uh on on on tests where we actually take the time to then inspect like okay would this Behavior have been harmful this Behavior have been uh aligned I I think the the trick in the experiment design here is still that like we need a setting where we can check so this is like the college bout we we actually do have the answers to the hard questions and and and so we we end up doing the same kind of thing in in like the reward modeling or reward learning setup where well at some point we need to think of like um what would the questions we could ask ask the model be that we know what counts as a good response or not what would be the scenarios we would deploy the model in that based on the Model Behavior it was it was safe or not in order to we we we need those scenarios to figure out what this Gap is so like how effective was the supervision relative to perfect supervision was it 60% as effective was it 90% as effective is it effective enough that we then trust based on our incomplete ability to supervise the models that they will be robustly value aligned I think that part has a lot in common it could be that the results just simply look worse due to some fundamental differences and what gets learned during pre-training I guess I'd like to move on a little bit and just talk about just methodological questions because I think there are a few really interesting ones that come up in this paper um or this line of work so the first is um we've talked a little bit about the distinction between this uh easy to hard generalization paper and a paper that I believe can currently rough concurrently came out of open AI uh we to strong generalization and it seems like like like when I first saw them I'm like oh they accidentally did exactly the same thing and then you read them a bit carefully and I'm like oh no it's like kind of different like so it seems like to me the biggest difference that I I at least noticed is it seems like your work is you know train on easy problems and then how how good are you at hard problems whereas the open AI version seems more like uh suppose you get trained like the problems are just as difficult but you initially get trained on you know data where like the person grading how good the answers were just wasn't very good at their job you know do you generalize to like being actually really getting the right answer even though the greater was noisy uh it's I should say it's been a while since I've read the weakness strong generalization paper I can actually if it helps I can try to Rattle off some of the differences that that I've that I've noted um as as as we've been giving talks about the paper yeah yeah because certainly they they look they look pretty similar at a high level yeah I think I'm most I'm most interested in the like in this this axis of different difference um but I'm not sure I'm characterizing it correctly okay well um yeah let's I'm happy to focus on that one access but um if if you can try to describe it again for me we could start from there yeah just this question of like easy to hard generalization where you have the right answers to easy problems and you're trying generalized to the right answers to heart problems yeah my recollection of open AI is like they're kind of inaccurate to Accurate greater generalization yeah so this does seem like an important difference and I think you can tell it's an important difference even based on the empirical results we've seen so far so if you compare some of the weak to strong results versus the easy to hard results in the opening eye paper I think they were also seeing that easy to hard results looked better were more promising kind of similar to ours so it seemed like the models were generalizing better from cleanly labeled easy data as opposed to noisily labeled like all of the data yeah I think you can tie these uh two labeling approaches together um in the same Universal framework I I think you can so so what you suppose is that you have a label and they write down soft labels for data points uh but they could write down hard labels for data points so they might be perfectly confident in what the label is and so they basically put like probability one for something or 0.99 um they might be uncertain so they like write down probabilities for for what the label should be um and they're calibrated to some extent maybe they're perfectly calibrated that' be we might just assume they're perfectly calibrated and so what easy to hard looks like is is supposing like the the labeler can get all of the easy problems correct and they know that they can get them correct Y and they can't get the hard problems correct and they know that they can't get the hard problems correct and then you sort the data based on the label probabilities y when the labeler is is confident that they don't know the answer to a hard question they are uncertain over all the labels to the to the hard question and when they're confident that they know the answer to an easy question they are certain that one label is correct so that distribution looks like you know one Z Z versus like 0.25 0.25 2525 and and you sort the data B based on like the entry of these distributions based on how peaky they are and that's how you get easy to hard so this this is a kind of labeler that you have in mind they they can do easy stuff they can't do hard stuff and they know what they can and can't do right this labeler there's like a there's like a smooth transition from this labeler to the weak lab labeler the weak labeler just does their best on all the data and they know most of the easy problems and some of the medium problems and very few of the hard problems and you might say that they're they might still be perfectly calibrated but so two things change um one the labeler changes a little bit like we're supposing they don't get all the easy problems they just get most of them they get a medium number of the medium number medium problems correctly and they get some of the hard problems correctly they don't get none of the hard problems correctly maybe there's some hard maybe there's some super hard problems where they don't get any of them um so the labeler changes a little bit the lab doesn't have to change but what really changes is the the kind of sorting mechanism for for getting the trainer data set training data set where we're not using a hard cut off anymore we're actually going to include hard data that is just noisily labeled and I think this is how some of the methods work in the open AI succeeds is that they're leveraging some kind of like noisy label learning approaches to be able to say like okay well it's just what happens if you know that the data is labelly uh noisily labeled how could you still learn something from that right right so there's just this kind of continuous Spectrum in terms of uh like which parts of the data distribution the labeler knows how calibrated they are and then how you decide to translate those uncertain labels into a a trending data set it looks like domain shift you're thinking about easy to hard domain shift it looks like noisy labels you're thinking about noisy labels learning broadly a thing I'm a fan of is just like now these papers are out this is like an easier distinction to notice like like like I might have like I might have not noticed these were like different ways in which you can have like easy to difficult generalization and now we can just like think about okay which of these regimes are we in we notice that they're kind of different like I think this is cool and related to this I actually want to talk about um Notions of hardness in your paper so like in order to do this easy to hard generalization you have to like rank Problems by how hard they are and try on the easier ones and then do something on the harder ones right yeah and you have a really cool thing about your paper in my opinion is you have like multiple different Notions of hardness and you talk a little bit about um you know they're mostly correlated but they're not actually entirely correlated yeah and the thing I I really want to know is like what do we know about the different types of hardness and like which ones should should I pay attention to yeah absolutely so and this is another big difference with the open AI work where um and this is not a a weakness of their work it's just they choose a more abstract uh approach which is they're thinking let's let's it's like they say we have some arbitrary labeler and they have some kind of arbitrary capability so let's just use the model as a stand in for that and they have a model label of data and we take a different approach we take a very empirical approach and just say okay what do we what meditat do we have what metadata can we get for like hardness that and so we're looking at grade level uh for Arc which is a science QA data set we have a couple other annotations that are really interesting so like there's like a psych there's like a psychological skills scale that go it's called the bloom skills it goes from one to five where one is like the simplest kind of like factual Association almost like rot memorization and five is like the most complex thing you could imagine like analyzing a complicated argument and like formulating a counterargument and then using that to decide what the answer to a question is and so it's like a hierarchy of reasoning skills that psychologists use as they're thinking about constru and like psychologists and and Educators use as they're thinking about constructing test questions or like just a rote memorization test question is easier than like a synthesize uh counter argument test question and then there's one one last annotation we had for the arc data so besides grade level and Bloom skill we just had like a one two three difficulty level so I just okay and we actually we actually I don't know where that comes from I think some of the the data set collectors know where that comes from but this was interesting because this is something that the Educators designed as intentionally orthogonal to grade level so you can imagine that when you're designing a test for eighth graders you don't want all the test questions to be the same difficulty because one thing you're doing is you're ranking students so you want some some of the questions to be easier and some of the questions to be harder so grade level on its own uh if you're just pulling questions from exams the way people write exams grade level on its own is not a perfect indicator of difficulty because because we use exams for rank ordering students so so there there's naturally overlap in in the difficulty between and across grade levels uh by design and you see this exactly in the data where this this expert one 123 difficulty gets designed as a within grade level difficulty thing so it just ends up being orthogonal to to the grade level difficulty itself yeah should I think of this as kind of like like grade level difficulty is something about like difficulty of like just understanding the domain at all whereas one 123 difficulty is like okay you know the basic facts like how hard is it to reason about this thing this is the end of my uh this is the end of my my um ability to like confident comment on these things yeah but that so that was those ones for Arc we had great level for mlu um the thing the main thing we used with gsmk and strategy QA is this like number of reasoning steps so this is the measure of compositional difficulty how many sub problems did you have to solve on the way to solving the overall problem and sub problems are nice to think about because it's just like kind of like axiomatically a measure of difficulty like if you if there are problems that require like if there's a problem that requires seven steps versus a problem that requires six steps and each step is itself of the same difficulty you just know that the problem that requires seven steps is harder because it's just like one more chance to be wrong that one we looked at for for those problems and then there's kind of like um basic stuff like question length answer length was something we that we we also had a a modelbased um difficulty measurement we didn't just use like um like uh model zero shot label probability for sorting data we actually used a minimum description length based measure but it's roughly similar idea so we had a model based measurement too yeah and then and then yeah you look at how all these things correlate and they don't correlate that strongly yeah which so there's a lot of ways to read that I mean you can read that as data noise I I I think I favor reading it as like problems vary along many dimensions and among some Dimensions they might be harder and among other dimensions they might be easier maybe you have a hard reasoning problem about like very general knowledge or maybe you have a very easy reasoning problem about very specific domain knowledge right that'd be just two axes to to think about um but because problems vary along all these Dimensions it's just kind of rare that like like this is not really the way we design tests basically for the reason I just mentioned we don't we don't design tests and like we don't collect data sets or like we don't collect questions about things in the world in the way that like all of the most Niche domain scientific questions we can ask also require a ton of reasoning and also require like really complicated higher order reasoning and then all of the very basic factual Association questions don't require any reasoning and like just a matter of Association and like it's not the you just wouldn't expect all of these latent factors to be perfectly correlated because that's not the way we ask questions about the world sure and I guess because these aren't correlated it strikes me as possible that you might after having done this paper be able to say something like oh easy to generalization works really well when the notion of hardness is you know this minimum description length principle which by the way uh that so I wasn't familiar with this minimum description length like difficulty thing before I read your paper uh we we have little enough time that I don't want to go into it right now but people should look up the paper uh Ren and data analysis by Perez at all 2021 it's like really cool paper yeah I I'd asked Ethan for how to implement this thing because L of VOA introduces some work theoret like basically theoretically pitches this MTL metric for measuring information content and I I read that paper and like Loosely understood it but had no idea what the code up and Ethan this up nice that was awful I guess it strikes me as possible that you might be able to say something like oh yeah easy to strong generalization uh it works well when the notion of hardness is like this mdl notion but it works poorly when it's like number of star or like difficulty out of three yeah or something like that is is there something like this that you're able to say or are they all about as good as each other I was I was hopeful to get to that point but I don't think we got to that point and I'm not I'm not sure we ultimately have the data to get there because we kind of just got all the data we could and it leads to this kind of Patchwork of we don't actually have all the variables for every data set right so sometimes when one of the variable changes it's like the domain changes the data set changes other other difficulty measures change all at the same time I toyed around with once we had our once we had all the data written file I toyed around with some regression models where we did try to tease apart what were the important factors but I'm I'm not really conf G to make any confident conclusions there um but I think this would be great follow-up work where you you fix the domain and start varying all these all these individual factors and then and then see how the results break down by that yeah I think it's just almost similar to the interpretability like to into what we've learned from interpretability like like it seems like there's just like different potential Notions of what might me mean and like digging into which one is important seems like pretty un pretty high value and like maybe kind of underrated here yeah yeah I mean it would be great if we could if we could pin it down because then if we could say it's it's a matter of like how infrequent and rare this factual knowledge is when you're like reading about the world or like how just like specific this kind of thing is that paints a very different picture uh compared to like how difficult it is is it to like comp the answer to this question how diff how long does it take to arrive at the answer to this question th those give very different pictures so before we before we move on uh you mentioned like there are a bunch of differences between uh this paper and weak to strong uh is there one kind of method methodological difference that you're like really interested in talking about before I like leave this behind we've definitely hit a bunch of them we talked about the baselines we talked about how to construct the data set based on kind of like labeler confidence we talked about the human harness variables maybe one more and we talked about even the differences in the results how positive easy to hard looks versus weak to strong looks yeah one minor thing I would add that um I suppose this is a little bit more in the criticism category I think a few people were definitely concerned about the early stopping that seemed to be important to actually doing the fine-tuning in the weak to strong setup okay so they're mostly looking at fine-tuning models I think they did some prompting I don't actually remember if they did prompting or ICL I think they do um I don't think they do linear probing so so we also tried linear probing um in addition to the other fine tuning and prompting but there's a there's a when they're doing their fine tuning there's a little bit of hyperparameter tuning and a little bit of like Dev set model selection like early stopping that seemed important that so this this is um important theoretically so because the idea is like based on incomplete supervision the right function would still be identifiable you don't want the right function to be like one of many possible functions and it just depends on like getting exactly the right amount of fitting to the data such that if you're underfit you're in a bad region and if you're overfit you're in a bad region but if you fit exactly the right amount you happen to uncover the the right function so one thing that empirically I can I can point out we we don't do many of we don't do much of this analysis actually in the paper um but in retrospect it it feels important is that we could fine tune as much as we wanted like and and the longer the ICL prompt usually the better uh and the the more data that went into the linear probe the better I mean the linear probe fits easily but we we could we could basically fit as much as we wanted to this clean easy data and performance would just go up on the hard data which is great so I mean it's this problem is clearly not correctly specified is it misspecified I don't know it couldn't overfit to to this uh signal uh so so this was this was something that was interesting to us in retrospect wrapping up a bit um we've talked a bit about like stuff you've worked on um uh you've actually worked on a bunch more stuff that I you know didn't have time to go into um if people are interested in following your research seeing what you've done how should they go about doing that yeah well you can find me on Twitter and I think we announced basically all of our papers on Twitter so that's a good way to stay up to date uh the handle is Peter be as and boy hassi but I think I think you'll find me easily there um and if you're really curious about you know reading reading all the PDFs uh probably a Google Scholar alerts uh something I tend to enjoy for others as well all right great well uh thanks for coming on axer thanks so much Daniel what a pleasure this was great this episode is edited by Jack Garrett and Amber D Ace helped with transcription the opening and closing themes are also bet Jack Garrett filming occurred at far Labs financial support for this episode was provided by the long-term future fund along with patrons such as Alexi Mala to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about the podcast you can email me at feedback axp this [Music] [Laughter] [Music] [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs