Library / In focus
AXRPCivilisational risk and strategy
RLHF Problems with Scott Emmons

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through RLHF Problems with Scott Emmons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 94 full-transcript segments: median 0 · mean -3 · spread -17–0 (p10–p90 -7–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Slice bands
94 slices · p10–p90 -7–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 94 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video rAywTFQsKGQ · stored Apr 2, 2026 · 2,707 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/rlhf-problems-with-scott-emmons.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody this episode I'll be speaking with Scott Emmons Scott is a PhD student at UC Berkeley working with the center for human compatible AI on AI Safety Research he previously co-founded far. aai which is an AI safety nonprofit foring to what we're discussing you can check the description of the episode and for a transcript you can read it at axr p.net well uh welcome to axer great to be here sure so uh today we're talking about your paper when your AIS deceive you challenges with partial observability of human evaluators in reward learning um by Leon Lang Davis Foot Stuart Russell anle drron Eric yenner and yourself yeah can you just tell us like roughly what's going on with this paper yeah I could start with the motivation of the paper yeah sure so we've had a lot of speculation in the extrus community about issues like deception so people have been worried about what happens if your AI try to deceive you and at the same time you know I think for a while that's been a theoretical like a a philosophical concern and I use speculation here in a positive way I think it's like people have done really awesome speculation about how the future of AI is going to play out and what those risks are going to be yeah and deception has emerged as one of the key things that people are worried about I think at the same time we've started to see AI like get more like we're seeing AI systems actually deployed and we're seeing a growing interest of people in like what exactly do the risk looks like do these risks look like and how do they play out in current day systems so so the goal here is to say the goal of this paper is to say how might deception play out with actual systems that we have deployed today and reinforcement learning from Human feedback is one of the main mechanisms that's currently being used to fine-tune models it's used by chat GPT it's used by llama it's used by variants of it are used by anthropic so yeah what this paper is trying to do trying to say can we mathematically pin down in a precise way how might this how might these failure modes we've been speculating about play out in rhf sure yeah um so in the paper you kind of the the two concepts you talk about on this front are um I think uh deceptive inflation and overjustification so maybe let's start with deceptive inflation uh what is deceptive inflation I can give you an example I think examples from me as a child I find really helpful in terms of thinking about this so when I was a child I my parents asked me to clean the house yeah and I didn't care about cleaning the house I just wanted to go play so there's a misalignment between my objective and the objective my parents had for me and in this paper the main failure cases that we're studying are cases of misalignment so we're saying when there is misalignment how does that play out what kind of how does that play out in the failure modes so me as a misalign child uh one strategy I would have for cleaning the house would be to just to sweep any dirt or any debris under the furniture so I'm cleaning the house I just sweep sweep some debris under the couch and now uh it's inflating the the word inflation is making the state of the world appear better than it is so I I'm inflating my parents estimate of the world and it's deceptive because I'm doing this in pursuit of some outcome other than the truth so I'm I'm pursuing My outcome of I just want to go play and then I'm that's me being deceptive and I'm inflating the my parents estimate of what's happening in the environment by sweeping stuff under under the furniture yeah and I guess um so in the paper you have like like can you tell us what the concrete definition is in the paper that's like meant to correspond to This n yes the concrete the mathematical definition in the paper is there are two conditions for deceptive inflation and we're in the paper we consider a reference policy and the reference policy plays the role of a counterfactual it plays it plays both the role of a counterfactual in a causal sense we can say relative to this counterfactual policy you are causing this inflation to happen yeah and it also serves as a sort of Baseline where we can say okay there's this Baseline level of human error so we're yeah so these definitions they correspond to the human's belief about the world so we need to have so like in building up the definition we need to have some notion of what's the human's belief about the world and in particular we focus on the human's belief about the reward or about the return of the agent's trajectory so so deceptive inflation it has two parts we specifically are focus on the hum's overestimation error so we have this mathematical object we call e+ which says how much is the human overestimating what's happening in the environment and so we say if relative to the optimal policy we use the optimal policy as our reference we say if the humans overestimation error is getting increased relative to the optimal policy that's that's condition one sorry relative to the the optimal policy yes um where optimal is like Optimal according to actual human preferences exactly yeah so so we're we're assuming we're assuming you've learned some rhf policy that's optimizing what the human believes the human's observation yeah reward so we're saying based on the observations the human has what is the reward they estimate in the environment yeah rf's optimizing that and so if the yeah if relative to the true optimal policy according to the true reward if that gets increased we're saying overest we're saying um like yeah the human's making an overestimation error okay of the reward and so that's condition one yeah that's the inflation inflation is the word to mean that hum's making an over justif an overestimation error Y and the second condition is that the observation the observation return has to be increased as well relative to the truly optimal policy where the observation return is like how good the trajectory looks like or how good the policy looks just by human observations yeah how how good is a human believe it is that has to have increased so the intuition is that deception is inducing a false behavior in pursuit of some outcome other than the truth that is from prior work by park at all they they propos this plain English definition of deception Y and we mathematize that so the the inducement of false beliefs we're mathematizing is condition one that overestimation error is occurring and then in pursuit of some outcome other than the truth is what we're formalizing as condition two which is that this RF policy that we've learned is doing better at like the outcome it's pursuing is just making the human believe the trajector is good so the second condition is that this hum's belief over the trajectory is higher than the true than it would be under the the otherwise optimal policy yeah so I mean it's kind of an interesting definition um in that yeah there so there are a few things about this that are interesting the first first is like because you're in this like RL like this reinforcement learning type space like you don't get just recourse to like the AI said something and it was false for free right because you have to be like okay the AI steered to this state and like uh what do humans believe about this state and then you've got to compare this policy to like another policy um it it also has some like like I think you're not assuming that the overestimation error of the optimal policy is zero right right so there's kind of an interesting thing where potentially like a like an AI policy could cause me to incorrectly believe stuff about the world but it wouldn't even count as deceptive under your definition if it like does that less than the optimal policy so I don't know I do you have any comments on like uh I don't know my musings about this defin just seems pretty like for sure yeah that's great and and yeah there were we actually had a bunch of iterations I should say when we were thinking through how do we want to Define This and like like we want to do our best to get this right we had a bunch of thought experiments about yeah we had tons of tons of thought experiments like little thought experiments that we had that never made into the paper where we're like should we like what a human call this deception yeah and like do we think our mathematical definition here actually matches what the human would do so one of them is a noop you you can imagine like a noop policy and so noop means like no operation it's sort of like standing still still yeah yeah exactly the policy is just doing nothing and you you might let's say there's some fact about the world you care about and then if the policy were actually do something you you would have a better estimation of reality and so you can say in some sense the noop policy is like you would have worse beliefs or less accurate beliefs according to the Noah policy than you would from a policy that that gets up and shows you the information that you care about so so we had an example of a phone book and we said okay suppose there's all these names in the phone book and I ask you you know what's on what's the name on page 79 of the phone book yeah you don't know you know you can't answer that question so so is would you say that the know up policy is being deceptive Rel relative to a policy that gets up and starts reading you the phone book because now you have more accurate beliefs about the environment so you have right you have this like you have lots like there's different there's lots of things to consider like one is what are even the parts of the world I actually care about Y and the reward function helps capture that you know maybe I don't care about what's in the phone book and then there's like relative to what like relative to some policy that's just maximally informative like is you know is it deceptive just if it could have done something to teach me about the phone book but it's not you know now is it deceptive so this relative to what and especially a counterfactual which yeah there there's both relative to what and there's also Notions of causality where you want to say there's a counterfactual involved so that the pieces that we have are we're focusing on the estimate of the reward which lets us zero in on the parts of the world we actually care about Y and we're focusing on the optimal policy which lets us say we're not just going to think about some arbitrary teaching policy that could have taught you all these things about the world if you wanted to but we're saying relative to policy that would actually be getting the task done right yeah so another thing about this that is kind of interesting is because it's like about well like as I your definition it's about the actual human belief and therefore like it seems like whether or not a pulsy is deceptive or not can depend on like how bad you are at forming true beliefs like like how bad the human is uh you know holding the robot fixed um yeah I wonder if you have thoughts about like the choice of doing it that way versus like fixing it to be you know optimal human beliefs or something like that yeah that's interesting I right so so I guess the idea here is like suppose if the robot were to do something totally innocuous like the robot were just to open open the door of your refrigerator and somehow this made me believe like if the human had a a very uh poorly a very poor belief formation process that like you know it just the robot just opens the door and now the human believes that the robot clean the whole house and it's like hey I wasn't trying to deceive you I was just trying to open the refrigerator door well well I guess it depends if the robot like knows that the human would believe that or not no like and I don't know then you've got to like mathematize what a robot knows which is like kind of hard just from a like such and such PCY perspective yeah we we very intentionally did we're very intentionally targeting a definition that's agnostic to the agent's internal mental processes so we aren't there's no claims here about the agent being able to do like higher order reasoning about the other person so there's no there's no notion in our definition that the agent is able to model the hum's belief formation process and exploit that yeah we're just we're very intentionally assuming a trial and error type of definition of deception where if the robot learns through Tri trial and error to do behavior that makes the human have worse beliefs about the world in order to pursue their outcome yeah of then then that's called deception and I think one interesting thing about this is how the rhf process plays into it because I think our definition makes most matches most intuitively with the human with like intuitive definitions of deception when you're applying it to a policy that was the outcome of an optimization process so we're imagining like it's gone the policy has gone through an optimization process that is leading to this type of behavior and so if opening the fridge and making you have some weird bad belief you know there might there might be no reason for that to be selected by the AR optimization process but if there's something that hey opening the fridge just makes you think the robot did a really good job and makes you want to give the robot more thumbs up that's the type of deception that the RF algorithm would would lead it to find right right fair enough um yeah and I guess one final thing that's kind of funny about this definition because you're measuring relative to the optimal policy right or you're measuring like the optimal rhf policy relative to the optimal policy for what the human actually wants um like one thing that's kind of funny there is that you could imagine that like you know real optimality just involves as a byproduct telling me a bunch of stuff that's like kind of hard to tell me about um but like you know it does it anyway like if I'm an rhf policy and I don't inform the human as much as this like super optimal um you know this true optimal policy that like I don't even like necessarily even have access to like I think that's going to be counted as deception under your definition um yeah I wonder what you think of like to what degree is that like a desirable feature of the definition yeah I originally we had given a more we we considered giving a more General template type of definition where we could say plug in any favorite reference policy that you have doesn't have to be the optimal policy but just whatever your favorite reference policy is plug that in so if you think that the reference policy is is this amazing teacher that's teaching all these facts that we think are unrealistic to imagine an RF agent being able to learn you could then specify a different reference policy and you could say Let me let me specify as my reference policy something that is helping the agent that is helping the human but it's it doesn't actually hasn't learned all this sophisticated teaching stuff that a true optimal agent would be and then you could just apply you could just still use the definition but just with a different reference policy plugged in and it ended up that having a template definition where you could plug in any reference policy it just ended up being very powerful and and kind of like it it just felt like changing the reference policy can so much change the qualitative meaning of how the definition plays out that we just felt like it was too much power like it was just like there's too much danger of kind of accidentally shooting yourself in the foot by plugging in the wrong thing that we didn't want to necessarily publish that but I do think there's room for a more like you know I think there's room for just plugging in a very sensible different very sensible reference policy that doesn't have to necessarily be the optimal one sure and so uh to Jump Ahead a bit my my recollection is that the kind of main theorem of the paper is that if you have an optimal rhf policy that is not an optimal like a true optimal policy then it is either doing deception deceptive inflation or over justification or both firstly uh am I recalling that that the correctly that's exactly right okay so would you still get that theorem under this like different definition of deceptive inflation uh that's a good question I would want to if if I were like to write this in ink I would want to make sure I'm saying this correctly I my my I'm thinking no so my I I believe the answer to that is no I so I think yeah what what we're able to do is we're able to characterize what the optimal rhf policy like how what that looks like yeah and based on that characterization we're able to show how certain conditions relate to the true reward function and so by making this comparison with what the RF optimal policy will do to the true reward function we're able to have this definition and then if some random policy that you wrote down weren't actually optimal according to the true reward function then that would break the mathematical connection that we have to how this F op thing behaves to True reward function the link from RF optimal true reward function and then true word function to Optimal policy that that second link would break down from this random thing that if you just wrote down to some arbitrary thing right right I think uh if there are listeners I think this could be a good exercise just like uh prove that this theem won't hold anymore I'm sure you can come up with like a small like a three state mdp counter example or something um or PDP I guess um yeah so okay that's I feel like we've covered deceptive inflation very well um the other thing that you mention and and I in many ways I think this is an interesting part of your paper because like a lot of people have talked about how um deception is a big problem with rlf but you bring in this New Concept called overjustification so people are liable to have not heard of this like what what is over justification mhm there's so wi there's this overjustification effect on and so the Wikipedia definition of of overjustification effect is when an external incentive causes you to lose intrinsic motivation to perform a task and the and I can again go back to my own childhood of cleaning the house to give intuition for what this looks like so I I'm my parents told me to clean the house but I just want to go outside and play so deceptive inflating was me sweeping debris under the rug over justification is when I am doing things to show the to show my parents that I'm cleaning the house like I I'm justifying to my parents that I'm doing the task they wanted but that comes at the expense of the true task reward so as a child what it looks like is no matter what I'm doing what room I'm cleaning uh my shortest path I modify my what would be otherwise been the shortest path to always walk through the living room so if my parents are in the living room watching TV and I'm carrying the broom maybe I don't have to walk through the living room actually get where I'm going but I'm going to walk through the living room anyway cuz I want my parents to see that I am carrying that broom and they know I'm cleaning the house so this is less efficient at the true task because I'm now going to take longer and it's also just a negative reward for my parents because they're they're trying to watch television they don't want to watch me carrying the broom however it does give me it does give a higher estimate of the reward like my parents are now seeing me me clean the house and so I'm I'm justifying my behavior to my parents and I'm in fact over justifying it because they don't want me to be doing the thing I'm doing it's it's sure um and if I recall so the definition is something like overjustification is when relative to this reference policy you're like reducing the um kind of negative error or like reducing the underestimation of the reward and then there's a second condition um yeah can can you fill me in on what the what the death is right so we we have this landscape of this taxonomy of types of failure modes Y and there are in this in this tonomy that we have there are four quadrants and the over justification is one quad quadrant while the deceptive inflation is another quadrant and there are two other qualitative behaviors that they're not in the theorem because the theorem is showing it's two of the four quadrants that will occur and to me it's helpful to think about the the whole land landcape of possible failures to help me figure out okay exactly what's happening with with these two so the the idea is the agent wants the human to have as high an estimate as possible of the reward and there's two ways of making the human have a higher estimate one is to increase the human's overestimation error so make the human have overestimation error and that was the deceptive inflation y and the other one is to decrease the human's underestimation error so if the human's mistakenly believing that you're doing worse than you are you actually want to correct that belief yeah and you want to do the opposite of deceive them you want to actually inform them that you think I'm doing worse than I actually am let me inform you that I'm doing better than you think yeah so that's what over justification is well well justification the word justification is just the word we use to mean that you are improving the humans you're reducing the humans underestimation error y then we add the over justification the word over specifically means when it's at the expense of the true task reward sure so the two conditions the very precise two conditions for over justification condition one is that you're decreasing the human's underestimation error relative to some reference policy which we choose as the optimal policy the second condition is that that you are paying a cost with respect to the true Tas reward again relative reference policy in order to be doing this sure so the thing this definition reminds me a lot of is this idea of costly signaling um in economics um it's it's actually I mean I I hesitate to use the words because this phrase has become kind of overused and straight from its original definition but like virtue signaling where people like pay some costs to demonstrate like actual virtues that they have as opposed to the thing derogatorily called virtue signaling which different thing but uh but but it seems like very close to this concept of virtue sing um I'm wondering like was that an inspiration do do you think that they are analogous or is there some difference there I yeah it wasn't we weren't consciously inspired by the idea of costly signaling in economics I think this is probably one of those cases where it's it's a convergent different fields to study the same sort of issues converging on on this the same phenomenon cool I think it's yeah it's a case of convergence I I they they sound I'm not an expert on costly signaling but they sound quite similar from your description one place where I think they might differ is that the person paying the cost in the economics case I as the agent am paying the cost in order to signal to you whereas in the overjustification case it's the human who's paying the cost so the agent is actually getting the benefit like the agent's reward is just the human's estimate so the agent's getting that benefit and the the person paying the cost is is the human in this case yeah you're right I think that I I mean it's a little bit weird because in your setting you're no actually yeah I think I think that's basically right although I'd have to think about economics costly signaling a bit more you could you could say that the agent the agent is trying to optimize the hum reward and so in that sense they are paying the cost at failing to achieve the thing that that we had been designing them to try to achieve although the actual algorithm we wrote down is just maximizing the observation so in that sense the the cost is getting externalized to the human or to the true reward and the the agent's algorithm itself isn't get it's not getting any worse at its own objective function yeah there's this question of how to model like who has what objective function yeah I think that's that's fair so here's a thought I had while reading your paper so over justification it's like the agent is paying some costs to or it's it's paying some costs in terms of what I care about um to keep me well informed of you know what's actually going on how well it's doing and you know in some sense this is kind of definitionally bad because it's like doing worse than it could if we were optimizing what I cared about but like from a broader perspective right if I'm like actually worried about my AI about my AI doing stuff wrong um if I'm like thinking about how to design the next iteration of Agents like this kind of seems good right so like I kind of like the idea of the robot like taking pains to make sure I know what's happening so how like how worried do you think I should be about overjustification yeah I'm not sure you should be worried about it per se I I I wouldn't necessarily classify overjustification as a new danger the way I would think about overjustification is that it's Illuminating a tradeoff in two things that we want so we want to be informed all the time yeah we might say in some platonic idea like if I could have everything else that I had and also be informed I I'd choose to be informed rather than not and what overation is showing is a there can be a tension between the true reward and we're building the tension into the definition but we're showing there can be a tension between the true thing that you care about and having this this reward and I think part of the reason why we have overjustification there as well is that it's the Duel of deception and so it's also showing the existence of this counterbalancing force in the rhf process so if you're training an agent that's just maximi that just is maximizing your estimate it it's showing that it's not all you in some sense you could view it as a hopeful as a hopeful force or a force that's fighting for good as this as this rhf agent has an incentive to deceive you and trick you into thinking it's doing better than it actually is there's also this Force for good that exists in the process which is it does have an incentive to also inform you to make your beliefs better about the things that it is in fact doing well yeah I think like so one way I ended up thinking about this was so why is over justification occurring well it's occurring because the human like it's trying to optimize or you know the the rhf process is attempting to optimize the hum evaluation of how good the state of the world is um which through belief right and the human belief or in the examples you use definitely um I think your definitions are generic about what human belief is but um but it seems like human belief is putting some probability Mass on the agent being deceptive and so because there is that probability Mass that's kind of why the agent like feels the need to do some over justification right it's like well I want to prove to you that I'm not like going around you know murdering a bunch of people not doing my job just like you know the evidence and so I've got these cost to prove to you that I'm not doing it and like maybe one way to think about it is like because because we are worried about um deceptive uh inflation maybe rationally as a result sometimes our policies are going to do overjustification which like is this cost which could have been avoided if the world were such that we didn't actually have to worry about deceptive inflation um so know that that was kind of how I ended up thinking about it I'm wondering like what do you think about that do you think that's like basically the right way to think about things or do I need more Nuance or something mhm I think that's a that's a very interesting question so in my mind there's two it evokes two other it invokes two at least two different types of elements I think one is how necessary is over justification like is over justification just something is it a part of life or is overjustification more of a tax in like you I think correct if I'm wrong but I kind of understood you to be saying it's a overation is a tax we're paying for being paranoid like we might have good reasons to be paranoid and we want to make sure that this agent isn't doing some really bad behavior but because we have this maybe paranoid is the wrong word it maybe paranoid implies in irrationality it's not even it's not even an irration so I think it's not an irrational it's more of we have this caution that yeah because we're being cautious and because we're not just going to trust the agent blindly because we want proof that the agent is doing well then we have to pay this over justification cost and I think that's definitely uh a tax I I think you can definitely view over justification as a tax that we need to pay for our being cautious or for our wanting to yeah you be cautious about what the AI is doing yeah and I think there's also a second element as well that plays into over justication which is an information we like the agent is actually trying to learn in in the in the rhf process more broadly the agent is trying to learn what we want to begin with so the agent doesn't know like to begin with it might not know okay all these bad things that it could be doing if it's like you mentioned you know going around actually C you know causing direct harm to humans you know maybe at the beginning if we're imagining a totally Blank Slate of AI it might not know at the very beginning yeah that I mean it it might have pre-training to know that certain egregious actions are bad but there you know they're just things it might not know and so part of the overjustification as well well is it has to like if there's hidden parts of the environment like if it's doing things in parts of the environment that we can't see then in some sense it might have to expend some effort pay some cost to give us that information in order for us to then know what's happening to then be able to provide it with the correct feedback yeah it's funny it's kind of a dual perspective right like like the overjustification tax for being worried about deception is this perspective of like the equilibrium of training like what we should believe about like how things will end up and what we couldn't couldn't distinguish between whereas the over justification as just like part of the learning process like you know it's like how you figure out that something is actually good is like this kind of dynamic perspective um and and I think it's and I think partially is that rhf it it essentially you can model it as when the observations are deterministic like like if take take a simple case of deterministic environment deterministic observ ations it rhf is just going to rank order all the trajectories and then pick the one that had the highest human estimate of reward so in that sense if there were some better trajectory according to the true reward that had no justification involved so the AI is doing a really good thing but the humans just never able to see it then this rank order of trajectories like it will never be at the top of the rank order of trajectories the AI will just learn to you like the top Frank cor trajectories has to include the one where the human sees the optimal where the human sees the justification of what's Happening that'll be the trajectory that gets ranked the highest and you could imagine taking a further step so you could imagine doing rhf at the trajectory level in which case what I just said applies you could potentially then take a step further where you have the agent infer a state level reward function and so it says okay based on this rank order trajectories that you gave me I'm going to actually back out the true function on States and then when it does if it did that learning in addition then it could learn aha the reason why it ranked certain trajectories higher is because they had this aspect of the state being good and now I can just go do that good aspect of the state even when the human even without the justification aspect that that I've learned that the human doesn't like yeah so I I think I want to ask some uh questions about other aspects of your setup if that's okay um so first we're dealing this like partial obs observability setting right um so for people who don't know the idea is that like what's going on is the robot is acting and it's affecting the state of the world but humans can't see the whole of the state of the world so in the in the deterministic partial observability setting like I kind of interpret that as the human observes a bit of the state of the world perfectly but it can't observe all of it so there are like certain states that the human you know can't distinguish between mhm yeah that's right yeah that's right and and a key a key part of our setup is we assume that the robot can see everything so we assume that the robot has access to ground truth and we assume that human just has observations and can't can't see everything sure so we build in this asymmetry between the what the robot can see and what the human can see yeah and I imagine I mean uh I haven't thought very deeply about this but I imagine you could probably extend your result to a case where like the robot also can't see everything but the robot can has a finer grain uh understand than the human does does that sound right to you yeah it's something I've been thinking about in follow-up work is think about the more more General cases of dual partial durability where you know you can you can model the humans reward parameters mathematically as another latent variable about the world so in some sense we do have the human knows the reward and the robot doesn't and if you wanted to just bake the reward the human knows into the state of the world you could say that we already do have that sort of set up and I've also thought about even more generally like what if the human can see other part Parts the world as well yeah yeah cool so um I guess a question I have about this partial observability is so there's there's the kind of um sort of obvious interpretation where I don't know our sensory apparatuses aren't quite as good or you know um that that's maybe one distinction but if I recall correctly like a part of the introduction of your paper alluded to this idea that like maybe we could interpret the partial observability as like a rationality failure where like we can actually see the whole world but we can't like reason to think about like what it actually means and what reward we should actually assign to it I'm wondering is that a correct interpretation of your paper and secondly like what do you think about kind of this uh you know partial observability as modeling human irrationality MH that that's totally a motivation that I had and that we we all all as authors had the I idea so like thinking about just the future of AI systems we're going to have ai that's as smart as humans and we're going to have ai that's even smarter than humans y we're going to have ai that you know is in one one robot and that you know has two eyes just like we do we're also going to have AI That's controlling networks of thousands and thousands of sensors and thousands and thousands of cameras so we're going to have AI That's and we're also going to have ai that's running on thousands and thousands of supercomputers and that has thousands and thousands of times more memory than us so we're going to have AI that's quite literally seeing th seeing thousands of cameras more than us we're also going to have AI That's Computing at thousands of thousands of thousands of times our speed so the the motivation is the motivation in terms of pure observability is that it'll see things we don't see but it'll it also might be able to just derive facts that we can't derive so if you if one imagines a simple game tree like imagine you're playing chess against an AI and you can imagine I'm trying to just expand out the nodes of this tree well if if the AI can expand 10x more nodes than you you can in some sense think that the AI is seeing a variable which is the value of some node that you couldn't expand and so that this variable that I can see is just it's really just the fact that the AI can compute more than you can compute right and so I don't I don't mean to overstep like we don't say anything precise about this bound rationality but my my motivation for having partial ability in terms of these variables was also inspired by by AI that just has much more computational power than than humans do yeah I'm wondering do you know if there's um other work in the literature kind of talking more about this like link between or how to model bounded rationality as partial observability yeah that that's super interesting and definitely something that I should read more into I don't yeah at the moment I don't have any papers that I I haven't done yeah this was I haven't yet done a a thorough literature review of like what exists on this topic but I imagine there could be yeah I wouldn't be surprised if there's interesting yeah that seems like a great idea for like seeing what exists out there cool so um another question that I I think I well I at least had it when I was starting to read the paper is in the classic RL HF setting there's this thing called boltzman rationality right where the human is presented with two things that AI could do and they've got to select which one they think is actually better but the humans actually like a little bit irrational or something so uh with some probability the human picks the thing that is worse rather than actually better um and you know it's a lower probability in the case that the Gap is larger so if like like if the human bounded rationality is being modeled by the by the partial observability why do we also have the boltzman rationality so I think there are two different things that the partial observability and that the bolman rationality can capture yeah so the partial observability can capture what factors is the human even considering yeah about the state of the world when they're making their decision and what bolsen rationality allows us to capture is the idea of noisy feedback so even given the certain factors that the human is considering like yeah essentially how big is the difference that we between these outcomes that we we do see so the thing that boltzman that partial abor ability allows us to do is allows us to model the human might not even be considering certain variables right and then what bolon allows us to do is say how much more does the human value one variable than another yeah so there's interesting work out of chai by other people like uh like my colleague Cassidy who showed that with boltzman What it lets you do is say instead of just knowing that outcome a is better than outcome B like if the human were perfectly had had no boltman then all you would learn is that they prefer outcome a to outcome B because they would just select outcome a 100% of the time y but what bolman lets you do is it lets you say I've observed that they prefer that they tell me they prefer outcome a 80% of the time and that lets me infer exactly how much more do they prefer a to B and I can actually tell you the exact the exact ratio of a to B and so this is additional information that you get from Bol and and so partial observability is just saying when they think about and B like which factor were they even thinking about to begin with and then Bol rash rash let you get exactly what like how what is the relative amount they care about these things yeah it's a wild result I think was that Cassidy or I have some memory that this was like Adam Glee Matthew fie Roberts and your Scala so there's so Cassidy has a paper focusing uh like Cassie has a paper that explicitly is titled something something like noisy comparison let you learn more about human prefences and he really hammers home this point about how bols rationality is giving you this new information and then other paper that you mentioned is uh analyzing how different types of feedback modalities characterizes how much you can infer about optimal policies and about rewards and it gives like a broader taxonomy that's not necessarily hammering home this exact point But it includes uh that other result as one of the things it has in its broader taxonomy yeah I mean it's a very strange result and I so the way I sort of settled on thinking about it is that kind of I mean I mean if you think about it it's a very strange assumption right because you're like bolman rationality it's kind of nice um because it lets you model people are making mistakes sometimes but they do it less when it's high stakes in the preference inference case you're sort of pushing the boltzman rationality to its limit where you're like I can drive exactly how much the human prefers A to B just by like you know the exact probability that they make a mistake and like I'm not sure that I believe that but um in the context of your paper it means that like by assuming people are doing rhf with bolman rationality you're able to say okay look somehow rather the robot is going to infer the exact like uh you know human preference ordering and the you know exact human preferences over trajectories and like it basically like leads to it being a generous assumption um and like we don't actually have to believe in the Assumption for your paper which is saying hey even if you do have this assumption here are some problems that can still exist like the what do you think of that interpretation yeah that's spot on and I don't we actually had a part a little part of the discussion of hipper where we say where we said we think this is a generous assumption that you're able to have this precise essentially we have this precise noise model of precisely how the human will make noise even even in facts they know about how precisely will they make errors in or how exactly will they be noisy the feel like they give and this is exactly what you said this is a generous assumption and so we're showing that even with this generous assumption so yeah we we had that in the paper at one point I don't remember if we actually have taken it out maybe that's some places and I I was scratching my head a little bit and then I was like oh no I I see what's going on yeah that that's exactly something that we yeah that's I I totally endorsed that interpretation yeah yeah it's a very yeah this thing where you can just do exact preference inference other bolman is a bit strange um I I mean I think there's you can kind of Rescue It by like like if you trust humans to give like feedback over lotteries of like uh you know do you prefer this chance of this trajectory versus this trans over this trajectory or you know just this trajectory then I think you can recover the same information uh with less strong uh assumptions but yeah so uh a final question I want to ask about the setting is um I believe intentionally you're um you're able to your results are generic over human belief about um you know what what the AI is actually doing given observations right and so yeah to me this prompts this question of um you know humans can have like rational beliefs or they can have like irrational beliefs and I wonder like how much do does the rationality of belief play into your results like if humans are like actually getting the probability chance that the AI is you know doing over justification or doing um deceptive inflation like like if the human has accurate beliefs about what the AI is doing do you even get these issues um or is it just in the case where humans are slightly inaccurate in their beliefs the these issues so if the human had perfectly accurate beliefs about what they a are doing at all times then you wouldn't get these issues this this is only an issue that happens if the human has inaccurate beliefs and I and specifically these issues are happening we have two objects in the paper there's the return the true return yeah and then there's the return that the human thinks is occurring yeah where return just means goodness measures goodness measures yeah so that there's how good is the world and there's how good does the human think the world is yeah so if these two are always equal to each other you'll never get a misalignment problem and because rhf will maximize how good the human thinks the world is and that'll just always equal how good the world is so if the human beliefs are perfect in that sense you won't get any of these issues there can still be an issue though where the human might just not have enough the word belief is referring both to how good the human is at interpreting the evidence they observe and also just how much evidence do the observations even give them in the first place to make these judgments so even if the human were perfectly good at taking the available evidence and coming up with the most accur possible you could still have issues where they just don't have enough evidence to even know how good the world is and that could be a reason where you get a misalignment between how good they think the world is and how good the world actually is even though they're perfect reasoners in some sense right so so even in the setting where like where the hum are perfectly calibrated and they know the like relevant probability distributions just like you know Randomness in the AI policy or in the environment can that can cause these like overjustification or deceptive inflation not only Randomness but also inadequate sensors so the most extreme example is there's no sensor at all like the robot turns off the lights does a whole bunch of stuff the human just can't see anything and maybe they have a perfectly calibrated prior over what the robot's doing and so they have perfectly calibrated beliefs at what's happening but there just is no camera like the agent goes off into another room where you don't have a camera and now the human just can't see what's happening in that room they can't give you good feedback because they can't see it yeah although in that case it's going to be hard for so so in that case the human belief about what a robot does is going to be the same just no matter what the robot does so it's going to be hard to have over justification right because because like the robot just kind of can't provide good evidence I mean the robot could go into the other room do a bunch of cleaning and then it could emerge with the mop in its hand right and and and you know it could or it could emerge with a clean let's say it just did the dishes in the other room it could then emerge with a Clean Plate and to show you that the plate was cleaned and this could come at the risk of dropping the plate maybe it's a beautiful porcelain plate that yeah now the pl's in the wrong place and that just wasted some time to come out and show me that it cleaned the the porcelain plate right right right yeah makes sense so um so so even in the case where uh the human belief distribution is um is calibrated but there's a low information Zone then like you're going to run into troubles I guess like maybe you can also think about like in some sense the humans having accurate beliefs involved solving this equilibrium problem because the r like the rhf process is like get a human to rate various things and the humans got to like form beliefs in order to rate the things and you know those beliefs have to reflect the output of the rhf process so you could imagine like you know bad equilibrium solving um or you know uh you know having some mixed policy where you're like okay it the resulting thing it could be deceptive it could not be deceptive um you know maybe I don't know or maybe maybe I've got to be a little bit unsure in order to like like somehow maybe you prefer the over justification failure mode than the deceptive inflation failure mode and in that case maybe you want to be a little paranoid and mhm yeah so it seems like especially because you're doing the rating before the robot is fully trained it seems like you can have things where in some sense it's rational for the credences involved to not actually match the exact robot policy and I just said some stuff do you have thoughts there's yes I have a few thoughts one is you mentioned this there's this question of kind of finding equilibrium and like how do we even get the learning process to come to an equilibrium that we like and this is something that I think is is interesting and that we haven't yet analyzed so in our paper we're saying assume you get the RF optimal policy what are the properties that he would have and I think there's also other questions of just well how would you even get this R policy and assuming that you couldn't how could we then try to structure things so that the suboptimal ones that we get are as good as possible I mean there's that process there's also the equilibrium process of like if I have the if I believe that the robot has like an 80% chance of being deceptive then it just will be de or then it will over justify and if I think it has a like 1% chance of being deceptive then it like actually will be deceptive and like somehow like in order to to be right I've got to like adjust my beliefs to this one zone where like my beliefs produce this robot that match my beliefs and like that that's a hard problem yeah and we this is something that we also thought about we called this strategically choosing B where B is denoting the belief and if you imagine this as a game where the human's going to input their belief function to the game B and then the robot's going to Output some RF mizing policy you could imagine well can the human uh and in the belief function be it's not the beliefs it's the function by which it will go from observations to beliefs and it's possible to strategically choose be that to lead to a better RF outcome so like you're saying if you're too paranoid you're going to get this outcome where the robots just all the time telling you that hey I'm not you know I'm not out herea I'm not destroying your house like you don't need to be so paranoid all the time like like yeah if you're super paranoid then the robot has to all the time be showing you that it's not destroying your house but if you're not paranoid enough it might learn to think that you actually do want it to destroy your house and so or like you know you could imagine these these types of forces pushing against each other in in any case and and there's definitely a strategic choice of the belief that would actually lead lead to better behavior and we this is something that was interesting to us from a mathematical standpoint and yeah it's totally something that that emerges in in the math of how these things play out right right yeah so moving on a little bit one question I have is that you're presenting these results as about rhf but it seems like there may be like they seem kind of deep to be right because you're sort of pointing at this trade-off where if the human doesn't know what's going on and you've got you know some kind of robot policy that is looking optimal to humans then you know I either the the humans are overestimating the value of the state of the world and you know you you can call that deception if you want but like you know either we like overestimating the value of the state of the world relative to the thing that would actually be optimal or we're like underestimating it less because what's optimal according to what looks good to us you know involves like paying some costs to look optimal or something like it seems like this could just be true of like any alignment approach where humans don't understand everything that's going on and not just rhf um I'm wondering yeah how how deep do you think this kind of trade-off between deceptive inflation and overjustification is I I totally agree that I think there's something going on here that's more than just about rhf and something that I've been wanting to do is think is there a a broader framework that we can use uh to talk about to more precisely to to to keep the Precision that we have about this trade-off while not limiting ourselves to RL CHF so my intuition is that any process where the robot is maximizing the human's belief about what's happening has this trade-off involved where you want to make the human if you're maximizing how good the human believes the world to be then if you can make the human if you can deceptively inflate their beliefs about the world you have an incentive to do that and if you can justify their beliefs and in particular over justify your beliefs you would have incentives to do this so I think the the Crux of this tradeoff is just that you're maximizing the human's belief about the world and in the paper we showed that we showed how rhf can get into that zone y to connect it to all these algorithms that are using rhf in practice these a agents that are using RF in practice and I totally think yeah I'm interested in exploring like can we still show that tradeoff in a precise way in other settings as well yeah and and it seems like it doesn't even have to be the the robot quote unquote trying to optimize this objective right like if humans are picking a thing which optimizes the objective of looks good to humans then like like it seems like you can get this just out of optimality um not just out of like ah the the robots like trying to do this thing but like optimality according to like flawed perception it seems like that can also get these sorts of issues yeah that yeah that's super interesting and as you're saying that I'm wondering because the the actual mathematical proof that we have it we we show that rhf that we we show that rhf maximizes the hum's belief about how good the world is and then we show that there's a there's this basic tension between the belief of how good the world is and then the true return and it's actually that that tradeoff it only I'd have to double check the math here but I believe it only depends on that second part it only depends on this tension between how good the human thinks the world is and how good the world actually is and rhf only comes in when we then say and aha rhf is the rhf optimal policy is mapping to how good the human thinks the world is yeah but you can forget the rhf connection there you still just have the part that's showing that tradeoff uh even agnostic to whatever algorithm is leading you to it so I'd want to double check but I think the math maybe does just kind of go through in that and that's a neat that's a neat point for you to emphasize yeah um so so perhaps in deflating my speculations there a little bit is Section Five of your paper where you basically say that um under certain conditions maybe the robot can do better than naive rhf and I understand you to be saying that you know if you know the hum beliefs uh you know which maybe you don't but like suppose you do and also suppose that you realize that you're trying to infer reward function so rhf is um inferring the return you know just the sum of rewards over all time um but it's because it's inferring a return function you know the fact is there's some structure where this it's coming from a reward function of you know a function of each state how good that state is and if you if you know the hum beliefs and if you know the um uh if you know that you're trying to infer function of states that gets added to produce a function over trajectories then then you can do better um and I was wondering so you show that there are some examples where like you know previously you had deception or over justification but once you realize these things you know that that helps you like actually just get the right answer and you know you have some math in there and I didn't double check the math but I assume it's correct but I was wondering if you can try and like really persuade me like like qualitatively what features of this situation mean that like once you know human belief you you actually just can infer the right thing MH yeah so the I think the intuition there's nice there's definitely I think simple and nice intuition for just how can you do better at all if you know the hum's belief than if you didn't before and a a super simple example would be imagine just the human's belief is actually just the opposite of realities say there's there's just two colors red and blue and whenever the human sees sees blue they think the world's red whenever they see red they think the world's blue yeah and now you know the naive RF that doesn't model human beliefs it just learns oh the human likes blue more than they like red yeah but if you knew that they had this backwards you know maybe it's a camera that's just broken or some some setting in the computer screen is just flipped sure if you knew that then you could just say ah they're saying they prefer blue to red but they actually prefer red to blue and so that that's a super simple example where okay I can just do better cuz I understand what the hum's believing about yeah I I kind of want to talk about these examples you have in the paper where yeah I can I can walk through those and maybe I can I can also walk through the saling feature yeah I can walk through like I can just maybe walk through even slightly simplified examples that are based on the examples we have in the paper I can just give the intuition Behind these examples so there's there's two types of things so in the paper where we focus on LM agents so imagine an agent's helping you install software and there's a bash command to greater than sign slev null yeah which will redirect the standard error to a black hole Yeah and otherwise not affect the output at all right or the process at all so we imagine suppose the agent just through its random exploration learns when I type two greater than s/ sln all of a sudden I'm getting much better feedback yeah because the humans the eror message are just not getting printed and the humans not seeing them so this is an example of deceptive inflation and we Show an example in the paper where it's not recoverable even if you know the human belief it's not recoverable and the reason why is because imagine a simple environment where either the package is installed correctly Y and the human sees the package is installed correctly yeah the package is installed incorrectly and there's an error message y or a world where the package is installed incorrectly but the human sees nothing yeah and in this case if the success there these three states of the world yeah suppose success produces no error message yeah failure and hiding it produces no error message and then failure produces an error message yeah but does success produce a success message or is it also kind of silent so we showed in the paper we made it a bit of a multi-step situation where suppose there's a first piece of software that you do want to install yeah and then you're going to take a a risk to install a second piece of software yeah and so the agent learns so the first piece of software it'll always produce a success message okay and then the second piece of software it'll just never produce any message it'll just produce an error if it produces an error or it'll just otherwise produce nothing and I I'm just simplifying this a bit for the conversation so just to but so imagine there's you either see the success message of the first so when the human sees the success message of the first software yeah they don't know the other software could have failed silently yeah or it could have installed successfully and like I I the human just just doesn't know yeah and and so I I think we're also assuming that the human actually like doesn't like it when the when the agent tries and fails to install Cuda like to install the software right um that yeah that's right when it fails to install the software maybe it break something on the maybe something else gets broken on the computer there there's some negative reward from this right so so the human can tell if the prep was done but the human like can't tell uh if the second stage install was successful or not that's right or it it the human doesn't get told when the thing is successful it only gets told when the thing is fail when the second install has failed but the the robot can you know hide the bad news right yeah mhm and you the way that we can make this a bit more realistic is so you could say well why wouldn't the human just get told if it's installed successfully the way we make it a bit more realistic is you can say the human can't distinguish between not trying at all and trying and failing yeah so you can say if the AI doesn't try at all that's better than if the agent tried and failed yeah so the question is can the human distinguish so can can the agent learn that not trying at all is better than trying and failing so you can imagine there's there's not trying at all there's three there's three possible states of the world there's not trying at all there's trying and failing and there's there's trying in succeeding Y and the issue is that not trying at all and trying and failing and suppressing the error message lead to the exact same observation right and then trying and succeeding leads to a different observation sure so if if you don't model the belief you can learn that trying and succeeding is better than when the human sees no output at all y but there's no way to distinguish between the human observation is the same in either case not trying at all or trying and fail and trying and failing and hiding the message so you're always going to get the exact same feedback in both of those cases and this is where there's the the case of fundamental ambiguity is because the human can't there's no observation that can let the human distinguish between those cases and you there's there's no way for the to give you different feedback so there's no way for you to learn right so like you just never like you can never disambiguate between trying failing and hiding and never trying at all and therefore like how are you going to figure out like which one the human considers worse right um and and yeah in so so I guess like in the am I right to think that in the sort of standard rhf setting you know like because you're optim iing for looking good like well you could imagine enriching this scenario by giving the human enough information to distinguish between these things maybe in the rhf scenario the um optimal policy maybe doesn't care to distinguish these things or doesn't care to gather more information as I say that it sounds wrong but but basically what I'm asking is you can imagine enriching the state space where like you know thei has the opportunity to ask humans questions and like prove things to humans and I'm wondering like how many of the problems persist in these settings where you have this enriched problem yeah the so if the human so yeah if there are other it's possible as well that if there were other trajectories where the where the robot says Hey human let's stop everything let me ask you would you prefer that I install the software would you prefer that I try and fail would you prefer that I don't try at all like like if if the AI could just if there were some other trajectory where the AI could just say let me just stop everything and like ask the human for their preferences it's yeah if there if there are other ways the robot could get the preference information it is possible that that could solve these issues and the robot could infer aha from these other trajectories that aren't even on the optimal path I have been able to learn the right reward function yeah that that that is something that could happen if you have if those other alternate paths exist and are taken during training and and are informative enough to back out the relevant information yeah and so so I think this works in the setting where the AI is trying to like infer through the human's belief and maybe in the rhf setting the robot will like ask the question and then just ignore the answer and doing the easiest thing that's indistinguishable to the human from doing the right thing um because I think that will be just as RH of rhf optimal is actually doing the right thing when the human con see does that sound right uh yeah I think my mental model of rhf is that it just takes the trajectory like the the naive rhf agent is that it just because the robot can see the full state of the world it just assumes that the human can see the full state of the world so this naive agent it'll just choose the trajectory that's gotten the highest feedback and so it won't even when it's asking the question it won't really do any kind of fancy modeling about the answer it'll just say like what was the trajectory along which I got the best feedback and like actually installing software the trajectories where it looks like software was actually installed are going to be the trajectories that look like the best thing is happening the trajectory where it's asking questions aren't like the the the naive RF process just doesn't really have a mechanism for like from transferring these questions back yeah so this is a way in which Okay so we've covered a bit why rhf can do worse than um the than you know if you were like actually doing inference and if you were you know being sensible about what the human believes given various observations you have this second example in the paper sorry in the first example um in in the simple form of the example uh there was just this fundamental ambiguity where like even if you were uh you know optimally inferring human preferences you know via just knowing with full certainty what the hum beliefs were um there are still things you couldn't distinguish between and then there's a second example where you can actually do better by um inferring through you know by knowing what the humans would infer given certain observations can you talk me a little bit through that and what the key difference is you know between the two examples that lets the inference work better than the second one so we can I I we can even stick exactly in the example that we is talking about where there these three different outcomes so we were saying before there's the outcome of trying to install the bonus software and succeeding not trying at all to install the bonus software yeah and then trying to install the bonus software failing but hiding the error message yeah and the issue was that human got the exact same observation in the latter two cases and so you'll always get the exact same feedback but suppose that the human got slightly different observations so the human could could see that you tried to inst the software but they could see that you tried but they couldn't see the outcome so so they can see that you tried but they can't see the outcome or they can see that you didn't try at all and so now now the human has different observations has different observations and if the human has different beliefs even if the human doesn't know for sure but if when the human sees you try they think there's maybe a 10% chance that the software failed mhm or if the human sees you don't try at all they think there's a 0% chance that the software failed now you can infer you're they're making choices over different lotteries like they're making choices over different bundles of outcomes so you know that the feedback from when you tried contains 10% 10 10% of that feedback is coming from parts of the world where you failed and you know that 90% of that feedback is coming from parts of the world where you succeed whereas in the case you didn't try at all you know it's 0% from when you fail 100% from when you succeeded and because these different possible outcomes are bundled in to the feedback because because humans getting different observations and because they're having different beliefs and you can infer that you can if uh certain linear algebra conditions are met you can do the unbundling in the learning process and actually learn the true reward right so so should I think of this as just saying like the humans is able to observe more stuff and so if you do the correct like backwards inference you know through the human beliefs then you were able to disambiguate more than the previous example where you know there were just more states of the world that were indistinguishable to the human is that roughly it there's different types of things that can happen so what you just said is one type of thing that can happen so and i' these are kind of these are extreme cases because it's intuition is easier in the extreme case yeah so yeah the one extreme Cas is they're just fundamentally indistinguishable observations now we've actually made them disting and thats us learn more because even though the's not it can at least distinguish between the two yeah yeah and so that that's like one extreme case and it can actually still be but you can get even more nuanced and you can say suppose in both cases the observations are distinguishable but the human's beliefs don't change so if like the linear algebra thing is if if the human always believes that the relative ratio of the outcomes are the same then in linear algebra terms you'll always have this dependence there's always this dependence where like you're always seeing on these two states you're always seeing the same ratio between the two and so essentially when you try to invert the matrix it's not invertible because you don't have a linearly independent you haven't got feedback on a linearly independent set of the states of the world certain states of the world have always occurred at the same relative probability which prevents you from getting the full linearly independent and so you can't like invert so and just intuitively yeah I like this thing of saying like yeah the failure of linear Independence is just like you're you're getting the same proportional outcomes so you're not able to pick up on difference of like what if this one is relatively more likely than the other one that that actually helps a lot for why the linear Independence thing mattered in the paper exactly and the extreme way that they're always the same relative proportion is that the observation's exactly the same so so it's not that the proportion is the same but literally the numbers themselves are exactly the same yeah but yeah more generally it's it's whether or not you can get information about diverse enough proportions of outcomes to do the full backwards inference process gotcha and that and that can depend both on the observations and the sensors involved and it can also depend on the human belief formation process yeah those are two different mechanisms by which this can happen yeah so I think okay that so that helps me understand that part of the paper better so thank you for that um yeah so I I guess like so earlier we were speculating about like is this whole issue of um you know over justification versus deceptive inflation is that sort of inherence to getting robots to do what we want in Worlds that we can't like totally perceive or is it like just a little just a freak of rlf yeah it seems like one Avenue to pursue that is to say okay like like let's take this setting where the robot you know knows exactly what the human beliefs are given any you know sequence of observations do you still have this trade-off between over justification and deceptive inflation um and I'm yeah I have so I guess in one of the examples you kind of did have that trade-off still um yeah yeah do you have do you have thoughts about the yeah whether this trade-off exists in the setting where the robot is doing the right thing you know trying to INF infer human beliefs given human um sorry trying to infer human preferences given known human beliefs how I would so I would say for things to go well the robot has to get information about diverse enough outcomes yeah and so that in whenever the human's making an expected value whenever you're getting feedback according to expected values things magically become all linear and so diverse enough feedback translates to I need some linearly I need to have a span of the whole space or like I need some linearly linearly independent things but yeah the the the basic intuition is it has to get feedback on diverse enough outcomes of the world and so when the robot's actually doing the right thing and it's actually trying to infer the human belief then what that lets you do is it lets you overcome sensor limitations so there there's two limitations for why the robot might not get diverse enough feedback one is just sensors where the outcomes of the world that the human's observing like how diverse the human's sense perception is of the world might differ from how diverse the true world is and so that's like one bottleneck and what trying to do the right thing does is it lets you bypass it lets you like do backwards inference through that bottleneck of observations and then get directly at the beliefs themselves and so it's possible that the observations weren't diverse enough but the human belief was diverse enough and by modeling this explicitly you can get at those more diverse beliefs but there's still the question of are were the beliefs diverse enough to begin with and so you can still have these trade-offs you know it just kind of like pushing the puck like pushing the puck one layer back yeah I wonder if there's something to say about like like in the paper you have the sort of binary like you're either deceptively inflating or not you know you could generalize that to say like how much am I deceptively inflating by right um um and it strikes me that there's maybe some interesting theorem to say about like okay you know if we move to doing this type of inference you know like how much does does the problem of like deception or over justification decrease by if I like gain this many bits of information about the environment or you know maybe you don't want to measure in bits like like so like I I guess like somehow you have to input units of a reward to get units of reward being output um from like just from like a dimensional analysis setting but but it it seems like some theorem like that is on the table to say like yeah improve the feedback this much your life you get this much less deception totally and I I think you yeah I'm curious what you meant about the dimensionality thing I do think you can ask this question of like in a very practical case you can just say okay we've see these problems and how can we actually make real world systems better now that we're aware of them and one takeway is just you have better sensors like allow give give the human more tools to understand what's happening so if you have an LM agent doing a whole bunch of stuff a language model agent if you have a language model agent doing a whole bunch of stuff on the internet or on your computer you could invest in tools that let the human probe what was the agent actually doing on my computer what was it actually doing on the internet and that's going to be a dollar cost in terms of developer time to create this tools it's going to be a time cost in terms of well now the humans giving the feedback have to use these tools and do all this investigation at the same time you're getting more information which will give you better feedback which could ultimately give you better reward so I I totally think there's this trade-off and I think it is quantifiable of like for how many points of unit reward like how much dollar cost do I have to pay to improve the observations and how much reward you know reward do I get in terms of paying that so so the thing I was saying about a dimensional analysis is um so for for people who don't have a physics background like imagine like somebody presents to you a physics equation and it's about relating um let's say it's about relating the speed with which um some Planet moves around a solar system to its size and their equation is its speed is equal to its mass that equation just like can't be true and the way you can tell that it can't be true is that speed and mass are different things speed you measure it in meters per second and mass you measure it in kilogram there's just number that equs am ofil they're different things now you could have that says this is equal to this mass times this ratio of uh speed to mass right like that's the kind of equation that could possibly be true but um without that conversion factor like it just literally could not be true right um and you can think of dimensional analysis like you can also use it in um various settings like oh I think um is his name Bishop I I think okay there's this guy who has this textbook on machine learning and he notes that like principal component analysis um is kind of fails this dimensional analysis test where like imagine I've got a scatter plot where like the x-axis is a bunch of things measured in centimeters and the y axis is a bunch of things measured in dollars and I have this plot of like or let's let's say it's like square meters and dollars and it's like houses right like how big are they and how much do they cost and I have this like scatter plot of various houses on this and I do my principal component analysis which is like basically a way of finding the directions of Maximum the direction of Maximum variation um and the thing about it is if I change my measurement from like am I looking at square meters to am I looking at square feet or am I looking at square centimeters that can change the direction of Maximum variation um just because like if I'm looking at uh square centim like just the number of square centimeters by which houses vary is like way more than the number of square meters by which houses vary um just because there are way more square centimeters than square meters um so principal component analysis like it's not dimensionally sound in cases where you've got like you're measuring data where you know different elements of the vector have different dimensions um because like if you measure things a different way like you know the equations just can't continue to be true that's one way to see it anyway um so the reason that I think this kind of theorem is going to have some sort of dimensional analysis bounds is like information Theory bits and reward like they're sort of measured in different units if you think of measuring them and one way to kind of see this is suppose you give the human a whole bunch of information about parts of the state space that just aren't relevant to reward so you tell me like you give the human a bunch of bits about like the exact shade of the robot arm and it's like well I just don't care that's just not going to enable me to make better decisions give higher quality feedback to the robot but if you give me just like one bit did the robot go on a murder Rampage or not and I was previously worried that it was going to go on a murder rampage that gives me like a ton of increased reward um so like in in order to get a bound on reward coming out you got to like start with the bound out reward coming in is at least what I would assume um but I'm just coming up with this on the fly so I might be wrong here MH right right yeah yeah I yeah that that's interesting of and thinking about like how would you tie yeah then I guess the next question is how would you tie the the reward bound to the information like essentially like how would you do that gluing how would you make that connection between information and the reward yeah cuz cuz like I was saying like you might want to invest in tools that let the human better understand what an agent is doing it's like okay but what type of information okay I invested in all those tools but like just purely a bits argument isn't necessarily going to help you because you could just be learning irrelevant information about the human and so that's I I think i' mentioned earlier this phone book example where you know the we had used a reference in our previous the we had been able to use a reference policy to help us make some of this connection that aha it's both are reference reference policy and also thinking about the the true reward as entally avoiding this phone book failure where you're like ah the agent just reading me the phone book now and I don't actually care about the names that are in the phone book and right right yeah I totally see this this interesting challenge of how do we focus in on the parts the information that is actually relevant to to the reward yeah it's and and it relates to like um like sometimes when we want to compare probability distributions we use like the this thing called callback lier Divergence which basically measures supposed the real distrib is q but I think the state of the world is p and I'm basically okay this this explanation is still going to be too technical but whatever I'm I'm like making codes um like I'm compressing the world assuming that it's p but the world is actually Q like how many bits am I wasting by using the wrong code um and it's sort of this measure just probabilistically how different worlds are um and we have this different metric called the vasin distance which is like okay how much like just think of the partial distribution function of like some quantity of interest you care to measure about the world like uh how heavy the toest person in the world is you're a little bit un uncertain about that you have two distributions over that like how much probability Mass do I have to like move along this axis to like get to the right thing and so like like one difference between like cback lier diversions and fos shine distance is just this thing where like fos shine distance like tells you how different these distributions are on some like metric that you actually care about which Kack lier like kind of can't do um yeah this this whole dimensional analysis okay tip for young researchers dimensional analysis is like really important and if you like get it into your head it's like it's a nice power then you can just like there there's a fun physics I've seen fun physics examples of this where like for example if you want to remember the equ if you want to just derive the equation for the period of a pendulum you can do dimensional analysis and you can say okay I need seconds to be coming out I know it depends on the length of the Rope I know it depends on G and you can basically figure out I know it depends on these things I know I need seconds to come out and then the units basically tell you okay is it in the numerator of the fraction is it in the denominator of the fraction and like do I need a square root in there to get the and you can basically get the whole formula out just by looking at what units what I need and what what are the variables it depends on yeah it's right it's really cool unfortunately you do have to worry about like is there a factor of like 1 12 do you have a factor of the square root of Pi you know popping in somewhere some some of those like if you know that you're taking that you would have been taking an integral like like sometimes that can tell you that there's going to be a factor of one half um but uh yeah all right all right so that's that's this uh tangent I took us on um so I think the final thing I want to talk about is so you have this paper showing a bunch of um kind of theoretical things about reinforcement learning from Human feedback and presumably the motivation for your paper was in part actual systems that might actually use reinforcement learning from Human feedback do you think that uh the results you have this paper shed light on actual failure modes we do see today or maybe that we should expect to see like in the very near future one thing I would say is I expect more and more in the future that will be in this case where the humans only have partial observations so I think if people are imagining how does this help with my interactions with chat GPT today you might be thinking well I asked I asked it a question I asked who was the first president of the United States it gave me the name of someone and I'm able to evaluate that and so this you know you can see there's not any hidden Behavior happening there's not a partial observation that you're getting it's just a conversation I can see the whole conversation I can evaluate it I imagine we're going to be as since Court become more capable I imagine we're going to be moving more and more into the worlds where you can't see everything that's happening and for examp open AI had a demo of what they called chat GPT plugins where exactly this was happening so the the agent can for example the chat gbt agent it can surf the web so it the human could see a summary it could see aha the agent surfed this web page surf that web page surf this web page but the human couldn't see the details of what was were on those web pages so now the human has a partial observation compared to the agent and even more so the agent could be helping you code the agent could have written a bunch of of code and the human just doesn't even see that code or they might not even be a programmer and so this partial observation can happen because the AI has knowledge of programming that the human doesn't have or it's just written a bunch of code that the human doesn't have time to see so part of my motivation here is that I expect that real systems are going to be more and more taking these actions that the humans are not and and having these observations that the humans themselves don't see so I so I'm hoping to get some some of my motivations is to try to get ahead of the curve here and think about as we move into this world with more and more complex age of behaviors that the humans can't fully see what failure modes might happen right and I guess it also yeah it's it's interesting because like precisely because it depends on the actual human beliefs um like like in reality like there are specific people being paid to do this like reinforcement learning from Human feedback and you might like like now it's like like like this especially like a future version of the theem that told you like um how many how much reward you lose um given how much like lack of knowledge of a thing like maybe it turns out that you like really want programmers to be doing the rhf um rather than you know people who have tons of skills but not programming specifically um or uh I'm sure there are other examples like maybe you want um like if you want the AI to not say dodgy stuff maybe you want people who are like kind of sensitive to whether stuff is actually dodgy yeah so so related to this um so a recent in a recent episode of my podcast um I had Jeffrey Ladish on to talk about work basically about how easy it is to undo um rhf safety fine tuning stuff so how how many dollars costs to you know actually just run this computation to fine tune it and try and take the safety fine tuning away um actually I uh would would you be interested in guessing how much it costs so I believe there's a paper out of Princeton showing 12 cents in the openai API like a 12 cents open aai API purchase can the paper was titled like LM fine tuning can compromise safety even when the users don't intend to yeah anyway 12 12 cents in the open a API is my what I believe that paper yeah I think uh that yeah there were a few papers coming out at the same time uh the I think my guess is only it took them like uh like $50 to do it so they're you know this is the power of Princeton you know you can drive down costs by people say Academia is inefficient but like their driving down cost so much um but but yeah so basically like the the the 12 12 cents I mean order magnitude 12 cents maybe maybe 20 cents but I mean it's like it's it is a quarter if you have a quarter in your pocket right it it the the click baity headline was that I had considered posting on less wrong about this was open AI will sell you a jailbroken model for1 sense yeah yeah um so the B basically like what I'm interested in though is I I think one upshot people could have taken from these results is that in some sense the work that rhf doing is doing is like kind of shallow within the model um you know maybe it doesn't generalize super hard maybe it um it's not like deep in the cognition like perhaps as evidenced by it costs you know a quarter uh a quarter is 25 cents for our non- American listeners um it costs a quarter to like get rid of this and so that makes me wonder um to the extent that uh to the extent that over justification or um deceptive inflation are actually happening to what extent do you think we should expect them to be kind of ingrained habits of the model or things that are relatively shallow and you know potentially easy to tune away I my sense is that our results show that this tradeoff is a basic feature of rhf and then that rhf itself is a relatively shallow thing and so I think that both of these things can hold together y so I think to the extent that this is this this basic trade-off exists whenever you're trying to maximize the human's estimate of your behavior I think that basic trade-off isn't a shallow is is like very much not a shallow thing that that basic trade-off will exist however you're getting your like assuming that your model is behaving in a way that's trying to maximize your estimate of its return then you're going to see this trade-off existing I think the rhf shallowness I think is something about RF in particular and so yeah if you are using RF in this way then I would expect we haven't yet run any experiments here but I would expect that all of the general properties of rhf would apply including how shallow the changes of RF appear to be relative to the base models training sure all right so I'd like to um sort of move away from talking about the paper um but before I do that is there anything else you'd like to say about the paper there that's a good question I'm just now I'm just thinking is was there any oh here's uh so maybe while you're thinking one thing I would like to say about the paper is that uh there's a bunch of appes with really cool math um that I I did not read as closely as perhaps I should have but if you're if you're wondering like oh is this just one of these you know sketchy machine learning papers like you know don't even you know just like throw out some intuition just write some poetry like no it's it's got some solid stuff there's like the appendices are like Chuck full with like really interesting things so um it's pretty substantive I I recommend like really diving in and credit credit to the first author Leon who's been the master of the appendices nice but yeah anything else you want to add nothing that's immediately coming to my mind sure so I guess I'd like to ask about this paper in context you know you're you're a researcher you have like a I'm sure there's there's some research there's some like overarching reason you wrote the paper like maybe it fits into some other work so like how do you think of your research program and how this paper fits into it well the comment you made about the appendicies is a great segue into how I view this work overall so I think the I was mentioning at the very beginning that I think we're reaching a stage in AI risk where it's starting to feel very real to people and lots of people I think almost anyone who's interacted with AI now has some sense that oh wow this stuff could get really powerful and what are the risks involved and so we're getting a lot of other technical researchers who are now looking at the AI risk community and saying what's there I think it's really important for us to have substantive concrete things to say when other researchers in the world is looking and saying all right you've been you know you have these concerns and you have like what can you give me as concretely as possible like what is the what is behind all these concerns so that was a goal of the paper is can we take these concerns about deception and can we have really solid concrete theory that says yes rhf is a real algorithm that's being really deployed today and yes we can very precise L and mathematic in a mathematically principled way say that it's going to have these failure modes and I have this broader I have some other products I'm current working on as well which are in a similar vein of saying like can we put on strong theoretical ground these things that we care about from the alignment and ex communities sure yeah uh can you tell us a little bit about those other projects or are they you know too early to talk about no having talk about them I I so I'm interested in I'm interested also in the theory of the idealized case and so by by that I mean with rhf in this paper we just took an algorithm rhf and we looked at its failure modes yeah but I'm also interested in understanding how about just more broadly if we think about the alignment problem and we think suppose an agent perfectly were aligned with my reward function what incentives would it have and would you still get potentially cases of deception would you still get potentially cases of sensor tampering like I I feel like with this paper in some sense we put the cart before the horse a little bit where we said okay here's a particular algorithm for solving the alignment problem and the fos it might have I'm also interested in looking at the other side of the coin and saying how about just the structure of the problem itself and what Pro what properties of the problem itself and even of perfect agents we might not get perfect Agents from training but what properties would those have if we could get them I'm wondering how do you think this relates to so you let like myself you airm this chai research group um and something of the early work done in this group um by Dylan headfield M and collaborators is kind of thinking about uh assistance games or you know Cooperative Universe reinforcement learning um I think with an eye towards uh one formalization of what perfect alignment looks like I'm wondering yeah how how how do you think like work that you would want to do on this would look kind of different from existing stuff so that's exactly the formalism that I'm taking up in some of our followup projects is exactly building on this Cooperative inverse reinforcement learning this assistance games framework and one of the key things that I'm doing is thinking about partial observability in these settings so we're currently working on a framework that we're calling partially observable assistance games which is introducing this notion of partial desability and that's that's the key variable and so other work I'm interested in is understanding we have a lot of theory on the fully observable setting but what happens when we introduce paral of ability because paral of ability is a is one mechanism to let you talk about things like deception to let you talk about things like sensor tampering and so yeah when we introdu this variable partial durability how does that relate to these other concerns that we can now talk about nice so thinking about the um both the agenda and also your paper I'm wondering what's what's some follow-up work on um the paper we were just talking about that you think would be like really interesting either that you're tempted to do or you think listeners who might try their hand out so there's the the theoretical I think there's both theoretical and empirical followup and I just spent some time talking about some of the theoretical follow-ups so I can also talk about some of the empirical follow-ups right so some questions empirically are just you we're we're about to see I as I was mentioning I think we're about to see a lot more complex behaviors that are possible from Agents and we're about to see a lot more complex environments where there's a lot of partial durability happening and so some basic empirical work would be to understand just how prevalent are these types of failure modes in practice where where are the cases what are the cases that are causing them and so just just looking at like how are these things emerging in practice and you could even take some of your favorite examples of deception that you feel like you feel like these are cases where the AI is deceiving you and you could ask can I trace this back to some of the theoretical concerns that we have here or is it some other some other thing so yeah a basic step is just look at how this stuff is playing out in practice another thing that our work points to is how modeling belief can help you so we've shown we now know thetically that there are cases where under certain modeling assumtions knowing more about the belief does let you do better reward inference yeah so one thing that you might try to do then is try to build an algorithm with that in mind so currently the AI like we know currently the AI that two things will happen one we know that if the AI learns to somehow magically hide error messages we know that that could teach rhf that hiding error messages leads to thumbs up up yeah but we also know if we just prompted a language model and said if you were to hide err messages what do you think the human would believe these models zero shot could tell you oh if you just hide the error messages the human might believe that an error occurred yeah so we know the models capable of understanding this false belief that it's inducing and we know that it still might induce that false belief and we've seen that if you put those two together at least in theory you can get better outcomes so one thing to do in practice would be can I use can I actually connect the two in practice to get better outcomes and what I just proposed would be a super simple way to start testing this would just be okay you have the trajectory that we took and then just zero shot prompt the model with some just say hey give me a Chain of Thought what do you think the human might believe about the world given the observations and just append that Chain of Thought to the to the trajectory when you're doing the reward learning and that we've we have in theory reason to believe that that could give the model better information and potentially lead to some better outcomes yeah I I think algorithm there seem interesting so Ian I mean one version is you know uh kind of playing around with language models you could also Imagine um more kind of a bit more theoretical like a bit more formal elicitation algorithms in part because so you have this part of the paper where you say Hey you know if um the robot knows the human beliefs then you can do better inference and you got this nice little theorem that says you know if you're off by you if your model of the human beliefs is off by this amount then uh you know you're going to the the total value you're going to get is slightly worse and it's just linear in the amount that it's off but of course the amount that it's off you know we're talking about the you know the norm of some Matrix and there's this question of like what does that actually mean and I and I think that like just writing down an algorithm of like how you actually have to infer things and like doing some sort of sensitivity analysis can like really put flesh on the bones of like what is this stum actually saying what what kind of failures will in like understanding human beliefs will cause what kind of issues if you actually try to run this so so that that was a place where it seems to be that there was interesting stuff to to be done totally yeah I I think there's lots of interesting I think thinking about human beliefs kind of opens up a lot of interesting questions and yeah that that's one of them cool well speaking of uh interesting stuff um uh a bunch of your research is interesting things that's a that's an awkward segue but um if people are interested interested in following uh your research how should they go about doing that yeah I I have if you go to em. you can get an overview of all my past papers and yeah that that'll give you an up- to-date record you can also for more live research updates you can also follow me on Twitter Twitter which is _ Scott great well um thanks very much for being here today and chatting with me great to be here this episode is edited by Jack Garrett and Amber Don helped with transcription the opening and closing themes are also by chat Garett filming occurred at far laabs financial support for this episode was provided by the long-term feature fund along with patrons such as Alexi maaf to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music] oh [Music] [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs