Library / In focus
AXRPCivilisational risk and strategy
Cooperative AI with Caspar Oesterheld

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Cooperative AI with Caspar Oesterheld, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 158 segments for display; stats use the full pass.
StartEnd
Across 158 full-transcript segments: median 0 · mean -1 · spread -13–0 (p10–p90 0–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Slice bands
158 slices · p10–p90 0–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 158 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video 0JkaOAzDfgE · stored Apr 2, 2026 · 4,587 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/cooperative-ai-with-caspar-oesterheld.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with Casper osterheld Casper is a PhD student at Carnegie Bellon where he studying with Vincent conitzer he's also the assistant director of the foundations of Cooperative AI lab or focal for link to what we're discussing you can check the description of this episode and you can read the transcript atp. net all right so welcome to the podcast Casper thanks uh happy to be on yeah so my impression is that the thing tying together like the various like strands of research uh you're involved in is something roughly along the lines of Cooperative AI is that fair to say I think that's fair to say I mean I do some work other than Cooperative Ai and cooptive AI can I guess mean many things to different people but generally I'm I'm happy with that characterization I think okay so I guess like to the extent that Cooperative AI covers like mere work like what does that mean to you so to me the kind of most Central problem of Cooperative AI is a situation where two different human parties like two companies or governments each builds their own AI system and then these two AI systems potentially while still interacting with their grade because these two AI system inter AI systems interact with each other in some kind of yeah General some Mass game theorists would say like mixed motive setting where there are opportunities for cooperation but also perhaps a potential for conflict and corporative AI uh as I view it are like the kind of the most Central corporative AI setting or qu question is how to make these kinds of interactions go well okay so I guess this is like the AI exis research podcast and I I also at least perceive you as being part of this like xris research Community I I think if you just say that many people will think like okay you know you know is this a matter of like life or death or is this just you know it would be nice to have a little bit more Kaya going on so how relevant to like X risk uh is cooperative AI yeah I I certainly view it as relevant to X risk certainly that's most of my motivation for working on this so I suppose there there are different kinds of interactions between these different AI systems that one can think about and I guess some of them aren't aren't so high stakes and it's more about just having some more Kumbaya and yeah meanwhile other other interactions might be very high stakes like if if governments make their decision decisions in part by using AI systems then the conflicts between governments uh could be could be a pretty big deal uh and could I don't know could could uh an existential risk can you flush that out like like what would an example be of like governments with AIS um having some sort of mixed sum interaction where one of the plausible outcomes is you know Doom basically so I suppose the the most straightforward example would just be to take an interaction that kind of already exists between governments and say like well well you could have this and well also like the decisions are made in part between AI so um so I I suppose um yeah there are various I don't know disputes between different different countries or different governments usually it's over territory I suppose and uh sometimes uh as part of these disputes uh countries are considering the use of nuclear weapons or threatening to use nuclear weapons or something like that so yeah I guess the current kind of maybe most Salient scenarios is that the US and Russia disagree over what should happen with Ukraine like whether it should be its own country or to what extent it should be able to make various decisions about whether to join NATO or the EU or or whatever and as part of that yeah Russia has brought up the the or Putin has brought up the the possibility of using uh nuclear weapons I I tend to think that kind of this particular scenario is it's not that likely that nuclear weapons would be used but uh I mean in the past during the Cuban Missile Crisis or whatever it seemed more plausible and I I suppose the yeah the most straightforward answer is just well we could have exactly these kinds of conflicts um just also with AI uh making some of the decisions okay and so how does AI change the picture like why aren't you just studying like Cooperative Game Theory or something yeah good question uh so okay I mean game uh uh sorry AI might uh introduce kind of its own new cooperation problems so yeah you could have these kind of AI arms races right where maybe even once one has the first human level AI systems or something like that there might still be a race between different countries to improve these systems as fast as possible and perhaps take some risks in terms of building on aligned systems in order to uh have the best systems so that would introduce new uh some new settings but mostly I guess what I think about is just that the kind of some there are kind of game theoretic Dynamics or like game theoretic questions that are somewhat specific to AI so yeah for example how to I don't know how to train AI systems to learn good equilibria or something like that um it's like an example that's kind of very natural to ask in the context of building machine learning systems and a bit less of a natural question if we think of humans who kind of already have I know some sense of how to play strategic uh how to play in these strategic situations okay a related question is when when I look at Cooperative AI literature it seems like it's usually sort of taking game the and like tweaking it or applying it in a different situation and I think Game Theory there are a bunch of problems with it that made me think that it's like a little bit overrated right um so I I have this list so according to at least according to Daniel Ellsberg apparently it's like it wasn't particularly used uh to by the US in order to figure out how to do nuclear war which is which is relevant because like that's why they sort of invented it it's not obviously super predictive you know you often have these like multiple equilibria problems where like if they're multiple equilibria and Game Theory says you play it in equilibrium it's like that sort of limits the predictive power it's computationally hard to find these especially if your solution concept is Nash equilibrium it's like hard to figure out how to get to an equilibrium it seems like there's important things that it doesn't model like you know sort of shelling or really sent options I don't know it seems very simplified and and it's also like seems difficult to account for Communication in a really nice satisfactory way in the game theory land so I guess I'm wondering to what extent is cooperative AI like fixing the problems I'm going to have with Game Theory versus inheriting them yeah that's a good uh that's a good question um I think there are a number of ways in which it kind of tries to fix them so for example with respect to equilibrium selection right that's that's a problem that kind of I think about very explicitly and then also kind of in the corporative AI setting or in the setting of building learning agent agents kind of comes up in some somewhere in a like more direct way uh so like with with humans you can you can kind of if you if you have this kind of descriptive attitude that like you use Game Theory to predict how humans behave you can just say well they'll play Nash equilibria well which ones well Anything Could Happen depends on what the humans decide to do and with AI you're kind of forced a bit more into this uh descriptive sorry prescriptive uh position of having to kind of actually come up with ways to yeah decide what to actually do right so you can't just say well you can do this you can do that it's up to you uh you have to you kind of at some at some point you'll have to say Okay we want to use this learning scheme or something like that yeah and there I guess there are also some some other things where yeah with with with with some you feel complaints about Game Theory I would imagine that lots of people including game theorists would agree that these are problems but where it seems they just seem kind of fundamentally quite difficult so this problem of people going for different equilibria like the equilibrium selection problem for example I think part of why there isn't that much work on it or why people generally don't work on this that much is that it it seems kind of intractable it seems very difficult to say much more than well different equilibria depending on blah blah blah different things might happen yeah yeah it's it seems very difficult to to go beyond that uh and I think similarly with with shelling points which I think or like these kind of like natural coordination points or focal points uh as they're sometimes called I think so yeah I'm actually quite interested in in that topic but I guess that too I would imagine that lots of game theorists would say like yeah this is it's just kind of hard to say anything about it basically uh and that's why less has been said about about those uh and I think the like this computational hardness issue that like Nation equilibria are hard to find so like people won't actually play Nash equilibrium so they'll play something they'll do something else that's maybe kind of probably like kind of a bit in the spirit of Nash equilibrium like they they'll still like they'll try to best respond to what their opponent is doing or something like that but probably they won't exactly get right I think there the issue too is that the like describing these kinds of Dynamics is just much more it's just much harder than the simple Nash equilibrium model yeah perhaps perhaps game theorists would and perhaps I also would think that it's still it's still kind of useful to think about ntion equilibrium and maybe that's something you would you would disagree with or you you would have a um a somewhat different uh view about I well uh I mean I certainly think it's useful to think about I mean I'm certainly sympathetic to the idea like well these are hard to study so studying easier problems but I I mean I just want to check like how you know should I just believe this stuff or is it simplifying things you know in order to make any Headway I I guess one question I had is so so you mentioned like for a few of these things it's just like kind of hard to study and a natural question is um do things easier or harder to study in the kind of Cooperative AI setting where you have like a few more tools available at your disposal I think it's kind of a mixed bag I think some things get some things get harder because we have kind of less I mean especially if we're thinking about kind of future AI systems we have kind of less kind of direct access to how they're thinking about things for example shelling points right and equilibrium selection I think in many cases like I don't know if if if if I ask you like I don't know how well when when we like walk into each other like when we walk walk towards each other on the road and we have to decide who goes right and who goes left or like whether to go to the right or to go to the left um how do we decide right like there's a very simple kind of answer that like we know because like well I don't know we're in the US or something and uh in the US there's uh rightand traffic so we're more likely to go to the right I mean that that rule fails way more than I think I would have guessed naively as far as I can tell the actual solution is like uh you people randomly guess One Direction and then if you Collide you switch with an Ever decreasing probability and that seems to pan out somehow yeah which I think is the that actually like is the yeah it's kind of the op I think it's the optimal kind of symmetric solution is to to randomize uniformly and uh until until you kind of anti- coordinate successfully yeah although it's not I I think there's more alternation than just repeated randomizing uniformly so like like I don't know I feel like more than average I end up in like these situations where like we're on the same side and then we both switch to the same side and then we both switch to the same side you know what I mean I I think it's I think there's a bias towards alternation rather than just randomization yeah um this this is a little bit off topic there but uh yeah or maybe seems plausible plausible true but like even so even like if it's if it's uh if what people are doing is irrational like we kind I don't know we have some kind of intuition for what what one does right like I mean yeah I mean I guess you have I I I didn't I I I I think you're basically right about what what people are doing but I mean you you just think okay well we we do this and um and that's how we solve this and I I don't know we maybe we can we can come up with with other examples where it's kind of more straightforward that people just follow some convention and it's kind of clear that people are following this Con convention because there is this convention and so like with AI it's it's harder to guess like especially if yeah if you imagine AI systems that haven't learned to play against each other face each other for the first time on some problem where with equilibrium selection it seems Seems harder to say what they're going to do yeah on the other hand I think yeah there also some things that I think are easier to reason about maybe the kind of the rational agent model is kind of more accurate about the kind of AI systems that we worry about than about uh humans yeah there also kind of lots of like if if we think more from this prescriptive perspective of like trying to figure out how to make things go well then there are a lot of kind of tricks that apply to AI more than to humans so I guess we'll we'll kind of talk a bit about uh some of my work later that's like in part about some of these kind of things that seem to apply more to AI than to to humans so there's more for example it's it's easier to imagine kind of very credible commitments because the AI can like can give it any goals like we can I don't change its source code in various ways whereas for humans that's kind of harder to do yeah in another way in which the situation might be harder for AI is that with humans kind of I mean it's kind of related to what we already talked about like humans are already kind of trained a bunch against each other probably some kind of some kind of group selection type or like I don't know selection on on a like convention level has has occurred so successful conventions or successful like ways of selecting equilibria have been selected in favor of conventions that fail somehow which seems uh and and kind of kind of this like more I don't know evolutionary Style learning seems less realistic for AIS at least I don't know it's very different from gradient descent or other kind of contemporary techniques H yeah I mean I guess there's yeah it seems like there's this General issue where like on the one hand you have like if AIS can do more stuff like they can you know they can be given various types of source code they can like read other AI Source codes you know like on the one hand that gives you more tools to solve problems but then on the other hand presumably just adds to strategic complexity right so I guess aior should be unclear whe if that makes things easier or harder yeah even the yeah even the some of the things that I described on the like making things easier side like the more opportunities to get good equilibria yeah often also imply more just more equilibria which typically makes things harder yeah I wonder like I don't know if I just think about this like a lot of these traits of AI can sort of be implemented for humans right so so I think um in in this one of the papers we're going to be talking about about uh safe parto improvements it's basically about the setting where you give an AI a task to solve a game but you can like change the game a little bit and like presumably we could do that with humans right like your company has to like play some game and it just outsources it to the game playing Department of your company and you know maybe your company can give instructions of the game playing Department I'm wondering like uh have these sorts of questions been studied much the setting of humans or um like do you think a is you know jolting people to think thoughts that they could have been principle thought earlier yeah so with some of these that's definitely the case so like some forms of credible commitment uh discussed in Game Theory more generally and I also agree that yeah this kind of transparent source code or like partially transparent source code Type setting is not that bad as a model of some interactions between humans so like I mean organizations for example human organizations are somewhat transparent right yeah to some extent we can predict what US Congress will do right it's not some black box that has some utility function and it's going to try to best respond to something with respect to this utility functions they they have like lots of rules for what they're going to do we can like we can ask the individual Congress people what they feel about certain issues so to some extent human organizations are bit like like I don't know they have like specified rules well like Constitution and so on so in some sense they also play this kind of Open Source type game or this like a game where your your source code is somewhat transparent I think with a lot of these things part of the reason why they aren't studied in this way is that like with for example with this kind of Open Source type model traditionally uh that is it's just that it's kind of it's a bit less of a natural setting and it's kind of clear that the human setting is just very fuzzy in some sense and I think the AI setting will actually also be like very fuzzy in a similar sense but it's kind of easier to imagine this extreme case of being able to read one another source code or something like that whereas for human organizations it's kind of less it's less natural to imagine this extreme case or this like particular model of source code but I yeah I definitely agree that one can that some of these things one can consider in the human context as well yeah and it's kind of close to models where players have types and you can observe other types and every one of a type plays the same right yeah yeah right yeah there's also that that kind of model yeah yeah so um kind of the final framing question I want to ask is I think when a lot of people encounter this line of research they think like okay well we sort of already have ai alignment of like AI is trying to adopt like you know do do what people want them to do if I'm like I don't know Boeing or the us or something like and I and I'm making AI That's aligned with me you know I wanted to be able to figure out how to play games such that we don't end up in some like terrible equilibrium where the world get n gets nuked anyway right so like I think a lot of people have this intuition well we should just like make really smart AI that does what we want and then just ask it to solve all of these like Cooperative AI questions and I'm wondering yeah what do you think about that that plan or that intuition so okay I think that I think there are kind of multiple questions here that I think are all very important to to ask and that yeah to some extent I also yeah think are like kind of important objections to a lot of kind of work that one might do I guess one one version of this is that if we build aligned AI whatever problems we have we can just ask the AI to solve those problem problems and if the AI is just kind of good at doing research on problems in general because it's generally intelligent or something like that then we should expect it to also be able to solve any particular problem that that we're concerned about or at least or there there's a weaker claim you could make which is it can solve those problems as well as we could solve them if we tried right right yeah yeah yeah sorry I interrupted no no that's a that's a good point yeah yeah that's that obviously that's the that's the much more important like yeah that's the more important claim yeah and that yeah I think that is an important consideration for yeah for Cooperative AI but also for a lot of other types of research that one might kind of naively think it is valuable to do yeah I think there are kind of lots of specific things where this kind of really I think this really works kind of extremely well like uh if if you think about solving like some specific computational problem like finding developing better algorithms for finding the good Nash equilibria in a free player normal form game or something like that I think it's relatively it seems very likely that if we get human like once we get human level AI uh or super human level AI or whatever it will it will just be better at developing these algorithms or be at least as good as humans as at developing these algorithms because to some extent it seems just like it kind of like it seems like a fairly generic task right like um I would imagine that in AI that's kind of generally good at solving technical problems will be good at this particular problem or what I think is an important property of the kind kinds of problems that I usually think about or the kind of kinds of ideas that I usually think about is that they're much less well defined technical problems and they're much more conceptual I yeah I guess we'll we'll be talking about some of these things later but they seem much less like the kinds of problems where you just specify some issue to the AI system and then it just gives like the correct answer and I also think that I mean this this kind of like goes maybe more towards a different version of this objection but I think another another important kind of claim here is that in some sense kind of game theory itself or like strategic interaction between multiple agents is is kind of a special thing in a way in in a way that's kind of similar to how like alignment or like having the kinds of values that we want it to have is special in that it's yeah it's like to some extent kind of orthogonal to other kinds of capabilities that one would naturally train a system to have so yeah in particular I think yeah you can be like a very capable agent and still land in bad equilibria right like in some sense a bad Nash equilibrium is like in equilibrium where like everyone behaves perfectly rational kind of holding fixed their opponent uh but the outcome is still bad and so if you at least if you imagine that kind of training mostly consists of like making making agents good at kind of responding to their environment then Landing in a bad Nash equilibrium is is completely compatible with with being super competent uh uh at at um best responding to the to the environment yeah I guess like maybe this goes to some underlying intuition of like Game Theory being sort of sort of a deficient abstraction because like I don't know in a lot of these situations like like in a lot of these difficult game theory things like prisoners dilemma right where like it's better for us if we cooperate but like whatever the other person does I'd rather defect but if we both defect that's worse than if we both cooperate like a lot of the time I'm like well you know we should just like talk it out or we should just like build an enforcement mechanism and just you know change the game to one where it actually does work and and maybe this just ends up as like and you know if if these options are available to powerful a eyes and somehow like the game theory aspect maybe or the the multiple Nash equilibria thing is maybe less relevant um so I don't know maybe this comes down to a rejection of like the game theory frame that one might have so I'm not sure I understand this particular objection so I mean okay so by I mean by default like the normal prison dma single shot uh without repetition without being able to uh set up enforcement mechanisms making credible commitments and these kind of things it just has one Nash equilibrium right yeah yeah yeah I I I think the the complaint is like in real life we just are going to be able to communicate and set up enforcement mechanisms and stuff yeah I think I'm I'm mostly agree with that but I mean the the the enforcement mechanisms are what would in the prison dilemma uh introduce multiple Nash equilibria right yeah like I guess in the prison dilemma it's kind of natural what the good equilibrium is because the game symmetric right so you just do the paror optimal symmetric equilibrium right so if if like assuming you have you can make cooporate corporate an a equilibrium by playing the game repeatedly or having enforcement mechanisms and so on then it's it seems very natural that that's what you would do yeah but if we kind of ignore the Symmetry Intuition or we just take some game that's somewhat more asymmetric then it seems like even with these enforcement mechanism Ms there might be many possible equilibria that are differently good for different players and yeah I don't see how you would yeah how you would avoid the kind of multiple Nash equilibrium issue in that kind of setting yeah I I guess the the hope is that you avoid the issue of like like really bad Nash equilibria but but I guess you still have this like yeah to to the degree that some asymmetry or anti- symmetry or something I I guess I guess you have this issue of there being multiple equilibria that you've got a bargain between like uh you know if we both have to coordinate to do a thing and I like one of the things more and you like one of the things more and somehow we can't randomize or something I I guess in that situation we've got this bargaining game of like okay are we all going to have this rule that says we do the thing you like or are we all going to have this rule that says we do the thing I like um yeah and ends up being a meta game I guess yeah I think I'm more I'm more worried about the multiple equilibrium thing than about the kind of getting the good equilibrium if there are if there's kind of a unique good equilibrium uh like if yeah if you think of something like the stack hunt or yeah or like a prison dilemma where all you all you really need to do is like I don't know pay like 10 cents to set up this commitment mechanism allows you to play corporate corporate an equilibrium yeah I'm also like much more optimistic about that I mean I I still it's it's it's still kind of funny because there's like in in some sense there's like not really that great of a theory for why the good thing uh should should happen but it's it seems kind of intuitive that it would happen and I think maybe maybe I think maybe this is also like one of the one way in which uh game theory is kind of weird that like even like this very basic thing where yeah you can if feels like you can just talk it out and say like yeah we're going to do this good thing and and so on like game game theory doesn't really have a model for this talking it out thing to get the good equilibrium it really that doesn't seem that that difficult to do but yeah the the case that I'm much more worried about is the case where you have yeah you have lots of different equilibria and the game is very asymmetric and so it's just completely unclear which of the equilibria to to go for like m so okay I said that was my final question but I actually have a final sort of bridging question before I get to the next thing we're going to talk about which is um I think a lot of Cooperative AI stuff it seems to be sort of it often seems kind of adjacent to this agent foundations line of work where people are really interested in like you know you're an AI and there are other players in the world and like they're modeling you and you're modeling them but you can't like perfectly model something that's perfectly modeling you because your brain can't properly fit inside your brain you know and you know people in this world are often interested in like different decision theories or um ways of thinking about how to be uncertain when you're like computationally limited a lot of this kind of thing seems like it shows up in the Cooperative AI literature I guess specifically like uh your research groups um Cooperative AI literature I'm wondering like uh yeah why why do you think that overlap is there and what do you think the um the most important reasons are driving that overlap yeah that's a another very interesting question okay so I mean I I should start by saying that there are also lots of people who work on corporative AI and who aren't thinking that much about these more foundational questions um I don't know more on the side of uh developing machine machine learning algorithms or setups for machine learning such that different machine learning agents just kind of empirically uh are more likely to converge to equilibrium or or to better equilibrium and things like that but okay yeah I agree that like with respect to me in particular there's there's kind of a lot of uh there's a lot of overlap okay I think there are a couple of reasons one reason is that it's good to do things that are useful for multiple reasons so if you can do something that's kind of mult that's useful both for the uh for analyzing the the the multi-agent stuff and also good for this kind of more agents foundations perspective that's nice yeah I think the kind of deeper reason for why there's overlap I think has to do yeah with a lot of kind of just the object level issues of Game Theory and how they interact with these things so for example it seems that this issue of yeah not being being able to fully model your environment it comes up very naturally in Game Theory because basically if you have two yeah if you have two rational agents right there's no there's kind of no way to really assume that they they can like both like perfectly have a model of the other player or something like that right I they can best respond to each other um but but it's it's like these kind of these kind of assumptions that in the in the case of kind of mere empirical uncertainty like you can at least theoretically make these kind of assumptions that you like you have you have your ban prior or something like that and you just do ban updating and then you converge the correct World model right everyone knows that you that this doesn't actually that's not actually how it works but you can kind of you can like easily think about this and so you just think about this even though actually the the real systems that actually do stuff in the in the real world don't really work this way but right you you can have this model whereas in Game Theory yeah even it's much harder to avoid these these kinds of issues but there I mean if that was it I would expect that there would be this whole field of Game Theory Ai and they would all be interested in agent Foundation stuff but somehow it seems like the Cooperative AI angles or I don't know maybe I just haven't seen the anti-cooperative AI agent Foundation work but um yeah so I think there's lots of there is actually lots of work that isn't that is on these kind of foundational questions and that isn't particularly motivated by foundations of corporative AI so there's this whole like regret learning literature which is like it's it's a huge huge literature about like some theoretical model of learning in games yeah uh or like well learning in general but you can in particular apply it to learning in games and it's it's kind of model of it's a model of rationality it's you can use it as a model of bounded rationality and then there there yeah many many papers about what happens if you have two regret Learners play against each other how quickly do they converge to what kind of solution concept do they con I don't know do they converge to Nash equilibrium or cause correlated equilibrium or or whatever so I think this literature basically just does exist okay I think yeah I think usually it's kind of it's a bit less like I think most of this is a bit less motivated by AI but um but yeah I mean there's there's definitely a lot of foundational work kind of on the intersection of you know what's what's a rational agent what should a rational agent do or how should one learn stuff and uh multi-agent interactions that that isn't about about cooperation okay cool well I guess bridging from that I guess I'd like to talk about your paper a theory of bound and inductive rationality so this is by yourself abam demsky and Vincent conitzer could you give us a brief overview of what this paper is about sure so this paper considers a setting which we like for now might imagine it's just a single agent setting where you have like you have a single agent and it has to learn to make decisions and the way it works is that kind of every day or I don't know at each time step as you as you might say uh it faces some set of available options and it can choose one of these options so I don't know you might imagine like I don't know every day someone comes to you and like puts five five boxes on your on your on your table with different descriptions on them and then you have to pick one of them and then once you pick a box you get a reward for the box that you pick and you only observe that reward you don't you don't observe any kind of counterfactuals you don't observe would have happened had you taken another box uh and in fact it might not even be kind of well defined what would have happened right like what what are these contrafactual anyway so like all that happens basically you get you you can choose between different things you choose one of the things get you get a reward and this happens kind of every day or like every time step and now the question is kind of how should you how should you learn in this kind of setting like how should you learn to make choices that maximize reward and in particular we yeah we consider a myopic setting which means we we just want to maximize the the short-term reward like the reward that we get right now with like the next box that we pick so like if we I don't know maybe we at some point we figure out that whenever you take the blue box you you maybe you get a high reward but then for the next thousands times time steps the reward will be low uh the the the the basic setting that the paper uh considers uh is such that you then still want to take the blue box you don't care about this losing the reward on the next 10,000 Days uh though you can consider alternatives to this yeah so okay so this is the basic setting of kind of learning and it's like this is very similar to this Bandit literature where regret minimization is kind of the main criteria of rationality that people consider uh so it's very similar to that the main difference is that we uh before you go on what what is Regret minimization oh yeah good good question I I I should uh should explain that so here's one here's one notion of like what it means to do well in this type of scenario like I don't know you let's say you play this for a thousand rounds and then in the end you ask yourself how much better could I have done had I done something else so for example how much better could I have done had I only taken the like I don't know always the blue let's say every day there's a blue box had I always just taken the blue box and so the regret is kind of like how much worse are you doing relative to the best thing you could have done in retrospect and usually like their constraints on what's kind of the best thing in retrospect is like usually it's not it's just not achievable to have like low regret relative to like on each day picking the best thing usually okay I I'll just introduce one more aspect of the setting because that's also important for for our paper um so for example you could have some set of experts that you might imagine it's just some set of algorithms and uh on each day each expert has a recommendation for what you should do then your regret might be kind of like how much worse did I do relative to the best expert uh so after a thousand days you look at each expert and you ask okay how much would that expert would I have done had I chosen at each time step what this expert recommended and so my understanding is the reason we're saying something like this rather than just in retrospect the best thing you could have done ever is like if the you know if the results are random or unknowable or something like you know in retrospect we could you know just pick the winning lottery numbers and stuff but like you don't want to allow those sorts of strategies my understand very good yeah thanks for thanks for explaining this yeah that's yeah that's I think that is exactly the the justification for for doing that okay okay so so minimizing regret means yeah minimizing regret with respect to the the kind of the best expert in retrospect um and the the intuition here is kind of that um like from a kind of learning and like rational agent perspective is that I know you're some agent you you have limited kind of computational abilities and let's say that this the set of experts is kind of the set of algorithms that you can compute or something like that you have I don't know you can compute all kind of simple strategies for what you should do and then kind of low regret means that there's no like in this set there's no strategy that does much better than what you do and in particular like usually what people talk about is sublinear Regret which means that kind of as time as time goes to Infinity your kind of average regret per round goes to zero so which in some sense kind of means that you like in the limit you kind of you learn to do at least as well as the best uh as as the best expert so like one yeah one very simple version of this is like I don't know imagine you play you play some game like you play chess or something like that and there are like 10 people in the room and they each make recommendations for what you should play and you have no you don't know anything about Chess and now one of the people in the room is it's like a chess grandmas and the other ones are like just random non-chess Grand Masters who are maybe much worse like a bit worse then it seems like if you learn in a reasonable way you should learn to do at least as well as the grandmas right because like at least the the least thing you can learn is just just do whatever this grandmas does right yeah that's kind of the intuition here but in your paper you're so you're contrasting yourself with regret minimization yeah so what's so bad about regret minimization sounds like isn't that just equivalent to like do as well as you can yeah I think regret minorization is is is certainly a kind of compelling Criterion yeah I also think kind of like in the I think in the kind of agent foundations like existential risk uh Community I think too few people aware of regret minimization and how kind of yeah it's like a very simple Criterion and so on but um okay now I want to make make the case against this yeah so why why don't I find this very satisfying so it has to do with the way that regret minimization kind of reasons about counterfactuals so like an immediate issue that you might or like an immediate question that you might have about regret minimization is like is is this even ever achievable like can you even ensure that you have low regret or sub linear regret because like you might imagine a setting where the environment kind of exactly kind of computes what you do and always kind of does the opposite right like I don't know every on every day you have two boxes in front of you and like one of them contains a dollar and the other one doesn't and the way the environment fills the boxes is that it it predicts what you do and then it puts the dollar in the other box h and in this kind of setting you might imagine like seems really hard to achieve low regret here right like it will be difficult uh to in retrospect not do worse than for example always taking the left box right because like if let's say you switch back and forth between the two then like in retrospect half the time the the left box was had the dollar so you would have done better if you had you just always taken the left box yeah for for if you would have done better like like somehow this is assuming that like I could have just taking the left box but the problem would have been the same as it actually was exactly so that's part of how this Criterion works is that it it kind of assumes that kind of the problem says which I don't know the specific instance of this uh multi-armed banded problem as it's called that you face consists in like specifying at each point how much like at each time step how much money or how much reward is in each of the boxes and that yeah this is yeah in some some sense independent of kind of what you do though it does like like in particular these kind of adversarial the adversarial multi-armed Bandit literature very explicitly allows the case where the the way the boxes are filled happens to be exactly like running whatever algorithm you run so how how is this solved so the way this is solved is that when users the that the learner has to use randomization so you have to randomize over which boxes you pick or like which I don't know whose experts advice you follow this is how it's solved in the adversarial yeah in the adversarial setting yeah yeah yeah yeah uh yeah there are also like non-adversarial settings where basically I don't know you know from the start that each of the box is like on every day it follows the same distribution like some maian with some mean and Al the the whole task consists in like doing optimal exploration to find out which uh which box has the highest mean uh V uh like mean value but yeah in the in the and and there yeah there you don't need you don't need randomization I think in practice probably people still do uh randomization in these cases but you don't you definitely don't need it whereas in the in this adversarial case to achieve sub sublinear regret you have to randomize uh and you have to also assume that the environment cannot predict the outcomes of your random coins it's like I don't know if you flip a r if if you flip a coin uh to decide whether to take the left or the right box then you assume that the environment it can also flip a coin but it can only flip a different coin yeah so that okay that already is like I don't know I found it a bit dissatisfying kind of philosophically it seems a bit weird to assume this but that maybe I can live with I think the the part that I really kind of don't like is that um in these problems where you randomize or like where you need to randomize like regret minimization requires that you randomize but it's the restrictions that it imposes are kind of all on your actions rather than on the distributions over actions that you pick how how do you mean yeah uh so here's here's an example uh which is it's basically newcom's problem so imagine that there are two uh that there are two boxes one of of the boxes like the way the environment fills the boxes works as follows uh it runs your algorithm to compute your probability of choosing the left box and the higher this probability is of you choosing the left box the more money it just puts in both boxes but then also it puts an extra dollar into the right box uh and the and we make it so that kind of like the you like probability of choosing the left box it like really really strongly in increases the reward of the of of both boxes so like you get for I don't know uh for each 1% of probability mass that you have on the left box you get a million dollars something like that put in both boxes okay the the numbers don't need to be that extreme I just don't want to think about the exact numbers like how how large they need to be so in this problem if you if this is how the problem works then if you want to maximize your utility and you maximize over your probability distribution over boxes then I think the reasonable thing to do is to always pick the left box because basically for each yeah for each probability mass that you put on the left box you gain lots of money whereas for each you only gain like $1 by moving probability Mass from the left box to the right box Y and you lose all the money you could have made by like having probability Mass on the left box and both boxes get money yeah so for example if you always choose the right box uh you only get a dollar and if you always choose the left box you get a100 million okay so this is so so this is an example of setting and yeah if you if you optimize over probability distributions then you should choose the left box this is not what regret minimization says you should do regret minimization would here imply that you have to always choose the right box well in the limit yeah you have to learn to always choose the right box and the reason for that is that if you choose the Le left box you'll have regret right you'll regret that you didn't choose the right box because then you would have gotten a dollar more yeah and the reason why that's kind of like what I said somewhat cryptically right like the in some sense the reason for why this issue occurs is that regret minimization is it's a Criterion on which actions you take and it's it doesn't care about how you random it's it kind of doesn't say anything about how you should randomize other than saying well you should randomize over the things that give you the highest like give you high reward holding fixed how you randomize or like I don't know we care don't care about this how you randomize like you should it says something about these actions and it doesn't yeah it doesn't say how you should yeah it doesn't say how you should randomize and I think this is just not so compelling in this particular problem I think in this particular problem which should just always choose the the left box and is this also true in the in the setting of adversarial Bandits yes yeah so this yeah so this is like an example of like basically adversarial Bandits uh allow specifically for this kind of problem to get high reward on these yeah one has to do this randomization stuff where one has to one has to somehow like sometimes one has to learn to make the probability distribution so that one's reward is low but one just to make one's regret also low all right so okay that's what's wrong with other settings so what's your theory of bounded inductive rationality can I say uh can I say something about why I think this is like even important oh yeah um oh yeah why is it important okay so I mean this particular setting like this left box right box examples like farfetched of course but I think the like the reason why I think it's important is that it's like this kind of setting where the environment kind of tries to predict what you do and then respond to it in a particular way like to some extent that's kind of the core of Game Theory and so yeah if you want to use these regret minimizer specifically in a a game theoretic concept context I think it's it's kind of weird to to use them given that kind of it's very it's very easy to come up with these kind of game theory flavored type of uh cases where they clearly give weird results um so like in particular I think like sometimes people just if y Nash equilibrium by appealing to these regret minimizers so like you can show that if you have two regret minimizers and they play a game against each other uh they convert to I think it's it's not exactly natur equilibrium it's like some form of correlated equilibrium but like yeah some equilibrium concept that is if they converge sometimes they also don't converge but if they converge they converge to some kind of Nash like solution concept and so you you might say like well it's like great this show really shows like Nash equilibrium is a good thing and why like rational agents should play Nash equilibrium against each other like in some sense I think the this like nashan idea is already kind of baked in in this like regret minimization uh Concept in a in a kind of a way that seems that seems not so compelling as this like left left box right box example shows yeah so that's why my theory that that's why yeah worked on this like try to come up with a theory that that reasons about randomization and things like that in a very different way okay can I try to paraphrase that so we want a theory of you know how to be a rational agent and there are a few criteria that we want firstly we want to be able to go from something just like how do you make decisions normally uh we want to be able to use that in the context of Game Theory like as some sort of foundation and secondly we want to have it be bounded so like that's why we sort of thinking the regret minimization some number of experts frame where like we just want to consider all the you know all the ways of doing things we can fit in our head and necessarily worry about everything so we want to be be um we want to have a found decision Theory foundation for Game Theory and we want to be bounded and because we want a decision Theory foundation for Game Theory specifically we we want our um decision Theory method we want our way of making decisions to allow for environments that are modeling us and so basically your complaint is like well normal regret minimization it does the bounded thing but it doesn't do a very good job of thinking about the environment modeling you normal DEC you know there are things there are various like ways of thinking about Game Theory where you can think about the environment modeling you but you're not necessarily bounded probably my favorite paper in this line is the is the reflective oracles line of work where um sadly I can't explain it right now but if you assume you have something that's like a little bit less powerful than a halting Oracle but still more powerful than any computer that exists then like agents that just you know model their environment and make decisions they end up playing Nash equilibrio against each other it's a really Co light of research I encourage people to read it but no no existing thing could possibly implement the thing that you know that those papers are talking about so you want to get all these three criteria boundedness decision Theory to Game Theory and environments that are modeling the decision maker is is that right yes that's uh that's a very good summary thanks all right I I now feel like I understand the point of this paper and how it's related to your research agenda much better yeah so Okay so we've talked about like what you want to be doing um how's it work what what do you have uh okay so so we have we have the same kind of setting as before we we choose on each day we choose between a set of options and slight slight different we don't we don't require that these counterfactuals which one talks about a lot in regret minimization we don't def require that these are well defined and and we also have these experts um which I guess we we we call them hypotheses rather than experts but they basically do the same thing except that like in addition to kind of making a recommendation at each time step they also give estimates of the utility that they expect to get if the recommendation is implemented uh so like if you again have this picture of I don't know I'm trying to play chess and there like 10 people in the room then you might say that I don't know one of them might say okay if you play Knight F3 then I think there's a 60% chance that you're going to win or you're going to I don't get six point6 points in expectation or something like that so so they are like a bit more complicated than the experts only slightly they give you like this one additional estimate okay and now similar to regret minimization we did like we Define a notion of rationality relative to this kind of set of hypotheses yeah I think I want to describe it in terms of the algorithm uh rather than the Criterion for now because I think the algorithm is actually slightly more intuitive well so this is the opposite way than how you do it in the paper uh for people who might read the paper yeah yeah the paper just gives the Criterion and then the the the algorithm is just kind of hidden in the appendix or I mean it's like some brief text in the main paper saying like yeah this is roughly how it works I think yeah I mean in some sense of course the the kind of General abstract Criterion is much more uh I think it's more important than the algorithm itself um I think it's it's a bit similar to to The Logical induction paper where right they Define similarly like a kind of notion of I don't know rationality or like having good beliefs or something like that and yeah they have this criteria on but they also have a specific algorithm which is basically which which is a bit like running a prediction Market on logical claims and I think like in practice I think many more people have this idea in their minds like just R just run a prediction Market between between algorithmic Traders then this like specific Criterion that they Define even though like I think from a theoretical perspective the Criterion is really the important thing and the algorithm is just some like very specific construction yeah I I actually want to talk more about the relationship to The Logical inductors paper later so uh so what's your algorithm so basically the algorithm is to run an auction between these hypotheses so first we need to give the hypotheses kind of money to to deal with so they now have they now have money like some kind of virtual currency and yeah we need to kind of be a bit careful about like how exactly we kind of give them money initially like what one way of doing it is that we I don't know we give the like especially it's especially tricky if we have kind of infinitely many hypotheses but uh basically we need to make sure that we eventually give give all hypotheses kind of enough money so that we can explore them which happens to be like infinite money but we also like if we give everyone uh 10 like $10 in the beginning then like this auction will just be chaos because every like there will be lots of crazy hypotheses that bit nonsense so maybe it's easiest to First consider the case of fin many hypotheses and let's just say that in the beginning we give each hypothesis 100 virtual dollars and now we run auctions and the way we do this specifically is that we we just ask we just ask all the hypotheses for their recommendation and their estimate of what reward we can get if we follow the recommendation then we follow the highest bidder so we take the highest bidder but the bidest they like the the the hypotheses they're like in in how much they can bid they're like constrained by how much money they have so like if a hypotheses does doesn't have any money it can't bid like it can't bid high in this auction so we take the the highest kind of the highest bid like the highest kind of budgeted bid I guess okay then we take that hypothesis we the hypothesis has to kind of pay us their bid so it's kind of like a first first price uh auction they pay us their bid and then we do whatever they told us we should do then then we observe we we get our reward and then we pay them that reward or like the like a virtual currency amount uh proportional to that reward okay and so so it's the idea that like if you you know slightly low ball like like if a hypothesis slightly low balls but is like basically accurate about how much reward you can get is the idea that other hypotheses that like overestimate how well you can do are going to blow all their money you're going to save it up because you get like some money every round and then eventually you could be the top bidder and you'll make money by like you know slightly low ball and so you like get back a little bit more than you spent to make the bid and you just eventually like dominate the bids is that roughly how I should think yes that's basically the thing to imagine I I think depending on yeah with this like low balling uh that's kind of like depends a bit on the scenario right like the for example if you if you have a setting where the the payoffs are deterministic and fully predictable so you just I don't know you just choose between a reward of five and a reward of 10 and there's a hypo there are hypotheses that can just like figure out what that these are the payoffs then like if you have enough hypotheses in your class then there there will be just one hypothesis that just uh bids 10 and says you should take the 10 and then you won't like there will like the winning hypothesis won't be lowballing it will just kind of it will just survive it's it's kind of like the this like typical kind of Market argument where like if you have if you have have enough competition then the profit margins go away okay and the idea is that it like you know other other agents which overpromise um are going to um like like if agents overpromise and they lose money relatives to this thing that bids accurately if agents under promise then they like don't bid they don't win the auction and so this thing never like spends money so it just survives and wins yeah basically that's the idea yeah I guess the yeah the most I think the most important uh yeah the most important kind of features like or yeah I don't know like the first thing to understand is really that the hypotheses that just claim High rewards and don't kind of don't hold up these promises uh they just run out of money and so they don't control much uh what you do yeah and so like what's what's left over kind of once these are gone is like you you basically follow the highest bit among those bids that do hold up their promises and so kind of in the limit in some sense yeah among these you like kind of actually do the the best thing cool and so basically in the rest of the paper if I recall correctly it seems like what you do is you basically say well you make a bunch of claims which net out to if if one of these uh you know traders in this auction can figure out some pattern then you know you can do at least as well as that Traer so like if you're betting on these like pseudo random numbers then you should be able to do at least as well as just guessing the expectation and like if none of if the pseudo Randomness is too hard for any of the agents to crack then like you don't you don't do any better than that at the end you I think there's a connection to Game Theory um which we can talk about a bit later but um we touched a bit on this relationship to logical inductors and when I was reading this paper I was thinking like okay we've got like uh the second author abam demsky I think he's an author on this logical inductors paper from 2016 or so I I think he's actually not I'm not entirely sure but I think you might not but I'm not oh okay he was he's a member of the organization that put out that paper at least um I mean he is very deep into all of this logical induction stuff and uh yeah I think probably I couldn't I'm not sure I could have written this paper without him cuz yeah cuz he know knows these yeah these things very well and kind of that was I think that was quite important for for this particular project yeah so so you've got this co-author who's connected logical induction world they're both about you know inductive rationality by bounded beings and they both involve these algorithms where agents like you know bit against each other how is this different from The Logical induction paper uh so I guess the main kind of and most like obvious difference is that the logical induction paper is just about forming beliefs in some sense it's just about assigning probabilities to statements you can think about how you can use that then to make decisions but the very basic if you just look at the logical induction paper it's all about forming beliefs about whether a particular claim is true whereas regret minimization but also this rational inductive agency project it's all about making decisions between some set of options yeah I think that's kind of a fundamentally different setting in important ways I think the the kind of the most important way in which it's different is that you yeah you have to deal with this counterfactuals issue that you you take one of the actions and you don't observe what would have happened otherwise uh and so like for example one one way in which you can kind of clearly see this is that in in any of these kind of decision settings you have to kind of pay some exploration cost so in the with the bounded rational inductive agents sometimes you will have to follow a bidder like a hypothesis that has done terribly in the past you need to sometimes follow it still because there's some chance that it just did poorly by bad luck right so like in fact like we didn't go that much into the details of this but like you actually have to like hand out money for free to your hypothesis so that you can explore each hypothesis infinitely often because otherwise yeah some chance that you kind of NE that some there's some hypothesis that like really has the the secret to the universe um but it's just like on the first 100 time steps for some reason it doesn't do so well so so you have pay this cost and I mean regret minimizers also pay this cost um with logical induction okay like in some sense you of course also have to you do some in some sense you do exploration like one thing is that if you maybe this will be hard to have to follow for readers who aren't familiar with that sorry listeners I'm now realizing we should probably just say a few sentences about what that paper was so according to me The Logical inductor paper was about how do you assign probabilities to logical statements so the statements are definitely either true or false there statements like the billionth and 31st digit of pi is seven and you're like oh what's the chance that that's true you know initially say one10 until you actually learn you know what that digit of pi actually is by calculating it or whatever and it uses a very similar algorithm and the the core point of it is to like you know every day you like have more questions that you're trying to you know assign probabilities to and you learn more logical facts and eventually you just get really good at assigning probabilities yeah uh that's my anything I missed out on um well I guess for what I was about to talk about is good to have some kind of intuition for how the actual kind of algorithm or like the proposed mechanism works and kind of very roughly it's that they have like instead of hypotheses or experts they have uh Traders which are also I don't know some set of computationally simple things and very roughly what happens is that these Traders make Bets with each other on some kind of yeah prediction Market about these different logical claims what that one is trying to assign probabilities to and then so there the idea is like if if there's a Trader that's better than the market at assigning probabilities then the trader can make money by bidding against the market uh and So eventually that bid will sorry that Trader will become very wealthy and so it will kind of dominate dominate the market so that way you kind of ensure that like in some sense you you do kind of at least as well as any Trader uh in in this set yeah and crucially Traders can kind of choose What markets to specialize in so if you're betting on digits of pi or digits of e like I can be like well I don't know about this e stuff but I'm I'm all in on you know digit of pi and I guess this can also happen in your setting right if you've got like different decision problems you face yeah that is that is a way in which our thing is more like our bonded rational inductive agency theory is more similar to The Logical inductors so our yeah our hypotheses are allowed to like really specialize and they only like they only like bid every 10,000 steps on like some very special kind of decision problem and otherwise they just bid zero then right you you kind of still get that power in some sense whereas the regret minimize they don't have this property they like often like generally you really only just learn what the best expert is cool so I I cut you off but um sorry we were saying about a comparison with logical inductors and it was something about the Traders so yeah one thing is that if you if you have like a new Trader or like a Trader that you like you don't really trust yet then you can give them like a tiny amount of money and then they get to make bets against the market and if you give them a tiny amount of money their bets won't affect the market probabilities very much and so like in some sense you can explore this in some sense like for free like in some sense they don't influence very much what you kind of think overall because the the idea is even if you give them like a tiny amount of money like if they're actually good they'll be able to outperform the market and they'll be able to get as much money as as they want that isn't possible in the decision- Mak context because to explore a hypothesis in the decision- Mak context you have to like make this kind of yes no decision of actually doing what they do and like if you yeah if you have a hypothesis it's just like complete disaster you're going to make this disastrous decision every once in a while yeah yeah and and in some sense it's because like decens just have discreet options right like if you're a predictor you can you know gradually tweak your probability a tiny bit but you can't quite do that and yeah though I think I think the the fundamental issue is more has more to do with these counterfactuals and then like the counterfactuals not being observed you kind of to test a hypothesis you have to do something differently so even like let's like kind of forget about the logical inductors for for for a second and just say like I don't know you make some predictions about all kinds of things and I I'm kind of unsure whether to trust you or not then I can just ignore what you say for the purpose of I don't know decision making or for the purpose of assigning beliefs myself or like stating beliefs to others I can completely ignore this I don't have to do anything about it and I I can still I can just track whether you're right right and if you're good I'll still eventually learn that you're good whereas with yeah if you if you tell me like you should really I don't know you should really do more exercise or something like that and I just I ignore it I'll never learn whether you were actually right about it or not M okay yeah it's giving me a better sense of like uh the the differences these theories have cool so another difference which was kind of surprising to me is that in the logical inductor setting I seem to recall hearing that if you actually just do the the algorithm they propos in the paper it's something like um 2 to the 2 to the X time to actually like get to actually figure out what they even what you even do whereas with your paper um if all of the um bits about what you should do if they're computable in like quadratic time if it's if it's just all the quadratic time algorithms it seemed like you can have your whole routine run in quadratic Time Times log of log of n or you know some like really slow growing function which strikes me as like crazy fast so how do you like uh what what's going on there that I mean for one like what's the difference between The Logical inductor setting and two just like how is that even possible that that just seems like so good you know yeah that is that is a kind of notable difference like in some sense yeah I mean this algorithm that I just described like this this auction uh or like decision auction as uh we sometimes call it it's just like it's just a kind of extremely low overhead algorithm that it's just like I mean you can you can just like think about it right like you you run all your bidest then you have like a bunch of numbers you take the max of these numbers right that's that's all very simple I think the so the the reason I think why the logical induction uh algorithm that they propose is relatively slow is that they have to do this fixed Point finding yeah yeah yeah so like I don't know I think roughly the way it actually works is that the Traders like it's not really like a prediction Market in the literal sense I think the the Traders actually give you functions from market prices to how much they would buy or something like that and then you have to to compute market prices that are kind of a fixed point of this or like an approximate fix point of this or something like that and this fixed Point finding is hard I don't know I think maybe here it's the continuity is a big issue that probabilities are continuous and so you kind of need to like in some sense you need to find something in a continuous space like the correct probability in some sense is yes it's like some number between zero and one or actually it's the I think it's the probability distribution over all of these logical statements are like the the probabilities of all of these logical statements so it's like this this large it's really large object or like an object that in some sense has kind of carries a lot of information it's like from some large set and so they need to kind of find this in this large set whereas yeah we only have this one decision but I think there's yeah I think it's just kind of that like on some level the criteria just pretty different like yeah they're similar in many ways in that they like they Design This market and so on I think on some level the the the criteria are just somewhat different between the The Logical induction paper and ours yeah I mean it's strange though because you would think that like making good decisions would reduce to having good beliefs right like like one way you can do the reduction is like um every day you have some logical statement and you have have to like um you have to guess is it like true with probability one with probability you know 99% 98% and like I don't know you just have like a 100 options and you have to pick one of them yeah so that in in theory that kind of works but if you apply a kind of generic decision-making framework to this setting where you have to I don't know say the probability or like decide which bet on which bets to accept or something like that then you're not going to satisfy this really strong Criterion that this gbrand inductor satisfies so for example if you have the bounded rational inductive agents and like on each day you have the choice between like like I don't know what's the highest price you would be willing to pay for a security that sh that pays a dollar if some logical statement is true and pays zero otherwise so that's kind of like assigning a probability to that statement but now you have all of these hypotheses that right you have to explore all of these hypotheses that say like yeah you should you should buy the $1 thing uh you should you should you should buy it for like $1 even even for digits of pi right where you like let's say you take all all all like coin flips right like something that's like actually random and where like you should you should just learn to to say one half every day the bounded rational inductive agents will necessarily say any answer infinitely often they won't converge to they they will converge to giving one of the answers with limit frequency one but they will they will kind of every once in a while they s they'll say something completely different whereas yeah so it's not like actual convergence I think in the regret minimization L people sometimes say yeah they have these different Notions of convergence like it's convergence in iterates and convergence in frequencies I think or something like that and you you you only have the weakest bonded ration inductive agents yeah in fact like this is one of the things I was wondering because like in the paper you prove these limit properties and I'm wondering like okay am I going to get like convergence rates and it seems like like if you make the wrong decision infinitely often but like infinitely less frequently in some sense it feels like the right thing to say is that you have converged and you can talk about the convergence rate but you haven't like actually literally converged is is there some intuition we can get on like what the convergence or quasi convergence properties of these things are like over time yeah so one can yeah make some assumptions about a given bounded ration inductive agents agent and then infer something about the rate at which it converges so roughly I think the the kind of important things the important factors are the following so the first is kind of how how long does it take you to explore the hypothesis that kind of gives you the desired Behavior if you imagine I don't know you have some like you have some computational problem like deciding whether a given graph has a click of size three or something like that so like that's that can be done in in in a cubic time if not faster and so you might wonder like okay how long does it take the first thing is like how long does it take you to to explore the hypothesis that just does the kind of obvious computational procedure for finding for just deciding this which is just like try out all the all the combinations of of vertices and deciding whether that's a click so like in some sense it's kind of similar to these kind of bounds that you have for kind of orom Razor type like solo solom mono prior things where it's like it's kind of like the the the prior of the correct hypothesis and like similarly here it's like it's a bit different because like really what matters like when do you start giving it money uh and then there's still like if if things are random then it might be that you you give it money but then it has bad luck and then it yeah takes a while for it to yeah but that's kind of like that's the first thing that matters and then the other thing that matters is kind of how much explor how much other exploration do you do so if you if you explore something very quickly but the reason you explore it very quickly is that you explore everything very quickly and you just like hand out uh you like give lots of money to lots of hypotheses that are all nonsense like in some sense it then takes you longer takes you longer to to converge in the sense that you're going to just spend more time doing random other stuff uh even like like once you found the the good thing right you'll you'll still have lots of like the good hypothesis that actually solves the problem and gives an honest estimate you'll still spend lots of time doing other stuff so you have to have this Balancing Act in uh your payout schedule yeah so like even just to satisfy the Criterion that we describe yeah one has to make sure that the kind of the overall payoffs per round go to zero this is like in some sense like the overall payoff per round is kind of like how much nonsense you can do because like to do nonsense you kind of have to bid high and and like not deliver which is kind of like losing money out of this market so so that's kind of how you control that uh and then meanwhile you also have to ensure that each each Trader like each hypothesis eventually gets infinite amounts of money yeah yeah yeah and these are kind of the two things and then you yeah you can balance them like within this within satisfying these constraints you can yeah you can balance them in in different ways yeah I guess like one over T is like the classic uh way to do this sort of thing but yeah that's the that's the one that we have in the in the proof yeah that that's actually how I uh I it just flashed in front of my eyes and I was like ah now I now I see why they pick that yeah I I just skimmed that appendix yeah so so one thing I want to ask about is you talk about this um Bandit setting um in particular you're being myopic right like uh you see a thing and you're supposed to act myopically to the thing and you know there are like other settings um you know other relatively natural settings like Mark of decision processes or something I'm wondering like how how do you think the work could like be extended to those sorts of settings yeah that's a good question I think yeah there there different ways so I guess one way is that if you like if it's for example an episodic mdp then sorry markof decision process or like I don't know some other oh yeah that was my fault yeah a mark decision process is like uh you're in a state of the world you can take an action and the world just works such that like whenever you're in a state and you take a certain action there's some other state that you go to with some like fixed probability no matter what the time is and similarly you get some reward similarly with some sort of fixed probability anyway so I interrupted you and I forgot what you said so maybe you can start again I'm sorry right so there there like there are different answers to how like one would apply uh bounded rational inductive agents or like extend them to the setting so I guess the the like most boring uh thing is well you can just apply them to find a whole policy for the whole for the whole Mark of decision process or for yeah sometimes there these episodic Mark of decision processes which basically means I don't know you do you act for I don't know you act for 10 time steps then you get a reward and then you kind of start over and then you can you can kind of treat the episodes as separate decision problems that you can solve myopically okay so that's like a bitly boring answer H the more interesting thing is that you could yeah you could have something like uh a single Mac of decision process and like all you do is like you play the single Mark of decision process uh and it's like it never starts over and you want to maximize discounted reward which yeah if you get a reward today it's uh like a reward of one today it's worth one to you if you get a reward of one tomorrow you get uh 0.9 get it the day after 0.81 and so on so that would be like a discount factor of well 0. n or 0.1 like the yeah so in this case I think what one could try to do I just haven't yeah I just haven't analyzed this like enormous detail but it's very I think it's a very natural thing that I think probably works is that one applies this this whole like auction setup but the the reward that one is supposed to estimate at each at each step so at each step that one is deciding which action to take one gives an the the hypotheses are supposed to give an estimate of the discounted reward that they're going to receive yeah and then they if they win they get the discounted reward which means that it has to be kind of paid out slowly over time so like you I don't know at time step 1,000 you have to give the winner of the auction at time step 100 some like tiny bit of reward and and I think then probably things generally hold yeah with some things that like some issues that complicate things so one issue is that hypotheses now have to take into account that on future steps exploration might occur so I might have like I might let's say I'm a hypothesis and I have like some amazing plan for what to do and the plan is to like first play this action then play that action and then this other action and so on and I have like this detailed plan for the next 100 time step yeah I have to worry that like in 50 time steps there'll be some like really stupid hypotheses coming coming along bidding 100 like bidding some high number and doing some nonsense I have to kind of take this into account so I can't I have to bid less and so like this makes everything much more complicated um like it's it's like much harder yeah it's much harder to say that like if if there is a hypothesis that has a really good plan it it's harder for that hypothesis to to actually make use of this because it can't rely on winning the auction in all these uh steps uh so there are definitely some complications yeah yeah interesting so the final thing I want to talk about in this paper is um at the end you talk about using it as some sort of foundation for Game Theory where if you have these Bri I'm going to call them for bound rational inductive agents I say I'm gon to call you know that's what they the paper um it's not just I'm not an amazing inventor of acronyms um so yeah you talk about these Bri like playing games with each other and they they each think of it as you know as one of these like Bandit problems and essentially you say that if they have these rich enough um hypothesis classes they eventually play Nash equilibrium with each other is my recollection so yeah so there there are different there are different versions of the paper that give actually give different uh results under different assumptions so there's a there is a result that under some assumptions they give Nash equilibrium there's also a result that they that is more like a folk theorem which kind of says that you can Cooper you can for example converge to cooperating in the prisoners dilemma so yeah the general maybe I could try to kind of give like a general sense of like why this is like why it's kind of complicated what happens and like why maybe sometimes it's going to be Nash and sometimes it's going to be something else so okay so the first thing is that these Bond rational inductive agents they don't like nothing hinges on randomization right and and like in some sense that's kind of the that's kind of the the appealing part relative to regret minimizers there's no like like no randomization no talk about counterfactuals you just deterministically do stuff and so in particular if you have a bounded rational inductive agent play against a copy of itself in a prison dilemma it will converge to corporating because it whenever it corporates uh it gets a high reward whenever it defects gets a low reward and so yeah there are biders that get high reward by recommending cooperation and like bidding the value of mutual coroporation or maybe slightly below and the defect biders they they I don't know they might hope like okay maybe I can I can get the defect corporate payoff um but they can't actually do this because if if the bidder in one market achieves this like the the hypothesis in one market like tries to do this then its copy in the other Market also does it and so whenever it's actually like you actually win like you win the auction you just get the defect def payoff this would converge to to cooperation the reason why there's an kind of Nash equilibrium style result also is that or like I should say nonetheless um is that if the different agents are not very similar to each other then you might imagine that the biders can the hypotheses can like try to defect in some way that's decorrelated from the other Market or from what happens in the other auction so like the simplest setting is one where they can actually just randomize but you could also imagine that they I don't know they look at the wall and like depending on like the value of some pixel in the upper right or something like that they decide whether to uh recommend defecting or not yeah and if they if they can do this in this decorrelated way then they kind of break the corporate coroporate equilibrium I still don't so I mean the the reason why there are different results and also why there are different versions with different results is that I'm I'm kind of still unsure like what the kind of correct conclusion from it is or like what the what the actual conclusion is so for example I'm still I'm still unsure whether it for example gives some like whether these cooperate cooperate uh outcomes in the prison dilemma whether you can kind of achieve them naturally like without kind of fine-tuning the markets too much to be exact copies yeah I mean I guess one way to think about it is like um yeah you've got this like weak uh uncorrelation Criterion where if you meet it you fall back to Nash and another criteria you get this and yeah presumably the right theorem to show is like under these under these circumstances you fall into this bucket and you get this under these circumstances you fall into this bucket and you get this you know yeah yeah yeah is it known like like if I Implement my bounded rational inductive agent like slightly differently like I order the hypotheses a bit differently like do we know whether that's going to hit the cooperate cooperate equum or if it's going to be weakly uncorrelated I think even that is already not so easy to to tell I don't know I've I've done some like experiments and generally it learns to defect in this kind of uh setting but the I mean in these experiments the the hypotheses are also like relatively Sim uh simple so you don't have you don't have hypotheses that try to kind of correlate themselves across agents right like that's like for cooperating one hope would be that you have you have a hypothesis in one of the markets uh and hypothesis in the other market and they they they try to they somehow try to coordinate to cooperate on the same rounds and I mean this becomes kind of becomes quite complicated pretty quickly I think maybe another important thing here is that the there's still like a difference between kind of the general bounded rational inductive agency Criterion versus this specific auction construction and so the Criterion allows all kinds of like additional mechanisms that you could set up to make it more likely to find the corporate corporate equilibrium so like you could specifically set it up so that uh hypotheses can kind of request that that they only be tested at various times uh in such a way that it's correlated between the markets or something like that yeah so yeah it's it's very complicated and yeah that is yeah still something I'm I'm kind of working on yeah hopefully other people will uh uh think about this kind of question too cool stuff yeah so there's actually a bunch more I would there's at least one more thing a few more things that I I would love to talk about with this um bounded rational inductive agents paper but there are like we spent a while on that and there are two other papers that uh I'd like to talk about as well so let's move on to the next paper so this is called safe petto improvements by yourself and vinent kiter and yeah can you just give us a sense of like what's this paper trying to do so this paper is kind of trying to directly tackle this kind of equilibrium selection problem so like the problem that if you play a given game it's it's kind of fundamentally kind of ambiguous what each player should do and so you might imagine that they kind of fail to coordinate on a on a good n equilibrium or like on a nation equilibrium at all uh and they might end up in these bad outcomes so uh kind very typical example of this is a setting where players can make demands for resources or something like that and they can both like demand some resource and if they make if they make conflicting demands they well they demands can't both be met and so some usually something bad happens or I don't know they go go to bar with each with each other in like the extreme case yeah so yeah and the safe Creator improvements is a is is a technique or like an idea for how one might uh improve such situations how one might improve outcomes in the face of these equilibrium selection problems so maybe yeah maybe I can illustrate this with an example so let's take a blackmail game so this is actually a bit different from the kind of example we give in the paper but uh but this you this is kind of a version of this idea that yeah people may have heard about uh under the name surrogate goal surrogate goals yeah so let's say that I'm going to delegate my choices to some AI agent so I know you can I don't know you can think of something like dpd4 and what I'm going to do is I'm going to kind of tell it's like I don't know here's here's like my money here's my I don't know my my bank account uh my I don't know other web accounts and you can kind of like you can manage these so like you maybe should do some investing or something like that and now this AI might face strategic decisions against different opponents right so other people might interact with it in various ways like try to like I don't know make deals with it or something like that and has to has to make decisions in the face of that and let's just take take some very concrete interaction so let's say that someone is considering whether to threaten to I don't know report my online banking account or something like that if I don't like if my AI doesn't give them $20 so they can choose to make this kind of threat now maybe you might think that it's just it's just bad to make this kind of threat and I should just not give into this kind of threat but you can you can imagine that there's like some kind of more more like moral ambiguity here so like you could you could imagine that this person actually has like a a reasonable case that they can make for why I owe them $20 maybe at some point in the past I promised them $20 and then I didn't really give them $20 or something like that so they have like some some uh some reason to make this demand so now it's this is a kind of equilibrium selection problem where my AI system has to decide whether to give into this kind of threat or not and this other person has to decide whether to make the threat whether to kind of insists on getting the $20 or not and they multiple equilibria so the the pure equilibria are the ones where like my AI doesn't give into these kind of threats and the other person doesn't make the threat and so that's one and the other one is my AI does give into the threat and the other person makes this kind of threat okay so typical equilibrium selection problem in some ways there's some ways in which this is like game theoretically a bit weird uh like this is a yeah non-generic game and so on uh so like I think this this kind of example is often a bit weird to game theorists uh so maybe for game theorists the example in the paper Works a bit better but I I think I I like this kind of example uh I think it's it's more intuitive to p and that deep into the game theory stuff okay so now I'm deploying this AI system and I'm worried about this particular strategic interaction and in particular maybe the the case that seems like worst is the case where the coordination on what the correct equilibrium is fails so in particular the case where my AI decides not to give into the threat uh and the other person thinks like no no I should really get those 20 bucks so I'm going to make this threat uh and then they report my online banking account and they don't even get the the $20 and so like everyone's worse off than if if if we hadn't interacted at all uh like basically utility is being burned like they're they're like spending time reporting me on this online banking uh platform I have to deal with not having this uh bank account anymore or have to call them or things like that okay now there are various ways in which you might address this and I I don't know I kind of want to like first talk a bit about some other ways you might deal with this uh if that's okay to kind of give a sense of what what's special about the solution that we propose because I think it's it's kind of easier to appreciate like what the point of it is if one first sees like how like no more obvious ideas might fail so the first thing is that like if one of us is able to kind of credibly commit at some point uh to some cause of action they might want to do so so I might think that the way for me to kind of do well in this game is just that I should like be the first to like really like make it credible that my my AI system is never going to give into this kind of threat and I should just like announce this as soon as possible or I should kind of try to prove this uh that that I'm not going to give him threats and then maybe the other person would they would want to try to as fast as possible commit to ignore this kind of commitment and so on yeah so so this is like still it's it's I think it's like a reasonable thing to think about but it's kind like it's it kind of feels like it's not really going anywhere right like like ultimately like like to some extent people are making these commitments simult simultaneously they might also just like ignore commitments right like it seems like if you go around the world and like whenever make some someone makes some commitment to I don't know threaten you or to ignore anything you do to get them to do something right like you shouldn't be the kind of person to just kind of give in and like cave to any such commitment you kind of have to have some kind of resistance against these kind of schemes so it's I think it's like I don't know it's like a very kind of I I I think in practice this just doesn't resolve the the problem and it's also like it's it's a very kind of Zero Sum way of approaching the solving this problem right it's all about like I I I'm just going to try to win right I'm just going to try to win by keeping my $20 and like deterring you from even uh threatening me and like and you might say well I'm going deter that I really want the I really want the $20 and I'm going to be first right so it's also very like uh well like I don't know if you imagine implementing this slightly more realistically like I don't know people learn things over time they like understand more facts and like if you're racing to make commitments you're like oh yeah I'm going to determine what I'm going to do when I know as little as possible it's like it's not an amazing uh yeah Daniel Daniel cell has uh this post on I think less wrong or the alignment form titled commitment races or something like that it's also kind of about this kind of this idea that like you yeah one wants to commit when one knows as little as possible and that like that seems that seems kind of seems kind of problematic yeah okay so that's one solution you could also another solution might be that I'm I could try to just pay you some like offer you $5 or something like that in return for you to not blackmail me uh so like we made some make some outside deal and the idea is that if we if we do make this deal then okay I still have to like pay some amount of money but at least this inefficiency of like utility really being burned it disappears if this deal is made but then on the level of kind of figuring out what deal to make right we get all the same problems again right like if I might offer $5 and the other person might say well actually like you really really owe me those $20 so obviously I'm not going to take just five like I I I want at least $18 something like that right and so like you get the same same problem again yeah it's funny yeah I find this a weird thing about Game Theory where like like intuitively talking out problems I don't know it seems like an easier way to solve stuff but whenever you try to like model bargaining it's I don't know it's kind of horrible like like I've never seen any convincing analysis of it um and I mean I do yeah I do think that this problem is just it's it's kind of very fundamental like this equilibrium selection like what's the appropriate way of Distributing resources yeah I mean yeah there are lots of approaches and that like they're all good but I do think that like yeah fundamentally this is just a problem that is hard to get rid of entirely but now okay now comes the safe prer improvements or circuit goal idea for making progress on this so I could like remember I'm deploying my AI system I could do the following so let's say that by default the way this delegation works is that I'm I'm going to tell my AI kind of everything that I wanted to do like I'm I'm telling it like okay the here's my money and I'm I'm such and such risk averse right I really want to make sure that I have I I always have at least uh such and such amount on my bank account in case something happens and uh I also maybe I I can I can tell it that I I I don't know there's some stamp that I really like and if if this stamp appears on eBay for a price of less than $30 it should it should try to get it uh and these kind of things so I I normally I just honestly tell it my preferences and then I thought do do your best now what I could do instead for the purpose of this particular interaction is the following I first set up a dummy bank account that I don't care about at all so I set up some new like online banking account similar to the online banking account that the other person might threaten to report and I don't care about this at all but I tell my AI system to care about this bank account as much as I care about my like the actual online banking account so I tell like okay if like someone like if this were to be reported that would be like just as bad as as if the other one is uh being reported and I have to do that in a way that's credible so that's important here it's like it has to be the other person needs to see that I'm doing this uh so let's say that I do this and let's say that in addition I tell my AI to not give into threats against my actual original banking account now what's the why is this why is this an appealing idea basically the idea is that from the person from the perspective of the person who thinking about threatening to report my banking account right nothing really has changed right they they can still threaten me and they can still like be equally successful at threatening me because well they they have to threaten this threaten to report this different account now like to my AI That's like just the same as it would have been by default yeah it's like to them nothing's really different they don't like they don't feel like I'm tricking them or anything they're they're completely fine with this happening but meanwhile M for me there's some chance that things improve relative to the default in particular they might still make this threat again now against to to to report this new dummy account and it might be that my AI just gives into that threat right in which case I always think okay this like kind of funny but okay that that was that that was part of the plan but it could also be that my AI resists like it decides not to give in this new kind of threat like probably that's just as likely as it would have been to not give into the original kind of threat and in this case if this happens like if the other person threatens and my AI doesn't give into the threat then I'm better off than I would have been by default because now they're going to report this dummy account that I don't actually care about so I don't I can just I'm just fine it's just as as if no threat had been made of course for my AI it might still be it might still be very sad so my my AI might still think uh oh I've done a terrible job I was instructed to protect this dummy account and now it's been reported I'm a bad being or something um but to me it's to me it's better than it would have been by by default and to the other again to the other person it's kind of just the same and that's yeah that's kind of what save pre to improvements mean in general is that it's yeah it's this idea of making some kind of commitment or like some modification of the utility functions or some like some kind of way of transforming a game that shows that everyone is as at least as well off as they would have been if the game had been played kind of in the default way but under some potential outcomes there's some par improvements so like some person is better off or everyone's better off without making anyone worse off like one important part is that it's it's kind of it's agnostic about how how this equilibrium selection problem is resolved so this all works like I like to make this commitment or like to tell my AI to do this I don't need to think about like how the equilibrium SE selection problem in the underlying game is going to be resolved I can make this kind of safely without having to kind of make any guesses about this and the other player assimilated they don't need to rely on any kind of guess about how it's going to be uh resolved gotcha so actually okay I I have an initial clarifying question where I think I know the answer you call it a safe parto Improvement what's like what's an unsafe parto Improvement yeah good question so the safety part is supposed to be this aspect that it's it doesn't rely on kind of guesses as to how the equilibrium selection stuff is going to work out so an unsafe par Improvement might be something like I'm transferring you $10 so that we don't play this game or something like that yeah which is like it's unsafe in the sense that it's hard to tell how we would have played the game and I actually don't know whether it's a it's an improvement or we like we don't know whether it's a parator Improvement to do this deal it relies on kind of specific estimates about how we would have played this game so yeah that's what that that's what the term safe is supposed to mean maybe it's not the not the optimal term but but that's what it's supposed to mean gotcha so one thing it seems like you care about is like these instructions I'm supposed to give to my AI they're not just supposed to make my life safely better off they're also supposed to make the guys life safely off why do I about that shouldn't it all be about me you know so that's uh yeah that's a good question so that's it's kind of yeah coming from this intuition that yeah that these kind of races to commit first kind of can't be one uh so that if I like if I try to come up with some scheme that like commits my AI in such a way that it kind of screws you over then maybe you should have already committed to punishing me if I Implement that scheme or like you have some reason to try to commit as fast as possible to punish me if I try to commit that scheme or to like ignore if I try to commit this scheme so there are all these issues and the yeah the idea is to avoid all of this by like yeah having something that's kind of fine for both players so that like no one Minds this being implemented everyone's happy for this to be implemented uh and so they're like yeah all of these competitive dynamics that otherwise are an obstacle to implementing I I just commit first approaches they disappear gotcha so if I kind of think about this this scheme it seems like the suggested plan roughly is I have this like really smart AI that knows like a bunch of more things than me you know in the real world I don't know exactly like what it's going to do or or or like what plans it could potentially think of and and the safe Proto Improvement literature basically instructs me to think of a way that I can deliberately misalign my AI with my preferences right yeah um you know it seems like this could go wrong easily right especially because like in in the safe Paro improvements paper you're kind of assuming that the uh the principal you know that the person who's doing the delegating of the game playing knows the game yeah but in real life you know that might not be true so so how how applicable do you think this is in in real life yeah in real life one will have to do things that are much more complicated and I think the like the the the real life surrogate goals will be much more kind of meta like you like in this scheme with the with a this like very simple Ai and this like black myth setting like the the game is very simple right there's like binary choices and so on also in this yeah in the scheme the like maybe my AI doesn't really know what's going on right it might not understand why I'm giving it this inst instructions it might just be like confused like okay well I guess this is what I'm supposed to do I think the the more realistic way to implement this is to give some kind of meta instruction to kind of adopt The Sur goal kind of in the way I would have liked you to to do or something like that so so somehow like delegate to the machine not only is it finding equilibria it's also trying to figure figure out what thei would be and uh yeah I think the I mean yeah there they're different kind of aspects that one can delegate um I guess one thing that one can like maybe one kind of slightly more complex setting is a is a setting where like very roughly it's clear what the what is going to happen like maybe very roughly it's clear that I'm going to deploy an AI and you can kind of make some kind of threat against it or kind of try to Blackmail it in some ways but it's not clear to me how costly this it is for you to to make different kinds of threats yeah and then basically I would have to imp like I mean implementing the surate goal requires knowing these things right because I have to make the new thing like the new Target I have to make it somehow equivalent to the old one and that like this is the kind of thing that one yeah probably should delegate to to the AI system in the real world gotcha yeah more radical approach is to just give it some like entirely generic instruction that just kind of tells it like okay whenever you are in any kind of strategic uh scenario that I might not have I might have no idea how that scenario what what that scenario will be like whenever you face any kind of strategic scenario first think about safe B improvements and and potentially uh Implement uh such an improvement if it exists so this kind of gets to a question where like I mean one of the things that safe prto improvements were supposed to do is uh deal with um you know multiple equilibria and maybe people picking different equilibria it seems like there are potentially tons of conflicting um or of different um safe prto improvements right yeah in fact like like because we're sort of like like they they have such a larger action space you know I've I would guess that there' be like even more equilibrium in the like find and safe Proto Improvement game so are we like getting very much if if it's just really hard to you know coordinate on a good SBI yeah also a very important question so I I think maybe first it's it's good to give an intuition for why these like many safe braer improvements might exist because I think in the example that I gave right yeah there actually is only one because like only one player can commit uh and I yeah I think Pro like okay depending on what what else exists in that world there might exist only one but yeah let's maybe give an example that kind of makes clear why there might be many so in the paper we have this example of the demand game which is just the game where I don't know there's some bit of territory and two countries can kind of try to like send their military to take that bit of territory um but if they both sent out their military then well there there's a military conflict over the territory uh so that's kind of the base game and then the idea is that they could try to kind of jointly commit to okay sorry one step back let's also assume that they make these the two countries make this decision by delegating to some some commission or some expert who's like thinking about what the appropriate equal rim is and then the idea there is that they could instruct the commission to sted some new attitudes toward this game so they would say okay like never mind the military let's just send someone with a flag uh like let's just decide whether to send someone with a flag who like just puts the flag in the ground and says this is now ours and then we just tell them well okay if both of our countries send someone with a flag to that territory and like put in the flag like that's really really bad that's like just as bad as ball and so so this is the asafe pressure Improvement in this situation or is this setting like somewhat somewhat clear I guess you I guess you've looked at the papis and maybe see it's yeah like you can either you know the players they can either like send the military or they can like you know roughly they can either send in the military or they can just send a guy with the flag or then they they can send nothing and you know if there's clashes that's bad but like clashes are wor like in in real life clashes are worse if there's military and you know if just one of the players does it then the player who like sends the most stuff sort of gets the the land and they want to have the land seems like that's roughly the situation yeah yeah thanks so the safe peror Improvement is to like yeah not send the military which is what we would normally do just send the the guy with the flag uh and then one avoids this conflict outcome where both send the military and here you could imagine that like in some sense it's kind of ambig us what to do in this conflict outcome like what to do instead of this conflict outcome right because like currently okay we have this like FL guy with a flag story like what exactly happens with territory if both countries send a guy with a flag right it's it's kind of just left open the I think the paper just just kind of specifies that I don't know in that case it's split or something like that I don't remember but you could right like it could be that if both players send someone El a flag then just country a gets the gets the territory yeah and then it's still it's still a safe Breer improvements because it might still be better to have just country a get the territory than to have a wall over the territory because wall is not so great yeah yeah so here there's yeah there these many safe prer improvements that are kind of characterized by what happens instead of War like instead of War does one player just get get the uh resource or do both player they like do they split it or justes the other player get the res something like that okay so now the question yeah does this does this mean that we get the same problem one level up or it's like just as bad or like maybe it just doesn't help and I think okay I think this depends a lot on the setting I think in some settings uh safe Breer improvements kind of really like literally do nothing like they don't at all because of this yeah and in other settings they still help and like roughly the settings where it helps are ones where the bad outcome that we're replacing with something else is just worse for both players than anything on the parade of Frontier for example let's imagine that in this demand game setting where there's this this territory that two countries are having a dispute over war is worse for both players than then it is to like even just not get the territory in the first place so in that case even the worst safe paror improvements that you can get namely the one where instead of war the other person just gets the resource is still an improvement right it's still better for like it's still a pror Improvement it's still an improvement for both players and so like in particular if you're like mostly just worried about war and like you're maybe you're not that it's not that terrible for you to give up this territory then this is you you you might say okay like when we meet to decide which safe greater improvements to get you're kind of you're kind of willing to just settle for the one that's worst for you and it's still like still a gain overall yeah this other example like the surrogate go example is another case of this where the worst outcome for you is like well sorry for me in that case is the one where a threat is being carried out and not like my bank account is being reported and so even if we somehow make it so that in the case where my dummy account is reported I still give the $20 right which only works if the other player also makes some kind of commitment then this is still an improvement for me right so I might still be okay with just giving up the $20 in that case so that's the the important condition I think is that yeah is that the the like the bad outcome that we're kind of like getting rid of is worse than anything on the PTO Frontier such that even if I get like the worst thing on the P Frontier like it's still good for me but even then like the players have to agree which thing on the PTO Frontier they go for right yeah and yeah you have this argument in your paper that I wasn't totally compelled by so my recollection was you basically said well you know most of the players can just say look if any other player recommends a safe prto Improvement I'll go for that one and you know we just we just need one person to actually think of something but then you know it's like well who actually submits the safe Proto Improvement and or you know what if multiple people do or what if I say you know you can imagine I say like if someone else submits a safe prto Improvement then I'll go for it but if they don't I want this one and you have a similar instruction but you know you have a different fullback SPI like it just seems yeah it it still seems quite difficult to figure out how to like break that tie you know so let's yeah I mean let's I'm not sure I exactly understand it like let's take the case where I'm I'm kind of happy to just implement this like the worst the worst possible safe braer Improvement where yeah like we replace the conflict outcome with me giving you the territory like what exactly is the if it's a two-player game and I have this attitude and I say like okay you can you can you can have the territory in that case and and let's say I'm I'm like I'm really happy to like I State this in the beginning of the negotiations like I'm actually happy if you just take it yeah right then what's what's is there any remaining problem in that case well I don't know what I don't know maybe maybe it's a happier problem but like suppose we both come to the table and we both say like hey I'm happy with any Proto improvement over like the worst one ah well we still have to figure out what we get right right and and still on this level like you know it seems like you might want to try and threaten to get your preferred SPI otherwise like yeah H if you don't agree to my SPI rather than your SPI then like screw it we're going to war like but okay so if like both players are kind of happy to like give the other person their favorite SBI and it's just the issue that that I don't know there's like an there's a large set of different things that both people would be okay with or yeah yeah and like I have spis that I'd prefer right and you have spis that you'd prefer and they might not be the same ones right yeah to me this seems like a much easier problem than equilibrium selection yeah because like this is just the case where like this is the case where players are basic they're kind of I it's it's almost this like uh no you go first you go first uh type problem I mean to me it sounds like a bargaining problem right but it's only but it's only like it doesn't seem like a bargaining problem anymore like once it's it's clear that the the players have overlap right like it's it's a bargaining problem once the like it's a bar bargaining problem for as long as it's kind of unclear which SPI to go to go for and I mean yeah if if the if you really want to kind of max out you want want to get the best safe braer Improvement for you and you also want to get the best safe braer Improvement for for you and we both kind of go to the table saying okay I actually really want this and you say you really want this okay then there's a risk but if my argument is that even by like going for the you can just just say like okay you can have whatever you want in terms of safe prer improvements like even this attitude already improves things yeah I guess I don't know maybe I shouldn't belor the point but it it still seems like almost identical to these like bargaining scenarios where there there's a best alternative to negotiated agreement and we can all do better but like there a few different mutually incompatible options do better and we have to figure out which one we want and you know I ideally would get the one that's best for me and you ideally would get the one that's best for you like that seems like the situation we're in like let's okay let's say we have a a bargaining some other bargaining scenario where I don't know the two of us start a startup together and we're both needed for the startup and so we have to come to an agreement on how to split the shares for the startup right then so this is a like very typical bargaining problem yeah we have to figure out how much to Demand right or like what fraction of the startup to demand yeah but now if I if I come to this negotiation table and just say I'll accept whatever demands want as long as it's better for me to do the startup than to not do the startup so like let's say that as long as I get 20% of the startup it's still better for me like in terms of how wealthy I'm going to be in five years or whatever uh to be a part of the startup than to not be and let's say that this is common knowledge then I might say uh okay as long as I get at least 20% I'm fine it seems that if I'm if for some reason I'm happy to adopt this attitude then I would think this bargaining problem becomes very easy even like you yeah you might say like theoretically it could be that you have similar attitude and you say like well actually for me I'm also happy with just getting my minimum of like 45% or whatever and okay then like we have to like yeah we have this problem where we have the remaining I mean uh 35% to distribute but I mean in some sense it's easy but also like like agents like this will will sort of get arbitrarily like the amount you'll improve off the base game will be arbitrarily small if you're like this and the other players like I would like everything as much as I can you know yeah so in the in this type of bargaining setup that's true so in this type of bargaining setup this attitude of just I'm going to take like the I'm happy to accept the minimum is bad you shouldn't adopt this attitude you should you should try to get more right you should try to make some demands and thus risk that uh conflicting demands are made because yeah if you just make the minimum demands you like you never make any money Beyond like what you could have made without the startup right like in some sense you don't get any you don't get any gain anything from this whole startup because you only demand the absolute minimum that makes it worth it for you to do the startup the point that I'm making is that in the case of safe braer improvements even this kind of minimalist bar like this yeah this like really doish bargaining approach to deciding which SPI to go for is still much better potentially than not doing anything so the idea is that that just like all the PTO like all of the options are just like significantly better than the batner basically yeah okay yeah yeah so like this yeah so this is like specifically if like the yeah the outcome that you that you replace is like some really bad thing like going to war with each other or uh or things can think of even more ghastly things if you want um but but uh yeah then anything like even like I don't know giving up the territory like I don't know maybe G giving up everything or something like that it's like still it might still be better than than this uh conflict outcome gotcha so there's a bun other things to talk about here what so one thing that um I was kind of thinking about when I was reading this paper is um it seems sort of analogous to the program equilibrium literature right so you know you write a computer program and the computer program like you you write a computer program to play a game I write a computer program to play a game but our programs can read each other source code right and the initial like papers in this literature they considered computer programs that just checked like if our programs are literally equal then they cooperate otherwise they defect or something and then like I think an advance of this you know in this literature um that came out of uh Mir I believe the machine intelligence Research Institute was to think about this thing they called modal combat where I you know try and prove properties about your program and you try and prove properties about my program then like through the magic of loes theorem it turns out that we can cooperate even if we can only search through proofs of however long I'm wondering so in thebi paper the the paper kind of explicitly envisions you know you send a particular SPI to your you know decision- making committee and it says you know if the other person sent the exact same thing then you know implement it otherwise you know fall back your default and I'm wondering like uh can we like make this modal combat step where um sorry it's called actually I'm not even going to explain why it's called modal combat um people can Google that if they want but is there some way to um to go a little bit more meta or abstract and so yeah one can use these kinds of mechanisms like the the modal combats the like the Loban fabot uh that's they propose like the yeah researchers at M propose for establishing cooptive equilibrium in this program setting I mean we can use that to achieve these safe perature improvements so if you do so the original safe prer Improvement that I described like the first one that I described like the surrogate goal idea where only I modify my Ai and you don't right that doesn't really require this but if you have yeah if you have this case where where two players each mod have to kind of give some kind of new instructions to their AI or their I don't know their committee that is deciding whether to send troops to the territory or something like that in that case the usually the safe parator improvements have to be kind of backed up by some joint commitment so like both sides have to they have to kind of commit that like I don't know we're going to tell our committee to like send the guy with a flag rather than the troops uh but we only do this conditional on the other side making an analogous commitment um and yeah one way to back this up is to yeah have this yeah this kind of program equilibrium type type setup where like we both both countries they write write some contract or something like that for their commission and the contracts they uh computer programs that look at the other country's contract um and like depending on what that contract says uh it gives different instructions for how to reason about the guy with the flag versus the troops and this contract like if you think of it as like literally a pro computer program as seems reasonable in the AI case then yeah you could you could use these Loban uh ideas for implementing this joint commitment so you could say uh yeah not sure how much to go into the details of the the Lan Fair but you can show like if I can prove that the other side adopts their side of the safe prer improvements then I adopt my side of the safe peror improvements otherwise I just give the default instructions and then if both sides make this commitment then it will result in both giving the safe paror Improvement in instructions to their committees or or whatever um is that is that what you had in mind or yeah that that sort of thing I guess there's a difficulty I mean you might hope that you would not have to specify exactly what thebi has to end up being but but I guess the trouble is like precisely because each one you're assuming you don't know how the the committee like soles for the equilibrium you kind of you know pres presumably your program can't like try and prove things about what the other solver is going to go for because if you could do that then you could just say like go for nice outcome or something oh well okay okay yeah one thing yeah so there are two obstacles I guess the the first is that you don't want to you can't uh potentially you can't predict what the other committee is going to do like how it's going to resolve the equilibrium selection problem but the other is also that you you don't want to know in some sense right or you you don't want to adopt a policy of first predicting what the other committee does and then doing what's ever best against that right because then the other committee can just say well this is what's going to happen like we're just going to uh demand that we're going to send the troops and the best response to sending the troops is not to send troops or the guy with the flag gotcha so I guess a bunch of things I could ask but the the final thing I wanted to ask about here was um so there's a critique of um this line of research I think this um less wrong post by vot kovaric um where one of the things mentioned is that kind of implicit it seems like implicit in the um paper is this idea that like the way um the committee solves games is sort of the same with or without the safe prto improvements potentially existing and all that the safe bro improvements do is just change which game the equilibrium selection mechanism place but you could imagine like well if I know I'm in a world that works like you know committees giving instructions to people you could imagine that this potentially does change how people make decisions and you know that potentially like seriously limits um you could imagine that this limits the applicability of This research and I'm wondering like yeah how serious a limitation do you think this is yeah so I do think that this is this is a limitation or like this is something that may yeah makes it non-applicable in some cases I mean even there's even just the kind of more basic worry that like you might like for example in this first AI case that I described like you might like NE never mind like I don't know influencing the way my AI reasons about games right like I might just tell it like okay actually secretly don't give into threats against this dummy bank account right if I can kind of secretly say this to the AI then already there's a problem so we have to assume that that's not possible and then yeah and then the there's the kind of I don't know yeah fuzzier problem that I can't I'm my ai's bargaining strategy can't kind of depend in some sense on the existence of safe Breer improvements which yeah I think in some in some settings this really is just a problem that makes this very difficult so so like here's an example where I think it's kind of clear that it's a big problem so like let's imagine that I am delegating to AI but it's kind of unclear which AI I'm going to delegate to I so I can kind of decide which AI to delegate to there are like 10 different AIS uh on the market that I can delegate my my like finances to or something like that and now I can I can decide which of them to hire to take care of my my finances and if I know that safe Breer improvements will be used I have some reason to use a like to hire an AI That's more kind of hawkish that's less likely to give into threats because I think that's it's like more likely that threats are going to be made against the the surrogate goal and so in response the threatener might think okay if I if I if I go Ong with this whole like Sur go idea then there's a good good chance that I'm going to be screwed over and so they they should just basically ignore the whole Sur good gold stuff and say like okay sorry I don't I don't wanna I don't want to do the Sur good gold business because I don't know what what you would have what AI you would have rented by default and so I I can't really judge whether I'm I'm I'm being screwed over here so yeah so definitely it definitely can be a problem in this setting yeah meanwhile I I think maybe in other cases it is kind of clear what the default would be so for example it might be clear that I generally in in cases where safe pressure improvements don't apply for other reasons for example because like no credible commitment is possible or something like that I might always delegate my choices to a particular Ai and then in cases where I want to apply my safe Breer improvements or my CET goals or or whatever it seems clear that if I just use the same AI as I use in other cases then in some sense like my choice of which bargaining strategy is deployed is not being influenced by by the existence of safe braer improvements so what you kind yeah what you kind of need is yeah you need I mean at least one way in which this can work is that you have some you kind of have access to the ground truth of what people would do without safe pressure improvements for example by being able to observe what people kind of intend to do uh in scenarios without safe prer improvements gotcha okay so the last paper I'd like to chat about is uh this paper on similarity based cooperation so this is uh co-authored by yourself um johanes trit line Roger gross um V conitzer and yakob fer so can you give us a sense of like what's this paper about sure so I guess in some sense you've already kind of set it up very well with uh some of this kind of open source Game Theory like program equilibrium stuff that you talked about earlier so yeah so that's the setting right where two players they each write some source code and then the programs get access to each other's source code and they choose an action from a given game so for example in the prison dilemma you would sub submit a computer program that takes the opponent's computer program as input and then outputs corporate or defect and has been shown by this literature that um this kind of setup allows for new Cooperative uh equilibria and like the simplest one is just like cooperate if the opponent is equal to this program and otherwise defect so similarity based coroporation like this paper considers a setting that is in some sense similar so we again have two players that submit some kind of program or policy or like something like that and that gets some information about the opponent policy or program the main difference is that we imagine that you only get like fairly specific information about the opponent you you don't get to kind of see their entire source code or something like that you only get a signal about how similar they are to you so uh in the in most of the settings that we consider in the paper one just gets to observe s a single number that describes how similar they are to like how similar the two policies are um and the the policies or programs are essentially just well they get as input a single number and then the output maybe stastically like whether to corporate or def or like what to do in the uh in the base game uh and yeah one can show that in this setting still one can get cooperative equilibria in some sense it's not too surprising right because the The Cooperative equilibrium in this like program equilibrium case that that we discussed this like if the opponent is equal to this one to this program then corporate otherwise defect like in some sense that is like a similarity based program it's like if the if the similarity is like 100% then corporate otherwise defect yeah but they are kind of more yeah as we show in the paper that kind of look more interesting things that can happen more like kind of less rigid ways of cooperating and you can apply this to other games uh and so on yeah and and in particular there's this strange there's a strange theorem where um so so you basically have this noisy observation of a similarity function um or a difference function I guess um it's the way you frame it but like I think there's this theorem that says if you don't put any constraints on the difference function then like you can get I think it was something like every outcome that's like better than minia Max payoff can be realized or something which which can be worse than Nash equilibrium right yes that can be worse than all Nash equilibria of the G yeah yeah so I guess somehow it at least in some classes of games it's better if like the thing you're observing is actual similarity rather than arbitrary things yes yeah so so that's this yeah this folk theorem result uh yeah which kind of says like which yeah which outcomes can occur in in equilibrium and it's yeah it's kind of surprisingly this EX exactly the same as in program equilibrium if you don't constrain the diff function yeah the like the I mean the way in which these weird equilibria are obtained is kind of it's very like it's very non-natural it's yeah it requires that the the diff function for example is completely asymmetric in symmetric games and things like that but yeah it is to avoid this one needs natural diff functions or like natural ways of observing how similar one is like in some sense of natural um and yeah we have some results in the paper also about that's kind of like in some sense says the opposite that says like if you're under certain conditions uh that admittedly are quite strong on the game and on the way that the similarity is observed uh you don't get this fum you get a much more kind of restricted set of it seemed like the it seemed like you had like relatively weak criteria so so that theorem roughly says under some Criterion on the difference function the truck me is relatively weak but on a symmetric game then you get as long as there are non-pro dominated Nash equilibria then they have equal payoffs and there must be the best payoff so it seem it seems like yeah I I guess you have this restriction that it's a symmetric game and also that the Nashi Gia non PTO dominated or I guess there must be hang on there always has to be some nonpr dominated Nash equilibria right yeah with some like weird analysis uh caveats right that the set of equilibria might be some open Set uh something something but uh I I don't know it's probably not not so reasonable to get into the details of that but yeah it's I think it's reasonable to assume that there is there always is a PTO like there is a Nash equilibrium that's not par dominated by another Nash equilibrium yeah the the paper says that if you have such any such natural equilibrium has to be uh symmetric so it gives both players the same gotcha so this is an interesting paper uh it seems yeah in some ways it's interesting that it's somehow like gets you the good qualities you wanted out of program equilibrium by like getting these nice Cooperative outcomes in symmetric games while providing less uh information and at least in this restricted setting you know you get less information to the full program and you get better results yeah I wonder if this is just one of these things where like more options in Game Theory can hurt you because on some level it's a little bit surprising right yeah I think it is it is an illustration of of that kind of by being able to fully observe each other you yeah you get all of these different equilibria including you weird asymmetric bad ones and so being a able to like fully observe each other's source code yeah in some sense it makes things worse because they there's now much more to choose from whereas under certain conditions as I mean as far as our paper goes yeah you avoid this problem if you have the more limited option of just accessing how similar you are to the opponent yeah so yeah speaking of like restrictions on the difference function one thing that struck me is that like in real life cooperate with people who are very similar to you and otherwise defect is like not the it's it's it's not an outcome that we like aspire to right yeah and I'm wondering like does it work it seems like you want something where like people cooperate just as long as they're just similar enough to cooperate you know even if they disagree about what's fair in some like you know in some sub game that we're not actually going to get into or something like like you want you want like minimal agree agreement to be able to get cooperation and like yeah I'm wondering like how does this fit in with this setting yeah it's a good good question and yeah I think an important thing to kind of clarify about how this would work I think the important thing is that to get this Cooperative equilibrium like one really needs a signal of how similar one is with respect to playing this game that one is currently playing like with respect to like how cooperatively One approaches the game that one is currently playing uh and in particular like all all other kinds of s signals about similarity are completely useless right like if you if we play a game and we get a signal about I don't know whether we have the same ha hair color or something like that that's completely useless for how we should play the game presumably unless the game is about some specific thing that relates to our hair color but that like that signal is useless and also like probably if you get like a super broad signal that just says like yeah well in general in general you're kind of pretty similar probably that's also not it's probably not even sufficient to get the Cooperative equilibria because uh if at least unless the okay I mean if if the signal says your exact copies then that's sufficient but if the if the signal says like you're 99% the same and there's just like one% that you like are kind of different uh like 1% of your I don't know your source code is different or something like that well it might be that this 1% is exactly the part that matters then the part just decides whether to cooperate or defect in this game so so yeah what really matters is this like strategic similarity yeah I I think there's an in between Zone though where like suppose we say like oh yeah you know like suppose I say hey I'm going to cooperate with uh people who cooperate with me but you know if we don't reach a Cooperative equilibrium I'm going to defect in like this one way right suppose you say oh you know I'm going to cooperate with people who are willing to cooperate with me but you know if we don't man to cooperate I'm going to have this different method of you know dealing with the like breakdown case now intuitively you'd hope that like there'd be some like good similar similarity metric that we could observe where like this counts as similar and we end up cooperating somehow uh I'm wondering like does that happen in this formalism so okay I mean our formalism generally like it doesn't restrict it doesn't necessarily restrict the diff functions that much so I mean it definitely allows it definitely allows diff functions that only depend on kind of what you do against similar players so the the kind of diff function that that you're kind of describing is like we say that players are similar if they do similar things when they observe that they're facing a similar opponent and otherwise we regard them as different and I think that would be sufficient for getting corative I I guess it seems in some sense I think that's kind of the minimum the the kind of minimum signal that that you uh that you need yeah yeah I guess in some sense the question is like yeah can we come up with a like minimally informative signal that still yields maximal cooperation or something yeah all right so so I now have I guess some questions about the details of the paper so one thing that kind of surprised me is um I think in proposition two of the paper you're looking at these um policies where like if our similarity is under this threshold cooperate otherwise defect and agents the similarity measure agents observe is like the absolute value of the difference between their thresholds plus some you know some zero mean random noise or you know let's say gaussian um and in proposition two it says that like even if the noises mean zero even if like the standard deviation is super tiny if I read it correctly it says that the policies are defecting against each other like at least half the time yeah and like that's like under particular payoffs uh of the pris yeah yeah that's that strikes me as rough like it it that seems like um pretty I would have imagined that I could have been able to do better there especially with like arbitrarily tiny noise right yeah yeah generally the way kind of noise affects what equilibria there are is kind of counterintuitive in the like in the paper or in the in the setting that we consider so I mean there's there's also another result that's kind of surprisingly positive so like if the if you have noise that's uniform between zero and some number X then in this setting where so like each player submit submits a threshold like cor rate below defect above and then the the diff that they observe the difference plus noise that's let's let's say uniform from zero to some number X regardless of X like you might think like okay with higher X means more noise right so if there's if there's higher X you might think the equilibria must get worse right it's like at some high enough X it's kind of stops working but turns out it basically doesn't matter it's like just completely kind of scale and variant even if the noise is like uniform from zero to 100,000 there's still a fully Cooperative equilibrium so sorry a fully what a fully Cooperative equilibrium yeah a fully Cooperative equilibrium so that's one where both players cooperate with probability one for the diff that is in fact observed interesting so um one uh thing that the paper sort of reminded me of is um or it seemed vely reminiscent to me of this algorithm called Lola or learning with opponent learning awareness where like basically like when you're learning to play a game you know you don't only think about how your action like gets you reward but you also think about how your action changes like how the opponent learns which uh you know later changes how much reward you get and the reason that this matters is you have some experiments of like you know actually just doing similarity based cooporation you know train training in a method that's inspired by this with neural networks so I think the thing you do is you study alternate best response learning which if I understand correctly is you know you train one network to respond well to the other network then you train that Network to respond well well to the first Network and you just keep on doing this and basically you find something like you know you do your similarity based cooperation thing and it ends up working well roughly is that a fair summary of what happens yeah though it's it's a bit more complicated so okay so like so we have this we have the setting like this setting where each player submits let's say a neural net and then they observe how similar they are to each other and then they play the game and then let's grant that this setting has a Cooperative equilibrium where they yeah cooperate if they're similar and defective the the more dissimilar they are and so yeah so there's the there's this problem of finding the Cooperative equilibrium so like you yeah like you don't know what exactly the neural Nets are and like this is some complex setting where they also need like Corporation isn't just like pressing the corporate button it's Computing some function and defecting is also some other function and so you need to do some ml to like even find strategies like even defecting is like not so easy you have to compute some function and yeah so what we do is indeed this like this alternating best response training which is exactly what you described the thing though is that if one if one just initializes the Nets randomly and then one does the alternating best response training then one converges to the defect defect equilibrium the reason for that I think I mean one never knows right with ML but I think the reason is probably just that the this defect defect equilibrium is kind of much easier to find and there's sort of this bootstrapping problem in finding the Cooperative equilibrium like the the reason why you learn like that the reason to learn this kind of similarity based cooperation scheme is that the other player also plays the similarity based Corporation scheme and you want to be similar to them right but if they're now randomly initialized you just want to defect basically like you just need to observe some similarity in order to exploit similarity and by default you just never observe it is that like roughly right that's well actually like if you initialize two random two neural Nets randomly right they will actually be like if they're large and so on right they'll actually be pretty similar to each other because they just compute some statistical average the issue is just that like it it doesn't pay off to become more similar to the other net because the other net just does random stuff right so like if you want to do well against the random neural network that doesn't that doesn't use the similarity value and just does random nonsense the best thing to do is just to effect so you have to be similar and reward similarity yeah you have to yeah exactly so you have to be similar or like I mean I guess the most important part is that the like to learn to do the scheme the other person basically has to already have implemented the scheme to some extent yeah if they if they are just doing something else like if they always corporate or always defect or just do some random nonsense then there's no reason to adopt the scheme it's still like it doesn't hurt to adopt the scheme if you adopt the scheme you'll they'll be dissimilar and so you'll defect against them so it's just as good as defecting well like it's a complicated scheme right you have to like learn how to exactly like uh decrease your amount of cooperation with uh how dissimilar they are and you need to you need to learn how to corporate which in the setting that we study experimentally is like actually hard so they need to set up this complicated structure uh and like if there's no pressure towards having this complicated structure if the opponent is just random so like there kind of there's no pressure ever towards towards having this complicated structure and I should say that this is like very normal in other settings as well so for example it's also not so easy to get Learners to like learn to play Tit for Tat and the reason for that is kind of similar that if your opponent in the beginning if your opponent is randomly initialized it mostly you just learn to defect and if you start both you both start learning to defect you never learn that you should corporate and do this tit fortat thing or whatever because your opponent is just defecting right so like you to kind of to get to the better equilibrium of TI both playing Tit for Tat or something like that you um you somehow need to coordinate to switch from both always defecting to both always doing this Tit for Tat uh which doesn't kind of randomly happen it's almost reminiscent of the problem of babbling equilibria right so so for listeners who might not know um yeah suppose you've got some communication game where agent like want to communicate things to each other there's this problem where like initially you know I can just talk nonsense and it means nothing and you can just ignore what I'm saying and that's kind of an equilibrium because if you're not listening why should I bother to say anything other than nonsense and you know uh if I'm saying nonsense why should you listen to it is are those like is that exactly the same or am I just like drawing loose associations no I think it's basically the same problem I think like in all of these cases I think kind of fundamentally the problem is that there's some kind of Cooperative structure that only pays off if the other player also have has the Cooperative structure and so as long as neither player has the Cooperative structure like you never there's never any pressure to get the corporate structure like the I don't know the communication protocol or the tit forat or the like cooperate against similar opponent like all of these schemes right they they there's just no pressure towards adopting them as long as the other player hasn't adopted adopted them so you just get stuck in just doing the the naive thing so in your paper you have a pre-training method to address this right yeah uh so we have like a very simple pre-training method basically it's just TR just explicitly train your neural Nets to basically cooperate against copies which like if you consider kind of more General games it's just like uh train them to maximize the payoff that they get if they're faced with while kind of taking it like kind of taking the gradient through both sort of so it's like you play the game fully cooperatively in some sense and then and you also train them to do well against randomly generated opponents which if you have like some prisoners dilemma like game which just basically it just means to defect against especially against dis similar opponents that's the pre-training method and basically the result of the pre-training method is that they they kind of very roughly do the kind of intuitive thing that they corporate at low low levels of difference or like high levels of similarity and they the more different they are from their opponent the more they defect so it does this intuitive thing but it like it doesn't do it in a like in some sense it's like unprincipled right like there's no like in this pre-training process it never like there's never any explicit reasoning about how to make something an equilibrium or something like like how to make it stable or something like that it's like a very just naively implementing like some some some way of implementing this kind of function so that's the pre-training and then we do the this alternating best response training where we then we take two different models that are independently TR pre-trained in this way and then we Face them off against each other and they like they typically start out cooperating against each other because they are actually quite similar uh after this after applying this pre pre training and then maybe this is surprising I don't know how surprising it is but the the kind of the more interesting thing I think is that if you then train them to with alternating best response training they converge to something that's at least somewhat Cooperative so they basically they find a Cooperative equilibrium uh yeah and this is kind of surprising so you might naively think well you know if my opponent is well like alternating best response is usually you hold your opponent fix and then you're like what can I do that's best for me and you might think what can I do that's best for me is just defect right yeah though I mean it is the case that like if your opponent cooperates against similar opponents and defects against this similar opponents right there's it's kind of clear that there's some pressure towards becoming a copy of them or like yeah becoming very similar to them and you are also taking that gradient yeah yeah yeah yeah one definitely yeah one definitely has to take that gradient otherwise one just lears to to defect I I guess the like the reason why it's kind of not obvious that this works is just that the pre-training is kind of it's it's like so na right like it's it's such a it's such a kind of simple method it's not yeah it's not it's much like it's much simpler than this opponent shaping stuff that you described where you kind of think about like okay I have to make it so that my opponent's gradient is such and such to make sure that I incentivize them to be a copy of me rather than something else one doesn't do any of that we just this very simple crude thing uh and so I think what this kind of does reasonably well to demonstrate is that it's it's kind of it's not that hard to find these Cooperative equilibria with like relatively crude methods Al although you said I I think you said that they didn't cooperate with each other all the time yeah so the corporation does unfortunately somewhat evaporate throughout this alternate best response training so they might initially like be almost fully Cooperative with with each other and then you train them to become best responses to each other and then they actually like learn to be a bit less Cooperative huh okay so if the if the pressure is to like if the only issue was like finding good pre-training and if the or find finding a good you know initialization and then like they have this this pressure to become more like these you know cooperate with uh things similar to me why wouldn't they like cooperate more rather than less over time yeah that's a good question so if one had yeah if one had like an optimal kind of pre-training scheme that kind of actually finds like a a kind of correct way of doing the similarity based cooperation scheme so like one finds a strategy that's kind of actually an equilibrium against itself for example and then one trains the like and let's say I don't know both players do this and maybe they don't find exactly the same equilibrium but like they they they find some such strategy and then we train the opponent to be like a best response to your kind of policy then what they should learn right is just to become an exact exact copy and then kind of once you're there you you should sorry you don't continue you're like you stop you're you're done right you can't do you can't improve your your payoff anymore okay gradient if you still take gradient steps right okay it gets complicated but like if you think of it as like really just trying to like improve your neural net you can't improve it anymore if if you're in that equilibrium so yeah so if you had this kind of optimal F this sorry this optimal pre-training scheme then alternating best response training would in some sense like immediately like on the first step it would just get stuck in the correct equilibrium uh so yeah so why doesn't this happen there are some reasons so like yeah multiple reasons I think so one one is just that the that our initialization just isn't so good so I kind of doubt that they are um equilibria against uh each other I think we tried this at some point to just like see what happens if you like just pair them up against a literal copy and they also unlearn to corporate a bit because they they just don't implement the kind of correct curve defecting more as they become more dissimilar so so if they just don't implement the correct algorithm you just don't have that pressure to remain being similar and reward similarity yeah yeah you have yeah you have like some you have like some complicated pressure I guess you you still don't want to be like completely different but like yeah there's all this trickery where you you're kind of similar in some ways but I don't know your curve goes down a bit more um and you exploit that their curve doesn't really go down enough as the diff increases yeah so so yeah you don't you I don't know you still have some pressure towards becoming similar but it's kind of it's not enough and it's like it's not exact and then I think the other issue is that the alternating best response training in some sense it can't make them more Cooperative so this isn't quite true because it can make them more Cooperative by making them more similar right so like if if I have my kind of optimally pre-trained uh Network and then I you train your thing to be a best response to mine then like they it becomes more Cooperative towards mine and mine will become more cooperate Cooperative towards it but the if you look at any individual Network it can't become more like once it only let's say cooperates with 60% probability against exact copies like once it corporates only 60% with 60% probability or like is kind of not exactly cooptive anymore at a diff value of zero there's no way to get back from this to cooperating with 100% chance as long as your opponent isn't still at I don't know cooperating at 100% uh so like there's a way in which like if you if you imagine this like alternating best response training like as like somewhat noisy then uh where like I don't know sometimes it kind of by accident makes people more defect a bit more or something like that or like sometimes you it you do it is just better to defect a bit more because the incentives Cur incentive curves aren't aren't optimal or or aren't like don't exactly make it an equilibrium to to be a copy as soon as you kind of yeah as soon as you lose a bit you kind of you're never going to get it back it's so like yeah as a consequence usually the this like alternating during the alternating best response training the the loss just goes up and then at some point it kind of stagnates at some value so naively I would have thought like if at 100% cooperativity I if if when you uh cooperate with you know really close copies like all the time at that point you know it's worth me becoming like a little bit more similar to you like if I if I nudge your number down to 99% like I'm surprised that it's so unstable or or is the idea that it's only stable in some Z and alternating best response can get outside of that Z it's just weird to me that I can have an incentive like to become more similar to you at one point in your parameter space but if you deviate from that that incentive goes the other way like I would expect some sort of gradual transition you know so okay so I mean in general like it's not it's it's definitely not exactly clear what happens exactly during the alternating breast response training like why it finds these partially Cooperative equilibria when originally there aren't any like what like why is that it's not exactly clear why that's the case I mean the I think the reason why yeah the reason why originally they're not Cooperative yeah is that the I don't know like that they like it's a typical thing is that just that their curve is kind of too flat in the beginning in in some sense so they they cooperate if they observe a diff value between zero and 0.1 and they I don't know they corporate roughly equally much and then if you're at if you're an exact copy of this you would want to defect more just so much that the diff value increases from zero to 0.1 okay it's not exact like there's noise so it's like a bit more complicated but like roughly it's something like this you don't want to be an exact copy at the very least you know yeah yeah you don't want to be an exact copy un okay I mean unless no if if the noise is like uniform with like from Z to 0.1 then maybe you do want to be an exact copy but it's like yeah it's it's it is it is somewhat complicated but yeah the that's I think that's the typical way in which they kind of fail in the beginning it's just that they they have these two flat curves I think another thing is also that they just aren't close enough copies in the beginning uh like typically they become closer copies throughout training uh I think if if I remember correctly yeah and then I mean it's yeah it's unclear why the or yeah it's at least not obvious why the alternating best response training causes the curves to be so that it is an equilibrium um I think part of it is just that they kind of learn to cooperate to defect kind of maximally much or something like that uh like as much as they can get away with yeah it's it's not um like at least I'm not aware of like a super simple analysis of yeah why why the alternating best response training then does find these Cooperative equilibria okay so speaking of things which aren't super simple uh I think at this paper like in one of the appendices you do try this you know fancier method where instead of just doing alternate best response you like try to shape your opponent to your own benefit and navely I might think like ah you know this is like one of my favorite ways of training agents to play games and you know you're like trying to shape the opponent to make your life better so I imagine things will work out better for these agents if you do this right uh but does that happen yeah unfortunately it doesn't seem to work very well yeah so that so that's this Lola method that you also talked about earlier yeah I I also went into this with this kind of hope uh like thinking like yeah this and and in some sense it is supposed to solve this exact kind of problem right like it was developed to like for examp learn tit for T in the prison slma but somehow um yes we we couldn't we we tried a bunch to get to work uh and we couldn't really uh like we I don't know there are like some kind of results where kind of works for a bit and then it unlearns to corporate again and then it it's it seems it seems relatively unstable is there any simple story of what's going on here or is it maybe just you know weird hyperparameter tuning or weird nonsense of strange Nets so definitely definitely Lola is very sensitive to hyperparameters uh so that's kind of known that if you if you take any positive Lola results and you change hyperparameters a bit it's like pretty good chance that it stops working relatively quickly but I don't I don't yeah I don't have a good intuition for why it doesn't work in this case or like even in other cases I'm I don't have inition for why it's so sensitive to to the Hy parameters and things like that and like why I don't know why why doesn't it always work very straightforwardly fair enough I might move to some uh closing questions if that's okay with you so first of all um so you're the I think assistant director or co-director of this like focal Lab at CMU right can you tell us a little bit about that yeah um so that's the foundations of cooperative AI Lab at kigi melon University so the like the actual director is uh Vincent coner who's also my uh my PhD adviser while I'm still in the in the final stages I I hope of my PhD yeah and uh the so so generally it's yeah it's a lab that's kind of supposed to work on the kinds of topics that we've discussed discussed today as as one might imagine from the from the name um it's part of the Cs department but yeah we have I don't know we have some kind of more philosophical work as well for example yeah currently we have I think one postto and five PhD students um cool and yeah I think listeners of this podcast who are kind of interested in these kind of topics would be a good fit for this kind of laugh so I don't know if anyone's considering uh starting a PhD um I think might make sense to to check it out okay um yeah if so if someone is in that situation like what should they like what do they do in order to get into focal so okay I mean a lot of the application process is I think not that different from General kind of Cs PhD application stuff uh so yeah I don't know it's good to have a paper or something like that I think yeah another strategy is to try to work with us for applying yeah for example I at least in the past few years I mentored like summer research fellows at the uh Center on long-term risk and also at the carry or Seri like the the Cambridge existential risk initiative so I guess that's that's a way to uh work with me at least before applying for PhD which I think helps well okay for one to just start working on on some of these topics uh but maybe also helps with um uh with with getting in sure so um before we wrap up the show as a whole is there anything that you kind of wish that I'd ask but uh that I hadn't okay so I guess one kind of question that I kind of expected uh was that both the bounded rational inductive uh agents paper and the the similarity based Corporation paper they kind of touch on this this kind of decision Theory like newcom's problem uh evidential versus causal decision Theory this whole cluster of topic uh topics and so so so I was expecting to get the question that um like how these how these relate right which yeah I I guess I could I could say some some stuff about May it's just an excuse to talk even more about very interesting topics I think uh I think listeners will be glad for an excuse to hear you talk more about interesting topics yeah so how how do they relate so they um like both of the papers are like very explicitly inspired by thinking about these kinds of things so yeah I think that one should cooperate in a prison against the copy for example and I think it's it's kind of unfortunate that there isn't that much of a theoretical Foundation for why one should do this like in terms of like learning for example like like regret regret minimizers have to learn not to do this for example and so part of the motivation between behind uh bonded rational inductive agents is to describe a theory that kind of very explicitly allows cooperating against copies as a rational thing to do so that's somewhat inspired by this and then of course the like with the similarity based cooperation paper in some sense it's even more kind of explicit that it's supposed to be doing that yeah though it takes this kind of program equilibrium in inspired uh kind of outside perspective where one one doesn't think about like what it is rational to do in the prison against the copy one thinks about what kind of program it is good to submit in this kind of setting and so like in some sense this one has this kind of question that we can ask from the inside like if if you are if you are neural net that was kind of built or like learn or whatever to play games well and you find yourself in this scenario like from the inside it's kind of like from like from the perspective of this neur it's kind of like yeah you face an exact copy and maybe you reason about things by saying okay if I cooperate then the opponent will cooperate and both both this and the safe Proto improvements paper have this quality right of like from the outside you're making some change to the agent that's like actually making the decisions and you know you might think that like yeah you know shouldn't this all just happen internally like yeah it is yeah it is interesting yeah it is interesting that both yeah both of these papers yeah make this yeah take this outside perspective and make yeah yeah well I agree one one would think that one kind of one doesn't need the outside perspective but at least at least conceptual conceptually it seems sometimes easier to reason from that outside perspective so in particular like yeah this like program equilibrium framework like in some sense you can think about this like in the General you can think of program equilibrium or like this framework as asking the question how should you reason if other people can like read your mind and you can read their mind H and this is like really a very hard kind of philosophical question and you kind of you can kind of avoid all of this qu all of these questions by taking this outside perspective where you submit a program that gets to read the other program's mind and then you treat this outside perspective just in the kind of normal standard game theoretic way I don't know it's kind of a it seems It's like a surprisingly I don't know maybe the the the other problem is kind of surprisingly hard and so the this this trick has kind surprisingly successful yeah yeah yeah so so yeah I guess that's one interesting relationship between prus and the similarity based cooperation of like you know the internal perspective versus the external perspective yeah yeah and and I guess I guess there's also this thing where with sity based cooperation you're saying well if there's this difference function then you know what happens whereas in the Bria thing you have you have a variety of these um you know these experts or these hypotheses and I guess like I guess in some sense you're like looking at a wider variety of maybe the analogy is you're looking at a wider variety of these similarity functions as well you're somehow more powerful I think kind yeah or just like kind of more generally just looking at yeah strategic interactions in like kind of a less yeah less constrained way yeah there are some interesting questions with these papers like yeah in particular with like safe Proto improvements I think relatedly one thing I was didn't quite ask but maybe I'll bring up here is like is this just you know a roundabout way of getting to you know some one of these like quote unquote functional decision theories where you just you know choose to be the type of agent that's the best type of agent be across you know possible ways the world could be yeah so yeah I think the yeah I mean maybe that's the case yeah I think the the trickiness is that the functional decision Theory updatelist decision Theory these sorts of things they're kind of not fully specified right they uh like especially in this kind of multi-agent scenarios it's just kind of unclear what they're supposed to do I suppose as a functional decision theorist or dateless decision theorist one one might argue that one shouldn't need surrogate goals because one in in some sense the goal of all of these theories kind of to do away with with like pre-commitment and things like that and you should you should kind of you should come built you should come with all the necessary pre pre-commitments built in and so maybe like in some sense an idealized functional decision Theory uh agent should already have automatically these well surrogate goals except you wouldn't call them surrogate goals you just have these kind of commitments to treat other things in the same way as you would treat threats against the original goal yeah to deflect uh threats you just uh just reliably do whatever you wish you'd pre-committed to do and hopefully hopefully there's one unique thing that's that yeah it's a bit like in the face of equilibrium selection it's very unclear why that would like what what what that is supposed to come out as yeah coming up on the end suppose somebody's listened to this they're like interested they want to they want to learn more like if they want to follow um your research or you know your your output and stuff how should they do that they three things so I recently made a an account uh on the social media platform x forly uh known as Twitter and that is ccore Oster heal okay then I also and I I I mostly plan to use this for kind of work related stuff I'm not going to I don't plan to have I don't know random takes on us elections or whatever then I also have a Blog at Casper ell.com which is also mostly kind of sticking relatively closely to my yeah my research interests and then you know if you if you don't want to don't want to deal with all this like Social Media stuff you can also just follow me on on Google Scholar and then you just get just the papers all right so we'll have links to all of those in the description of the episode yeah it's been uh really nice talking thanks so much for taking the time to be on axer yeah thanks for having me and so listeners uh I hope this was a valuable episode this episode is edited by Jack Garrett and Amber Dawn Ace helped with transcription the opening and closing themes are also by Jack Garett financial support for this episode was provided by the long-term future fund as well as patrons such as Alexi mfv Ben Weinstein Ron and Tor barstad to read a transfer of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback axr [Music] p.net [Music] [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs