Win-win and communicate, then introduce a social dilemma - Ram Rachum / AI Safety Research

> [!NOTE] > The goal of this note is to discuss a difference between the way I do my research and the way other researchers do. See [[How my approach is different]] for more differences. The big challenge at the center of my research is that I want to get selfish RL agents to show [[emergent reciprocity]]. This is difficult, because RL agents do a trial-and-error process and repeat the actions that get them the most points. Reciprocity will get an agent points only after other agents have learned reciprocity. If there are no other agents that reciprocate, any attempts by any agent to reciprocate will only cause it to lose points. This is [[The great Catch 22 of reciprocity]]. Once I can get some of the agents to reciprocate, it's possible that a critical mass will form, and then many of the other agents will learn to reciprocate. But we can't get there without making the most difficult step of solving that Catch 22. I've been thinking a lot about that, from the agent's perspective. Here is how I would phrase reciprocity: > I will change my own behavior, against all my instincts, > so the other agent will change its behavior, against all its instincts. This is a giant pill to swallow. There are too many moving parts for it to be taught effectively. My first reaction is to try to break it down to multiple parts, and teach them separately. Other researchers have had the same idea. They broke it down horizontally, starting with the first line: > **I will change my own behavior, against all my instincts,** > so the other agent will change its behavior, against all its instincts. I wish that could work, but it doesn't. Reciprocity is a profound social behavior. It's the pillar that all other social behaviors rest on. You can't just focus on one individual and hope it works out. I want to do something different and break it down vertically: > **I will change my own behavior,** against all my instincts, > **so the other agent will change its behavior,** against all its instincts. I'm keeping the social part, and putting the social dilemma part aside. I want to create an environment in which agents are listening to each other and responding to each other: Sometimes by mimicking, and sometimes by contrasting. (See the [[Fruit Slots]] experiment.) I want this to be a win-win situation, i.e. both agents are getting points by listening to each other. After I have that setup working, I want to carefully introduce a social dilemma, slowly raising the stakes. I hope that after the agents become experts at listening to each other, they would learn to reciprocate.