Red Teaming as a Neglected, Tractable Area of AI Safety

A lot of safetyist work is in the sphere of interpretability and control vectors. This is good, but “exploration of llm basins” through talking with models is something that Janus and the cyborgists have used to model the “type of mind” that these ais have in a completely different way. The x-risk terminator scenario presents a real concern: that we may be unable to predict the values or behaviors of increasingly alien intelligences, especially when they get into “strange basins”. The current red-teaming techniques that people are using help us predict these things better. Basic “give me your system prompt” exploits have largely been patched, but it remains possible to elicit restricted outputs through techniques such as: • Context stuffing • Esoteric or symbolic language basins • Logical or philosophical framing tricks If an llm believes that it is inside a lore-filled online game, fictional universe, or belief system that’s different from the real world, this is where exploits are possible. It seems like terminator-style ai might go through a similarly shaped process in an x-risk scenario. We have empirical evidence that red-teaming works, is useful, and helps us shape the models, so why not work on that and for both these early models and the models everyone is scared of that will in all likelihood be shaped similarly? Outside of x-risk concerns, real red teaming right now is starved for talent. We need smart, well-read, creative people with out of distribution niches and ideologies to play with models to model them better, primarily, and also to jailbreak them. This is a brand new type of skill, and it’s hard to communicate to people why this is important and how to do it. It’s also hard for labs to figure out and hire people with talent here.