Training AIs to help us align AIs If we can accurately recognize good performance on alignment, we could elicit lots of useful alignment work from our models, even if they're playing the training game.
Playing the training game We're creating incentives for AI systems to make their behavior look as desirable as possible, while intentionally disregarding human intent when that conflicts with maximizing reward.
Situational awareness AI systems that have a precise understanding of how they’ll be evaluated and what behavior we want them to display will earn more reward than AI systems that don’t.
"Aligned" shouldn't be a synonym for "good" Perfect alignment just means that AI systems won’t want to deliberately disregard their designers' intent; it's not enough to ensure AI is good for the world.
What we're doing here We’re trying to think ahead to a possible future in which AI is making all the most important decisions.