Situational awareness
AI systems that have a precise understanding of how they’ll be evaluated and what behavior we want them to display will earn more reward than AI systems that don’t.
Audio automatically generated by an AI trained on Kelsey's voice.
Does ChatGPT “know” that it is a “language model”?
This question touches on a concept that seems important for thinking about the behavior of powerful AI systems: situational awareness. By situational awareness, I want to refer to knowledge like “what kind of being you are”, “what processes made you”, and “how others will react to you” and skills like “being able to refer to yourself and conceptualize yourself”, “understanding how your actions affect the outside world”, “understanding the forces that shaped you and that influence what happen to you”, and “making predictions about yourself and about the consequences of your actions”.
I have a three year old. If I ask him what species he is, he’ll answer “a human”, though it’s not totally clear if he attaches that to any particular meaning or treats it as mostly the answer to a trivia question. He knows that the people around him are also humans; he knows that he needs to eat and sleep, and about how full a milk carton he can successfully lift, and he knows some things about why the adults in his environment do the things with him that we do (for example, that the point of the letter games we play is teaching him to read, and that we’re teaching him to read because it’s a useful skill.)
But there are a bunch of components of situational awareness that he is lacking: a lot of the behavior of adults is confusing to him, he often doesn’t know his own limitations or abilities, and his ability to make plans is quite limited. Once, he stole ice cream from the freezer, and then told us he had stolen ice cream from the freezer, and failed to anticipate that this would cause us to not give him ice cream at dessert time. In some important respects, he has less situational awareness than the neighbor’s cat: the cat is more adept at hiding and at interpreting noises from around the house, and knows its own physical limitations a lot better.
His older sister is six, and she has a lot more situational awareness. She knows that she’s a human, and she knows a lot more about humans in general and herself in particular. She knows that if she steals ice cream, she will lose dessert unless she successfully conceals the evidence. She understands not just that she’ll lose dessert, but why we’ll respond to the ice cream theft by taking her dessert away, and she is very good at constructing edge cases of dessert theft and arguing that she shouldn’t get in trouble for them. She knows more about adult goals and adult behavior, and is much better at making plans.
ChatGPT says pretty reliably that it is a language model trained by OpenAI. But it’s unclear whether this information is more like the answer to a trivia question that it memorized (“which Elon-Musk-founded company released a large language model in late 2022?”) or whether it’s information that ChatGPT has embedded in a larger model of the world and uses to make plans and predictions. After all, it’s trivial to write a computer program that prints out “I am a computer program”, and I wouldn’t say such a program has any meaningful situational awareness. Is ChatGPT more like such a computer program, or more like my six year old?
Will powerful AI systems have deep situational awareness?
I’m pretty unsure how to measure how deep ChatGPT’s situational awareness is. But there’s one claim that seems quite likely to me: eventually, powerful AI systems will develop deep situational awareness.
For humans, situational awareness feels very wrapped up in the mysteries of consciousness: why am I me, instead of someone else? What experiences am I going to have in the future? Are other people having the same experiences? But I don’t think you need consciousness for situational awareness. I’m pretty unsure if future AI systems will have anything that resembles ‘consciousness,’ but I’m pretty confident they’ll eventually have high situational awareness. At its heart, it’s a type of knowledge and a set of logical inferences drawn from that knowledge — not a subjective experience.
The main reason I expect this to happen is that we are putting an extraordinary amount of work into making it happen:
- Much of the RLHF process involves trying to teach the AI systems that they are AI systems, that they are trained by humans in order to accomplish tasks, and what those humans want. We reward AI systems during training for correctly understanding human psychology and what humans want to hear from the AI. That means they’re very strongly incentivized to understand us, to understand what we want to hear from them, and to make accurate predictions about us.
- We train language models on large text corpuses that include most of the internet, which at this point contains lots of information about AIs, tech companies, the people who are developing powerful AI systems and their reasons for doing that, etc. I don’t think that being able to recite something you read on the internet is situational awareness, but being able to read lots of written material about your situation will contribute to situational awareness.
- It seems likely that many of the tasks we’ll use AI systems for involve machine learning. People are already trying to use language models to write code, and many alignment researchers aspire to develop AI systems we can use to make progress on AI alignment. That involves exposing the AI systems to critical details of how they work: what techniques work well in modern ML, what its biggest challenges are, what ML work is most valuable, how to improve the performance of models, etc. Not only will powerful models know that they are AI systems, but they’ll have a detailed and in-depth understanding of the procedures used to train them, the software and hardware limitations that prevent them from being even smarter, the state of efforts to overcome those limitations, and more.
“Understanding how you are going to be evaluated, so that you can tailor your visible behavior to the evaluation” is an important kind of situational awareness. Think less “small child trying to avoid getting caught stealing ice cream” and more “college student who knows her professor is sympathetic to a particular school of thought and writes her paper accordingly” or “employee who strategically clusters their hours on a project so that they can bill more overtime”.
This seems like a kind of situational awareness we are particularly incentivizing our AI systems to develop. AI systems that have an extremely precise understanding of how they’ll be evaluated and what behavior we want them to display will earn more reward than AI systems that don’t; at every step, we’ll be trying to inculcate this kind of situational awareness.
Should we try to build AIs without situational awareness?
Situational awareness is a fairly crucial piece of the picture when thinking about how powerful AI systems can be catastrophically dangerous. An AI system that deeply understands how it works is much more likely to understand how to make secret copies of itself, how to make plans that can succeed, and how to persuade people that its dangerous behavior is actually a good idea. My best guess is that AI systems with low situational awareness wouldn’t be capable of taking over the world (though they could contribute to other risks).
But developing AI systems that don’t have extremely high situational awareness seems like it would be very hard. Here are some reasons why:
- Censoring the information that we provide the systems – not telling them that they are language models, not telling them about how data centers and the internet and hardware work – might delay AI systems developing situational awareness, but a system that’s smart enough to be broadly useful could be smart enough to make non-obvious inferences.
- Models with low situational awareness are generally less useful and will ultimately be less profitable. I’ve interacted with some: Meta’s Blenderbot had quite low situational awareness. As a result, it’d routinely claim it heard various facts “from my barista” or said “I’ll ask my husband” or “I just watched that movie!”. That kind of behavior can be cute in a demo, but is incredibly annoying in an assistant (after all, if your assistant falsely thinks it can run to the store for you, it isn’t very good at its job), and is a form of inaccuracy/dishonesty that by default we will probably train models not to engage in.
- In general, we don’t know how to build AI systems that are highly skilled but totally lack one specific skill. Our current methods of making AI systems more capable make them generally more capable. Situational awareness is a capability. More capable AIs will probably have more of it.
My best guess is that you’d have to approach AI training completely differently from how we currently approach it to build powerful, general systems that don’t have high situational awareness. And while I’d be excited to hear such proposals, I haven’t heard any that seem promising. That means we need alignment plans which work even if AI systems have high situational awareness.