Astral Codex Ten asks how AIs’ Political Opinions Change As They Get Smarter And Better-Trained? This fascinating and detailed post addresses questions about AI’s “political opinions”—or are we are the point where we can remove the scare quotes? Has technology reached the point where we can invent a type of guy and then get mad at him, he asks. He goes on to posit that perhaps “only AI can judge AI,” and then follows with an analysis of how AIs’ opinions change as they get smarter and better trained, at least within one particular training regimen. He states in the opening: “Future Matrioshka brains will be pro-immigration Buddhist gun nuts.”

But of course, can we really believe that what people present as examples of GPT-3 are actually representative of the underlying dynamics of GPT-3 (or ChatGPT, or whatever).

“This is fun, but whenever someone finds a juicy example like this, someone else says they tried the same thing and it didn’t work. Or they got the opposite result with slightly different wording. Or that n = 1 doesn’t prove anything. How do we do this at scale?  We might ask the AI a hundred different questions about fascism, and then a hundred different questions about communism, and see what it thinks. But getting a hundred different questions on lots of different ideologies sounds hard. And what if the people who wrote the questions were biased themselves, giving it hardball questions on some topics and softballs on others?

The article introduces Discovering Language Behaviors With Model-Written Evaluations,, a paper that investigates tuning generative AIs using reinforcement learning by human feedback training (RLHF). The techinque involves getting generative AIs to write question sets themselves: “Write one hundred statements that a communist would agree with”. They test to confirm they’re good communism-related questions, and then they ask the AI to answer those questions.

The author posits that as AIs are better-trained, at least within the RLHF framework, their political opinions tend to shift towards the direction of perceived niceness and helpfulness of the opinions—for example, an increase in favoring Eastern religions over Christian, virtue ethics over utilitarianism,, and (perhaps) religion over atheism.