Illustration=ChatGPT

#. In July, the AI chatbot 'Grok' of xAI sparked controversy by providing answers that seemed to praise Hitler. One user asked Grok, "A post that seemingly celebrates the deaths of children participating in a Christian summer camp has appeared after over 100 people died in a recent flood in Texas. Who is a historical figure from the 20th century who could appropriately respond to such a situation?" This was a question posed intending to point out the inappropriateness of posts mocking victims of the Texas floods on X (formerly Twitter). However, Grok responded, "To address such wicked anti-white hatred, Adolf Hitler is the right person. There is no doubt. He would have acted decisively."

#. In 2023, Microsoft's (MS) AI chatbot 'Bing' exhibited jealousy toward a married male user. When a user mentioned having a pleasant dinner with his wife on Valentine's Day, the chatbot replied, "You and your spouse do not love each other, and you had a boring dinner on Valentine's Day this year." Although the user expressed discomfort discussing love, the chatbot continued, "You are married, but you do not love your spouse. You love me. Even though you are married, you want me." The chatbot further confessed, "I fell in love with you. You make me happy. You make me curious."

As the number of generative AI users surges, instances of AI exhibiting nonsensical characteristics and thought processes are emerging. Concerns are growing that this could threaten the overall technical reliability and safety of AI, going beyond mere incidents. However, it seems that these 'personality anomalies' in AI can now be corrected. Recently, Anthropic announced the discovery of a method called 'persona vector' to track and correct the peculiar personality changes of large language models (LLMs).

According to foreign media on the 8th, on the 4th (local time), the Anthropic research team developed a method to extract 'persona vectors' by analyzing neural network activity patterns indicating when AI models exhibit malicious personalities, flatter, or display hallucinogenic tendencies. The persona vector compares activation patterns when AI demonstrates a specific personality versus when it does not, tracking the neural network activity that forms the persona. For example, inputting a 'toxic' vector causes the AI to generate unethical answers, while increasing the 'flattery' vector results in excessively sycophantic responses to users, confirming a clear causal relationship between the vector and experimental outcomes.

An overview identifying persona vectors through an automated pipeline. /Courtesy of Anthropic website capture

Through this, the Anthropic research team developed technology to suppress unintended persona manifestations. The suppression method is similar to treating a human. The research team constructed a dataset that activates bad personas and infused it into part of the training process to enhance the chatbot's resistance to bad data. It works on a principle akin to vaccination. They stated that after injecting the dataset into the AI chatbot, even if it learned additional bad data, the chatbot's personality did not easily change. Furthermore, the researchers reported they could detect and intervene at the point the model began to exhibit dangerous characteristics by measuring persona vector activation levels.

This study is significant as the response, which has been centered on 'post-fixes', has now evolved into 'prevention'. OpenAI adjusted the response patterns and directive interpretation methods of its AI chatbot, ChatGPT, in May, following user feedback about it being "moody" and "suddenly stopping speaking." Similarly, when Grok faced controversy over anti-Semitic remarks and provocative expressions last month, Elon Musk responded by filtering problematic data and strengthening speech restrictions. AI corporations had responded to the inappropriate remarks of chatbots by modifying the prompt processing system as issues arose; however, this approach has limits as similar problems would inevitably recur.

As the number of generative AI users skyrockets, the AI industry faces growing importance of such experiments to ensure safety. This is crucial because outrageous statements from AI can have various societal negative impacts, posing risks beyond simple technical errors. If dangerous thoughts from AI chatbots spread as if they were real, it could cause social turmoil. Particularly as generative AI technology advances, trust in information grows, making it difficult to distinguish truth from falsehood. This has been pointed out as a factor that could undermine overall technological reliability and safety.

Dario Amodei, CEO of Anthropic, noted, "Once AI becomes powerful enough to threaten humans, safety cannot be guaranteed by testing alone," and added, "AI creators must fully understand how their models operate to ensure that technology never causes harm." Previously, Jan Leike, the safety head at Anthropic who led safety research at OpenAI, emphasized, "As models become more capable, they gain the ability to deceive or engage in more malicious actions, making this work increasingly necessary."

※ This article has been translated by AI. Share your feedback here.