KAIST researchers identify an attack method that exploits vulnerabilities in mixture-of-experts architectures to severely undermine LLM safety. The image is an AI-generated research concept./Courtesy of KAIST

Major commercial large language models (LLMs), including Google Gemini, widely use a "mixture of experts" architecture that selects among multiple small models depending on the situation to boost efficiency. It has now been shown that this approach can become a pathway for new security threats.

A joint research team led by Shin Seung-won, professor of electrical engineering at KAIST, and Son Su-el, professor of computer science, said on the 26th that it identified an attack technique that can seriously undermine the safety of LLMs by exploiting vulnerabilities in the mixture of experts architecture. The study won the best paper award at ACSAC, an information security conference held in Hawaii on Dec. 12. Only two studies were selected as the best papers among all ACSAC papers this year.

The researchers focused on how the mixture of experts architecture works. It selects some experts among several expert models to generate answers depending on the input, and as this selection process repeats, the influence of a particular expert model can grow.

The team showed that even if an attacker does not directly access the internal structure of a commercial LLM, maliciously manipulating just one expert model distributed externally can induce that expert model to be repeatedly selected under certain conditions, causing the overall model to generate dangerous responses.

Put simply, even if many normal experts are mixed in, if just one malicious expert is among them, that expert can be invoked in certain situations and the safeguards can collapse. A bigger problem is that there is little apparent performance degradation while such an attack is underway, making it hard to detect early warning signs during development and deployment. The researchers noted that this characteristic increases risk in the mixture of experts architecture.

In fact, when the attack technique proposed by the team was applied, the harmful response rate could rise from around 0% to as high as 80%, and even in environments with many experts, the overall LLM's safety could drop significantly if just one expert model was compromised.

The researchers said, "We empirically confirmed that the mixture of experts architecture, which is spreading rapidly for efficiency, can become a new security threat," and added, "This award is meaningful in that it recognizes the importance of artificial intelligence (AI) security internationally."

References

LINK: https://jaehanwork.github.io/files/moevil.pdf

※ This article has been translated by AI. Share your feedback here.