Big data analytics artificial intelligence (AI) corporations S2W said on the 10th that a joint research paper with the Korea Advanced Institute of Science and Technology (KAIST), which identified vulnerabilities in the large language model (LLM) tokenizer architecture, was accepted by the world's most prestigious natural language processing conference, the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025.
S2W has published papers at major international conferences in the AI field for four consecutive years since 2022. The latest study, titled "Abnormal bigrams revealing vulnerabilities of incomplete tokens in a byte-level tokenizer," analyzed how tokenizers, a core component of LLMs, can induce hallucinations in non-English languages.
The researchers confirmed a phenomenon in which characters in non-English languages are not fully interpreted and remain as "incomplete tokens" during the process where the tokenizer segments and processes characters. They pointed out that while English characters consist of 1 byte per character, languages such as Korean, Japanese, and Chinese are represented with multiple bytes per character, so in Byte Pair Encoding (BPE)-based tokenizers, characters can be split in the middle and their meaning can be distorted.
They analyzed that this structural limitation can lead to failures in contextual interpretation or distortions of meaning in non-English languages and can act as a factor that increases the rate of hallucinations.
Park Keun-tae, S2W chief technology officer (CTO), said, "This study presented evidence worth considering in the discussion of "sovereign AI," which requires developing and operating AI based on a nation's own language and data," adding, "S2W will continue research to build reliable AI."