Kakao unveiled research results in multimodal artificial intelligence (AI) technology. The company described it as "an advanced multimodal that sees, hears, and speaks like a person and best understands the Korean language and Korean culture."
On the 12th, Kakao announced via its tech blog the development process and performance of the integrated multimodal language model "Kanana-o," optimized for understanding Korean context, and the multimodal embedding model "Kanana-v-embedding."
First, Kanana-o is an integrated multimodal language model that understands text, speech, and images simultaneously and responds in real time. The company said it features superior performance in understanding Korean context compared with global models and has natural, rich expressiveness like a person. Kanana-o's compliance with instructions has improved since its performance was first revealed in May.
Kakao said it focused on the limitation that existing multimodal models are strong in text input but tend to produce somewhat simple answers and show weaker reasoning ability in voice conversations. To address this, the company enhanced Kanana-o's instruction-following ability so it can identify users' hidden intent and complex requirements. It also said it boosted performance so the model can handle not only simple Q&A but also a variety of tasks—including summarization, emotion and intent interpretation, error correction, format conversion, and translation—across diverse modality inputs and outputs by training on an in-house dataset.
Kakao said it applied high-quality voice data and DPO (Direct Preference Optimization) technology to finely train intonation, emotion, and breathing, improving not only vivid emotions by situation—such as joy, sadness, anger, and fear—but also the ability to express emotions according to subtle changes in timbre and tone. It built a "podcast" style dataset where a host and a guest converse, enabling seamless, natural multi-turn conversations.
Kakao said benchmark evaluation results showed Kanana-o achieved a level similar to GPT-4o in English speech performance and far higher levels in Korean speech recognition and synthesis and emotion recognition.
Kakao plans to evolve the model so it can generate more natural full-duplex conversations and real-time soundscapes suited to the situation.
Kanana-v-embedding, released together, is a core technology for image-based search—a Korea-tailored multimodal model that can understand and process text and images simultaneously. It supports searching for images with text, searching for information related to a user-selected image, and searching documents that include images.
In particular, this model was developed with the goal of real-world service deployment. It can find accurate images by understanding context for proper nouns like "Gyeongbokgung" and "bungeoppang," as well as for misspelled words like "Hameltun cheese." The company also said it accurately understands composite conditions such as "a group photo taken in hanbok," and has the discrimination to filter out photos that meet only part of the conditions.
Kanana-v-embedding is currently applied within Kakao to a system that analyzes and reviews the similarity of advertising creatives. The company plans to expand its scope to video and audio and apply it to more diverse services.
Kim Byunghak, Kakao's Kanana performance leader, said, "Kakao's in-house AI model Kanana aims to go beyond listing information to become an AI that understands users' emotions and converses in a familiar and natural way by enhancing its understanding of Korean context and expressive power," adding, "We will create AI technology experiences in users' daily lives through real service environments and focus on implementing AI that can interact like a person."