"The search infrastructure and know-how accumulated over the past 27 years, the vast content from blogs and cafes, and diverse service assets such as Shopping and Place are Naver's unique competitiveness. By consolidation with artificial intelligence (AI) technology, we implemented in AI search an experience that connects from search to execution."
Han Seung-gyun, Naver's AI search service leader, said this at a briefing on the 2nd at Naver D2SF Gangnam. Naver officially launched its AI-based conversational search service "AI Tab" on the 25th of last month. It converses with users, understands intent and context, and finds the optimal results. Naver introduced three core technologies applied to AI Tab that day. ▲ A product-native large language model (LLM) developed for AI search ▲ Harness engineering to operate AI efficiently ▲ Multimodal technology that expands AI's visual understanding.
◇ Product-native LLM that boosts response speed and efficiency
Naver's AI Tab applies a product-native LLM. It is a lightweight model based on HyperCLOVA X. Lee Ki-chang, director of hyperscale AI models at Naver Cloud, said, "The aim of the product-native LLM is not to rank first on every benchmark, but to deliver the best performance when Naver users search, purchase, or make reservations," adding, "We optimized the entire process—from building training data to model design and reinforcement learning—for Naver services."
Naver said it developed the product-native LLM around three pillars—data, architecture, and training—to maximize efficiency. On the data side, it improved the quality of training data; on the architecture side, it introduced a Mixture of Experts (MoE) structure optimized for large-scale service environments by selecting only some parameters. Compared with HyperCLOVA X, it increased response speed and throughput. In the training phase, it expanded the computing resources投入ed for reinforcement learning to more than double those of HyperCLOVA X. It also introduced Clarify RL, a clarity reinforcement learning technique that has AI ask follow-up questions to clearly confirm user intent, reducing hallucinations.
◇ Not relying only on the LLM, but combining SLMs
In AI services, model performance matters, but designing the work environment so the model actually performs well is also important. The process of setting up AI's work environment is called "harness engineering." Han likened harness engineering to AI's "work sense," explaining, "To build an AI agent, harness engineering that designs expense efficiency and stability is essential, not just the LLM model."
The harness engineering applied by Naver to AI Tab features a division-of-labor small language model (SLM) structure. Instead of assigning all tasks to the LLM, it combines SLMs specialized by role. Naver said this reduced equipment operating expense by up to threefold compared with before, while improving response speed by more than twofold. It added that the division-of-labor SLM structure allows a newly developed SLM to be swapped in as a plug-in only for the relevant part, enabling improvements without suspending the service.
◇ The next step for AI search is 'reading intent through photos and acting'
Naver outlined plans to advance multimodal technology and introduce multimodal search across various areas. Multimodal refers to technology that converts images into representations (embeddings) that AI can understand, enabling it to comprehend and use not only text but also information in various forms such as images and videos.
Since launching Smart Lens in 2017 to introduce image search services, Naver has built multimodal search capabilities through technological advancement. Naver has focused on implementing user experiences that recognize and purchase products through Smart Lens, and plans to expand into continuous multimodal search that extends through exploration, queries, and even execution such as reservations. Yoon Sang-doo, leader of Naver Future AI Center, said, "Naver's AI agent services will evolve to understand user intent not only through text but also through images and to consolidation to actual actions." Naver added that MuCo (Multi-turn Contrastive Learning), a multimodal embedding technology that learns a single image together with real conversational patterns to understand context, was recognized for its achievements at CVPR, the world's top Computer Vision conference.