Han In-su, professor in the School of Electrical Engineering at KAIST. /Courtesy of KAIST

"I was very surprised to see a single TurboQuant algorithm influencing even the hardware and memory markets."

Han In-su, a professor in the School of Electrical Engineering at KAIST, said this on the 30th while introducing Google's artificial intelligence (AI) memory compression algorithm "TurboQuant" at an online briefing on research results. He said this case shows that the variables determining AI competitiveness are not limited to semiconductors and hardware, and that joint optimization of hardware and software will become important going forward.

TurboQuant is a technology that reduces memory load by compressing more tightly the information AI temporarily stores to generate answers. For example, a large language model (LLM) keeps stacking the transfer context and intermediate computation results as a conversation gets longer, so memory usage increases rapidly, and expense and processing time rise together. TurboQuant was designed to reduce this bottleneck.

Since joining KAIST in 2024, Professor Han has continued joint research as a visiting researcher at Google Research since last year. Connections among co-researchers dating back to his postdoctoral days at Yale University in the United States became the background for the collaboration. In this process, he participated in PolarQuant and QJL (Quantized Johnson-Lindenstrauss), which formed the basis of TurboQuant. PolarQuant's random rotation idea was applied to TurboQuant's first-stage quantization (the process of representing information with fewer values), and the QJL study was reflected in the second-stage error correction.

Han cited as an advantage of TurboQuant that it combines practicality with theoretical verification. AI compression technologies are usually introduced with a focus on performance metrics, but TurboQuant can theoretically explain why the algorithm works and how far its performance can go.

He offered a relatively optimistic assessment of its potential for practical use. Han said, "The relevant implementation code is already available online, and if the technology is understood correctly, there should be no major difficulty in applying the code to AI models," adding, "Because it can be applied directly to pre-trained language models without separate retraining or complex tuning, its real-world performance can be verified in a short time."

TurboQuant is also seen as highly promising for use in On-device AI environments. As memory usage declines, AI can be run more efficiently even in settings with tight on-device memory and network constraints. It becomes easier for individuals to run personalized AI models directly on their devices with their own data, and because data does not leave the device, the benefits for information security can also grow. Han also mentioned that the military sector, where security is crucial, could be an area affected by this change.

However, Han saw limits to what software alone can achieve for further long-term efficiency gains. Even if values are compressed and stored to save memory now, the actual computation stage requires decompressing them for use, which can incur additional expense. If hardware emerges that can compute directly on compressed values without separate recovery, there is potential not only for memory savings but also for improvements in computation speed and power efficiency.

Han said, "In this respect, joint optimization of hardware and software will become important for AI efficiency," adding, "We plan to continue follow-up research with Google Research to further improve the efficiency of AI inference computation."

한 교수는 "이런 점에서 AI 효율화를 위해서는 하드웨어와 소프트웨어의 공동 최적화가 중요해질 것"이라며 "앞으로도 구글 리서치와 후속 연구를 이어가며 AI 추론 연산을 더 효율화하는 방향의 연구를 계속할 계획"이라고 밝혔다.

※ This article has been translated by AI. Share your feedback here.