Google's recently unveiled artificial intelligence (AI) memory compression algorithm "TurboQuant" is drawing attention. In particular, Han In-su, a professor in the School of Electrical Engineering who took part in the TurboQuant research, said TurboQuant could ease AI's memory bottlenecks to boost efficiency across industries and bring medium- to long-term changes to the memory semiconductor market.
KAIST said on the 27th that a joint research team from Google Research, DeepMind, and New York University, with participation from School of Electrical Engineering Professor Han In-su, unveiled the next-generation quantization algorithm "TurboQuant" to solve AI memory overload.
Large language models (LLMs) operate by continuously storing prior information to understand the context of questions and answers. As conversations get longer, the amount of information that must be stored also increases, rapidly expanding the required memory capacity. Because of this, memory bottlenecks have been cited as one of the biggest obstacles to running AI faster and more cheaply.
Google developed the compression technology "TurboQuant" to address this problem. It is designed to cut an AI model's memory usage by up to sixfold with little to no performance degradation.
The core is quantization. Quantization, simply put, is a technology that stores complex numerical data in a simpler form. For example, even if you represent a long decimal number with a simpler number, overall performance may not be greatly affected if the important information is preserved. It is similar in principle to reducing a photo file's size while minimizing loss of image quality. Using this approach reduces storage space and speeds up computation.
TurboQuant works in two stages. First, in stage one, it randomly rotates the input data and then compresses each element individually. This reduces unusually large or outlier values, enabling more efficient compression of the entire dataset. The approach was also used in the "PolarQuant" study in which Han previously participated.
In stage two, it compresses the errors produced in stage one once more. This process applies the QJL (Quantized Johnson-Lindenstrauss) technique, which represents data using only the two values -1 and 1. It lowers the burden of complex computation while maintaining model performance.
Han said the technology could also have a positive medium- to long-term impact on the memory semiconductor market. In the short term, the memory needed to run the same AI model will decrease, which might make demand growth appear to slow temporarily. But in the long term, as AI becomes cheaper and easier to use, it could help expand the overall market. As AI becomes widespread, demand for semiconductors is likely to grow not just in quantity but for more efficient and advanced products.
Han said, "This study presents a new direction that effectively reduces bottlenecks from rising AI memory usage while maintaining accuracy," adding, "We expect it to serve as a core foundational technology for operating large-scale AI models more efficiently."
Meanwhile, the PolarQuant study is scheduled to be presented at the international AI and statistics conference AISTATS (Artificial Intelligence and Statistics) in May.