ETRI unveils OmniExtend to pool GPU memory over Ethernet for large-scale AI training

ETRI researchers analyze AI training performance in a memory pool environment based on OmniXtend. /Courtesy of ETRI

There is a recurring challenge when training giant artificial intelligence (AI) models. As models grow, the data that the graphics processing unit (GPU) must handle surges, quickly filling GPU memory and often causing training to slow down or even halt due to "out-of-memory" errors. A domestic research team said it has developed a new technology that can structurally address this problem.

The Electronics and Telecommunications Research Institute (ETRI) announced on the 8th that it developed a memory technology, OmniExtend, that can reduce GPU memory limits and data bottlenecks, which are cited as the biggest obstacles in training environments for giant AI.

When training a giant AI model, not only the model parameters but also intermediate values generated during training and information needed for optimization all reside in memory. If GPU memory is insufficient, data must be moved repeatedly, and the resulting delays slow training. In the end, to train a larger model, organizations often had to add expensive hardware or split workloads across multiple GPUs while accepting communication overhead.

ETRI's OmniExtend uses Ethernet, a standard networking technology, to ease these constraints. The key is a structure that makes the separately existing memories of multiple servers and accelerators (GPUs) shareable over the network as if they were one large-capacity memory.

Simply put, it pools memory that had been split across devices into one large memory pool. Once this structure is in place, memory needed for AI training can be secured more flexibly as circumstances demand, and the training scale is less likely to be hamstrung by the memory limits of a particular device.

ETRI said OmniExtend can help boost AI training speed by minimizing data movement delays. It added that because the approach expands memory over the network without replacing whole machines as before, it is also expected to reduce data center construction and operation expense.

In particular, by using Ethernet switches to group many physically distant devices into a single memory pool, the technology is seen as competitive in terms of scale-out needed for ultra-large AI environments.

ETRI said it unveiled the technology in succession at RISC-V Summit Europe 2025 in Paris in May last year and at RISC-V Summit North America 2025 in Santa Clara, drawing strong interest.

Kim Kang-ho, head of ETRI's Extreme Performance Computing Research Division, said, "We plan to significantly expand research on memory interconnect technology centered on Neural Processing Unit (NPU) and accelerators through new project planning going forward," adding, "We will continue to advance the technology and pursue international cooperation so that the technology can be applied to next-generation systems of global AI and semiconductor corporations."

※ This article has been translated by AI. Share your feedback here.