Musk’s xAI builds Colossus fast, ignites AI data center arms race in Korea and US

xAI operates Colossus, the world's largest AI data center, in Memphis, Tennessee, U.S./Courtesy of xAI website

The race for artificial intelligence (AI) dominance is shifting from the development of AI models such as GPT, Gemini and Claude to the physical infrastructure that runs them reliably. Even with high-performance AI models, they cannot be applied in the real world if there are not enough graphics processing units (GPUs) to handle the computation demand or data centers that can bear the massive power load. As a result, competition to develop and build AI-specialized data centers that run GPUs at maximum density, the so-called "AI factories," is intensifying among U.S. big tech and neo-cloud corporations.

According to the industry on the 28th, AI data centers designed around GPUs needed for training and inference of AI models have recently been gaining attention in the AI infrastructure market.

A representative case is Colossus, the world's largest AI data center operated in Memphis, Tennessee, by xAI, the AI company led by Tesla Chief Executive Elon Musk. Colossus is characterized by a redesign of the data center architecture optimized for AI training. While data centers run by existing hyperscalers (large-scale cloud corporations) such as Google, Amazon Web Services (AWS) and Microsoft (MS) are general-purpose infrastructure for handling diverse workloads, Colossus is closer to an AI-dedicated "AI factory." To that end, xAI designed Colossus' GPU density, cooling architecture, power supply structure and network configuration differently from general-purpose data centers.

In general, data centers take an average of two to three years from planning to completion, but xAI said it built an ultra-large cluster (a group of servers) equipped with 100,000 AI chips (Nvidia H100 GPUs) in 122 days, and then doubled the GPU scale to around 200,000 within three months. Instead of building a new data center at a new site, xAI converted the former Electrolux plant in Memphis and applied modular design to dramatically speed up data center construction. The modular design supports rapid expansion of the data center by arranging equipment that houses core facilities such as standardized GPU servers, racks, cooling facilities and network gear like Lego blocks.

General-purpose data centers have mainly used air cooling to dissipate server heat, but Colossus uses liquid cooling (liquid cooling) that circulates coolant to reduce GPU heat generation. As the latest GPUs consume more power and generate more heat, liquid cooling with higher cooling efficiency is emerging as an alternative to air cooling. Each liquid-cooling rack (a shelf that stacks servers and equipment in tiers) in Colossus houses eight GPU servers, and uses a coolant distribution unit (CDU) placed at the bottom of the rack to directly cool the heat from AI chips. This enables high-density power delivery of more than 100 kW per rack.

xAI also solved the power supply issue by installing dozens of mobile gas turbines and Tesla's large battery Megapack at Colossus. In the United States, even if a data center site is secured, there is currently a wait of more than three to five years for grid connection approval, so xAI secured flexibility by equipping on-site generation facilities.

In the AI industry, Colossus' high-density GPU environment, high cooling and power efficiency, and modular design are seen as closer to neo-cloud than to existing hyperscalers. Major neo-cloud (AI-specialized infrastructure corporations) such as CoreWeave, Crusoe and Nebius also tout environments that stably operate GPUs at maximum density, efficient liquid cooling and rapid build-out as strengths of their AI infrastructure.

Korean corporations are also expanding AI-specialized data center businesses in step with this trend. NHN Cloud recently launched NHN FactoryX, a service that provides data centers, GPUs and AI software needed for AI development and operations all at once. The company emphasized that FactoryX helps corporations use their secured GPUs to the fullest without waste. Kim Dong-hun, CEO of NHN Cloud, said, "Only 7% of corporations that have secured GPUs have a peak-time utilization rate of 85% or higher," and added, "How robustly and efficiently AI infrastructure is operated will determine success or failure in the AI market going forward."

Samsung SDS plans to invest 10 trillion won in AI infrastructure, including AI data centers, and related mergers and acquisitions (M&A) by 2031. As part of that, the company is building a 60 MW AI data center in Gumi, North Gyeongsang Province. SK Telecom is also joining hands with AWS and others to build a large-scale AI data center in the Ulsan area.

AI infrastructure corporation Elice is expanding its portable modular data center (PMDC) business, which can shorten the construction period for data centers from more than two years to three to four months. Kim Jae-won, CEO of Elice, said, "AI infrastructure competitiveness does not depend on how many GPUs are secured, but on how well they are used," and added, "No matter how good the GPUs are, if customized storage systems and other software and infrastructure are not in place, speed and performance will degrade."

※ This article has been translated by AI. Share your feedback here.

Musk's xAI builds Colossus fast, ignites AI data center arms race in Korea and US