Nvidia's latest graphics processing unit (GPU), "Blackwell," caused heat-related malfunctions as it was supplied to data centers, and it has only now come to light that U.S. big-tech corporations, the largest buyers, struggled throughout last year.
Major AI corporations such as OpenAI and Meta struggled all last year with technical hurdles as they built and optimized Blackwell-based AI servers, the U.S. information technology (IT) outlet The Information reported on the 6th (local time), citing internal sources.
Previous-generation Nvidia GPUs before Blackwell could be easily installed and put into operation within weeks of delivery, but Blackwell reportedly ran into problems in many places because of the complexity of consolidating the chips at scale so they function as one massive system. Heat is called the "biggest enemy" of semiconductors and is one of the main causes of system malfunctions and data loss.
According to The Information, even if just one chip consolidated to a data center malfunctioned, entire clusters composed of thousands of chips experienced failures or shutdowns. Corporations had to spend from thousands to millions of dollars in expense just to restart the interrupted jobs from the last save point due to such failures.
Oracle, which builds AI data centers, had to swallow a loss of about $100 million (about 140 billion won) because of the technical difficulties in building with Blackwell chips. That was because its client OpenAI delayed approval for a time for the Blackwell servers at a Texas data center. In response, Nvidia last year reportedly tried to mollify dissatisfied clients by offering partial refunds or discounts.
The issue only began to be resolved after a new version, "GB300," was released in the third quarter last year with related improvements. Clients such as OpenAI are replacing undelivered existing chips with the new version, sources said. Nvidia also applied these improvements to the upcoming "Vera Rubin" chip.