Nvidia said on the 20th that it has moved to popularize "tokenomics (token economy)" by cutting AI inference expense by up to 10 times through its next-generation GPU Blackwell platform.

Nvidia CEO Jensen Huang unveils GeForce RTX 50 series graphics cards powered by the latest AI accelerator Blackwell during his keynote at the Mandalay Bay Convention Center in Las Vegas, Nevada, on the 6th, a day before CES 2025 opens (local time)./Courtesy of News1

According to Nvidia, major inference service providers such as Baseten and Together AI succeeded in reducing per-token expense by up to 90% compared with Hopper after adopting Blackwell. The company said this was thanks to co-designing advanced hardware architecture and an optimized software stack including TensorRT-LLM.

By industry, medical AI corporations Sully.ai deployed an open-source model based on Blackwell to cut inference expense by 10 times compared with closed models and improved response time by 65%. In gaming, Latitude used Blackwell's NVFP4 low-precision format to lower per-token expense by four times, while customer service corporations Decagon reduced voice AI interaction expense by six times and still secured a fast response time under 400ms.

Nvidia said this downward trend in expense is expected to accelerate with the next-generation Rubin platform. Targeting a 10-fold performance boost and an additional 10-fold expense reduction over Blackwell, Rubin is seen lowering the barrier for corporations to scale AI services.

A Nvidia official said, "Through improvements in infrastructure and algorithm efficiency, the inference expense of state-of-the-art AI is decreasing by as much as 10 times annually," and noted, "Blackwell will be the core infrastructure that helps corporations economically deploy intelligent agents across all industries."

※ This article has been translated by AI. Share your feedback here.