Nvidia said on the 20th that it has moved to popularize "tokenomics (token economy)" by cutting AI inference expense by up to 10 times through its next-generation GPU Blackwell platform.
According to Nvidia, major inference service providers such as Baseten and Together AI succeeded in reducing per-token expense by up to 90% compared with Hopper after adopting Blackwell. The company said this was thanks to co-designing advanced hardware architecture and an optimized software stack including TensorRT-LLM.
By industry, medical AI corporations Sully.ai deployed an open-source model based on Blackwell to cut inference expense by 10 times compared with closed models and improved response time by 65%. In gaming, Latitude used Blackwell's NVFP4 low-precision format to lower per-token expense by four times, while customer service corporations Decagon reduced voice AI interaction expense by six times and still secured a fast response time under 400ms.
Nvidia said this downward trend in expense is expected to accelerate with the next-generation Rubin platform. Targeting a 10-fold performance boost and an additional 10-fold expense reduction over Blackwell, Rubin is seen lowering the barrier for corporations to scale AI services.
A Nvidia official said, "Through improvements in infrastructure and algorithm efficiency, the inference expense of state-of-the-art AI is decreasing by as much as 10 times annually," and noted, "Blackwell will be the core infrastructure that helps corporations economically deploy intelligent agents across all industries."