As OpenAI delays the release of its next-generation generative artificial intelligence (AI) model, 'GPT-5', Elon Musk's xAI and Google are leading the performance race with 'Grok 4' and 'Gemini 2.5', respectively. With GPT-4o, o3, and o4-mini being OpenAI's flagship models, GPT-5 has now become a model that needs to 'close the gap' rather than 'open a new era'.
◇ OpenAI falls behind competitors in benchmarks and performance metrics
According to industry sources on the 28th, GPT-5 was expected to be released in the first half of the year, as CEO Sam Altman of OpenAI had indicated in February that it would be launched within a few months, but no specific schedule has been announced yet. Later, in mid-last month, CEO Altman mentioned in a podcast that 'there is a possibility of a summer release,' implying that while the timeline has been somewhat delayed, the possibility of release remains open.
In contrast, competitors are expanding their market share by releasing new models one after another, while OpenAI continues to respond with models in the GPT-4 series. According to the latest Intelligence Index from the AI benchmark organization Artificial Analysis, xAI's 'Grok 4' achieved a score of 73, securing the top spot overall. Google's 'Gemini 2.5 Pro' scored 70, tying for second place, while OpenAI's 'GPT-o3 Pro' scored 71, ranking high as well. However, it shows limitations in terms of multimodal and real-time performance compared to text-based models. OpenAI's latest multimodal model, GPT-4o (May version), remained in fifth place with a score of 68. Previous generation models like GPT-4.1 (53) and GPT-4o (41) have fallen into the lower mid-range.
The same is true for the speed metric, 'Output Tokens per Second'. Google's Gemini 2.5 Flash recorded an outstanding first place by generating 352 tokens per second, while Grok 4 generated 202, and the Chinese Minimax's 'Reasoning mini' model recorded 161. In contrast, GPT-4o produced only 130, and GPT-4.1 had 118, indicating a significant difference. This metric directly pertains to chatbot response speed and traffic processing capabilities, affecting not only the perceived response speed in user conversations but also API costs and throughput. These factors are critical considerations for corporations choosing fast and efficient AI services.
In detailed AI performance metrics, OpenAI shows signs of falling behind competitors in several categories. In the MMLU (Massive Multitask Language Understanding Evaluation), which deals with college-level general knowledge questions, GPT-o3 Pro scored 88%, achieving first place. However, in GPQA (General Purpose Question Answering) which evaluates scientific concept reasoning ability, Grok 4 outscored the OpenAI models (GPT-o3 Pro at 85% and o4 mini at 83%) with a score of 88%. In LiveCodeBench, which assesses real-time coding problem-solving ability, Grok 4 scored 82%, surpassing both GPT-o3 Pro (80%) and o4 mini (71%). In the 'Last Exam' measuring extreme difficulty comprehensive reasoning ability, Grok 4 scored 23.9%, significantly outperforming GPT-o3 Pro (17.5%).
◇ Google and xAI accelerating expansion into video AI and mobile
Google is also showing strength in the video AI sector. Recently, Google added image-to-video functionality to its video generation model, 'Veo 3'. Based on YouTube training data, this model can automatically generate up to 8 seconds of 720p resolution video using just one photo and a text description, and it supports sound insertion and animation effects.
The Veo model is evaluated among general users as the most intuitive and high-quality video generation tool, both in generation speed and output quality. Google noted that 'more than 40 million videos have been produced through Veo 3, and the service has expanded to 159 countries.' This model is currently integrated into the Gemini app, enabling general users to easily create generative videos.
The technological gap is also affecting market responses. As of the 17th of this month, Grok 4 ranked first in downloads on the Japanese Apple App Store, while ChatGPT remained in second place. Japan, where OpenAI has focused its efforts the most among Asian markets, is a strategic hub that recently established its first overseas branch in Asia. Nonetheless, conceding the top spot to Grok is a painful point for OpenAI.
In Korea, the expansion of Gemini is also noticeable. According to Mobile Index, the number of new installations of Gemini in the country last month was 338,957, a fivefold increase compared to April (69,132). While it still does not surpass ChatGPT in the overall AI app monthly active users (MAU) ranking, it has entered the top 10 for the first time according to domestic statistics, showing significant growth.
However, among experts, there is a cautious view that benchmark scores alone should not be used to judge a model's overall performance. Professor Jae-woo Kang of Korea University commented, 'Subsequent models can be fine-tuned based on existing benchmarks, which can only be advantageous in terms of scores. Therefore, it may not be entirely accurate to say that benchmarks wholly represent performance in real-world general use environments.'
He added, 'The continuous delay in the release of GPT-5 is partly because LLMs (large language models) have already become highly advanced, and significant improvements that overwhelmingly surpass them will inevitably take time,' suggesting that 'currently, we seem to be in a situation of gradual improvement without any new technological breakthroughs.'