OpenAI unveiled its latest artificial intelligence (AI) model 'GPT-4.5' on the 27th (local time), but controversy surrounding its performance and pricing is spreading. Benchmark results indicate that in some areas it falls short against competitors or does not show significant differences, raising concerns about OpenAI's competitiveness. Companies such as Anthropic, xAI, and DeepSeek are rapidly catching up to OpenAI's performance, changing the dynamics of the market.
On the 28th, industry sources reported that OpenAI claimed the hallucination rate of GPT-4.5 decreased compared to earlier models, but benchmark test results did not show a clear advantage in information provision capability compared to competing models.
OpenAI announced that GPT-4.5 is "the largest and most powerful conversational AI model" released to date. It emphasized that emotional intelligence (EQ) has been enhanced, allowing for more natural conversations with people, and that the ability to recognize patterns and find correlations has improved. Additionally, it noted a reduction in hallucination phenomena.
However, the actual benchmark results differ from OpenAI's announcement. In the AI benchmark test, GPT-4.5 scored 65% in the Agentic Coding Evaluation, falling behind Anthropic's Claude Sonnet 3.7, which scored 67%. Compared to its earlier version, Sonnet 3.5 (new), it only surpassed it by 3%.
In the GPQA (AIME) and LCB benchmarks assessing AI's math, science, and coding capabilities, performance differences were observed in each category. In science (GPQA), GPT-4.5 performed at 71.4%, similar to Grok3's 75%, but in math (AIME 24), it recorded a lower score of 36.7% compared to Grok3's 52. In the coding benchmark, it demonstrated a gap with a score of 41%, lower than both Grok3 (57%) and Sonnet 3.7 (57%).
In the ARC-AGI benchmark evaluating AI's generalization capabilities, GPT-4.5 also found itself at the center of controversy. The high cost relative to its performance compared to competing models has been flagged as a significant issue. According to benchmark results, while GPT-4.5's performance is on par or lower than Anthropic's Claude Sonnet 3.7 (Thinking 8K), O3 Mini Low, and R1 models, the cost of API (Application Programming Interface) access is more than 10 times higher. This has led to growing controversy among users regarding the excessively high subscription price relative to performance improvement.
Industry experts analyze that "it may have reached the limits of relying solely on the 'scaling law' of simply increasing model size (the number of parameters) and computing power to enhance AI performance.
OpenAI has led the AI market by maintaining a performance advantage in general LLM (large language models) throughout the previous year. However, recent trends in the AI industry indicate that reasoning models and 'agent AI' are emerging as next-generation technologies. Agent AI autonomously performs tasks based on user objectives, while reasoning models are optimized for complex logical reasoning and problem-solving.
Kang Jae-woo, a professor at Korea University, noted that "just as the reasoning model o1 was derived from GPT-4, there is a possibility that a new reasoning model based on GPT-4.5 will emerge," adding that "OpenAI has indicated plans to integrate general models and reasoning models starting with GPT-5, suggesting that in the future, AI will evolve to exhibit reasoning abilities for specific tasks while operating efficiently for general tasks."
He further stated, "As technology advances, the distinction between general models and reasoning models may no longer be necessary, and competitors are also likely to move in the same direction."