OpenAI’s Brown urges Korea to revamp AI evaluations for inference era

Noam Brown, OpenAI vice president of research institutional sector, delivers a keynote at the Global AI Frontier Symposium 2026 at the Westin Seoul Parnas in Gangnam-gu, Seoul, on the 3rd. /Courtesy of Lee Jae-eun

"The performance of cutting-edge artificial intelligence (AI) models tends to improve as more compute resources and time are devoted to inference."

Noam Brown, vice president of research at OpenAI, argued that the current way of evaluating artificial intelligence (AI) models does not properly reflect the actual performance and safety of the models. As the AI industry's center of gravity shifts from training to inference, major safety evaluations and AI benchmarks do not measure how much time, expense, and number of tokens a model投入 during inference, meaning they cannot gauge the capabilities and risks of state-of-the-art models.

Brown said in a keynote speech at the Global AI Frontier Symposium 2026 held at the Westin Seoul Parnas in Gangnam-gu on the 3rd, "We need to change how we evaluate AI models to fit an era where inference carries more weight." He noted that for the latest AI models, problem-solving ability improves as "test-time compute"—the compute resources and time used in the logical thinking process (inference) before giving an answer to a user's question—increases, but existing benchmarks and safety evaluations are failing to keep up with this trend.

He said, "When OpenAI released the GPT-5.5 model in April, its performance was clearly better than the previous model, but it didn't appear to have advanced much by major benchmark standards. In fact, users felt how much performance had improved only after trying the model on various tasks, because cutting-edge models see performance gains when the time and compute投入 in the inference process increase."

In a cybersecurity evaluation conducted by the U.K. AI Safety Institute, cutting-edge models such as OpenAI's GPT-5.5 and Anthropic's "Mythos" continued to improve in performance even up to the point of outputting 100 million tokens (the basic unit an AI model uses to process information and generate answers). Even then, the two models had not hit performance limits; the evaluation only ran until 100 million tokens were output because the AI Safety Institute halted the experiment due to limited budget and infrastructure.

Brown explained, "In the case of GPT-4, performance plateaued at a certain level no matter how much compute resources were increased, but the latest models plateau much farther out, making it difficult to grasp their true capabilities with existing evaluation methods."

He went on to argue that future AI model evaluations should reflect performance changes based on inference expense and time, and the number of tokens the model generates to produce an answer. Brown emphasized, "Through this, we need to predict a model's latent capabilities in high-cost environments." He assessed that it has become important to establish an evaluation framework that can gauge the risks and performance of such models in preparation for an era when numerous AI agents collaborate over long periods.

Brown noted, "The pace of AI development is so fast that the cycle for releasing new models has shortened to two to three months," adding, "The problem is that with the current evaluation framework, there is a high possibility we will fail to determine when the capabilities of current AI models will hit a limit before the next model arrives."

Asked about the tasks OpenAI is currently focusing on, he answered, "Opening an era of AI agents that operate over long periods and collaborate with one another." Brown said, "Technologies such as space exploration or AI did not become possible because humans became biologically much smarter over the past 10,000 years, but as a result of billions of people passing down knowledge over thousands of years, collaborating, and accumulating new knowledge," adding, "In the next few years, I expect an era in which billions of AI agents share each other's knowledge and learn from one another's results, complementing human expertise and solving humanity's major challenges."

The Global AI Frontier Symposium 2026, co-hosted that day by the Ministry of Science and ICT and the Institute of Information & communications Technology Planning & Evaluation (IITP), drew a large turnout of AI industry, academia, and research stakeholders from Korea and abroad. Leslie Pack Kaelbling, Panasonic Professor at MIT, gave a keynote on "rational robots," and Noam Brown, vice president of research at OpenAI, delivered a keynote on "implications of large-scale inference compute."

In the subsequent expert track, Lim Woo-hyung, head of LG AI Research, Morita Jun, Asia head of Perplexity, and Kim Myeong-ju, head of the Artificial Intelligence Safety Research Institute, presented as speakers. In addition, representatives from major corporations such as POSCO, LG Electronics, OpenAI, Anthropic, and Perplexity, as well as overseas research institutions including France's Prairie Institute and Canada's Vector Institute, also took part.

※ This article has been translated by AI. Share your feedback here.

OpenAI's Brown urges Korea to revamp AI evaluations for inference era