The virtual image of a simulation that increases the learning efficiency of the AI model and reduces the expense./Courtesy of 챗GPT4o

Professor Yoo Min-soo and his research team from the Korea Advanced Institute of Science and Technology (KAIST), along with Samsung Electronics and Samsung Research, have developed simulation technology that can enhance the learning efficiency and reduce the expense of ultra-large artificial intelligence (AI) models. The results of this research were presented at the international conference 'MICRO' in November 2024.

AI models like ChatGPT and DeepSeek utilize tens of thousands of GPUs for data centers to learn. In the case of GPT-4, the learning expense is estimated to be about 140 billion won. Corporations have found it difficult to identify optimal learning strategies and are only using empirically validated methods, leading to inefficient use of GPU resources and an increase in unnecessary expenses.

The vTrain technology developed by the research team accurately predicts the time required to train large AI models and assists in finding the most efficient learning methods. The researchers trained various AI models in environments using multiple GPUs and compared the predicted training times of vTrain with the actual times taken. As a result, the predicted training time for a single node differed from the actual average by 8.37%, while for multiple nodes it was 14.73%, confirming that it could predict training times with relatively high accuracy.

Additionally, the research team has released the vTrain framework and over 1,500 actual training time measurement data as open source through collaborative research with Samsung Electronics. This is expected to help AI researchers and corporations establish optimal learning strategies and utilize GPU resources more efficiently.

Professor Yoo Min-soo noted that 'vTrain enables more precise analysis than existing empirical methods, increasing GPU utilization and reducing learning expenses.' He explained that this will allow corporations to efficiently reduce the learning expenses of ultra-large AI models.

References

MICRO (2024), DOI: https://doi.org/10.1109/MICRO61859.2024.00021