A domestic research team has advanced the training method of multimodal artificial intelligence (AI) by one step. By guiding AI to interpret diverse inputs such as text, images, and audio in a balanced way, it is expected to help AI harmoniously understand a variety of information like a human.
Hwang Ui-jong, a professor in the School of Electrical Engineering at the Korea Advanced Institute of Science and Technology (KAIST), said on the 14th that his team has developed a new data augmentation technique for training that helps multimodal AI, which must process various data types at once, utilize all data evenly.
Multimodal AI refers to AI that makes judgments by using multiple data simultaneously, such as text and video. However, just as a person's gaze usually goes to the image first when a picture and text appear together, multimodal AI that uses multiple senses at the same time has also tended to rely more heavily on certain data.
To solve this, the researchers deliberately mixed mismatched data and used it for training. The AI learned to use all data—text, images, and audio—in a balanced way without relying on just one type of data in any case. They also improved performance stably by supplementing low-quality data and emphasizing more difficult data for training.
Professor Hwang said, "To improve AI performance, how and what data you use for training is far more important than just changing the model architecture (algorithm)," adding, "This study showed that an approach of designing and processing the data itself can be effective so that multimodal AI can use information in a balanced way without leaning toward specific data."
The study is scheduled to be presented at NeurIPS (Conference on Neural Information Processing Systems), an international academic conference in the AI field to be held in San Diego, United States, and Mexico City, Mexico, in December.
References
arXiv(2025), DOI: https://doi.org/10.48550/arXiv.2509.25831