Samsung Electronics said on the 25th that it unveiled TRUEBench, an AI work productivity benchmark it developed in-house. TRUEBench is a benchmark developed by Samsung Research, the advanced research and development organization of the DX Division at Samsung Electronics, based on its experience applying generative AI models inside the company, and it evaluates AI models' work productivity performance.
Samsung Electronics noted that while many corporations are introducing AI across their operations, existing benchmarks make it difficult to accurately measure an AI model's work productivity performance. In fact, most AI benchmarks publicly available on the market focus on English and evaluate conversations not as ongoing exchanges but in a single turn or a limited number of turns.
The newly unveiled TRUEBench from Samsung Electronics is characterized by a differentiated approach that conducts a focused evaluation of work productivity compared with existing benchmarks. The actual evaluation items consist of 10 categories, 46 tasks, and 2,485 granular items.
The evaluation items were completed based on checklists used in real office work, such as content creation, data analysis, document summarization and translation, and ongoing dialogue that corporations frequently use. TRUEBench uses a total of ,485 evaluation criteria to broadly assess real work situations, from short user prompts to summaries of long documents of up to 20,000 characters.
The presentation of results is also differentiated from existing benchmarks. Users can select up to five models at once to compare them, allowing them to see the performance of various AI models at a glance. It also discloses metrics such as the average length of responses, enabling simultaneous comparison of performance and efficiency indicators.
In addition to the overall evaluation score, it discloses scores for detailed items across 10 categories, allowing for more granular results than existing benchmarks.
TRUEBench supports 12 languages in total, including English, Korean, Japanese, Chinese, and Spanish. In particular, considering the global business environment, it can also evaluate translation functions for cross-lingual content where multiple languages, such as English and Korean, are mixed.
Samsung Electronics has published data samples from TRUEBench and a leaderboard showing evaluation results for AI models on the global open-source platform Hugging Face.
Evaluating AI model performance requires clear criteria not only for answer generation performance but also for judging whether the AI model's answers are correct. TRUEBench is designed to evaluate not only the accuracy of answers but also users' implicit intent and context that do not surface explicitly.
AI is also used to validate the evaluation items. AI reviews the human-built evaluation criteria to check for errors, contradictions, or unnecessary constraints, and through repeated, continuous cross-verification, completes more sophisticated evaluation standards. Automatic evaluation of AI models based on these standards minimizes subjective bias and provides consistent results.
Jeon Kyung-hoon, chief technology officer (CTO) of the DX Division and head of Samsung Research (president), said, "Samsung Research possesses differentiated productivity AI technological competitiveness and know-how based on a wide range of real-world applications," adding, "By releasing TRUEBench, we will establish productivity performance evaluation standards and further solidify Samsung Electronics' technological leadership."