Hancom discloses its core PDF extraction technology as a global open source. /Courtesy of Hancom

HANCOM said on the 19th that it has released as a global open source a core technology that can resolve the bottleneck in processing PDF document data, which has been flagged as a problem in the artificial intelligence (AI) training process.

The newly released "OpenDataLoader PDF" is a PDF data extraction engine developed based on HANCOM's document processing technology that converts text, tables, images, and layout information into structured data (JSON, Markdown, HTML) to support use in AI training.

Recently, as Hugging Face released the FinePDFs dataset comprising 475 million PDF files, PDF has become a widely used document format worldwide, but its structural complexity has imposed many constraints on extracting training data. This project is the result of a business agreement signed 7월 between HANCOM and Dual Lab to resolve these issues, and the two companies have been jointly developing an open-source PDF data loader.

OpenDataLoader PDF has also been validated in terms of performance. In benchmark tests, it demonstrated performance at 85% of existing open-source tools on the Normalized Indel Distance (NID), an indicator that evaluates the reading order of documents, and it operates completely offline without any network connection in environments that handle sensitive data, such as finance and public institutions, blocking the risk of data leakage.

In addition, to strengthen AI Safety, which has emerged as a major task for the recent AI industry, the company plans to provide a feature that automatically detects and blocks security threats such as prompt injection. The goal is to improve the stability and reliability of AI training data and help build a safe training environment.

Going forward, HANCOM plans to strengthen integration with major AI frameworks such as ChatGPT, Gemini, and LangChain, and expand collaboration with the GitHub-based global developer community to continue spreading the open-source ecosystem.

Jung Ji-hwan, HANCOM chief technology officer (CTO), said, "In the era of AI transformation (AX), open source is an essential strategy for innovation and securing competitiveness across corporations and society," adding, "Through this release, we will work with global developers to advance PDF data extraction technology and continue to upgrade the project by adding AI-based document recognition technology by year-end."

※ This article has been translated by AI. Share your feedback here.