HANCOM said on the 17th that it has released as a global open source a core technology to resolve the bottleneck in processing PDF document data, a chronic challenge cited in the process of AI training and use. The newly released "OpenDataLoader PDF" is a PDF data extraction engine developed based on HANCOM's long-accumulated document processing expertise.
A HANCOM official said, "PDF is the most widely used document format for AI training worldwide, but extracting training data is not easy because of its complex internal structure," adding, "Because of this, the AI development process has faced major constraints, to the point it's called a 'data prison.'"
In response, HANCOM jointly developed an open-source–based PDF data loader with Dual Lab, a PDF technology specialist corporations. The jointly developed OpenDataLoader PDF extracts text, tables, images, and layout information from PDF documents with high accuracy and fast performance, and converts them into structured data (JSON, Markdown, HTML) that can be used immediately for AI training.
OpenDataLoader PDF will add a feature to automatically detect and block security threats. The plan is to ensure both the stability and reliability of AI training data.
Going forward, HANCOM will strengthen integration and compatibility with major AI frameworks such as ChatGPT, Gemini, and LangChain, and continue collaboration with the global developer community through GitHub.
Jeong Ji-hwan, HANCOM chief technology officer (CTO), said, "In the era of AI transformation (AX), open source is no longer a choice but an essential strategy for innovation and securing competitiveness across corporations and society," adding, "Through the release of the core technology of OpenDataLoader PDF, we will gain recognition from developers worldwide and, through collaboration, further advance PDF data extraction technology to complete a world-class AI data extraction technology."
He added, "By year-end, we will continue to advance the open-source project, including adding AI-based document recognition technology."