/Courtesy of Hancom

HANCOM said on the 12th that it released the open-source PDF data extraction tool "OpenDataLoader PDF v2.0."

OpenDataLoader PDF v2.0 features a hybrid engine that combines an artificial intelligence (AI) approach with direct extraction. Corporations and developers can use PDF data extraction in a local environment without sending data to external servers.

Four AI add-ons that analyze elements within documents are also provided by default. "Optical character recognition (OCR)" supports text recognition for image-based PDFs and scanned documents, while the "table extraction" feature analyzes complex table structures such as merged cells. "Formula extraction" recognizes formulas in science and math papers, and "chart analysis" explains chart information in sentence form.

These add-ons were implemented to be technically compatible with open-source AI models such as Docling. They are not in partnership with any specific corporation or institution and were made compatible for integration with existing technology environments. The add-on architecture was also designed to expand to additional AI models going forward.

OpenDataLoader PDF v2.0 recorded high performance in areas such as reading order, tables, and heading inference in its own benchmark tests. The benchmark test data and reproducible code were released on the official GitHub repository.

With this release, the open-source license was also changed. It shifted from the Mozilla Public License 2.0 (MPL 2.0) to the Apache License 2.0, expanding the scope for commercial use.

HANCOM plans to expand integration with AI frameworks going forward. It completed integration with LangChain in 2025, and in 2026 it will pursue integration with Langflow, LlamaIndex, and Gemini-cli. It is also preparing Model Context Protocol (MCP) functionality to support AI agents.

Jung Ji-hwan, HANCOM's chief technology officer, said, "OpenDataLoader PDF v2.0 was developed with a structure that combines an AI approach and a direct extraction approach," and noted, "We broadened the scope of use for developers and corporations by changing the open-source license."

※ This article has been translated by AI. Share your feedback here.