Hancom open-source PDF tool tops GitHub trending as stars surge

/Courtesy of Hancom

HANCOM said on the 23rd that its open-source project "OpenDataLoader PDF v2.0" topped GitHub's trending list across all programming languages.

The milestone came a week after launch, surpassing 7,000 stars and 500 forks on GitHub as of Mar. 21. It also recorded a daily increase of more than 1,800 stars.

"OpenDataLoader PDF v2.0" is a technology that breaks down PDF documents into text, tables, and images and converts them into data formats that artificial intelligence (AI) can use.

The version applies a hybrid engine that combines AI-based and direct extraction methods, and offers optical character recognition (OCR), table extraction, formula extraction, and chart analysis.

It also runs in a local environment, enabling data processing without sending data to external servers. It secured compatibility with other open-source models such as Docling.

HANCOM said it recorded the highest accuracy in all categories, including reading order, tables, and heading extraction, in its own benchmark tests.

The version is released under the Apache License 2.0, allowing commercial use.

Kim Yeon-su, HANCOM's CEO, said, "This achievement shows the completeness and practicality of document data extraction technology have been validated by the global developer community," adding, "We will develop it into an open PDF data platform."

※ This article has been translated by AI. Share your feedback here.