Training Tesseract OCR on Kurdish Historical Documents

2025/08/19 16:00

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

4 Experiments, Results, and Discussion

Initially, we collected some historical publications from the Zaytoon Public Library in Erbil. However, due to the fragile condition of the documents, it was not easy to transfer them into digital format. Then, via the internet, we found the Zheen Center for Documentation and Research in Sulaymaniyahn https://zheen.org, a facility specializing in scanning and archiving historical documents using unique technologies explicitly designed for that function. After visiting them and explaining our project, they agreed to provide us with digital copies of the earliest Kurdish publications they had in their collection.

4.1 Processed Data

To handle image processing tasks, we utilized a dedicated batch processing tool that was freely available. With this tool, we loaded the images and applied a de-skewing process to correct any skew present in the images. We also performed automatic cropping and converted the images to binary format, saving them in the specified destination directory.

4.2 Dataset

After receiving the historical documents from Zheen Center for Documentation and Research in a digital format, we converted the pages into single-line images with respected transcription for the line. We used an Image Processing application to crop lines and saved them in TIFF format.

\ After converting the pages into image lines (See Figure 16), we created transcription files for each image line using a text editing program by manually typing what is written in the images.

\ \ Figure 15: Sample page in the book titled ’Awat’ published in 1938 (Zheen Center for Documentation and Research)

\ \ We named the transcription files the same name as the image line with (.gt.txt) postfix (See Figure 17).

\ This way, the dataset for training Tesseract was created, which resulted in 1233 files. Half are the image lines, and the other is the transcription files (See Table 1).

4.3 Experiments

In this section, we provide details of the steps taken to prepare our environment, the training process of the model, and other relevant aspects.

\ 4.3.1 Environment Setup

\ For this training environment, we used Ubuntu 22.04.2 LTS (Jammy Jellyfish). We cloned the tesstrain from https://github.com/tesseract-ocr/tesstrain and we trained the model using our prepared dataset.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq (blnd.yaseen@ukh.edu.krd);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq (hosseinh@ukh.edu.krd).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Institutional Pulse: XRP, Stellar & Algorand Touted to Power Tomorrow’s Government Liquidity

Institutional Pulse: XRP, Stellar & Algorand Touted to Power Tomorrow’s Government Liquidity

XRP, XLM & ALGO: The Blockchain Backbones of Government-Aligned LiquidityTaking on X, formerly Twitter, crypto observer SMQKE highlights a new category of digital assets emerging beyond speculation and retail hype, which is government-aligned digital assets.Built or positioned to serve as liquidity rails for states, central banks, and regulated institutions, this class is led by Ripple's XRP, Stellar (XLM), and Algorand (ALGO).Unlike meme coins or decentralized experiments, these three projects have consistently sought alignment with regulatory frameworks, enterprise adoption, and government partnerships. Their emphasis is not on retail speculation, but on building institutional-grade financial plumbing.SMQKE points out, “Assets like XRP, Stellar and Algorand are optimized for liquidity provision, high-throughput settlement and interoperability with existing financial infrastructure.”XRP, through Ripple, has established itself as a bridge currency for cross-border payments, offering low-cost, high-speed settlements tested by banks and remittance providers worldwide. With Ripple actively collaborating on central bank digital currency (CBDC) pilots, XRP’s technology is increasingly positioned to play a structural role in how central banks enable international settlements.XLM, developed by Stellar, shares similar DNA but with a stronger emphasis on inclusion. By targeting remittances and underbanked regions, Stellar has formed partnerships with entities like MoneyGram and has built pathways for converting fiat into digital form seamlessly. Its architecture makes it suitable for government-backed stablecoin issuance, especially in emerging markets where financial accessibility is a priority.Meanwhile, ALGO distinguishes itself with its pure proof-of-stake consensus and scalability. The blockchain has already been used by governments such as the Republic of the Marshall Islands for their digital currency initiative. Its strong focus on compliance, efficiency, and sustainability makes it a contender for large-scale state digital infrastructure projects.Together, XRP, XLM, and ALGO represent a convergence between blockchain innovation and government necessity. While Bitcoin and Ethereum often stand as decentralized counterpoints to traditional finance, these three assets are carving out a role as infrastructural backbones for regulated liquidity.XRP Finds Itself at a CrossroadsAccording to Vlad Anderson, “After teasing a push above $3.25, XRP couldn’t hold momentum. Instead, the price slipped back under $3.15 → $3.10, even testing the $3.00 support zone with a local low at $2.971.”The market analyst added that XRP sits at a make-or-break range and until $3.05/$3.06 is reclaimed as support, bearish pressure dominates.At the time of this writing, XRP was up by 1.4% in the past 24 hours to trade at $3.02, according to CoinGecko data.ConclusionCrypto researcher SMQKE urges that as governments fast-track CBDC rollouts and seek reliable cross-border settlement layers, state-aligned assets like XRP, Stellar, and Algorand are set to take center stage. Therefore, the narrative is shifting because digital assets are moving beyond speculation to become the backbone of sovereign liquidity management.Meanwhile, XRP is at a pivotal juncture because unless $3.05/$3.06 flips to support, bearish momentum remains in control.
Share
Coinstats2025/08/19 21:10
Share