Developing a collaborative corpus-builder platform for Urdu language OCR system: A multi-Institutional initiative

Community Idea Exchange

16 January 2020

00:00

A huge number of historic manuscripts and printed materials in Urdu language, need to be digitized for preservation and access. However, such efforts are limited to imaging only. No comprehensive OCR System exists for Urdu Language.

This session describes a project to develop an online Urdu language OCR platform, among Ewing Memorial Library at FCCU in Lahore, Pakistan and the Roshan Institute for Persian Studies at University of Maryland in College Park, USA.

This collaboration initiated by chance, in March 2018, from a news of Persian Language OCR, developed on the Roshan Institute’s KITAB Project website.

As the Arabic, Persian and Urdu languages share same alphabets and similar scripts, this author envisioned a possibility that the System may be adapted for Urdu Language OCR.

This idea shared with developers of Persian OCR, was positively responded and in July 2018, Dr. Matthew Miller expressed his willingness to take URDU OCR Project.

This is an informal agreement, to share the available resources of technical expertise and OCR technology from Roshan Institute and language expertise and digitized Urdu Documents from FCCU, for building a corpus for Urdu OCR.

Dr. Miller has completed online training of FCC faculty volunteers, for performing the first pilot-run to develop training data. Now scans of Urdu documents with varying typefaces are being examined to start project.

When completed, this project will offer an online Urdu OCR Platform for the researchers, faculty and librarians, to facilitate conversion of scanned documents images, into soft copies of editable text.

Speakers

Muhammad Imtiaz Ahmed
Bushra Almas Jaswal

Session resources

Slides 1.91 MB

Developing a collaborative corpus-builder platform for Urdu language OCR system: A multi-Institutional initiative

Speakers

Muhammad Imtiaz Ahmed

Bushra Almas Jaswal

Session resources