NEH banner [Return to Query]

Products for grant PR-263939-19

PR-263939-19
Development of Image-to-text Conversion for Pashto and Traditional Chinese
Marek Rychlik, Arizona Board of Regents

Grant details: https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=PR-263939-19

Multi-lingual Optical Character Recognition Seminar (Conference/Institute/Seminar)
Title: Multi-lingual Optical Character Recognition Seminar
Author: Marek Rychlik
Author: Yan Han
Abstract: This seminar is devoted to current OCR research and development of "Worldly OCR" software. It is open to external speakers. We are set up for Zoom presentations. Volunteering to give a presentation is welcome.
Date Range: 2019,2020
Primary URL: http://alamos.math.arizona.edu/ocr

worldly-ocr (Web Resource)
Title: worldly-ocr
Author: Marek Rychlik
Author: Sayyed Vazirizade
Author: Yan Han
Author: Dylan Murphy
Author: Dwight Nweigwe
Abstract: Data and MATLAB code for a new OCR system
Year: 2018
Primary URL: https://github.com/mrychlik/worldly-ocr
Primary URL Description: The website contains the data and MATLAB code produced by the project and will be the primary dissemination site for the software products resulting from the project.

Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese (Web Resource)
Title: Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
Author: Marek Rychlik
Author: Dwight Nweigwe
Author: Yan Han
Author: Dylan Murphy
Abstract: We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.
Year: 2020
Primary URL: https://arxiv.org/abs/2005.08650
Primary URL Description: A version of the "white paper" on a widely known Cornell e-print server.

Marek Rychlik's YouTube channel (Web Resource)
Title: Marek Rychlik's YouTube channel
Author: Marek Rychlik
Abstract: The channel features approximately 20 videos created by the software produced by the project, visualizing the algorithms developed by the project.
Year: 2019
Primary URL: https://www.youtube.com/channel/UCcq2ciH_Eb0rDckJmS_p-XQ
Primary URL Description: The videos are a mix of highly technical and easy to understand visualizations, illustrating character recognition, the method of outlines, etc. One video entitled "Image-to-text conversion for Farsi" illustrates the operation of a full implementation of the OCR pipeline on Farsi text (used as a proxy for Pashto, as at the time of creation we did not have the Pashto training data yet). The video demonstrates 97-98% accuracy on the level of characters.


Permalink: https://securegrants.neh.gov/publicquery/products.aspx?gn=PR-263939-19