NEH banner [Return to Query]

Products for grant PR-276810-21

Unlocking Endangered Language Resources
Antonios Anastasopoulos, George Mason University

Grant details:

Lexically Aware Semi-Supervised Learning for OCR Post-Correction (Article)
Title: Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Author: Shruti Rijhwani
Author: Daisy Rosenblum
Author: Antonios Anastasopoulos
Author: Graham Neubig
Abstract: Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.
Year: 2021
Primary URL:
Access Model: open access
Format: Journal
Periodical Title: Transactions of the Association for Computational Linguistics
Publisher: Transactions of the Association for Computational Linguistics 2021

Explorations in Transfer Learning for OCR Post-Correction (Conference Paper/Presentation)
Title: Explorations in Transfer Learning for OCR Post-Correction
Author: Lindia Tjuatja
Author: Shruti Rijhwani
Author: Graham Neubig
Abstract: In this abstract, we explore transfer learning to improve post-correction for optical character recognition (OCR), specifically for documents that contain endangered language texts. We extend an existing OCR post-correction model (Rijhwani et al., 2020) by introducing an additional pretraining step on related data, such as text in a related language or available target endangered language datasets that may differ in orthography. Although cross-lingual transfer is often successful in high-resource settings, our preliminary results show that transferring from data in another language decreases performance for this task. On the other hand, we observe small improvements in performance when transferring from additional target language data.
Date: 11/10/2021
Primary URL:
Conference Name: 5th Widening NLP Workshop