NEH banner

[light] [dark]

Funded Projects Query Form
One match

Grant number like: HAA-263837-19

Query elapsed time: 0.062 sec

Export results to Excel
Save this query


Northeastern University (Boston, MA 02115-5005)
David Smith (Project Director: June 2018 to present)
Improving Optical Character Recognition and Tracking Reader Annotations in Printed Books by Collating and Transcribing Multiple Exemplars

Further research in enhanced optical character recognition techniques for historical print books and automatic discoverability of handwritten marginalia drawing upon the collections of the Internet Archive.

Most past digitization projects have focused on transcribing documents individually. With the availability of library-scale digital collections, we propose a Digital Humanities Advancement Grant (Level II) to develop computational image and language models to discover multiple copies and editions of similar texts and to correct each text using these comparable witnesses. We provide evidence that this collational transcription system can significantly improve optical character recognition on historical books. We also propose to use these collated editions to discover annotated passages in large digitized book collections. This approach will therefore not only mitigate the errors that reader annotations introduce into the OCR process but will also produce the first automatically generated database of handwritten annotations, Ichneumon. Methods and software developed by this project will thus benefit future research on automatic collation, book history, and historical reading practices.

[White paper][Grant products]

Project fields:
Computational Linguistics

Digital Humanities Advancement Grants

Digital Humanities

$100,000 (approved)
$99,224 (awarded)

Grant period:
1/1/2019 – 6/30/2021