University of California, Berkeley (Berkeley, CA 94704-5940)
David Bamman (Project Director: January 2020 to present)

Digital Humanities Advancement Grants
Digital Humanities

[Grant products]

$324,874 (approved)
$292,054 (awarded)

Grant period:
9/1/2020 – 8/31/2023

Multilingual BookNLP: Building a Literary NLP Pipeline Across Languages

The expansion of the BookNLP platform for studying the linguistic structure of textual materials to allow for the analysis of resources in Spanish, Japanese, Russian and German.

BookNLP (Bamman et al., 2014) is a natural language processing pipeline for reasoning about the linguistic structure of text of books, specifically designed for works of fiction. In addition to its pipeline of part-of-speech tagging, named entity recognition, and coreference resolution, BookNLP identifies the characters in a literary text, and represents them through the actions they participate in, the objects they possess, their attributes, and dialogue. The availability of this tool has driven much work in the computational humanities, especially surrounding character (Underwood et al., 2018; Kraicer and Piper, 2018; Dubnicek et al., 2018). At the same time, however, BookNLP has one major limitation: it currently only supports texts written in English. The goal of this project is to develop a version of BookNLP to support literature in Spanish, Japanese, Russian and German, and create a blueprint for others to develop it for additional languages in the future.