NEH banner [Return to Query]

Products for grant HT-272570-20

HT-272570-20
New Languages for NLP: Building Linguistic Diversity in the Digital Humanities
Natalia Ermolaev, Princeton University

Grant details: https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=HT-272570-20

New Languages for NLP Course Materials (Course or Curricular Material)
Title: New Languages for NLP Course Materials
Author: Andrew Janco
Author: Natalia Ermolaev
Author: Toma Tasovac
Author: David Lassner
Author: Quinn Dombrowski
Author: Anubhav Sharma
Abstract: This site provides an open reference resource for participants during the workshops and acts as the first draft of materials for the online course. The course materials site has sections that present pre-requisite skills and knowledge. It has entries for each session during the workshops with supporting information and instructions. The overall goal of the course materials site is to provide an ongoing reference work to support participants’ work and asynchronous learning.
Year: 2021
Primary URL: https://new-languages-for-nlp.github.io/course-materials/intro.html
Primary URL Description: This is the URL for the course materials.
Audience: Graduate

New Languages for NLP project website (Web Resource)
Title: New Languages for NLP project website
Author: Andrew Janco
Author: Natalia Ermolaev
Abstract: The project website serves as the public-facing informational source for the project. This is where we articulated our aims and goals, as well as the significance of our project. We have a page that describes our languages, team-members, and research goals. The full schedules for our workshops are posted publicly here.
Year: 2021
Primary URL: https://newnlp.princeton.edu/
Primary URL Description: Project website URL

Event Recap: New Languages for NLP Workshop I (Blog Post)
Title: Event Recap: New Languages for NLP Workshop I
Author: Anubhav Sharma
Abstract: A recap of Workshop 1
Date: 07/22/21
Primary URL: https://cdh.princeton.edu/updates/2021/07/23/event-recap-new-languages-for-nlp-workshop-i/
Primary URL Description: Blogpost page
Blog Title: CDH Updates
Website: Center for Digital Humanities at Princeton

Cadet (Computer Program)
Title: Cadet
Author: Andrew Janco
Abstract: Cadet is an open-source Python web application that was created in 2021 by Andrew Janco to facilitate participants’ work and will be shared with the general public following the grant. The application facilitates the customization of language defaults for tokenization and lookups data. Cadet also uses token frequency to bulk annotate frequent unambiguous terms and to shorten the time needed for annotation.
Year: 2021
Primary URL: https://github.com/New-Languages-for-NLP/cadet
Primary URL Description: Source code for Cadet
Access Model: Open-source
Programming Language/Platform: Python
Source Available?: Yes

Eisenstein (Computer Program)
Title: Eisenstein
Author: Andrew Janco
Abstract: “Eisenstein” is an open-source Python web application that was built in 2021 by Andrew Janco for participants that needed optical character recognition using Tesseract. This web application simplifies Tesseract text extraction in over one hundred languages:
Year: 2021
Primary URL: https://eisenstein.apjan.co/
Primary URL Description: User-facing website
Secondary URL: https://github.com/apjanco/eisenstein
Secondary URL Description: Source code
Access Model: Open-source
Programming Language/Platform: Python
Source Available?: Yes

Multilingual NLP as Interface (Conference Paper/Presentation)
Title: Multilingual NLP as Interface
Author: Andrew Janco
Author: Natalia Ermolaev
Author: Toma Tasovac
Author: David Lassner
Author: Quinn Dombrowski
Abstract: The extreme focus on modern English in much of the natural language processing (NLP) community has led to a chasm between what is computationally possible for English and, for some languages, the feasibility of using computational methods at all. Even within the sphere of modern English, one encounters a performance gap when applying state-of-the-art algorithms to literature; models trained on news corpora and Wikipedia lose some efficacy when applied to such different kinds of text (Bamman et al 2019). While they typically lack a graphical user interface, NLP models and packages serve as interfaces to text, enabling scholars to do some things, but not others, depending on how they were created, and the nature and quality of their training data. This panel features three talks by scholars working to create new NLP tools and pedagogical materials that address the needs of humanities scholars who work with languages other than English -- in effect, building better interfaces for a wider range of computational scholarship.
Date: 09/09/21
Primary URL: https://dariah-2021.sciencesconf.org/354644
Primary URL Description: Panel abstract
Secondary URL: https://www.youtube.com/watch?v=L7QAdfGq5S8&list=PLfWGHIkSIx0VvopwbKLjZlYL8qFvne7cT&index=9
Secondary URL Description: YouTube recording of event
Conference Name: DARIAH Virtual Annual Event 2021

New Languages for NLP (Public Lecture or Presentation)
Title: New Languages for NLP
Abstract: In this 5-minute pitch, we will present “New Languages for NLP,” a workflow to create natural language processing (NLP) models for currently-unsupported languages. Our workflow includes a new annotation tool called Cadet, which significantly speeds up the process and enables small teams without extensive technical expertise to do what seems like a very complex task: to transform raw text into the linguistic data necessary to train machine learning models. The outcomes can be used for a variety of research, commercial, or government purposes.
Author: Andrew Janco
Author: Natalia Ermolaev
Date: 10/07/2021
Location: Princeton University (remote)
Primary URL: https://kellercenter.princeton.edu/people/startups-teams/new-languages-nlp
Primary URL Description: Team abstract
Secondary URL: https://www.youtube.com/watch?v=Uj6Q_0EMgBw&t=5855s
Secondary URL Description: YouTube recording of event (starting 1:37:40)

Machine Predictions and Synthetic Text: A Roundtable (Public Lecture or Presentation)
Title: Machine Predictions and Synthetic Text: A Roundtable
Abstract: Since it was published in March 2021, "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" has sparked impassioned conversations on the unintended consequences and potential harms of prominent natural language processing (NLP) projects. While this groundbreaking paper has been influential in computer and data science—prompting reflection on the dangers of relying on poorly conceptualized and curated data—it is only beginning to be discussed by humanities scholars who use NLP methods in their research. For this roundtable, two co-authors of "Stochastic Parrots" will speak with three leading digital humanities scholars about the implications of the article for humanities research employing NLP methods. Together, they will discuss how the authors’ attention to process (data gathering, documentation, standards) and ethics in AI can be turned to humanists creating data and models for the study of literature, history, and culture.
Author: Angelina McMillan-Major
Author: Gimena del Rio Riande
Author: Lauren Klein
Author: Margaret Mitchell
Author: Ted Underwood
Author: Toma Tasovac
Date: 10/26/21
Location: Princeton University, on Zoom
Primary URL: https://cdh.princeton.edu/events/2021/10/machine-predictions-and-synthetic-text-a-roundtable-on-large-language-models-in-the-humanities/
Primary URL Description: Event abstract


Permalink: https://securegrants.neh.gov/publicquery/products.aspx?gn=HT-272570-20