Séminaire 1 | “Measuring Representation in Culture”
Much work in cultural analytics has examined questions of representation in narrative–whether through the deliberate process of watching movies or reading books and counting the people who appear on screen, or by developing algorithmic measuring devices to do so at scale. In this talk, I’ll explore the use of NLP and computer vision to capture the diversity of representation in both contemporary literature and film, along with the challenges and opportunities that arise in this process. This includes not only the legal and policy challenges of working with copyrighted materials, but also in the opportunities that arise for aligning current methods in NLP with the diversity of representation we see in contemporary narrative; toward this end, I’ll highlight models of referential gender that align characters in fiction with the pronouns used to describe them (he/she/they/xe/ze/etc.) rather than inferring an unknowable gender identity.
(Ce séminaire s’inscrit dans le cadre des séminaires DHAI, Digital Humanities and Artificial Intelligence)
Séminaire 2 | “Building Multilingual BookNLP”
Le mercredi 21 juin à 10h, salle Jaurès (29 rue d’Ulm) | pré-inscription ici
BookNLP (Bamman et al. 2014) is a natural language processing pipeline for reasoning about the linguistic structure of text in books, specifically designed for works of fiction. In addition to its pipeline of part-of-speech tagging, named entity recognition, and coreference resolution, BookNLP identifies the characters in a literary text, and represents them through the actions they participate in, the objects they possess, their attributes, and dialogue. The availability of this tool has driven much work in the computational humanities, especially surrounding character (Underwood et al. 2018; Kraicer and Piper 2018; Cheng 2020). At the same time, however, BookNLP has had one major limitation: it currently only supports texts written in English. In this talk, I will describe our efforts to expand BookNLP to support literature in languages beyond English, and create a blueprint for others to develop it for additional languages in the future.
Le séminaire sera suivi d’une session de travail sur BookNLP
Séminaire 3 | “The Promise and Peril of Large Language Models for Cultural Analytics”
Le mercredi 28 juin à 14h, salle Jaurès (29 rue d’Ulm) | pré-inscription ici
In this talk, I’ll discuss the role of large language models (such as ChatGPT, GPT-4 and open alternatives) for research in cultural analytics, both raising issues about the use of closed models for scholarly inquiry and charting the opportunity that such models present. I’ll discuss recent work carrying out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query, where we find that OpenAI models have memorized a wide collection of materials and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. I’ll also detail the use of those models for downstream tasks in cultural analytics, illustrating their affordances for measurement of difficult cultural phenomena, but also the risks that come in establishing measurement validity. The rise of large pre-trained language models has the potential to radically transform the space of cultural analytics by both reducing the need for large-scale training data for new tasks and lowering the technical barrier to entry, but need care in establishing the reliability of results.
_ _ _ _ _ _ _ _ _ _
David Bamman is an associate professor in the School of Information at UC Berkeley, where he works in the areas of natural language processing and cultural analytics, applying NLP and machine learning to empirical questions in the humanities and social sciences. His research focuses on improving the performance of NLP for underserved domains like literature (including LitBank and BookNLP) and exploring the affordances of empirical methods for the study of literature and culture. Before Berkeley, he received his PhD in the School of Computer Science at Carnegie Mellon University and was a senior researcher at the Perseus Project of Tufts University. Bamman’s work is supported by the National Endowment for the Humanities, National Science Foundation, the Mellon Foundation and an NSF CAREER award.