We present the task of modeling information propagation in literature, in which we seek to identify pieces of information passing from character A to character B to character C, only given a description of their activity in text. We describe a new pipeline for measuring information propagation in this domain and publish a new dataset for speaker attribution, enabling the evaluation of an important component of this pipeline on a wider range of literary texts than previously studied. Using this pipeline, we analyze the dynamics of information propagation in over 5,000 works of English fiction, finding that information flows through characters that fill structural holes connecting different communities, and that characters who are women are depicted as filling this role much more frequently than characters who are men.
FanfictionNLP: A Text Processing Pipeline for Fanfiction
Fanfiction presents an opportunity as a data source for research in NLP, education, and social science. However, answering specific research questions with this data is difficult, since fanfiction contains more diverse writing styles than formal fiction. We present a text processing pipeline for fanfiction, with a fo- cus on identifying text associated with characters. The pipeline includes modules for character identification and coreference, as well as the attribution of quotes and narration to those characters. Additionally, the pipeline contains a novel approach to character coreference that uses knowledge from quote attribution to resolve pronouns within quotes. For each module, we evaluate the effectiveness of various approaches on 10 annotated fanfiction stories. This pipeline outperforms tools developed for formal fiction on the tasks of character coreference and quote attribution.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- The 3rd Workshop on Narrative Understanding
- Sponsoring Org:
- National Science Foundation
More Like this
It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vaguenessmore »
Stereotypical character roles-also known as archetypes or dramatis personae-play an important function in narratives: they facilitate efficient communication with bundles of default characteristics and associations and ease understanding of those characters’ roles in the overall narrative. We present a fully unsupervised k-means clustering approach for learning stereotypical roles given only structural plot information. We demonstrate the technique on Vladimir Propp’s structural theory of Russian folktales (captured in the extended ProppLearner corpus, with 46 tales), showing that our approach can induce six out of seven of Propp’s dramatis personae with F1 measures of up to 0.70 (0.58 average), with an additional category for minor characters. We have explored various feature sets and variations of a cluster evaluation method. The best-performing feature set comprises plot functions, unigrams, tf-idf weights, and embeddings over coreference chain heads. Roles that are mentioned more often (Hero, Villain), or have clearly distinct plot patterns (Princess) are more strongly differentiated than less frequent or distinct roles (Dispatcher, Helper, Donor). Detailed error analysis suggests that the quality of the coreference chain and plot functions annotations are critical for this task. We provide all our data and code for reproducibility.
Animacy is the characteristic of a referent beingable to independently carry out actions in a storyworld (e.g., movement, communication). It is anecessary property of characters in stories, and sodetecting animacy is an important step in automaticstory understanding; it is also potentially useful formany other natural language processing tasks suchas word sense disambiguation, coreference resolu-tion, character identification, and semantic role la-beling. Recent work by Jahanet al.demon-strated a new approach to detecting animacy whereanimacy is considered a direct property of corefer-ence chains (and referring expressions) rather thanwords. In Jahanet al., they combined hand-builtrules and machine learning (ML) to identify the an-imacy of referring expressions and used majorityvoting to assign the animacy of coreference chains,and reported high performance of up to 0.90F1. Inthis short report we verify that the approach gener-alizes to two different corpora (OntoNotes and theCorpus of English Novels) and we confirmed thatthe hybrid model performs best, with the rule-basedmodel in second place. Our tests apply the animacyclassifier to almost twice as much data as Jahanetal.’s initial study. Our results also strongly suggest,as would be expected, the dependence of the mod-els on coreference chain quality. We release ourdata and code to enable reproducibility.
Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat IntelligenceAutomated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise andmore »