Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problems—the lack of fixed test sets and the use of inappropriately homogeneous corpora—our paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing.
more »
« less
Adjectives and adverbs as stylometric analysis parameters
Abstract The present study considers the role of adjectives and adverbs in stylometric analysis and authorship attribution. Adjectives and adverbs allow both for variations in placement and order (adverbs) and variations in type (adjectives). This preliminary study examines a collection of 25 English-language blogs taken from the Schler Blog corpus, and the Project Gutenberg corpus with specific emphasis on 3 works. Within the blog corpora, the first and last 100 lines were extracted for the purpose of analysis. Project Gutenberg corpora were used in full. All texts were processed and part-of-speech tagged using the Python NLTK package. All adverbs were classified as sentence-initial, preverbal, interverbal, postverbal, sentence-final, or none-of-the-above. The adjectives were classified into types according to the universal English type hierarchy (Cambridge Dictionary Online, 2021; Annear, 1964) manually by one of the authors. Ambiguous adjectives were classified according to their context. For the adverbs, the initial samples were paired and used as training data to attribute the final samples. This resulted in 600 trials under each of five experimental conditions. We were able to attribute authorship with an average accuracy of 9.7% greater than chance across all five conditions. Confirmatory experiments are ongoing with a larger sample of English-language blogs. This strongly suggests that adverbial placement is a useful and novel idiolectal variable for authorship attribution (Juola et al., 2021). For the adjective, differences were found in the type of adjective used by each author. Percent use of each type varied based upon individual preference and subject-matter (e.g. Moby Dick had a large number of adjectives related to size and color). While adverbial order and placement are highly variable, adjectives are subject to rigid restrictions that are not violated across texts and authors. Stylometric differences in adjective use generally involve the type and category of adjectives preferred by the author. Future investigation will focus, likewise, on whether adverbial variation is similarly analyzable by type and category of adverb.
more »
« less
- Award ID(s):
- 1814602
- PAR ID:
- 10541769
- Publisher / Repository:
- IJDH
- Date Published:
- Journal Name:
- International Journal of Digital Humanities
- Volume:
- 5
- Issue:
- 2-3
- ISSN:
- 2524-7840
- Page Range / eLocation ID:
- 233 to 245
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality openended texts (so-called neural texts), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.more » « less
-
Certain colors are strongly associated with certain adjectives (e.g. red is hot, blue is cold). Some of these associations are grounded in visual experiences like seeing hot embers glow red. Surprisingly, many congenitally blind people show similar color associations, despite lacking all visual experience of color. Presumably, they learn these associations via language. Can we detect these associations in the statistics of language? And if so, what form do they take? We apply a projection method to word embeddings trained on corpora of spoken and written text to identify color-adjective associations as they are represented in language. We show that these projections are predictive of color-adjective ratings collected from blind and sighted people, and that the effect size depends on the training corpus. Finally, we examine how color-adjective associations might be represented in language by training word embeddings on corpora from which various sources of color-semantic information are removed.more » « less
-
The availability of quantitative text analysis methods has provided new waysof analyzing literature in a manner that was not available in thepre-information era. Here we apply comprehensive machine learning analysis tothe work of William Shakespeare. The analysis shows clear changes in the styleof writing over time, with the most significant changes in the sentence length,frequency of adjectives and adverbs, and the sentiments expressed in the text.Applying machine learning to make a stylometric prediction of the year of theplay shows a Pearson correlation of 0.71 between the actual and predicted year,indicating that Shakespeare's writing style as reflected by the quantitativemeasurements changed over time. Additionally, it shows that the stylometrics ofsome of the plays is more similar to plays written either before or after theyear they were written. For instance, Romeo and Juliet is dated 1596, but ismore similar in stylometrics to plays written by Shakespeare after 1600. Thesource code for the analysis is available for free download.more » « less
-
Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods,new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours.To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.more » « less
An official website of the United States government

