Title: A Call for Clarity in Contemporary Authorship Attribution Evaluation
Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problems—the lack of fixed test sets and the use of inappropriately homogeneous corpora—our paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing. more »« less
Wang, R.; Riddell, A.; Juola, P.
(, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics)
null
(Ed.)
The success of authorship attribution relies on the presence of linguistic features specific to individual authors. There is, however, limited research assessing to what extent authorial style remains constant when individuals switch from one writing modality to another. We measure the effect of writing mode on writing style in the context of authorship attribution research using a corpus of documents composed online (in a web browser) and documents composed offline using a traditional word processor. The results confirm the existence of a “mode effect” on authorial style. Online writing differs systematically from offline writing in terms of sentence length, word use, readability, and certain part-of-speech ratios. These findings have implications for research design and feature engineering in authorship attribution studies.
Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality openended texts (so-called neural texts), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.
Lukin, Eugenia; Roberts, James Cooper; Berdik, David; Mugar, Eliana; Juola, Patrick
(, International Journal of Digital Humanities)
Abstract The present study considers the role of adjectives and adverbs in stylometric analysis and authorship attribution. Adjectives and adverbs allow both for variations in placement and order (adverbs) and variations in type (adjectives). This preliminary study examines a collection of 25 English-language blogs taken from the Schler Blog corpus, and the Project Gutenberg corpus with specific emphasis on 3 works. Within the blog corpora, the first and last 100 lines were extracted for the purpose of analysis. Project Gutenberg corpora were used in full. All texts were processed and part-of-speech tagged using the Python NLTK package. All adverbs were classified as sentence-initial, preverbal, interverbal, postverbal, sentence-final, or none-of-the-above. The adjectives were classified into types according to the universal English type hierarchy (Cambridge Dictionary Online, 2021; Annear, 1964) manually by one of the authors. Ambiguous adjectives were classified according to their context. For the adverbs, the initial samples were paired and used as training data to attribute the final samples. This resulted in 600 trials under each of five experimental conditions. We were able to attribute authorship with an average accuracy of 9.7% greater than chance across all five conditions. Confirmatory experiments are ongoing with a larger sample of English-language blogs. This strongly suggests that adverbial placement is a useful and novel idiolectal variable for authorship attribution (Juola et al., 2021). For the adjective, differences were found in the type of adjective used by each author. Percent use of each type varied based upon individual preference and subject-matter (e.g. Moby Dick had a large number of adjectives related to size and color). While adverbial order and placement are highly variable, adjectives are subject to rigid restrictions that are not violated across texts and authors. Stylometric differences in adjective use generally involve the type and category of adjectives preferred by the author. Future investigation will focus, likewise, on whether adverbial variation is similarly analyzable by type and category of adverb.
Özden‐Schilling, Tom
(, Journal of the Royal Anthropological Institute)
Abstract Since 2001, beetles have killed two‐thirds of the pine trees in British Columbia, Canada, decimating the predominant commercial tree species in one of the world's largest timber economies. Attempts to construct and circulate computer models of the infestation and its aftermaths, however, have obscured destabilizing changes across state institutions for environmental research. Juxtaposing literary conceptualizations of distributed authorship with ethnographic critiques of technoscientific bureaucracy, this article examines how the proliferation of computer models in contemporary resource planning institutions has altered the ways experts participate in and sanction interpretive communities. The dynamic conceptualizations of authorship produced through these exchanges challenge existing portraits of anticipatory governance, an emergent mode of administration that often relies on models for procedural implementation and narrative framing even as it circumscribes modellers’ voices to specific moments of interpretation and critique. While modellers make claims on distant futures to provoke discussion among diverse actors, later interpreters may highlight a model's apparent precision or its radical uncertainties to defer criticisms of problematic interventions and government restructuring. Such modes of attribution have deepened many scientists’ sense of estrangement from the interpretive communities their models help to engender.
Xing, Eric; Venkatraman, Saranya; Le, Thai; Lee, Dongwon
(, Proceedings of the AAAI Conference on Artificial Intelligence)
Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by state-of-the-art (SOTA) AA methods,new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours.To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.
Riddell, Allen, Wang, Haining, and Juola, Patrick. A Call for Clarity in Contemporary Authorship Attribution Evaluation. Retrieved from https://par.nsf.gov/biblio/10389592. Proceedings of the International Conference on Recent Advances in Natural Language Processing . Web. doi:10.26615/978-954-452-072-4_132.
Riddell, Allen, Wang, Haining, & Juola, Patrick. A Call for Clarity in Contemporary Authorship Attribution Evaluation. Proceedings of the International Conference on Recent Advances in Natural Language Processing, (). Retrieved from https://par.nsf.gov/biblio/10389592. https://doi.org/10.26615/978-954-452-072-4_132
Riddell, Allen, Wang, Haining, and Juola, Patrick.
"A Call for Clarity in Contemporary Authorship Attribution Evaluation". Proceedings of the International Conference on Recent Advances in Natural Language Processing (). Country unknown/Code not available. https://doi.org/10.26615/978-954-452-072-4_132.https://par.nsf.gov/biblio/10389592.
@article{osti_10389592,
place = {Country unknown/Code not available},
title = {A Call for Clarity in Contemporary Authorship Attribution Evaluation},
url = {https://par.nsf.gov/biblio/10389592},
DOI = {10.26615/978-954-452-072-4_132},
abstractNote = {Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problems—the lack of fixed test sets and the use of inappropriately homogeneous corpora—our paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing.},
journal = {Proceedings of the International Conference on Recent Advances in Natural Language Processing},
author = {Riddell, Allen and Wang, Haining and Juola, Patrick},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.