Over the past decade, text recycling (TR; AKA ‘self‐plagiarism’) has become a visible and somewhat contentious practice, particularly in the realm of journal articles. While growing numbers of publishers are writing editorials and formulating guidelines on TR, little is known about how editors view the practice or how they respond to it. We present results from an interview‐based study of 21 North American journal editors from a broad range of academic disciplines. Our findings show that editors' beliefs and practices are quite individualized rather than being tied to disciplinary or other structural parameters. While none of our participants supported the use of large amounts of recycled material from one journal article to another, some editors were staunchly against any use of recycled material, while others were accepting of the practice in certain circumstances. Issues of originality, the challenges of rewriting text, the varied circulation of texts, and abiding by copyright law were prominent themes as editors discussed their approaches to TR. Overall, the interviews showed that many editors have not thought systematically about the practice of TR, and they sometimes have trouble aligning their beliefs and practices.
more »
« less
A Text-Analytic Method for Identifying Text Recycling in STEM Research Reports
Background: Text recycling (hereafter TR)—the reuse of one’s own textual materials from one document in a new document—is a common but hotly debated and unsettled practice in many academic disciplines, especially in the context of peer-reviewed journal articles. Although several analytic systems have been used to determine replication of text—for example, for purposes of identifying plagiarism—they do not offer an optimal way to compare documents to determine the nature and extent of TR in order to study and theorize this as a practice in different disciplines. In this article, we first describe TR as a common phenomenon in academic publishing, then explore the challenges associated with trying to study the nature and extent of TR within STEM disciplines. We then describe in detail the complex processes we used to create a system for identifying TR across large corpora of texts, and the sentence-level string-distance lexical methods used to refine and test the system (White & Joy, 2004). The purpose of creating such a system is to identify legitimate cases of TR across large corpora of academic texts in different fields of study, allowing meaningful cross-disciplinary comparisons in future analyses of published work. The findings from such investigations will extend and refine our understanding of discourse practices in academic and scientific settings. Literature Review: Text-analytic methods have been widely developed and implemented to identify reused textual materials for detecting plagiarism, and there is considerable literature on such methods. (Instead of taking up space detailing this literature, we point readers to several recent reviews: Gupta, 2016; Hiremath & Otari, 2014; and Meuschke & Gipp, 2013). Such methods include fingerprinting, term occurrence analysis, citation analysis (identifying similarity in references and citations), and stylometry (statistically comparing authors’ writing styles; see Meuschke & Gipp, 2013). Although TR occurs in a wide range of situations, recent debate has focused on recycling from one published research paper to another—particularly in STEM fields (see, for example, Andreescu, 2013; Bouville, 2008; Bretag & Mahmud, 2009; Roig, 2008; Scanlon, 2007). An important step in better understanding the practice is seeing how authors actually recycle material in their published work. Standard methods for detecting plagiarism are not directly suitable for this task, as the objective is not to determine the presence or absence of reuse itself, but to study the types and patterns of reuse, including materials that are syntactically but not substantively distinct—such as “patchwriting” (Howard, 1999). In the present account of our efforts to create a text-analytic system for determining TR, we take a conventional alphabetic approach to text, in part because we did not aim at this stage of our project to analyze non-discursive text such as images or other media. However, although the project adheres to conventional definitions of text, with a focus on lexical replication, we also subscribe to context-sensitive approaches to text production. The results of applying the system to large corpora of published texts can potentially reveal varieties in the practice of TR as a function of different discourse communities and disciplines. Writers’ decisions within what appear to be canonical genres are contingent, based on adherence to or deviation from existing rules and procedures if and when these actually exist. Our goal is to create a system for analyzing TR in groups of texts produced by the same authors in order to determine the nature and extent of TR, especially across disciplinary areas, without judgment of scholars’ use of the practice.
more »
« less
- Award ID(s):
- 1737093
- PAR ID:
- 10168553
- Date Published:
- Journal Name:
- The journal of writing analytics
- Volume:
- 3
- ISSN:
- 2474-7491
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
When writing journal articles, science, technology, engineering and mathematics (STEM) researchers produce a number of other genres such as grant proposals and conference posters, and their new articles routinely build directly on their own prior work. As a result, STEM authors often reuse material from their completed documents in producing new documents. While this practice, known as text recycling (or self-plagiarism), is a debated issue in publishing and research ethics, little is known about researchers’ beliefs about what constitutes appropriate practice. This article presents results of from an exploratory, survey-based study on beliefs and attitudes toward text recycling among STEM “experts” (faculty researchers) and “novices” (graduate students and post docs). While expert and novice researchers are fairly consistent in distinguishing between text recycling and plagiarism, there is considerable disagreement about appropriate text recycling practice.more » « less
-
Text recycling, often called “self-plagiarism”, is the practice of reusing textual material from one’s prior documents in a new work. The practice presents a complex set of ethical and practical challenges to the scientific community, many of which have not been addressed in prior discourse on the subject. This essay identifies and discusses these factors in a systematic fashion, concluding with a new definition of text recycling that takes these factors into account. Topics include terminology, what is not text recycling, factors affecting judgements about the appropriateness of text recycling, and visual materials.more » « less
-
A real-world text corpus sometimes comprises not only text documents, but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships). Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework PATTON. PATTON1 includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where PATTON outperforms baselines significantly and consistently.more » « less
-
We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are related. This novel form of textual supervision is used for learning to match aspects across papers. We develop multi-vector representations where vectors correspond to sentence-level aspects of documents, and present two methods for aspect matching: (1) A fast method that only matches single aspects, and (2) a method that makes sparse multiple matches with an Optimal Transport mechanism that computes an Earth Mover’s Distance between aspects. Our approach improves performance on document similarity tasks in four datasets. Further, our fast single-match method achieves competitive results, paving the way for applying fine-grained similarity to large scientific corpora.more » « less
An official website of the United States government

