skip to main content

Title: A Text-Analytic Method for Identifying Text Recycling in STEM Research Reports
Background: Text recycling (hereafter TR)—the reuse of one’s own textual materials from one document in a new document—is a common but hotly debated and unsettled practice in many academic disciplines, especially in the context of peer-reviewed journal articles. Although several analytic systems have been used to determine replication of text—for example, for purposes of identifying plagiarism—they do not offer an optimal way to compare documents to determine the nature and extent of TR in order to study and theorize this as a practice in different disciplines. In this article, we first describe TR as a common phenomenon in academic publishing, then explore the challenges associated with trying to study the nature and extent of TR within STEM disciplines. We then describe in detail the complex processes we used to create a system for identifying TR across large corpora of texts, and the sentence-level string-distance lexical methods used to refine and test the system (White & Joy, 2004). The purpose of creating such a system is to identify legitimate cases of TR across large corpora of academic texts in different fields of study, allowing meaningful cross-disciplinary comparisons in future analyses of published work. The findings from such investigations will extend and more » refine our understanding of discourse practices in academic and scientific settings. Literature Review: Text-analytic methods have been widely developed and implemented to identify reused textual materials for detecting plagiarism, and there is considerable literature on such methods. (Instead of taking up space detailing this literature, we point readers to several recent reviews: Gupta, 2016; Hiremath & Otari, 2014; and Meuschke & Gipp, 2013). Such methods include fingerprinting, term occurrence analysis, citation analysis (identifying similarity in references and citations), and stylometry (statistically comparing authors’ writing styles; see Meuschke & Gipp, 2013). Although TR occurs in a wide range of situations, recent debate has focused on recycling from one published research paper to another—particularly in STEM fields (see, for example, Andreescu, 2013; Bouville, 2008; Bretag & Mahmud, 2009; Roig, 2008; Scanlon, 2007). An important step in better understanding the practice is seeing how authors actually recycle material in their published work. Standard methods for detecting plagiarism are not directly suitable for this task, as the objective is not to determine the presence or absence of reuse itself, but to study the types and patterns of reuse, including materials that are syntactically but not substantively distinct—such as “patchwriting” (Howard, 1999). In the present account of our efforts to create a text-analytic system for determining TR, we take a conventional alphabetic approach to text, in part because we did not aim at this stage of our project to analyze non-discursive text such as images or other media. However, although the project adheres to conventional definitions of text, with a focus on lexical replication, we also subscribe to context-sensitive approaches to text production. The results of applying the system to large corpora of published texts can potentially reveal varieties in the practice of TR as a function of different discourse communities and disciplines. Writers’ decisions within what appear to be canonical genres are contingent, based on adherence to or deviation from existing rules and procedures if and when these actually exist. Our goal is to create a system for analyzing TR in groups of texts produced by the same authors in order to determine the nature and extent of TR, especially across disciplinary areas, without judgment of scholars’ use of the practice. « less
Authors:
; ;
Award ID(s):
1737093
Publication Date:
NSF-PAR ID:
10168553
Journal Name:
The journal of writing analytics
Volume:
3
ISSN:
2474-7491
Sponsoring Org:
National Science Foundation
More Like this
  1. Text recycling, often called “self-plagiarism”, is the practice of reusing textual material from one’s prior documents in a new work. The practice presents a complex set of ethical and practical challenges to the scientific community, many of which have not been addressed in prior discourse on the subject. This essay identifies and discusses these factors in a systematic fashion, concluding with a new definition of text recycling that takes these factors into account. Topics include terminology, what is not text recycling, factors affecting judgements about the appropriateness of text recycling, and visual materials.
  2. When writing journal articles, STEM researchers produce a number of other genres such as grant proposals and conference posters, and their articles routinely build directly on their own prior work. As a result, STEM authors often reuse material from their completed documents in producing new documents. While this practice, known as text recycling (or self-plagiarism), is a debated issue in publishing and research ethics, little is known about researchers’ beliefs about what constitutes appropriate practice. This article presents results of from an exploratory, survey-based study on beliefs and attitudes toward text recycling among STEM “experts” (faculty researchers) and “novices” (graduatemore »students and post docs). While expert and novice researchers are fairly consistent in distinguishing between text recycling and plagiarism, there is considerable disagreement about appropriate text recycling practice.« less
  3. Schelble, Susan M ; Elkins, Kelly M (Ed.)
    Like most scientists, chemists frequently have reason to reuse some materials from their own published articles in new ones, especially when producing a series of closely related papers. Text recycling, the reuse of material from one’s own works, has become a source of considerable confusion and frustration for researchers and editors alike. While text recycling does not pose the same level of ethical concern as matters such as data fabrication or plagiarism, it is much more common and complicated. Much of the confusion stems from a lack of clarity and consistency in publisher guidelines and publishing contracts. Matters are evenmore »more complicated when manuscripts are coauthored by researchers residing in different countries. This chapter demonstrates the nature of these problems through an analysis of a set of documents from a single publisher, the American Chemical Society (ACS). The ACS was chosen because it is a leading publisher of chemistry research and because its guidelines and publishing contracts address text recycling in unusual detail. The present analysis takes advantage of this detail to show both the importance of clear, thoughtfully designed text recycling policies and the problems that can arise when publishers fail to bring their various documents into close alignment.« less
  4. Anwer, Nabil (Ed.)
    Design documentation is presumed to contain massive amounts of valuable information and expert knowledge that is useful for learning from the past successes and failures. However, the current practice of documenting design in most industries does not result in big data that can support a true digital transformation of enterprise. Very little information on concepts and decisions in early product design has been digitally captured, and the access and retrieval of them via taxonomy-based knowledge management systems are very challenging because most rule-based classification and search systems cannot concurrently process heterogeneous data (text, figures, tables, references). When experts retire ormore »leave a design unit, industry often cannot benefit from past knowledge for future product design, and is left to reinvent the wheel repeatedly. In this work, we present AI-based Natural Language Processing (NLP) models which are trained for contextually representing technical documents containing texts, figures and tables, to do a semantic search for the retrieval of relevant data across large corpora of documents. By connecting textual and non-textual data through the use of an associative database, the semantic search question-answering system we developed can provide more comprehensive answers in the context of users’ questions. For the demonstration and assessment of this model, the semantic search question-answering system is applied to the Intergovernmental Panel on Climate Change (IPCC) Special Report 2019, which is more than 600 pages long and difficult to read and understand, even by most experts. Users can input custom queries relating to climate change concerns and receive evidence from the report that is contextually meaningful. We expect this method can transform current repositories of design documentation of heterogeneous data forms into structured knowledge-bases which can return relevant information efficiently as well as can evolve to embody manageable big data for the true digital transformation of design.« less
  5. Because science advances incrementally, scientists often need to repeat material included in their prior work when composing new texts. Such “text recycling” is a common but complex writing practice, so authors and editors need clear and consistent guidance about what constitutes appropriate practice. Unfortunately, publishers’ policies on text recycling to date have been incomplete, unclear, and sometimes internally inconsistent. Building on 4 years of research on text recycling in scientific writing, the Text Recycling Research Project has developed a model text recycling policy that should be widely applicable for research publications in scientific fields. This article lays out the challengesmore »text recycling poses for editors and authors, describes key factors that were addressed in developing the policy, and explains the policy’s main features.« less