Commonly used data citation practices rely on unverifiable retrieval methods which are susceptible to “content drift”, which occurs when the data associated with an identifier have been allowed to change. Based on our earlier work on reliable dataset identifiers, we propose signed citations, i.e., customary data citations extended to also include a standards-based, verifiable, unique, and fixed-length digital content signature. We show that content signatures enable independent verification of the cited content and can improve the persistence of the citation. Because content signatures are location- and storage-medium-agnostic, cited data can be copied to new locations to ensure their persistence across current and future storage media and data networks. As a result, content signatures can be leveraged to help scalably store, locate, access, and independently verify content across new and existing data infrastructures. Content signatures can also be embedded inside content to create robust, distributed knowledge graphs that can be cited using a single signed citation. We describe real-world applications of signed citations used to cite and compile distributed data collections, cite specific versions of existing data networks, and stabilize associations between URLs and content.
more » « less- Award ID(s):
- 2027654
- NSF-PAR ID:
- 10518760
- Publisher / Repository:
- NSF Public Access Repository (NSF-PAR)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract Commonly used data citation practices rely on unverifiable retrieval methods which are susceptible to content drift, which occurs when the data associated with an identifier have been allowed to change. Based on our earlier work on reliable dataset identifiers, we propose signed citations, i.e., customary data citations extended to also include a standards-based, verifiable, unique, and fixed-length digital content signature. We show that content signatures enable independent verification of the cited content and can improve the persistence of the citation. Because content signatures are location- and storage-medium-agnostic, cited data can be copied to new locations to ensure their persistence across current and future storage media and data networks. As a result, content signatures can be leveraged to help scalably store, locate, access, and independently verify content across new and existing data infrastructures. Content signatures can also be embedded inside content to create robust, distributed knowledge graphs that can be cited using a single signed citation. We describe applications of signed citations to solve real-world data collection, identification, and citation challenges.more » « less
-
Verifiable generation requires large language models (LLMs) to cite source documents supporting their outputs, thereby improve output transparency and trustworthiness. Yet, previous work mainly targets the generation of sentencelevel citations, lacking specificity about which part of the sentence is backed by which cited source. This work studies verifiable generation with subsentence-level fine-grained citations to locate the generated content that is supported by the cited sources in a more precise way. We first present a dataset, SCIFI, comprising 10K Wikipedia paragraphs with subsentence-level citations.1 Each paragraph in SCIFI is paired with a set of candidate source documents for citation and a query that triggers the generation of the paragraph content. On SCIFI, we then evaluate the performance of state-of-the-a rt LLMs and strategies for processing long documents designed for these models. Our experiment results reveal key factors that can enhance the quality of citations, including the expansion of the source documents’ context to be accessible to the models and the implementation of specialized model tuning.more » « less
-
A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps: first, identify lexical and semantic changes using contextual embeddings and word frequencies; second, aggregate information about these changes into per-document influence scores by estimating a high-dimensional Hawkes process with a low-rank parameter matrix. We show that this measure of linguistic influence is predictive of future citations: the estimate of linguistic influence from the two years after a paper’s publication is correlated with and predictive of its citation count in the following three years. This is demonstrated using an online evaluation with incremental temporal training/test splits, in comparison with a strong baseline that includes predictors for initial citation counts, topics, and lexical features.more » « less
-
Abstract Motivation Development of bioinformatics methods is a long, complex and resource-hungry process. Hundreds of these tools were released. While some methods are highly cited and used, many suffer relatively low citation rates. We empirically analyze a large collection of recently released methods in three diverse protein function and disorder prediction areas to identify key factors that contribute to increased citations.
Results We show that provision of a working web server significantly boosts citation rates. On average, methods with working web servers generate three times as many citations compared to tools that are available as only source code, have no code and no server, or are no longer available. This observation holds consistently across different research areas and publication years. We also find that differences in predictive performance are unlikely to impact citation rates. Overall, our empirical results suggest that a relatively low-cost investment into the provision and long-term support of web servers would substantially increase the impact of bioinformatics tools.
-
Traditional citation analysis methods have been criticized because their theoretical base of statistical counts does not reflect the motive or judgment of citing authors. In particular, self-citations may give undue credits to a cited article or mislead scientific development. This research aims to answer the question of whether self-citation is biased by probing into the motives and context of citations. It takes an integrated and fine-grained view of self-citations by examining them via multiple lenses—polarity, density, and location of citations. In addition, it explores potential moderating effects of citation level and associations among location contexts of citations to the same references for the first time. We analyzed academic publications across different topics and disciplines using both qualitative and quantitative methods. The results provide evidence that self-citations are free of bias in terms of citation density and polarity uncertainty, but they can be biased with respect to positivity and negativity of citations. Furthermore, this study reveals impacts of self-citing behavior on some citation patterns involving citation density, location concentration, and associations. The examination of self-citing behavior from those new perspectives shed new lights on the nature and function of self-citing behavior.more » « less