skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LitStoryTeller+: an interactive system for multi-level scientific paper visual storytelling with a supportive text mining toolbox
The continuing growth of scientific publications has posed a double-challenge to researchers, to not only grasp the overall research trends in a scientific domain, but also get down to research details embedded in a collection of core papers. Existing work on science mapping provides multiple tools to visualize research trends in domain on macro-level, and work from the digital humanities have proposed text visualization of documents, topics, sentences, and words on micro-level. However, existing micro-level text visualizations are not tailored for scientific paper corpus, and cannot support meso-level scientific reading, which aligns a set of core papers based on their research progress, before drilling down to individual papers. To bridge this gap, the present paper proposes LitStoryTeller+, an interactive system under a unified framework that can support both meso-level and micro-level scientific paper visual storytelling. More specifically, we use entities (concepts and terminologies) as basic visual elements, and visualize entity storylines across papers and within a paper borrowing metaphors from screen play. To identify entities and entity communities, named entity recognition and community detection are performed. We also employ a variety of text mining methods such as extractive text summarization and comparative sentence classification to provide rich textual information supplementary to our visualizations. We also propose a top-down story-reading strategy that best takes advantage of our system. Two comprehensive hypothetical walkthroughs to explore documents from the computer science domain and history domain with our system demonstrate the effectiveness of our story-reading strategy and the usefulness of LitStoryTeller+.  more » « less
Award ID(s):
1633286
PAR ID:
10063055
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Scientometrics
ISSN:
0138-9130
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., “apple” could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as “distant” supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework. 
    more » « less
  2. null (Ed.)
    Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, ACKEXTRACT based on open-source text mining software and evaluate our method using manually labeled data. ACKEXTRACT uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F1 = 0:92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by ACKEXTRACT including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at https://github.com/ lamps-lab/ackextract. 
    more » « less
  3. Campos, Ricardo; Jorge, Alípio Mário; Jatowt, Adam; Bhatia, Sumit; Finlayson, Mark (Ed.)
    A crucial step in the construction of any event story or news report is to identify entities involved in the story, such entities can come from a larger background knowledge graph or from a text corpus with entity links. Along with recognizing which entities are relevant to the story, it is also important to select entities that are relevant to all aspects of the story. In this work, we model and study different types of links between the entities with the goal of identifying which link type is most useful for the entity retrieval task. Our approach demonstrates the e 
    more » « less
  4. Machine learning interatomic potential (MLIP) is an emerging technique that has helped achieve molecular dynamics simulations with unprecedented balance between efficiency and accuracy. Recently, the body of MLIP literature has been growing rapidly, which propels the need to automatically process relevant information for researchers to understand and utilize. Named entity recognition (NER), a natural language processing technique that identifies and categorizes information from texts, may help summarize key approaches and findings of relevant papers. In this work, we develop an NER model for MLIP literature by fine‐tuning a pre‐trained language model. To streamline text annotation, we build a user‐friendly web application for annotation and proofreading, which is seamlessly integrated into the training procedure. Our model can identify technical entities with an F1 score of 0.8 for new MLIP paper abstracts using only 60 training paper abstracts and up to 0.75 for scientific texts on different topics. Notably, some “errors” in predictions are actually reasonable decisions, showcasing the model's ability beyond what the performance metrics indicate. This work demonstrates the linguistic capabilities of the NER approach in processing textual information of a specific scientific domain and has the potential to accelerate materials research using language models and contribute to a user‐centric workflow. 
    more » « less
  5. null (Ed.)
    Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clustering-based inference model further improves entity linking by 2.3 points. 
    more » « less