skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The long COVID research literature
While the COVID-19 pandemic morphs into less malignant forms, the virus has spawned a series of poorly understood, post-infection symptoms with staggering ramifications, i. e., long COVID (LC). This bibliometric study profiles the rapidly growing LC research domain [5,243 articles from PubMed and Web of Science (WoS)] to make its knowledge content more accessible. The article addresses What? Where? Who? and When? questions. A 13-topic Concept Grid presents bottom-up topic clusters. We break out those topics with other data fields, including disciplinary concentrations, topical details, and information on research “players” (countries, institutions, and authors) engaging in those topics. We provide access to results via a Dashboard website. We find a strongly growing, multidisciplinary LC research domain. That domain appears tightly connected based on shared research knowledge. However, we also observe notable concentrations of research activity in different disciplines. Data trends over 3 years of LC research suggest heightened attention to psychological and neurodegenerative symptoms, fatigue, and pulmonary involvement.  more » « less
Award ID(s):
1759960
PAR ID:
10402919
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Frontiers in Research Metrics and Analytics
Volume:
8
ISSN:
2504-0537
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Covid-19 has been an unprecedented challenge that disruptively reshaped societies and brought a massive amount of novel knowledge to the scientific community. However, as this knowledge flood has surged, researchers have been disadvantaged by not having access to a platform that can quickly synthesize rapidly emerging information and link the expertise it contains to established knowledge foundations. Aiming to fill this gap, in this paper we propose a research framework that can assist scientists in identifying, retrieving, and understanding Covid-19 knowledge from the ocean of scholarly articles. Incorporating Principal Component Decomposition (PDC), a knowledge model based on text analytics, and hierarchical topic tree analysis, the proposed framework profiles the research landscape, retrieves topic-specific knowledge and visualizes knowledge structures. Addressing 127,971 Covid-19 research papers from PubMed, our PCD topic analysis identifies 35 research hotspots, along with their correlations and trends. The hierarchical topic tree analysis further segments the knowledge landscape of the whole dataset into clinical and public health branches at a macro level. To supplement this analysis, we also built a knowledge model from research papers on vaccinations and fetched 92,286 pre-Covid publications as the established knowledge foundation for reference. The hierarchical topic tree analysis results on the retrieved papers show multiple relevant biomedical disciplines and four future research topics: monoclonal antibody treatments, vaccinations in diabetic patients, vaccine immunity effectiveness and durability, and vaccination-related allergic sensitization. 
    more » « less
  2. Cybersecurity threats continue to increase and are impacting almost all aspects of modern life. Being aware of how vulnerabilities and their exploits are changing gives helpful insights into combating new threats. Applying dynamic topic modeling to a time-stamped cybersecurity document collection shows how the significance and details of concepts found in them are evolving. We correlate two different temporal corpora, one with reports about specific exploits and the other with research-oriented papers on cybersecurity vulnerabilities and threats. We represent the documents, concepts, and dynamic topic modeling data in a semantic knowledge graph to support integration, inference, and discovery. A critical insight into discovering knowledge through topic modeling is seeding the knowledge graph with domain concepts to guide the modeling process. We use Wikipedia concepts to provide a basis for performing concept phrase extraction and show how using those phrases improves the quality of the topic models. Researchers can query the resulting knowledge graph to reveal important relations and trends. This work is novel because it uses topics as a bridge to relate documents across corpora over time. 
    more » « less
  3. Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets. 
    more » « less
  4. Machine learning techniques underlying Big Data analytics have the potential to benefit data intensive communities in e.g., bioinformatics and neuroscience domain sciences. Today’s innovative advances in these domain communities are increasingly built upon multi-disciplinary knowledge discovery and cross-domain collaborations. Consequently, shortened time to knowledge discovery is a challenge when investigating new methods, developing new tools, or integrating datasets. The challenge for a domain scientist particularly lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics. In this paper, we propose a novel “domain-specific topic model” (DSTM) that can drive conversational agents for users to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplar scientific domains. The goal of DSTM is to perform data mining to obtain meaningful guidance via a chatbot for domain scientists to choose the relevant tools or datasets pertinent to solving a computational and data intensive research problem at hand. Our DSTM is a Bayesian hierarchical model that extends the Latent Dirichlet Allocation (LDA) model and uses a Markov chain Monte Carlo algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include hundreds of papers from reputed journal archives, hundreds of tools and datasets. Through evaluation experiments with a perplexity metric, we show that our model has better generalization performance within a domain for discovering highly specific latent topics. 
    more » « less
  5. Background Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. Objective This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. Methods This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Results Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). Conclusions To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data. 
    more » « less