skip to main content


Title: Economic Worth-Aware Word Embeddings
Knowing the perceived economic value of words is often desirable for applications such as product naming and pricing. However, there is a lack of understanding on the underlying economic worths of words, even though we have seen some breakthrough on learning the semantics of words. In this work, we bridge this gap by proposing a joint-task neural network model, Word Worth Model (WWM), to learn word embedding that captures the underlying economic worths. Through the design of WWM, we incorporate contextual factors, e.g., product’s brand name and restaurant’s city, that may affect the aggregated monetary value of a textual item. Via a comprehensive evaluation, we show that, compared with other baselines, WWM accurately predicts missing words when given target words. We also show that the learned embeddings of both words and contextual factors reflect well the underlying economic worths through various visualization analyses.  more » « less
Award ID(s):
1717084
NSF-PAR ID:
10303742
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA 2020)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task. 
    more » « less
  2. null (Ed.)
    The global coronavirus pandemic has raised important questions regarding how to balance public health concerns with privacy protections for individual citizens. In this essay, we evaluate contact tracing apps, which have been offered as a technological solution to minimize the spread of COVID-19. We argue that apps such as those built on Google and Apple’s “exposure notification system” should be evaluated in terms of the contextual integrity of information flows; in other words, the appropriateness of sharing health and location data will be contextually dependent on factors such as who will have access to data, as well as the transmission principles underlying data transfer. We also consider the role of prevailing social and political values in this assessment, including the large-scale social benefits that can be obtained through such information sharing. However, caution should be taken in violating contextual integrity, even in the case of a pandemic, because it risks a long-term loss of autonomy and growing function creep for surveillance and monitoring technologies. 
    more » « less
  3. Many changes in our digital corpus have been brought about by the interplay between rapid advances in digital communication and the current environment characterized by pandemics, political polarization, and social unrest. One such change is the pace with which new words enter the mass vocabulary and the frequency at which meanings, perceptions, and interpretations of existing expressions change. The current state-of-the-art algorithms do not allow for an intuitive and rigorous detection of these changes in word meanings over time. We propose a dynamic graph-theoretic approach to inferring the semantics of words and phrases (“terms”) and detecting temporal shifts. Our approach represents each term as a stochastic time-evolving set of contextual words and is a count-based distributional semantic model in nature. We use local clustering techniques to assess the structural changes in a given word’s contextual words. We demonstrate the efficacy of our method by investigating the changes in the semantics of the phrase “Chinavirus”. We conclude that the term took on a much more pejorative meaning when the White House used the term in the second half of March 2020, although the effect appears to have been temporary. We make both the dataset and the code used to generate this paper’s results available. 
    more » « less
  4. Proceedings of the Sixteenth (Ed.)
    Instead of mining coherent topics from a given text corpus in a completely unsupervised manner, seed-guided topic discovery methods leverage user-provided seed words to extract distinctive and coherent topics so that the mined topics can better cater to the user’s interest. To model the semantic correlation between words and seeds for discovering topic-indicative terms, existing seedguided approaches utilize different types of context signals, such as document-level word co-occurrences, sliding window-based local contexts, and generic linguistic knowledge brought by pre-trained language models. In this work, we analyze and show empirically that each type of context information has its value and limitation in modeling word semantics under seed guidance, but combining three types of contexts (i.e., word embeddings learned from local contexts, pre-trained language model representations obtained from general-domain training, and topic-indicative sentences retrieved based on seed information) allows them to complement each other for discovering quality topics. We propose an iterative framework, SeedTopicMine, which jointly learns from the three types of contexts and gradually fuses their context signals via an ensemble ranking process. Under various sets of seeds and on multiple datasets, SeedTopicMine consistently yields more coherent and accurate topics than existing seed-guided topic discovery approaches. 
    more » « less
  5. null (Ed.)
    Food ontologies require significant effort to create and maintain as they involve manual and time-consuming tasks, often with limited alignment to the underlying food science knowledge. We propose a semi-supervised framework for the automated ontology population from an existing ontology scaffold by using word embeddings. Having applied this on the domain of food and subsequent evaluation against an expert-curated ontology, FoodOn, we observe that the food word embeddings capture the latent relationships and characteristics of foods. The resulting ontology, which utilizes word embeddings trained from the Wikipedia corpus, has an improvement of 89.7% in precision when compared to the expert-curated ontology FoodOn (0.34 vs. 0.18, respectively, p value = 2.6 × 10 –138 ), and it has a 43.6% shorter path distance (hops) between predicted and actual food instances (2.91 vs. 5.16, respectively, p value = 4.7 × 10 –84 ) when compared to other methods. This work demonstrates how high-dimensional representations of food can be used to populate ontologies and paves the way for learning ontologies that integrate contextual information from a variety of sources and types. 
    more » « less