skip to main content


Title: Metaconcepts: Isolating Context in Word Embeddings
ord embeddings are commonly used to measure word-level semantic similarity in text, especially in direct word- to-word comparisons. However, the relationships between words in the embedding space are often viewed as approximately linear and concepts comprised of multiple words are a sort of linear combination. In this paper, we demonstrate that this is not generally true and show how the relationships can be better captured by leveraging the topology of the embedding space. We propose a technique for directly computing new vectors representing multiple words in a way that naturally combines them into a new, more consistent space where distance better correlates to similarity. We show that this technique works well for natural language, even when it comprises multiple words, on a simple task derived from WordNet synset descriptions and examples of words. Thus, the generated vectors better represent complex concepts in the word embedding space.  more » « less
Award ID(s):
1824198
NSF-PAR ID:
10108132
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)
Page Range / eLocation ID:
544 to 549
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to all words or attributes. In this paper, we argue that considering topic-level semantic interactions between nodes is crucial to learn discriminative node embedding vectors. In order to model pairwise topic relevance between linked text nodes, we propose topical network embedding, where interactions between nodes are built on the shared latent topics. Accordingly, we propose a unified optimization framework to simultaneously learn topic and node representations from the network text contents and structures, respectively. Meanwhile, the structure modeling takes the learned topic representations as conditional context under the principle that two nodes can infer each other contingent on the shared latent topics. Experiments on three real-world datasets demonstrate that our approach can learn significantly better network representations, i.e., 4.1% improvement over the state-of-the-art methods in terms of Micro-F1 on Cora dataset. (The source code of the proposed method is available through the github link: https:// github.com/codeshareabc/TopicalNE.) 
    more » « less
  2. Social media is the ultimate challenge for many natural language processing tools. The constant emergence of linguistic constructs challenge even the most sophisticated NLP tools. Predicting word embeddings for out of vocabulary words is one of those challenges. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models are unable to directly provide any useful information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is very small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem. 
    more » « less
  3. This paper focuses on a newly challenging setting in hard-label adversarial attacks on text data by taking the budget information into account. Although existing approaches can successfully generate adversarial examples in the hard-label setting, they follow an ideal assumption that the victim model does not restrict the number of queries. However, in real-world applications the query budget is usually tight or limited. Moreover, existing hard-label adversarial attack techniques use the genetic algorithm to optimize discrete text data by maintaining a number of adversarial candidates during optimization, which can lead to the problem of generating low-quality adversarial examples in the tight-budget setting. To solve this problem, in this paper, we propose a new method named TextHoaxer by formulating the budgeted hard-label adversarial attack task on text data as a gradient-based optimization problem of perturbation matrix in the continuous word embedding space. Compared with the genetic algorithm-based optimization, our solution only uses a single initialized adversarial example as the adversarial candidate for optimization, which significantly reduces the number of queries. The optimization is guided by a new objective function consisting of three terms, i.e., semantic similarity term, pair-wise perturbation constraint, and sparsity constraint. Semantic similarity term and pair-wise perturbation constraint can ensure the high semantic similarity of adversarial examples from both comprehensive text-level and individual word-level, while the sparsity constraint explicitly restricts the number of perturbed words, which is also helpful for enhancing the quality of generated text. We conduct extensive experiments on eight text datasets against three representative natural language models, and experimental results show that TextHoaxer can generate high-quality adversarial examples with higher semantic similarity and lower perturbation rate under the tight-budget setting. 
    more » « less
  4. Word embeddings, which represent words as dense feature vectors, are widely used in natural language processing. In their seminal paper on word2vec, Mikolov and colleagues showed that a feature space created by training a word prediction network on a large text corpus will encode semantic information that supports analogy by vector arithmetic, e.g., "king" minus "man" plus "woman" equals "queen". To help novices appreciate this idea, people have sought effective graphical representations of word embeddings.We describe a new interactive tool for visually exploring word embeddings. Our tool allows users to define semantic dimensions by specifying opposed word pairs, e.g., gender is defined by pairs such as boy/girl and father/mother, and age by pairs such as father/son and mother/daughter. Words are plotted as points in a zoomable and rotatable 3D space, where the third ”residual” dimension encodes distance from the hyperplane defined by all the opposed word vectors with age and gender subtracted out. Our tool allows users to visualize vector analogies, drawing the vector from “king” to “man” and a parallel vector from “woman” to “king-man+woman”, which is closest to “queen”. Visually browsing the embedding space and experimenting with this tool can make word embeddings more intuitive. We include a series of experiments teachers can use to help K-12 students appreciate the strengths and limitations of this representation. 
    more » « less
  5. Word embeddings, which represent words as dense feature vectors, are widely used in natural language processing. In their seminal paper on word2vec, Mikolov and colleagues showed that a feature space created by training a word prediction network on a large text corpus will encode semantic information that supports analogy by vector arithmetic, e.g., "king" minus "man" plus "woman" equals "queen". To help novices appreciate this idea, people have sought effective graphical representations of word embeddings.We describe a new interactive tool for visually exploring word embeddings. Our tool allows users to define semantic dimensions by specifying opposed word pairs, e.g., gender is defined by pairs such as boy/girl and father/mother, and age by pairs such as father/son and mother/daughter. Words are plotted as points in a zoomable and rotatable 3D space, where the third ”residual” dimension encodes distance from the hyperplane defined by all the opposed word vectors with age and gender subtracted out. Our tool allows users to visualize vector analogies, drawing the vector from “king” to “man” and a parallel vector from “woman” to “king-man+woman”, which is closest to “queen”. Visually browsing the embedding space and experimenting with this tool can make word embeddings more intuitive. We include a series of experiments teachers can use to help K-12 students appreciate the strengths and limitations of this representation. 
    more » « less