Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for anymore »
Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings
Traditionally, many text-mining tasks treat individual
word-tokens as the finest meaningful semantic granularity.
However, in many languages and specialized corpora, words are
composed by concatenating semantically meaningful subword
structures. Word-level analysis cannot leverage the semantic
information present in such subword structures. With regard
to word embedding techniques, this leads to not only poor
embeddings for infrequent words in long-tailed text corpora
but also weak capabilities for handling out-of-vocabulary words.
In this paper we propose MorphMine for unsupervised morpheme
segmentation. MorphMine applies a parsimony criterion
to hierarchically segment words into the fewest number of
morphemes at each level of the hierarchy. This leads to longer
shared morphemes at each level of segmentation. Experiments
show that MorphMine segments words in a variety of languages
into human-verified morphemes. Additionally, we experimentally
demonstrate that utilizing MorphMine morphemes to enrich
word embeddings consistently improves embedding quality on a
variety of of embedding evaluations and a downstream language
modeling task.
- Publication Date:
- NSF-PAR ID:
- 10160143
- Journal Name:
- 2019 {IEEE} International Conference on Big Data (Big Data)
- Volume:
- 1
- Issue:
- 1
- Page Range or eLocation-ID:
- 64 to 73
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for anymore »
-
Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications asmore »
-
Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to allmore »
-
ord embeddings are commonly used to measure word-level semantic similarity in text, especially in direct word- to-word comparisons. However, the relationships between words in the embedding space are often viewed as approximately linear and concepts comprised of multiple words are a sort of linear combination. In this paper, we demonstrate that this is not generally true and show how the relationships can be better captured by leveraging the topology of the embedding space. We propose a technique for directly computing new vectors representing multiple words in a way that naturally combines them into a new, more consistent space where distancemore »