Evaluating the Stability of Embedding-based Word Similarities

Antoniak, Maria; Mimno, David

Citation Details

Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of language or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sensitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. We show that these effects are more prominent for smaller training corpora. We recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora. more »

Award ID(s):: 1652536

PAR ID:: 10057829

Author(s) / Creator(s):: Antoniak, Maria; Mimno, David

Date Published:: 2018-01-01

Journal Name:: Transactions of the Association for Computational Linguistics

Volume:: 6

ISSN:: 2307-387X

Page Range / eLocation ID:: 107-119

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this