Can Data Diversity Enhance Learning Generalization?

Yu, Yu; Khadivi, Shahram; Xu, Jia

Citation Details

This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0{\%} relatively) in sentiment analysis, -44.7 perplexity decrease (37.9{\%} relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge. more »

Award ID(s):: 2113906

PAR ID:: 10514662

Author(s) / Creator(s):: Yu, Yu; Khadivi, Shahram; Xu, Jia

Publisher / Repository:: International Committee on Computational Linguistics

Date Published:: 2022-01-10

Format(s):: Medium: X

Location:: Gyeongju, Republic of Korea

Sponsoring Org:: National Science Foundation

Conference Paper:
The DOI is not currently available.

More Like this