skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Search for: All records

Creators/Authors contains: "Xu, Jia"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available December 10, 2024
  2. The opacity of deep neural networks remains a challenge in deploying solutions where explanation is as important as precision. We present ConceptX, a human-in-the-loop framework for interpreting and annotating latent representational space in pre-trained Language Models (pLMs). We use an unsupervised method to discover concepts learned in these models and enable a graphical interface for humans to generate explanations for the concepts. To facilitate the process, we provide auto-annotations of the concepts (based on traditional linguistic ontologies). Such annotations enable development of a linguistic resource that directly represents latent concepts learned within deep NLP models. These include not just traditional linguistic concepts, but also task-specific or sensitive concepts (words grouped based on gender or religious connotation) that helps the annotators to mark bias in the model. The framework consists of two parts (i) concept discovery and (ii) annotation platform. 
    more » « less
    Free, publicly-accessible full text available June 27, 2024
  3. We introduce our Maximum-Entropy Rewarded Reinforcement Learning (MERRL) framework that selects training data for more accurate Natural Language Processing (NLP). Because conventional data selection methods select training samples based on the test domain knowledge and not on real life data, they frequently fail in unknown domains like patent and Twitter. Our approach selects training samples that maximize information uncertainty measured by entropy, including observation entropy like empirical Shannon entropy, Min-entropy, R\'enyi entropy, and prediction entropy using mutual information, to cover more possible queries that may appear in unknown worlds. Our MERRL using regularized A2C and SAC achieves up to -99.7 perplexity decrease (-43.4\% relatively) in language modeling, +25.0 accuracy increase (+40.0\% relatively) in sentiment analysis, and +5.0 F1 score increase (+30.8\% relatively) in named entity recognition over various domains, demonstrating strong generalization power on unknown test sets. 
    more » « less
  4. Calzolari, Nicoletta ; Huang, Chu-Ren ; Kim, Hansaem ; Pustejovsky, James ; Wanner, Leo ; Choi, Key-Sun ; Ryu, Pum-Mo ; Chen, Hsin-Hsi ; Donatelli, Lucia ; Ji, Heng (Ed.)
    "Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840{\%} relative improvement.", 
    more » « less
  5. This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0{\%} relatively) in sentiment analysis, -44.7 perplexity decrease (37.9{\%} relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge. 
    more » « less
  6. Crowdsourcing has become an efficient paradigm to utilize human intelligence to perform tasks that are challenging for machines. Many incentive mechanisms for crowdsourcing systems have been proposed. However, most of existing incentive mechanisms assume that there are sufficient participants to perform crowdsourcing tasks. In large-scale crowdsourcing scenarios, this assumption may be not applicable. To address this issue, we diffuse the crowdsourcing tasks in social network to increase the number of participants. To make the task diffusion more applicable to crowdsourcing system, we enhance the classic Independent Cascade model so the influence is strongly connected with both the types and topics of tasks. Based on the tailored task diffusion model, we formulate the Budget Feasible Task Diffusion ( BFTD ) problem for maximizing the value function of platform with constrained budget. We design a parameter estimation algorithm based on Expectation Maximization algorithm to estimate the parameters in proposed task diffusion model. Benefitting from the submodular property of the objective function, we apply the budget-feasible incentive mechanism, which satisfies desirable properties of computational efficiency, individual rationality, budget-feasible, truthfulness, and guaranteed approximation, to stimulate the task diffusers. The simulation results based on two real-world datasets show that our incentive mechanism can improve the number of active users and the task completion rate by 9.8% and 11%, on average. 
    more » « less