skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Title: On the Robustness of Language Encoders against Grammatical Errors
We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Page Range / eLocation ID:
3386 to 3403
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Language understanding involves processing text with both the grammatical and 2 common-sense contexts of the text fragments. The text “I went to the grocery store 3 and brought home a car” requires both the grammatical context (syntactic) and 4 common-sense context (semantic) to capture the oddity in the sentence. Contex5 tualized text representations learned by Language Models (LMs) are expected to 6 capture a variety of syntactic and semantic contexts from large amounts of training 7 data corpora. Recent work such as ERNIE has shown that infusing the knowl8 edge contexts, where they are available in LMs, results in significant performance 9 gains on General Language Understanding (GLUE) benchmark tasks. However, 10 to our knowledge, no knowledge-aware model has attempted to infuse knowledge 11 through top-down semantics-driven syntactic processing (Eg: Common-sense to 12 Grammatical) and directly operated on the attention mechanism that LMs leverage 13 to learn the data context. We propose a learning framework Top-Down Language 14 Representation (TDLR) to infuse common-sense semantics into LMs. In our 15 implementation, we build on BERT for its rich syntactic knowledge and use the 16 knowledge graphs ConceptNet and WordNet to infuse semantic knowledge. 
    more » « less
  2. Psychotic disorders are forms of severe mental illness characterized by abnormal social function and a general sense of disconnect with reality. The evaluation of such disorders is often complex, as their multifaceted nature is often difficult to quantify. Multimodal behavior analysis technologies have the potential to help address this need and supply timelier and more objective decision support tools in clinical settings. While written language and nonverbal behaviors have been previously studied, the present analysis takes the novel approach of examining the rarely-studied modality of spoken language of individuals with psychosis as naturally used in social, face-to-face interactions. Our analyses expose a series of language markers associated with psychotic symptom severity, as well as interesting interactions between them. In particular, we examine three facets of spoken language: (1) lexical markers, through a study of the function of words; (2) structural markers, through a study of grammatical fluency; and (3) disfluency markers, through a study of dialogue self-repair. Additionally, we develop predictive models of psychotic symptom severity, which achieve significant predictive power on both positive and negative psychotic symptom scales. These results constitute a significant step toward the design of future multimodal clinical decision support tools for computational phenotyping of mental illness. 
    more » « less
  3. Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing). 
    more » « less
  4. Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words chopped, chef, and onion are more likely used to convey “The chef chopped the onion,” not “The onion chopped the chef.” Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in English BERT and GPT-2, on instances where lexical expectations are not sufficient, and word order knowledge is necessary for correct classification. Such non-prototypical instances are naturally occurring English sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences like “The onion chopped the chef”. We find that, while early layer embeddings are largely lexical, word order is in fact crucial in defining the later-layer representations of words in semantically non-prototypical positions. Our experiments isolate the effect of word order on the contextualization process, and highlight how models use context in the uncommon, but critical, instances where it matters. 
    more » « less
  5. null (Ed.)
    Deep Neural Networks (DNNs) are becoming an integral part of most software systems. Previous work has shown that DNNs have bugs. Unfortunately, existing debugging techniques don't support localizing DNN bugs because of the lack of understanding of model behaviors. The entire DNN model appears as a black box. To address these problems, we propose an approach and a tool that automatically determines whether the model is buggy or not, and identifies the root causes for DNN errors. Our key insight is that historic trends in values propagated between layers can be analyzed to identify faults, and also localize faults. To that end, we first enable dynamic analysis of deep learning applications: by converting it into an imperative representation and alternatively using a callback mechanism. Both mechanisms allows us to insert probes that enable dynamic analysis over the traces produced by the DNN while it is being trained on the training data. We then conduct dynamic analysis over the traces to identify the faulty layer or hyperparameter that causes the error. We propose an algorithm for identifying root causes by capturing any numerical error and monitoring the model during training and finding the relevance of every layer/parameter on the DNN outcome. We have collected a benchmark containing 40 buggy models and patches that contain real errors in deep learning applications from Stack Overflow and GitHub. Our benchmark can be used to evaluate automated debugging tools and repair techniques. We have evaluated our approach using this DNN bug-and-patch benchmark, and the results showed that our approach is much more effective than the existing debugging approach used in the state-of-the-practice Keras library. For 34/40 cases, our approach was able to detect faults whereas the best debugging approach provided by Keras detected 32/40 faults. Our approach was able to localize 21/40 bugs whereas Keras did not localize any faults. 
    more » « less