skip to main content

Title: Evaluating the Morphosyntactic Well-formedness of Generated Texts
Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L’AMBRE – a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.  more » « less
Award ID(s):
1761548 2203097 2125201 2125466
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Evaluating the Morphosyntactic Well-formedness of Generated Texts
Page Range / eLocation ID:
7131 to 7150
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Lierler, Yuliya ; Morales, Jose F ; Dodaro, Carmine ; Dahl, Veroniica ; Gebser, Martin ; Tekle, Tuncay (Ed.)
    Knowledge representation and reasoning (KRR) systems represent knowledge as collections of facts and rules. Like databases, KRR systems contain information about domains of human activities like industrial enterprises, science, and business. KRRs can represent complex concepts and relations, and they can query and manipulate information in sophisticated ways. Unfortunately, the KRR technology has been hindered by the fact that specifying the requisite knowledge requires skills that most domain experts do not have, and professional knowledge engineers are hard to find. One solution could be to extract knowledge from English text, and a number of works have attempted to do so (OpenSesame, Google's Sling, etc.). Unfortunately, at present, extraction of logical facts from unrestricted natural language is still too inaccurate to be used for reasoning, while restricting the grammar of the language (so-called controlled natural language, or CNL) is hard for the users to learn and use. Nevertheless, some recent CNL-based approaches, such as the Knowledge Authoring Logic Machine (KALM), have shown to have very high accuracy compared to others, and a natural question is to what extent the CNL restrictions can be lifted. In this paper, we address this issue by transplanting the KALM framework to a neural natural language parser, mStanza. Here we limit our attention to authoring facts and queries and therefore our focus is what we call factual English statements. Authoring other types of knowledge, such as rules, will be considered in our followup work. As it turns out, neural network based parsers have problems of their own and the mistakes they make range from part-of-speech tagging to lemmatization to dependency errors. We present a number of techniques for combating these problems and test the new system, KALMFL (i.e., KALM for factual language), on a number of benchmarks, which show KALMFL achieves correctness in excess of 95%. 
    more » « less
  2. With their Discovery of Inference Rules from Text (DIRT) algorithm, Lin and Pantel (2001) made a seminal contribution to the field of rule acquisition from text, by adapting the distributional hypothesis of Harris (1954) to patterns that model binary relations such as X treat Y, where patterns are implemented as syntactic dependency paths. DIRT’s relevance is renewed in today’s neural era given the recent focus on interpretability in the field of natural language processing. We propose a novel take on the DIRT algorithm, where we implement the distributional hypothesis using the contextualized embeddings provided by BERT, a transformer-network-based language model (Vaswani et al., 2017; Devlin et al., 2018). In particular, we change the similarity measure between pairs of slots (i.e., the set of words matched by a pattern) from the original formula that relies on lexical items to a formula computed using contextualized embeddings. We empirically demonstrate that this new similarity method yields a better implementation of the distributional hypothesis, and this, in turn, yields patterns that outperform the original algorithm in the question answering-based evaluation proposed by Lin and Pantel (2001). 
    more » « less
  3. We conduct a large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well machine-generated text can be distinguished from human-written text, as well as word overlap metrics that assess how similar the generated text compares to human-written references. We determine to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is challenging even for human evaluators, and human decisions correlate better with lexical overlaps. We find lexical diversity an intriguing metric that is indicative of the assessments of different evaluators. A post-experiment survey of participants provides insights into how to evaluate and improve the quality of natural language generation systems. 
    more » « less
  4. Abstract. Driven by foundation models, recent progress in AI and machine learning has reached unprecedented complexity. For instance, the GPT-3 language model consists of 175 billion parameters and a training-data size of 570 GB. While it has achieved remarkable performance in generating text that is difficult to distinguish from human-authored content, a single training of the model is estimated to produce over 550 metric tons of CO2 emissions. Likewise, we see advances in GeoAI research improving large-scale prediction tasks like satellite image classification and global climate modeling, to name but a couple. While these models have not yet reached comparable complexity and emissions levels, spatio-temporal models differ from language and image-generation models in several ways that make it necessary to (re)train them more often, with potentially large implications for sustainability. While recent work in the machine learning community has started calling for greener and more energy-efficient AI alongside improvements in model accuracy, this trend has not yet reached the GeoAI community at large. In this work, we bring this issue to not only the attention of the GeoAI community but also present ethical considerations from a geographic perspective that are missing from the broader, ongoing AI-sustainability discussion. To start this discussion, we propose a framework to evaluate models from several sustainability-related angles, including energy efficiency, carbon intensity, transparency, and social implications. We encourage future AI/GeoAI work to acknowledge its environmental impact as a step towards a more resource-conscious society. Similar to the current push for reproducibility, future publications should also report the energy/carbon costs of improvements over prior work. 
    more » « less
  5. Aerospace systems are inherently stochastic and increasingly data-driven, thus hard to formally verify. Data-driven statistical models can be used to estimate the state and classify potentially anomalous conditions of aerospace systems from multiple heterogeneous sensors with high accuracy. In this paper, we consider the problem of precisely bounding the regions in the sensor input space of a stochastic system in which safe state classification can be formally proven. As an archetypal application, we consider a statistical model created to detect aerodynamic stall in a prototype wing retrofitted with piezoelectric sensors and used to generate data in a wind tunnel for different flight states. We formally define safety envelopes as regions parameterized by [Formula: see text] and [Formula: see text], to respectively capture how model-predictable observed sensor values are, and given these values, how likely the model’s accurate state classification is. Safety envelopes are formalized in the Agda proof assistant, used to also generate formally verified runtime monitors for sensor data stream analyses in the Haskell programming language. We further propose a new metric for model classification quality, evaluate it on our wing prototype model, and compare it to the model restricted to two different fixed airspeeds, and enhanced to a continuous Gaussian process regression model. Safety envelopes are an important step in formally verifying precise probabilistic properties of data-driven models used in stochastic aerospace systems and could be used by advanced control algorithms to maintain these systems well within safe operation boundaries.

    more » « less