skip to main content


Title: Can Synthetic Translations Improve Bitext Quality?
Synthetic translations have been used for a wide range of NLP tasks primarily as a means of data augmentation. This work explores, instead, how synthetic translations can be used to revise potentially imperfect reference translations in mined bitext. We find that synthetic samples can improve bitext quality without any additional bilingual supervision when they replace the originals based on a semantic equivalence classifier that helps mitigate NMT noise. The improved quality of the revised bitext is confirmed intrinsically via human evaluation and extrinsically through bilingual induction and MT tasks.  more » « less
Award ID(s):
1750695
NSF-PAR ID:
10399217
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Page Range / eLocation ID:
4753 to 4766
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. fewshot). Unlike most prior work in cross-lingual ToD that only focuses on Dialogue State Tracking (DST), we build an end-to-end agent. We show that our approach closes the accuracy gap between few-shot and existing fullshot methods for ToD agents. We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations. We evaluate our approach on the recent bilingual dialogue dataset BiToD. In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training. 
    more » « less
  2. null (Ed.)
    Neural Machine Translation (NMT) performs training of a neural network employing an encoder-decoder architecture. However, the quality of the neural-based translations predominantly depends on the availability of a large amount of bilingual training dataset. In this paper, we explore the performance of translations predicted by attention-based NMT systems for Spanish to Persian low-resource language pairs. We analyze the errors of NMT systems that occur in the Persian language and provide an in-depth comparison of the performance of the system based on variations in sentence length and size of the training dataset. We evaluate our translation results using BLEU and human evaluation measures based on the adequacy, fluency, and overall rating. 
    more » « less
  3. null (Ed.)
    This paper describes a systematic study of an approach to Farsi-Spanish low-resource Neural Machine Translation (NMT) that leverages monolingual data for joint learning of forward and backward translation models. As is standard for NMT systems, the training process begins using two pre-trained translation models that are iteratively updated by decreasing translation costs. In each iteration, either translation model is used to translate monolingual texts from one language to another, to generate synthetic datasets for the other translation model. Two new translation models are then learned from bilingual data along with the synthetic texts. The key distinguishing feature between our approach and standard NMT is an iterative learning process that improves the performance of both translation models, simultaneously producing a higher-quality synthetic training dataset upon each iteration. Our empirical results demonstrate that this approach outperforms baselines. 
    more » « less
  4. Background/Context:

    Computer programming is rarely accessible to K–12 students, especially for those from culturally and linguistically diverse backgrounds. Middle school age is a transitioning time when adolescents are more likely to make long-term decisions regarding their academic choices and interests. Having access to productive and positive knowledge and experiences in computer programming can grant them opportunities to realize their abilities and potential in this field.

    Purpose/Focus of Study:

    This study focuses on the exploration of the kind of relationship that bilingual Latinx students developed with themselves and computer programming and mathematics (CPM) practices through their participation in a CPM after-school program, first as students and then as cofacilitators teaching CPM practices to other middle school peers.

    Setting:

    An after-school program, Advancing Out-of-School Learning in Mathematics and Engineering (AOLME), was held at two middle schools located in rural and urban areas in the Southwest. It was designed to support an inclusive cultural environment that nurtured students’ opportunities to learn CPM practices through the inclusion of languages (Spanish and English), tasks, and participants congruent to students in the program. Students learned how to represent, design, and program digital images and videos using a sequence of 2D arrays of hexadecimal numbers with Python on a Raspberry Pi computer. The six bilingual cofacilitators attended Levels 1 and 2 as students and were offered the opportunity to participate as cofacilitators in the next implementation of Level 1.

    Research Design:

    This longitudinal case study focused on analyzing the experiences and shifts (if any) of students who participated as cofacilitators in AOLME. Their narratives were analyzed collectively, and our analysis describes the experiences of the cofacilitators as a single case study (with embedded units) of what it means to be a bilingual cofacilitator in AOLME. Data included individual exit interviews of the six cofacilitators and their focus groups (30–45 minutes each), an adapted 20-item CPM attitude 5-point Likert scale, and self-report from each of them. Results from attitude scales revealed cofacilitators’ greater initial and posterior connections to CPM practices. The self-reports on CPM included two number lines (0–10) for before and after AOLME for students to self-assess their liking and knowledge of CPM. The numbers were used as interview prompts to converse with students about experiences. The interview data were analyzed qualitatively and coded through a contrast-comparative process regarding students’ description of themselves, their experiences in the program, and their perception of and relationship toward CPM practices.

    Findings:

    Findings indicated that students had continued/increased motivation and confidence in CPM as they engaged in a journey as cofacilitators, described through two thematic categories: (a) shifting views by personally connecting to CPM, and (b) affirming CPM practices through teaching. The shift in connecting to CPM practices evolved as students argued that they found a new way of learning mathematics, in that they used mathematics as a tool to create videos and images that they programmed by using Python while making sense of the process bilingually (Spanish and English). This mathematics was viewed by students as high level, which in turned helped students gain self-confidence in CPM practices. Additionally, students affirmed their knowledge and confidence in CPM practices by teaching them to others, a process in which they had to mediate beyond the understanding of CPM practices. They came up with new ways of explaining CPM practices bilingually to their peers. In this new role, cofacilitators considered the topic and language, and promoted a communal support among the peers they worked with.

    Conclusions/Recommendations:

    Bilingual middle school students can not only program, but also teach bilingually and embrace new roles with nurturing support. Schools can promote new student roles, which can yield new goals and identities. There is a great need to redesign the school mathematics curriculum as a discipline that teenagers can use and connect with by creating and finding things they care about. In this way, school mathematics can support a closer “fit” with students’ identification with the world of mathematics. Cofacilitators learned more about CPM practices by teaching them, extending beyond what was given to them, and constructing new goals that were in line with a sophisticated knowledge and shifts in the practice. Assigned responsibility in a new role can strengthen students’ self-image, agency, and ways of relating to mathematics.

     
    more » « less
  5. The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification. 
    more » « less