skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.  more » « less
Award ID(s):
2109578
PAR ID:
10451365
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Volume:
1
Page Range / eLocation ID:
14466 to 14487
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Second language learners studying languages with a diverse set of prepositions often find preposition usage difficult to master, which can manifest in second language writing as preposition errors that appear to result from transfer from a native language, or interlingual errors. We envision a digital writing assistant for language learners and teachers that can provide targeted feedback on these errors. To address these errors, we turn to the task of preposition error detection, which remains an open problem despite the many methods that have been proposed. In this paper, we explore various classifiers, with and without neural network-based features, and finetuned BERT models for detecting preposition errors between verbs and their noun arguments. 
    more » « less
  2. null (Ed.)
    Indonesian language is heavily riddled with colloquialism whether in written or spoken forms. In this paper, we identify a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and propose a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex) consisting of informal words on Twitter expertly annotated with their standard forms and their word formation types/tags. We evalu- ate several models for character-level transduction to perform morphological word normalization on this testbed to understand their failure cases and provide baselines for future work. As IndoCollex catalogues word formation phenomena that are also present in the non-standard text of other languages, it can also provide an attractive testbed for methods tailored for cross-lingual word normalization and non-standard word formation. 
    more » « less
  3. Recent advances in trusted execution environments, specifically with Intel's introduction of SGX on consumer processors, have provided unprecedented opportunities to create secure applications with a small TCB. While a large number of SGX solutions have been proposed, nearly all of them focus on protecting native code applications, leaving scripting languages unprotected. To fill this gap, this paper presents SCRIPTSHIELD, a framework capable of running legacy script code while simultaneously providing confidentiality and integrity for scripting code and data. In contrast to the existing schemes that either require tedious and time-consuming re-development or result in a large TCB by importing an entire library OS or container, SCRIPTSHIELD keeps the TCB small and provides backwards compatibility (i.e., no changes needed to the scripting code itself). The core idea is to customize the script interpreter to run inside an SGX enclave and pass scripts to it. We have implemented SCRIPTSHIELD and tested with three popular scripting languages: Lua, JavaScript, and Squirrel. Our experimental results show that SCRIPTSHIELD does not cause noticeable overhead. The source code of SCRIPTSHIELD has been made publicly available as an open source project. 
    more » « less
  4. Global teams frequently consist of language-based subgroups who put together complementary information to achieve common goals. Previous research outlines a two-step work communication flow in these teams. There are team meetings using a required common language (i.e., English); in preparation for those meetings, people have subgroup conversations in their native languages. Work communication at team meetings is often less effective than in subgroup conversations. In the current study, we investigate the idea of leveraging machine translation (MT) to facilitate global team meetings. We hypothesize that exchanging subgroup conversation logs before a team meeting offers contextual information that benefits teamwork at the meeting. MT can translate these logs, which enables comprehension at a low cost. To test our hypothesis, we conducted a between-subjects experiment where twenty quartets of participants performed a personnel selection task. Each quartet included two English native speakers (NS) and two non-native speakers (NNS) whose native language was Mandarin. All participants began the task with subgroup conversations in their native languages, then proceeded to team meetings in English. We manipulated the exchange of subgroup conversation logs prior to team meetings: with MT-mediated exchanges versus without. Analysis of participants' subjective experience, task performance, and depth of discussions as reflected through their conversational moves jointly indicates that team meeting quality improved when there were MT-mediated exchanges of subgroup conversation logs as opposed to no exchanges. We conclude with reflections on when and how MT could be applied to enhance global teamwork across a language barrier. 
    more » « less
  5. Badia, Rosa M; Mohror, Kathryn (Ed.)
    In contemporary high-performance computing architectures, the integration of GPU accelerators has become increasingly prevalent. To harness the full potential of these accelerators, developers often resort to vendor-specific kernel languages, such as CUDA. While this approach ensures optimal efficiency, it inherently compromises portability and engenders vendor dependency. Existing portable programming models, such as OpenMP, while promising, demand extensive code rewriting due to their foundamental difference from kernel languages. In this work, we introduce extensions to LLVM OpenMP, transforming it into a versatile and performance portable kernel language for GPU programming. These extensions allow for the seamless porting of programs from kernel languages to high-performance OpenMP GPU programs with minimal modifications. To evaluate our extension, we implemented a proof-of-concept prototype that contains a subset of extensions we proposed. We ported six established CUDA proxy and benchmark applications and evaluated their performance on both AMD and NVIDIA platforms. By comparing with native versions (HIP and CUDA), our results show that OpenMP, augmented with our extensions, can not only match but also in some cases exceed the performance of kernel languages, thereby offering performance portability with minimal effort from application developers. 
    more » « less