NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Digital Documentation for Diasporic Data: challenges, opportunities, and solutions for working with Diaspora Communities

Taguchi, Chihiro; Liebl, J Elizabeth; Anastasopoulos, Antonios; Chiang, David; Walther, Géraldine (May 2025, 9th International Conference on Language Documentation & Conservation (ICLDC))

Language documentation involving diaspora communities presents a combination of challenges and opportunities for approaches leveraging large data collection and assisted transcription and annotation. Demonstrating our projects on Kichwa and Mapudungun, we will present a suite of our computational tools designed to effectively work with diaspora communities for language documentation.
more » « less
Free, publicly-accessible full text available May 3, 2026
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative

https://doi.org/10.18653/v1/2024.wmt-1.4

Maillard, Jean; Burchell, Laurie; Anastasopoulos, Antonios; Federmann, Christian; Koehn, Philipp; Wang, Skyler (November 2024, Association for Computational Linguistics)

Full Text Available
Language and Speech Technology for Central Kurdish Varieties

Ahmadi, Sina; Jaff, Daban; Ibn_Alam, Md_Mahfuz; Anastasopoulos, Antonios (May 2024, ELRA and ICCL)

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish, creating a corpus by transcribing movies and TV series as an alternative to fieldwork. Additionally, we report the performance of machine translation, automatic speech recognition, and language identification as downstream tasks evaluated on Central Kurdish subdialects. Data and models are publicly available under an open license at https://github.com/sinaahmadi/CORDI.
more » « less
Full Text Available
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation

Ibn_Alam, Md_Mahfuz; Ahmadi, Sina; Anastasopoulos, Antonios (March 2024, Association for Computational Linguistics)

Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release CODET, a contrastive dialectal benchmark encompassing 891 different variations from twelve different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. All the data and code have been released.
more » « less
Full Text Available
Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages

https://doi.org/10.21437/Interspeech.2023-1979

Sikasote, Claytone; Siaminwe, Kalinda; Mwape, Stanly; Zulu, Bangiwe; Phiri, Mofya; Phiri, Martin; Zulu, David; Nyirenda, Mayumbo; Anastasopoulos, Antonios (August 2023, Proc. INTERSPEECH 2023)

Full Text Available
Measure words are measurably different from sortal classifiers

Wang, Yamei; Walther, Géraldine (March 2023, Association for Computational Linguistics)

Full Text Available
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

https://doi.org/10.18653/v1/2023.acl-long.809

Ahmadi, Sina; Anastasopoulos, Antonios (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.
more » « less
Full Text Available
PALI: A Language Identification Benchmark for Perso-Arabic Scripts

https://doi.org/10.18653/v1/2023.vardial-1.8

Ahmadi, Sina; Agarwal, Milind; Anastasopoulos, Antonios (January 2023, Association for Computational Linguistics)

Full Text Available
Mandarin classifier systems optimize to accommodate communicative pressures

https://doi.org/10.18653/v1/2023.findings-emnlp.843

Wang, Yamei; Walther, Géraldine (January 2023, Association for Computational Linguistics)

Full Text Available
Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

https://doi.org/10.18653/v1/2023.fieldmatters-1.7

Ahmadi, Sina; Azin, Zahra; Belelli, Sara; Anastasopoulos, Antonios (January 2023, Association for Computational Linguistics)

Full Text Available

« Prev Next »

Search for: All records