NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MAGNET : Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Ahia, Orevaoghene; Kumar, Sachin; Gonen, Hila; Hofmann, Valentin; Limisiewicz, Tomasz; Tsvetkov, Yulia; Smith, Noah_A (December 2024, NeurIPS)

Full Text Available
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Ahia, Orevaoghene; Aremu, Anuoluwapo; Abagyan, Diana; Gonen, Hila; Adelani, David_Ifeoluwa; Abolade, Daud; Smith, Noah_A; Tsvetkov, Yulia (December 2024, EMNLP)

Full Text Available
Teaching LLMs to Abstain across Languages via Multilingual Feedback

Feng, Shangbin; Shi, Weijia; Wang, Yike; Ding, Wenxuan; Ahia, Orevaoghene; Li, Shuyue_Stella; Balachandran, Vidhisha; Sitaram, Sunayana; Tsvetkov, Yulia (December 2024, EMNLP)

Full Text Available
Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Xie, Roy; Ahia, Orevaoghene; Tsvetkov, Yulia; Anastasopoulos, Antonios (June 2024, NAACL)

Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both posthoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations
more » « less
Full Text Available
Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

https://doi.org/10.18653/v1/2024.naacl-short.5

Xie, Roy; Ahia, Orevaoghene; Tsvetkov, Yulia; Anastasopoulos, Antonios (June 2024, Association for Computational Linguistics)

Full Text Available
DialectBench: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Faisal, Fahim; Ahia, Orevaoghene; Srivastava, Aarohi; Ahuja, Kabir; Chiang, David; Tsvetkov, Yulia; Anastasopoulos, Antonios (July 2024, ACL)

Full Text Available
DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

https://doi.org/10.18653/v1/2024.acl-long.777

Faisal, Fahim; Ahia, Orevaoghene; Srivastava, Aarohi; Ahuja, Kabir; Chiang, David; Tsvetkov, Yulia; Anastasopoulos, Antonios (January 2024, Association for Computational Linguistics)

Full Text Available
LEXPLAIN: Improving Model Explanations via Lexicon Supervision

https://doi.org/10.18653/v1/2023.starsem-1.19

Ahia, Orevaoghene; Gonen, Hila; Balachandran, Vidhisha; Tsvetkov, Yulia; Smith, Noah (July 2023, Proceedings of the The 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023))

Model explanations that shed light on the model’s predictions are becoming a desired additional output of NLP models, alongside their predictions. Challenges in creating these explanations include making them trustworthy and faithful to the model’s predictions. In this work, we propose a novel framework for guiding model explanations by supervising them explicitly. To this end, our method, LEXPLAIN, uses task-related lexicons to directly supervise model explanations. This approach consistently improves the plausibility of model’s explanations without sacrificing performance on the task, as we demonstrate on sentiment analysis and toxicity detection. Our analyses show that our method also demotes spurious correlations (i.e., with respect to African American English dialect) on toxicity detection, improving fairness.
more » « less
Full Text Available
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

https://doi.org/10.18653/v1/2023.emnlp-main.614

Ahia, Orevaoghene; Kumar, Sachin; Gonen, Hila; Kasai, Jungo; Mortensen, David; Smith, Noah; Tsvetkov, Yulia (January 2023, Association for Computational Linguistics)

Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of {``}tokens{''} processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API{'}s pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI{'}s language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable, to begin with. Through these analyses, we aim to increase transparency around language model APIs{'} pricing policies and encourage the vendors to make them more equitable.
more » « less
Full Text Available

Search for: All records