Abstract Color encoding is foundational to visualizing quantitative data. Guidelines for colormap design have traditionally emphasized perceptual principles, such as order and uniformity. However, colors also evoke cognitive and linguistic associations whose role in data interpretation remains underexplored. We study how two linguistic factors, name salience and name variation, affect people's ability to draw inferences from spatial visualizations. In two experiments, we found that participants are better at interpreting visualizations when viewing colors with more salient names (e.g., prototypical ‘blue’, ‘yellow’, and ‘red’ over ‘teal’, ‘beige’, and ‘maroon’). The effect was robust across four visualization types, but was more pronounced in continuous (e.g., smooth geographical maps) than in similar discrete representations (e.g., choropleths). Participants' accuracy also improved as the number of nameable colors increased, although the latter had a less robust effect. Our findings suggest that color nameability is an important design consideration for quantitative colormaps, and may even outweigh traditional perceptual metrics. In particular, we found that the linguistic associations of color are a better predictor of performance than the perceptual properties of those colors. We discuss the implications and outline research opportunities. The data and materials for this study are available athttps://osf.io/asb7n
more »
« less
Reproducible research in linguistics: A position statement on data citation and attribution in our field
Abstract This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. Reproducibility can play a key role in increasing verification and accountability in linguistic research, and is a hallmark of social science research that is currently under-represented in our field. We believe that we need to take time as a discipline to clearly articulate our expectations for how linguistic data are managed, cited, and maintained for long-term access.
more »
« less
- Award ID(s):
- 1447886
- PAR ID:
- 10055458
- Date Published:
- Journal Name:
- Linguistics
- Volume:
- 56
- Issue:
- 1
- ISSN:
- 0024-3949
- Page Range / eLocation ID:
- 1 to 18
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.more » « less
-
The data and compute requirements of current language modeling technology pose challenges for the processing and analysis of low-resource languages. Declarative linguistic knowledge has the potential to partially bridge this data scarcity gap by providing models with useful inductive bias in the form of language-specific rules. In this paper, we propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing. We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM. The results demonstrate that significant leaps in performance and efficiency are possible with the right combination of: a) linguistic inputs in the form of grammars, b) the interpretive power of LLMs, and c) the trainability of smaller token classification networks. We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages. Our work also offers documentary linguists a more reliable and more usable tool for morphological glossing by providing well-reasoned explanations and confidence scores for each output.more » « less
-
In this paper, we seek to advance the state-of-the-art in code evolution analysis research and practice by statistically analyzing, interpreting, and formally describing the evolution of code lexicon in Open Source Software (OSS). The underlying hypothesis is that, similar to natural language, code lexicon falls under the remit of evolutionary principles. Therefore, adapting theories and statistical models of natural language evolution to code is expected to provide unique insights into software evolution. Our analysis in this paper is conducted using 2,000 OSS systems sampled from a broad range of application domains. Our results show that a) OSS projects exhibit a significant shift in their linguistic identity over time, b) different syntactic structures of code lexicon evolve differently, c) different factors of OSS development and different maintenance activities impact code lexicon differently. These insights lay out a preliminary foundation for modeling the linguistic history of OSS projects. In the long run, this foundation will be utilized to provide support for basic software maintenance and program comprehension activities, and gain new theoretical insights into the complex interplay between linguistic change and various system and human aspects of OSS development.more » « less
-
The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream(R). The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream(R) (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article.more » « less
An official website of the United States government

