skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Ticuna (tca) language documentation: A guide to materials in the California Language Archive
Ticuna (ISO: tca) is a language isolate spoken in the northwestern Amazon Basin (Brazil, Colombia, Peru). Ticuna has more speakers than almost all other Indigenous Amazonian languages and – unlike most languages of the area – is still learned by children. Yet academic linguists have given it relatively little research attention. Therefore, to raise the profile of this areally important language, I offer a guide to three collections of Ticuna language materials held in the California Language Archive. These materials are extensive, including over 1,396 hours of recordings – primarily of child language and everyday conversations between adults – and 33 hours of transcriptions. To contextualize the materials, I provide background on the Ticuna language and people; the research projects which produced the materials; the participants who appear in them; and the ethical and permissions issues involved in collecting them. I then discuss the nature and scope of the materials, showing how the content of each collection motivated collection-specific choices about recording, transcription, organization in the archive, and metadata. Last, I outline how other researchers could draw on the collections for comparative analysis.  more » « less
Award ID(s):
1911762
PAR ID:
10275773
Author(s) / Creator(s):
Date Published:
Journal Name:
Language documentation and conservation
Volume:
15
ISSN:
1934-5275
Page Range / eLocation ID:
153-189
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Users of digital language archives face a number of barriers when trying to discover and reuse the materials preserved in the digital collections created by current language documentation projects. These barriers include sparse descriptive metadata throughout many collections and the prevalence of audio-video materials that are impervious to text-based search. Users could more easily evaluate, navigate, and use such a collection if it contained a guide that contextualized it, summarized its contents, and helped users identify and locate items within it. This article will discuss the importance of thorough collection descriptions and finding aids by synthesizing guidelines and best practices for archival description created for traditional archives and adapting these to the structure and makeup of today’s digital language documentation collections. To facilitate the iterative description of growing collections, the checklist of information to include is presented in three groups of descending priority. 
    more » « less
  2. As theorized in language documentation, archives serve to make research reproducible and to make primary data accessible for multiple audiences (Himmelmann 2006; Berez-Kroeker et al. 2018). Scholars in the emerging mid-20th-century field of African history emphasized these same priorities. Mid-century Africanist historians assembled large text collections but failed in a clearly stated disciplinary project to preserve them in accessible archives. This paper explores the relationship between institutional and social factors in data preservation through the story of audio recordings and field notes documenting Soo (Uganda: Kuliak/Nilo-Saharan) collected in the mid-20th century by Makerere University history PhD student John M. Weatherby. For decades, Weatherby struggled and failed to find an institutional home for his materials, which were nearly lost amid changing disciplinary trends. I encountered them only through informal social interactions in 2018 and have subsequently been depositing them in a language archive. The slide of Weatherby’s data into obscurity shows how archiving is inherently a disciplinary practice. Institutions intending to preserve data rose and fell with changing disciplinary paradigms, but Weatherby’s data were preserved through personal relationships. Despite a common emphasis on technical and institutional initiatives for archiving, the relational contexts of legacy materials are central to their preservation. 
    more » « less
  3. Over the past three decades the field of linguistics has refocused attention on endangered languages, and enormous strides have been made to document these languages and develop archive infrastructure for language data. Although the potential for language archives to support language renewal efforts has often been tacitly assumed, much greater attention has been given to the preservation of data than to access and utilization. Documentation activities are imagined as a race against time to get language data into a lasting form before the last speakers pass away. Here I describe three examples of efforts which are working to engage with language communities and increase the accessibility and usability of language resources. Though not necessarily representative, these efforts suggest ways in which linguists, archivists, and communities can collaborate to support digital return. 
    more » « less
  4. Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages. 
    more » « less
  5. Abstract Commonly recommended methods for documenting endangered languages are built around the assumption that a given documentary project will focus on a single language rather than a multilingual ecology. This hinders the potential usability of documentary materials for the study of language contact. Research in domains such as ethnography and sociolinguistics has developed conceptual and analytical tools for understanding patterns of multilingual usage, but the insights of such work have yet to be translated into concrete recommendations for enhancements to documentary practice. This paper considers how standard documentary approaches can be adapted to multilingual contexts with respect to activities such as the collection of metadata, the use of ethnographic methods, and the recording and annotation of naturalistic multilingual discourse. A particular focus of the discussion are ways in which documentary projects can create better records of multilingual practices even if these are not the focus of the work. 
    more » « less