skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Archival description for language documentation collections
Users of digital language archives face a number of barriers when trying to discover and reuse the materials preserved in the digital collections created by current language documentation projects. These barriers include sparse descriptive metadata throughout many collections and the prevalence of audio-video materials that are impervious to text-based search. Users could more easily evaluate, navigate, and use such a collection if it contained a guide that contextualized it, summarized its contents, and helped users identify and locate items within it. This article will discuss the importance of thorough collection descriptions and finding aids by synthesizing guidelines and best practices for archival description created for traditional archives and adapting these to the structure and makeup of today’s digital language documentation collections. To facilitate the iterative description of growing collections, the checklist of information to include is presented in three groups of descending priority.  more » « less
Award ID(s):
1653380
PAR ID:
10200663
Author(s) / Creator(s):
Date Published:
Journal Name:
Language documentation and conservation
Volume:
14
ISSN:
1934-5275
Page Range / eLocation ID:
520-578
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter‐university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine‐readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot‐based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data. 
    more » « less
  2. As theorized in language documentation, archives serve to make research reproducible and to make primary data accessible for multiple audiences (Himmelmann 2006; Berez-Kroeker et al. 2018). Scholars in the emerging mid-20th-century field of African history emphasized these same priorities. Mid-century Africanist historians assembled large text collections but failed in a clearly stated disciplinary project to preserve them in accessible archives. This paper explores the relationship between institutional and social factors in data preservation through the story of audio recordings and field notes documenting Soo (Uganda: Kuliak/Nilo-Saharan) collected in the mid-20th century by Makerere University history PhD student John M. Weatherby. For decades, Weatherby struggled and failed to find an institutional home for his materials, which were nearly lost amid changing disciplinary trends. I encountered them only through informal social interactions in 2018 and have subsequently been depositing them in a language archive. The slide of Weatherby’s data into obscurity shows how archiving is inherently a disciplinary practice. Institutions intending to preserve data rose and fell with changing disciplinary paradigms, but Weatherby’s data were preserved through personal relationships. Despite a common emphasis on technical and institutional initiatives for archiving, the relational contexts of legacy materials are central to their preservation. 
    more » « less
  3. null (Ed.)
    Ticuna (ISO: tca) is a language isolate spoken in the northwestern Amazon Basin (Brazil, Colombia, Peru). Ticuna has more speakers than almost all other Indigenous Amazonian languages and – unlike most languages of the area – is still learned by children. Yet academic linguists have given it relatively little research attention. Therefore, to raise the profile of this areally important language, I offer a guide to three collections of Ticuna language materials held in the California Language Archive. These materials are extensive, including over 1,396 hours of recordings – primarily of child language and everyday conversations between adults – and 33 hours of transcriptions. To contextualize the materials, I provide background on the Ticuna language and people; the research projects which produced the materials; the participants who appear in them; and the ethical and permissions issues involved in collecting them. I then discuss the nature and scope of the materials, showing how the content of each collection motivated collection-specific choices about recording, transcription, organization in the archive, and metadata. Last, I outline how other researchers could draw on the collections for comparative analysis. 
    more » « less
  4. Over the past three decades the field of linguistics has refocused attention on endangered languages, and enormous strides have been made to document these languages and develop archive infrastructure for language data. Although the potential for language archives to support language renewal efforts has often been tacitly assumed, much greater attention has been given to the preservation of data than to access and utilization. Documentation activities are imagined as a race against time to get language data into a lasting form before the last speakers pass away. Here I describe three examples of efforts which are working to engage with language communities and increase the accessibility and usability of language resources. Though not necessarily representative, these efforts suggest ways in which linguists, archivists, and communities can collaborate to support digital return. 
    more » « less
  5. null (Ed.)
    A wealth of information about how parasites interact with their hosts already exists in collections, scientific publications, specialized databases, and grey literature. The US National Science Foundation-funded Terrestrial Parasite Tracker Thematic Collection Network (TPT) project began in 2019 to help build a comprehensive picture of arthropod ectoparasites including the evolution of these parasite-host biotic associations, distributions, and the ecological interactions of disease vectors. TPT is a network of biodiversity collections whose data can assist scientists, educators, land managers, and policymakers to better understand the complex relationship between hosts and parasites including emergent properties that may explain the causes and frequency of human and wildlife pathogens. TPT member collections make their association information easier to access via Global Biotic Interactions (GloBI, Poelen et al. 2014), which is periodically archived through Zenodo to track progress in the TPT project. TPT leverages GloBI's ability to index biotic associations from specimen occurrence records that come from existing management systems (e.g., Arctos, Symbiota, EMu, Excel, MS Access) to avoid having to completely rework existing, or build new, cyber-infrastructures before collections can share data. TPT-affiliated collection managers use collection-specific translation tables to connect their verbatim (or original) terms used to describe associations (e.g., "ex", "found on", "host") to their interpreted, machine-readable terms in the OBO Relations Ontology (RO). These interpreted terms enable searches across previously siloed association record sets, while the original verbatim values remain accessible to help retain provenance and allow for interpretation improvements. TPT is an ambitious project, with the goal to database label data from over 1.2 million specimens of arthropod parasites of vertebrates coming from 22 collections across North America. In the first year of the project, the TPT collections created over 73,700 new records and 41,984 images. In addition, 17 TPT data providers and three other collaborators shared datasets that are now indexed by GloBI, visible on the TPT GloBI project page. These datasets came from collection specimen occurrence records and literature sources. Two TPT data archives that capture and preserve the changes in the data coming from TPT to GloBI were published through Zenodo (Poelen et al. 2020a, Poelen et al. 2020b). The archives document the changes in how data are shared by collections including the biotic association data format and quantity of data captured. The Poelen et al. 2020b report included all TPT collections and biotic interactions from Arctos collections in VertNet and the Symbiota Collection of Arthropods Network (SCAN). The total number of interactions included in this report was 376,671 records (500,000 interactions is the overall goal for TPT). In addition, close coordination with TPT collection data managers including many one-on-one conversations, a workshop, and a webinar (Sullivan et al. 2020) was conducted to help guide the data capture of biotic associations. GloBI is an effective tool to help integrate biotic association data coming from occurrence records into an openly accessible, global, linked view of existing species interaction records. The results gleaned from the TPT workshop and Zenodo data archives demonstrate that minimizing changes to existing workflows allow for custom interpretation of collection-specific interaction terms. In addition, including collection data managers in the development of the interaction term vocabularies is an important part of the process that may improve data sharing and the overall downstream data quality. 
    more » « less