skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CatMapper: user interface support for large complex categories and semantic data exploration
Scientists and policymakers are increasingly leveraging complex, multi-scale data from diverse, worldwide sources to understand the causes and consequences of economic development, social stratification, climate change, cultural diversity, and violent conflict. This work frequently requires integrating data across diverse datasets by complex, dynamic categories (e.g., ethnicities, languages, religions, subdistricts). However, different datasets encode corresponding categories in disparate formats and at different resolutions (e.g., Guatemala Indigenous vs. Maya vs. K’iche’). These diverse encodings must be translated across datasets before bringing them together for analysis. At global scales across thousands of categories, the combinatorial complexity creates thorny challenges for manual reconciliation and for transparent documentation and sharing of researcher decisions. There is a need to investigate direct and uncomplicated ways to support search and explore the semantics for complex and diverse datasets.We design and deploy such a tool, CatMapper, to support semantic discovery through exploration and manipulation for large, complex and diverse datasets. CatMapper enables exploring contextual information about specific categories, translating new sets of categories from existing datasets and published studies, identify and integrating novel combinations of datasets for researchers’ custom needs, including automatically generated syntax to merge datasets of interest, and publishing and sharing merging templates for public re-use and open science. CatMapper does not store observational data. Rather, it is a dynamic, interactive dictionary of keys to help users integrate observational data from diverse external datasets in disparate formats, thereby complementing and leveraging a fast-growing ecology of datasets storing observational data. We have conducted heuristic evaluation on CatMapper usability. Results shed lights on enriching semantic data discovery.  more » « less
Award ID(s):
2318505
PAR ID:
10590897
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
AHFE International
Date Published:
ISSN:
2771-0718
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT A key challenge in conducting comparative analyses across social units, such as religions, ethnicities, or cultures, is that data on these units is often encoded in distinct and incompatible formats across diverse datasets. This can involve simple differences in the variables and values used to encode these units (e.g., Roman Catholic is V130 = 1 vs. Q98A = 2 in two different datasets) or differences in the resolutions at which units are encoded (Maya vs. Kaqchikel Maya). These disparate encodings can create substantial challenges for the efficiency and transparency of data syntheses across diverse datasets. We introduce a user‐friendly set of tools to help users translate four kinds of categories (religion, ethnicity, language, and subdistrict) across multiple, external datasets. We outline the platform's key functions and current progress, as well as long‐range goals for the platform. 
    more » « less
  2. Health-related speech datasets are often small and varied in focus. This makes it difficult to leverage them to effectively support healthcare goals. Robust transfer of linguistic features across different datasets orbiting the same goal carries potential to address this concern. To test this hypothesis, we experiment with domain adaptation (DA) techniques on heterogeneous spoken language data to evaluate generalizability across diverse datasets for a common task: dementia detection. We find that adapted models exhibit better performance across conversational and task-oriented datasets. The feature-augmented DA method achieves a 22% increase in accuracy adapting from a conversational to task-specific dataset compared to a jointly trained baseline. This suggests promising capacity of these techniques to allow for productive use of disparate data for a complex spoken language healthcare task. 
    more » « less
  3. Abstract Legumes, comprising one of the largest, most diverse, and most economically important plant families, are the subject of vibrant research and development worldwide. Continued improvement of legume crops will benefit from the recent proliferation of genetic (including genomic) resources; but the diversity, scale, and complexity of these resources presents challenges to those managing and using them. A workshop held in March of 2019 addressed questions of data resources and priorities for the legumes. The workshop identified various needs and recommendations: (a) Develop strategies to effectively store, integrate, and relate genetic resources collected in different projects. (b) Leverage information collected across many legume species by standardizing data formats and ontologies, improving the state of metadata about datasets, and increasing use of the FAIR data principles. (c) Advocate for the critical role that curators exercise in integrating complex datasets into databases and adding high value metadata that enable downstream analytics and facilitate practical applications. (d) Implement standardized software and database development practices to best leverage limited developer time and expertise gained from the various legume (and other) species. (e) Develop tools and databases that can manage genetic information for the world's plant genetic resources, enabling efficient incorporation of important traits into breeding programs. (f) Centralize information on databases, tools, and training materials and establish funding streams to support training and outreach. 
    more » « less
  4. Knowledge discovery and information extraction of large and complex datasets has attracted great attention in wide-ranging areas from statistics and biology to medicine. Tools from machine learning, data mining, and neurocomputing have been extensively explored and utilized to accomplish such compelling data analytics tasks. However, for time-series data presenting active dynamic characteristics, many of the state-of-the-art techniques may not perform well in capturing the inherited temporal structures in these data. In this paper, integrating the Koopman operator and linear dynamical systems theory with support vector machines (SVMs), we develop a novel dynamic data mining framework to construct low-dimensional linear models that approximate the nonlinear flow of high-dimensional time-series data generated by unknown nonlinear dynamical systems. This framework then immediately enables pattern recognition, e.g., classification, of complex time-series data to distinguish their dynamic behaviors by using the trajectories generated by the reduced linear systems. Moreover, we demonstrate the applicability and efficiency of this framework through the problems of time-series classification in bioinformatics and healthcare, including cognitive classification and seizure detection with fMRI and EEG data, respectively. The developed Koopman dynamic learning framework then lays a solid foundation for effective dynamic data mining and promises a mathematically justified method for extracting the dynamics and significant temporal structures of nonlinear dynamical systems. 
    more » « less
  5. Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects,as exemplified by the diverse uses of the particle *just* (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English *just*, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles. 
    more » « less