skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Using Machine Learning and Visualization for Qualitative Inductive Analyses of Big Data
Many domains require analyst expertise to determine what patterns and data are interesting in a corpus. However, most analytics tools attempt to prequalify “interestingness” using algorithmic approaches to provide exploratory overviews. This overview-driven workflow precludes the use of qualitative analysis methodologies in large datasets. This paper discusses a preliminary visual analytics approach demonstrating how visual analytics tools can instead enable expert-driven qualitative analyses at scale by supporting computer-in-the-loop mixed-initiative approaches. We argue that visual analytics tools can support rich qualitative inference by using machine learning methods to continually model and refine what features correlate to an analyst’s on-going qualitative observations and by providing transparency into these features in order to aid analysts in navigating large corpora during qualitative analyses. We illustrate these ideas through an example from social media analysis and discuss open opportunities for designing visualizations that support qualitative inference through computer-in-the-loop approaches.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 2019 Workshop on Machine Learning from User Interaction
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Even highly motivated undergraduates drift off their STEM career pathways. In large introductory STEM classes, instructors struggle to identify and support these students. To address these issues, we developed co‐redesign methods in partnership with disciplinary experts to create high‐structure STEM courses that better support students and produce informative digital event data. To those data, we applied theory‐ and context‐relevant labels to reflect active and self‐regulated learning processes involving LMS‐hosted course materials, formative assessments, and help‐seeking tools. We illustrate the predictive benefits of this process across two cycles of model creation and reapplication. In cycle 1, we used theory‐relevant features from 3 weeks of data to inform a prediction model that accurately identified struggling students and sustained its accuracy when reapplied in future semesters. In cycle 2, we refit a model with temporally contextualized features that achieved superior accuracy using data from just two class meetings. This modelling approach can produce durable learning analytics solutions that afford scaled and sustained prediction and intervention opportunities that involve explainable artificial intelligence products. Those same products that inform prediction can also guide intervention approaches and inform future instructional design and delivery.Practitioner notesWhat is already known about this topic

    Learning analytics includes an evolving collection of methods for tracing and understanding student learning through their engagements with learning technologies.

    Prediction models based on demographic data can perpetuate systemic biases.

    Prediction models based on behavioural event data can produce accurate predictions of academic success, and validation efforts can enrich those data to reflect students' self‐regulated learning processes within learning tasks.

    What this paper adds

    Learning analytics can be successfully applied to predict performance in an authentic postsecondary STEM context, and the use of context and theory as guides for feature engineering can ensure sustained predictive accuracy upon reapplication.

    The consistent types of learning resources and cyclical nature of their provisioning from lesson to lesson are hallmarks of high‐structure active learning designs that are known to benefit learners. These designs also provide opportunities for observing and modelling contextually grounded, theory‐aligned and temporally positioned learning events that informed prediction models that accurately classified students upon initial and later reapplications in subsequent semesters.

    Co‐design relationships where researchers and instructors work together toward pedagogical implementation and course instrumentation are essential to developing unique insights for feature engineering and producing explainable artificial intelligence approaches to predictive modelling.

    Implications for practice and/or policy

    High‐structure course designs can scaffold student engagement with course materials to make learning more effective and products of feature engineering more explainable.

    Learning analytics initiatives can avoid perpetuation of systemic biases when methods prioritize theory‐informed behavioural data that reflect learning processes, sensitivity to instructional context and development of explainable predictors of success rather than relying on students' demographic characteristics as predictors.

    Prioritizing behaviours as predictors improves explainability in ways that can inform the redesign of courses and design of learning supports, which further informs the refinement of learning theories and their applications.

    more » « less
  2. Abstract Background

    Direct-sequencing technologies, such as Oxford Nanopore’s, are delivering long RNA reads with great efficacy and convenience. These technologies afford an ability to detect post-transcriptional modifications at a single-molecule resolution, promising new insights into the functional roles of RNA. However, realizing this potential requires new tools to analyze and explore this type of data.


    Here, we present Sequoia, a visual analytics tool that allows users to interactively explore nanopore sequences. Sequoia combines a Python-based backend with a multi-view visualization interface, enabling users to import raw nanopore sequencing data in a Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to identify properties of interest. We demonstrate the application of Sequoia by generating and analyzing ~ 500k reads from direct RNA sequencing data of human HeLa cell line. We focus on comparing signal features from m6A and m5C RNA modifications as the first step towards building automated classifiers. We show how, through iterative visual exploration and tuning of dimensionality reduction parameters, we can separate modified RNA sequences from their unmodified counterparts. We also document new, qualitative signal signatures that characterize these modifications from otherwise normal RNA bases, which we were able to discover from the visualization.


    Sequoia’s interactive features complement existing computational approaches in nanopore-based RNA workflows. The insights gleaned through visual analysis should help users in developing rationales, hypotheses, and insights into the dynamic nature of RNA. Sequoia is available at

    more » « less
  3. Abstract  
    more » « less
  4. Data visualization provides a powerful way for analysts to explore and make data-driven discoveries. However, current visual analytic tools provide only limited support for hypothesis-driven inquiry, as their built-in interactions and workflows are primarily intended for exploratory analysis. Visualization tools notably lack capabilities that would allow users to visually and incrementally test the fit of their conceptual models and provisional hypotheses against the data. This imbalance could bias users to overly rely on exploratory analysis as the principal mode of inquiry, which can be detrimental to discovery. In this paper, we introduce Visual (dis) Confirmation, a tool for conducting confirmatory, hypothesis-driven analyses with visualizations. Users interact by framing hypotheses and data expectations in natural language. The system then selects conceptually relevant data features and automatically generates visualizations to validate the underlying expectations. Distinctively, the resulting visualizations also highlight places where one's mental model disagrees with the data, so as to stimulate reflection. The proposed tool represents a new class of interactive data systems capable of supporting confirmatory visual analysis, and responding more intelligently by spotlighting gaps between one's knowledge and the data. We describe the algorithmic techniques behind this workflow. We also demonstrate the utility of the tool through a case study. 
    more » « less
  5. null (Ed.)
    As technology advances, data driven work is becoming increasingly important across all disciplines. Data science is an emerging field that encompasses a large array of topics including data collection, data preprocessing, data visualization, and data analysis using statistical and machine learning methods. As undergraduates enter the workforce in the future, they will need to “benefit from a fundamental awareness of and competence in data science”[9]. This project has formed a research practice partnership that brings together STEM+C instructors and researchers from three universities and an education research and consulting group. We aim to use high frequency monitoring data collected from real-world systems to develop and implement an interdisciplinary approach to enable undergraduate students to develop an understanding of data science concepts through individual STEM disciplines that include engineering, computer science, environmental science, and biology. In this paper, we perform an initial exploratory analysis on how data science topics are introduced into the different courses, with the ultimate goal of understanding how instructional modules and accompanying assessments can be developed for multidisciplinary use. We analyze information collected from instructor interviews and surveys, student surveys, and assessments from five undergraduate courses (243 students) at the three universities to understand aspects of data science curricula that are common across disciplines. Using a qualitative approach, we find commonalities in data science instruction and assessment components across the disciplines. This includes topical content, data sources, pedagogical approaches, and assessment design. Preliminary analyses of instructor interviews also suggest factors that affect the content taught and the assessment material across the five courses. These factors include class size, students’ year of study, students’ reasons for taking class, and students’ background expertise and knowledge. These findings indicate the challenges in developing data modules for multidisciplinary use. We hope that the analysis and reflections on our initial offerings has improved our understanding of these challenges, and how we may address them when designing future data science teaching modules. These are the first steps in a design-based approach to developing data science modules that may be offered across multiple courses. 
    more » « less