skip to main content


Title: Supervised vs Unsupervised Learning on Gaze Data to Classify Student Distraction Level in an Educational VR Environment
Educational VR may help students by being more engaging or improving retention compared to traditional learning methods. However, a student can get distracted in a VR environment due to stress, mind-wandering, unwanted noise, external alerts, etc. Student eye gaze can be useful for detecting these distraction. We explore deep-learning-based approaches to detect distractions from gaze data. We designed an educational VR environment and trained three deep learning models (CNN, LSTM, and CNN-LSTM) to gauge a student’s distraction level from gaze data, using both supervised and unsupervised learning methods. Our results show that supervised learning provided better test accuracy compared to unsupervised learning methods.  more » « less
Award ID(s):
1815976
NSF-PAR ID:
10338215
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2021 ACM Symposium on Spatial User Interaction
Page Range / eLocation ID:
1 to 2
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Educational VR may increase engagement and retention compared to traditional learning, for some topics or students. However, a student could still get distracted and disengaged due to stress, mind-wandering, unwanted noise, external alerts, etc. Student eye gaze can be useful for detecting distraction. For example, we previously considered gaze visualizations to help teachers understand student attention to better identify or guide distracted students. However, it is not practical for a teacher to monitor a large numbers of student indicators while teaching. To help filter students based on distraction level, we consider a deep learning approach to detect distraction from gaze data. The key aspects are: (1) we created a labeled eye gaze dataset (3.4M data points) from an educational VR environment, (2) we propose an automatic system to gauge a student's distraction level from gaze data, and (3) we apply and compare three deep neural classifiers for this purpose. A proposed CNN-LSTM classifier achieved an accuracy of 89.8\% for classifying distraction, per educational activity section, into one of three levels. 
    more » « less
  2. Emerging Virtual Reality (VR) displays with embedded eye trackers are currently becoming a commodity hardware (e.g., HTC Vive Pro Eye). Eye-tracking data can be utilized for several purposes, including gaze monitoring, privacy protection, and user authentication/identification. Identifying users is an integral part of many applications due to security and privacy concerns. In this paper, we explore methods and eye-tracking features that can be used to identify users. Prior VR researchers explored machine learning on motion-based data (such as body motion, head tracking, eye tracking, and hand tracking data) to identify users. Such systems usually require an explicit VR task and many features to train the machine learning model for user identification. We propose a system to identify users utilizing minimal eye-gaze-based features without designing any identification-specific tasks. We collected gaze data from an educational VR application and tested our system with two machine learning (ML) models, random forest (RF) and k-nearest-neighbors (kNN), and two deep learning (DL) models: convolutional neural networks (CNN) and long short-term memory (LSTM). Our results show that ML and DL models could identify users with over 98% accuracy with only six simple eye-gaze features. We discuss our results, their implications on security and privacy, and the limitations of our work. 
    more » « less
  3. Vedaldi, A. ; Bischof, H. ; Brox, T. ; Frahm, JM. (Ed.)
    The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS’13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task. 
    more » « less
  4. null (Ed.)
    Egocentric perception has grown rapidly with the advent of immersive computing devices. Human gaze prediction is an important problem in analyzing egocentric videos and has primarily been tackled through either saliency-based modeling or highly supervised learning. We quantitatively analyze the generalization capabilities of supervised, deep learning models on the egocentric gaze prediction task on unseen, out-of-domain data. We find that their performance is highly dependent on the training data and is restricted to the domains specified in the training annotations. In this work, we tackle the problem of jointly predicting human gaze points and temporal segmentation of egocentric videos without using any training data. We introduce an unsupervised computational model that draws inspiration from cognitive psychology models of event perception. We use Grenander's pattern theory formalism to represent spatial-temporal features and model surprise as a mechanism to predict gaze fixation points. Extensive evaluation on two publicly available datasets - GTEA and GTEA+ datasets-shows that the proposed model can significantly outperform all unsupervised baselines and some supervised gaze prediction baselines. Finally, we show that the model can also temporally segment egocentric videos with a performance comparable to more complex, fully supervised deep learning baselines. 
    more » « less
  5. The commonsense natural language inference (CNLI) tasks aim to select the most likely follow-up statement to a contextual description of ordinary, everyday events and facts. Current approaches to transfer learning of CNLI models across tasks require many labeled data from the new task. This paper presents a way to reduce this need for additional annotated training data from the new task by leveraging symbolic knowledge bases, such as ConceptNet. We formulate a teacher-student framework for mixed symbolic-neural reasoning, with the large-scale symbolic knowledge base serving as the teacher and a trained CNLI model as the student. This hybrid distillation process involves two steps. The first step is a symbolic reasoning process. Given a collection of unlabeled data, we use an abductive reasoning framework based on Grenander's pattern theory to create weakly labeled data. Pattern theory is an energy-based graphical probabilistic framework for reasoning among random variables with varying dependency structures. In the second step, the weakly labeled data, along with a fraction of the labeled data, is used to transfer-learn the CNLI model into the new task. The goal is to reduce the fraction of labeled data required. We demonstrate the efficacy of our approach by using three publicly available datasets (OpenBookQA, SWAG, and HellaSWAG) and evaluating three CNLI models (BERT, LSTM, and ESIM) that represent different tasks. We show that, on average, we achieve 63% of the top performance of a fully supervised BERT model with no labeled data. With only 1000 labeled samples, we can improve this performance to 72%. Interestingly, without training, the teacher mechanism itself has significant inference power. The pattern theory framework achieves 32.7% accuracy on OpenBookQA, outperforming transformer-based models such as GPT (26.6%), GPT-2 (30.2%), and BERT (27.1%) by a significant margin. We demonstrate that the framework can be generalized to successfully train neural CNLI models using knowledge distillation under unsupervised and semi-supervised learning settings. Our results show that it outperforms all unsupervised and weakly supervised baselines and some early supervised approaches, while offering competitive performance with fully supervised baselines. Additionally, we show that the abductive learning framework can be adapted for other downstream tasks, such as unsupervised semantic textual similarity, unsupervised sentiment classification, and zero-shot text classification, without significant modification to the framework. Finally, user studies show that the generated interpretations enhance its explainability by providing key insights into its reasoning mechanism. 
    more » « less