skip to main content

Title: Supervised vs Unsupervised Learning on Gaze Data to Classify Student Distraction Level in an Educational VR Environment
Educational VR may help students by being more engaging or improving retention compared to traditional learning methods. However, a student can get distracted in a VR environment due to stress, mind-wandering, unwanted noise, external alerts, etc. Student eye gaze can be useful for detecting these distraction. We explore deep-learning-based approaches to detect distractions from gaze data. We designed an educational VR environment and trained three deep learning models (CNN, LSTM, and CNN-LSTM) to gauge a student’s distraction level from gaze data, using both supervised and unsupervised learning methods. Our results show that supervised learning provided better test accuracy compared to unsupervised learning methods.
; ;
Award ID(s):
Publication Date:
Journal Name:
2021 ACM Symposium on Spatial User Interaction
Page Range or eLocation-ID:
1 to 2
Sponsoring Org:
National Science Foundation
More Like this
  1. Educational VR may increase engagement and retention compared to traditional learning, for some topics or students. However, a student could still get distracted and disengaged due to stress, mind-wandering, unwanted noise, external alerts, etc. Student eye gaze can be useful for detecting distraction. For example, we previously considered gaze visualizations to help teachers understand student attention to better identify or guide distracted students. However, it is not practical for a teacher to monitor a large numbers of student indicators while teaching. To help filter students based on distraction level, we consider a deep learning approach to detect distraction from gaze data. The key aspects are: (1) we created a labeled eye gaze dataset (3.4M data points) from an educational VR environment, (2) we propose an automatic system to gauge a student's distraction level from gaze data, and (3) we apply and compare three deep neural classifiers for this purpose. A proposed CNN-LSTM classifier achieved an accuracy of 89.8\% for classifying distraction, per educational activity section, into one of three levels.
  2. Vedaldi, A. ; Bischof, H. ; Brox, T. ; Frahm, JM. (Ed.)
    The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. Wemore »demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS’13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task.« less
  3. Egocentric perception has grown rapidly with the advent of immersive computing devices. Human gaze prediction is an important problem in analyzing egocentric videos and has primarily been tackled through either saliency-based modeling or highly supervised learning. We quantitatively analyze the generalization capabilities of supervised, deep learning models on the egocentric gaze prediction task on unseen, out-of-domain data. We find that their performance is highly dependent on the training data and is restricted to the domains specified in the training annotations. In this work, we tackle the problem of jointly predicting human gaze points and temporal segmentation of egocentric videos without using any training data. We introduce an unsupervised computational model that draws inspiration from cognitive psychology models of event perception. We use Grenander's pattern theory formalism to represent spatial-temporal features and model surprise as a mechanism to predict gaze fixation points. Extensive evaluation on two publicly available datasets - GTEA and GTEA+ datasets-shows that the proposed model can significantly outperform all unsupervised baselines and some supervised gaze prediction baselines. Finally, we show that the model can also temporally segment egocentric videos with a performance comparable to more complex, fully supervised deep learning baselines.
  4. Abstract: Modeling student learning processes is highly complex since it is influenced by many factors such as motivation and learning habits. The high volume of features and tools provided by computer-based learning environments confounds the task of tracking student knowledge even further. Deep Learning models such as Long-Short Term Memory (LSTMs) and classic Markovian models such as Bayesian Knowledge Tracing (BKT) have been successfully applied for student modeling. However, much of this prior work is designed to handle sequences of events with discrete timesteps, rather than considering the continuous aspect of time. Given that time elapsed between successive elements in a student’s trajectory can vary from seconds to days, we applied a Timeaware LSTM (T-LSTM) to model the dynamics of student knowledge state in continuous time. We investigate the effectiveness of T-LSTM on two domains with very different characteristics. One involves an open-ended programming environment where students can self-pace their progress and T-LSTM is compared against LSTM, Recent Temporal Pattern Mining, and the classic Logistic Regression (LR) on the early prediction of student success; the other involves a classic tutor-driven intelligent tutoring system where the tutor scaffolds the student learning step by step and T-LSTM is compared with LSTM, LR,more »and BKT on the early prediction of student learning gains. Our results show that TLSTM significantly outperforms the other methods on the self-paced, open-ended programming environment; while on the tutor-driven ITS, it ties with LSTM and outperforms both LR and BKT. In other words, while time-irregularity exists in both datasets, T-LSTM works significantly better than other student models when the pace is driven by students. On the other hand, when such irregularity results from the tutor, T-LSTM was not superior to other models but its performance was not hurt either.« less
  5. The development and application of deep learning method- ologies has grown within educational contexts in recent years. Perhaps attributable, in part, to the large amount of data that is made avail- able through the adoption of computer-based learning systems in class- rooms and larger-scale MOOC platforms, many educational researchers are leveraging a wide range of emerging deep learning approaches to study learning and student behavior in various capacities. Variations of recurrent neural networks, for example, have been used to not only pre- dict learning outcomes but also to study sequential and temporal trends in student data; it is commonly believed that they are able to learn high- dimensional representations of learning and behavioral constructs over time, such as the evolution of a students’ knowledge state while working through assigned content. Recent works, however, have started to dis- pute this belief, instead finding that it may be the model’s complexity that leads to improved performance in many prediction tasks and that these methods may not inherently learn these temporal representations through model training. In this work, we explore these claims further in the context of detectors of student affect as well as expanding on exist- ing work that explored benchmarks inmore »knowledge tracing. Specifically, we observe how well trained models perform compared to deep learning networks where training is applied only to the output layer. While the highest results of prior works utilizing trained recurrent models are found to be superior, the application of our untrained-versions perform compa- rably well, outperforming even previous non-deep learning approaches. Keywords: Deep Learning · LSTM · Echo State Network · Affect · Knowledge Tracing.« less