skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: A Clustering-based framework for Classifying Data Streams

The non-stationary nature of data streams strongly challenges traditional machine learning techniques. Although some solutions have been proposed to extend traditional machine learning techniques for handling data streams, these approaches either require an initial label set or rely on specialized design parameters. The overlap among classes and the labeling of data streams constitute other major challenges for classifying data streams. In this paper, we proposed a clustering-based data stream classification framework to handle non-stationary data streams without utilizing an initial label set. A density-based stream clustering procedure is used to capture novel concepts with a dynamic threshold and an effective active label querying strategy is introduced to continuously learn the new concepts from the data streams. The sub-cluster structure of each cluster is explored to handle the overlap among classes. Experimental results and quantitative comparison studies reveal that the proposed method provides statistically better or comparable performance than the existing methods.

 
more » « less
Award ID(s):
2000320
PAR ID:
10328297
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Joint Conferences on Artificial Intelligence Organization
Page Range / eLocation ID:
3257 to 3263
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Aidong Zhang ; Huzefa Rangwala (Ed.)
    In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning and continuous learning: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Experimental results demonstrate the effectiveness of the proposed method. 
    more » « less
  2. Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts. 
    more » « less
  3. Abstract

    Objective.This paper presents data-driven solutions to address two challenges in the problem of linking neural data and behavior: (1) unsupervised analysis of behavioral data and automatic label generation from behavioral observations, and (2) extraction of subject-invariant features for the development of generalized neural decoding models.Approach. For behavioral analysis and label generation, an unsupervised method, which employs an autoencoder to transform behavioral data into a cluster-friendly feature space is presented. The model iteratively refines the assigned clusters with soft clustering assignment loss, and gradually improves the learned feature representations. To address subject variability in decoding neural activity, adversarial learning in combination with a long short-term memory-based adversarial variational autoencoder (LSTM-AVAE) model is employed. By using an adversary network to constrain the latent representations, the model captures shared information among subjects’ neural activity, making it proper for cross-subject transfer learning.Main results. The proposed approach is evaluated using cortical recordings of Thy1-GCaMP6s transgenic mice obtained via widefield calcium imaging during a motivational licking behavioral experiment. The results show that the proposed model achieves an accuracy of 89.7% in cross-subject neural decoding, outperforming other well-known autoencoder-based feature learning models. These findings suggest that incorporating an adversary network eliminates subject dependency in representations, leading to improved cross-subject transfer learning performance, while also demonstrating the effectiveness of LSTM-based models in capturing the temporal dependencies within neural data.Significance. Results demonstrate the feasibility of the proposed framework in unsupervised clustering and label generation of behavioral data, as well as achieving high accuracy in cross-subject neural decoding, indicating its potentials for relating neural activity to behavior.

     
    more » « less
  4. Abstract B-mode ultrasound (US) is often used to noninvasively measure skeletal muscle architecture, which contains human intent information. Extracted features from B-mode images can help improve closed-loop human–robotic interaction control when using rehabilitation/assistive devices. The traditional manual approach to inferring the muscle structural features from US images is laborious, time-consuming, and subjective among different investigators. This paper proposes a clustering-based detection method that can mimic a well-trained human expert in identifying fascicle and aponeurosis and, therefore, compute the pennation angle. The clustering-based architecture assumes that muscle fibers have tubular characteristics. It is robust for low-frequency image streams. We compared the proposed algorithm to two mature benchmark techniques: UltraTrack and ImageJ. The performance of the proposed approach showed higher accuracy in our dataset (frame frequency is 20 Hz), that is, similar to the human expert. The proposed method shows promising potential in automatic muscle fascicle orientation detection to facilitate implementations in biomechanics modeling, rehabilitation robot control design, and neuromuscular disease diagnosis with low-frequency data stream. 
    more » « less
  5. null (Ed.)
    Data collected from real-world environments often contain multiple objects, scenes, and activities. In comparison to single-label problems, where each data sample only defines one concept, multi-label problems allow the co-existence of multiple concepts. To exploit the rich semantic information in real-world data, multi-label classification has seen many applications in a variety of domains. The traditional approaches to multi-label problems tend to have the side effects of increased memory usage, slow model inference speed, and most importantly the under-utilization of the dependency across concepts. In this paper, we adopt multi-task learning to address these challenges. Multi-task learning treats the learning of each concept as a separate job, while at the same time leverages the shared representations among all tasks. We also propose a dynamic task balancing method to automatically adjust the task weight distribution by taking both sample-level and task-level learning complexities into consideration. Our framework is evaluated on a disaster video dataset and the performance is compared with several state-of-the-art multi-label and multi-task learning techniques. The results demonstrate the effectiveness and supremacy of our approach. 
    more » « less