skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.


Title: Efficient discovery of sequence outlier patterns
Modern Internet of Things ( IoT ) applications generate massive amounts of time-stamped data, much of it in the form of discrete, symbolic sequences. In this work, we present a new system called TOP that deTects Outlier Patterns from these sequences. To solve the fundamental limitation of existing pattern mining semantics that miss outlier patterns hidden inside of larger frequent patterns, TOP offers new pattern semantics based on contextual patterns that distinguish the independent occurrence of a pattern from its occurrence as part of its super-pattern. We present efficient algorithms for the mining of this new class of contextual patterns. In particular, in contrast to the bottom-up strategy for state-of-the-art pattern mining techniques, our top-down Reduce strategy piggy backs pattern detection with the detection of the context in which a pattern occurs. Our approach achieves linear time complexity in the length of the input sequence. Effective optimization techniques such as context-driven search space pruning and inverted index-based outlier pattern detection are also proposed to further speed up contextual pattern mining. Our experimental evaluation demonstrates the effectiveness of TOP at capturing meaningful outlier patterns in several real-world IoT use cases. We also demonstrate the efficiency of TOP, showing it to be up to 2 orders of magnitude faster than adapting state-of-the-art mining to produce this new class of contextual outlier patterns, allowing us to scale outlier pattern mining to large sequence datasets.  more » « less
Award ID(s):
1910880
NSF-PAR ID:
10251262
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
12
Issue:
8
ISSN:
2150-8097
Page Range / eLocation ID:
920 to 932
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Summary

    Computational methods to predict protein–protein interaction (PPI) typically segregate into sequence-based ‘bottom-up’ methods that infer properties from the characteristics of the individual protein sequences, or global ‘top-down’ methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g. AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms.

    Availability and implementation

    https://topsyturvy.csail.mit.edu.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Local outlier techniques are known to be effective for detecting outliers in skewed data, where subsets of the data exhibit diverse distribution properties. However, existing methods are not well equipped to support modern high-velocity data streams due to the high complexity of the detection algorithms and their volatility to data updates. To tackle these shortcomings, we propose local outlier semantics that operate at an abstraction level by leveraging kernel density estimation (KDE) to effectively detect local outliers from streaming data. A strategy to continuously detect top-N KDE-based local outliers over streams is designed, called KELOS – the first linear time complexity streaming local outlier detection approach. The first innovation of KELOS is the abstract kernel center-based KDE (aKDE) strategy. aKDE accurately yet efficiently estimates the data density at each point – essential for local outlier detection. This is based on the observation that a cluster of points close to each other tend to have a similar influence on a target point’s density estimation when used as kernel centers. These points thus can be represented by one abstract kernel center. Next, the KELOS’s inlier pruning strategy early prunes points that have no chance to become top-N outliers. This empowers KELOS to skip the computation of their data density and of the outlier status for every data point. Together aKDE and the inlier pruning strategy eliminate the performance bottleneck of streaming local outlier detection. The experimental evaluation demonstrates that KELOS is up to 6 orders of magnitude faster than existing solutions, while being highly effective in detecting local outliers from streaming data. 
    more » « less
  3. Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques. 
    more » « less
  4. Green wireless networks Wake-up radio Energy harvesting Routing Markov decision process Reinforcement learning 1. Introduction With 14.2 billions of connected things in 2019, over 41.6 billions expected by 2025, and a total spending on endpoints and services that will reach well over $1.1 trillion by the end of 2026, the Internet of Things (IoT) is poised to have a transformative impact on the way we live and on the way we work [1–3]. The vision of this ‘‘connected continuum’’ of objects and people, however, comes with a wide variety of challenges, especially for those IoT networks whose devices rely on some forms of depletable energy support. This has prompted research on hardware and software solutions aimed at decreasing the depen- dence of devices from ‘‘pre-packaged’’ energy provision (e.g., batteries), leading to devices capable of harvesting energy from the environment, and to networks – often called green wireless networks – whose lifetime is virtually infinite. Despite the promising advances of energy harvesting technologies, IoT devices are still doomed to run out of energy due to their inherent constraints on resources such as storage, processing and communica- tion, whose energy requirements often exceed what harvesting can provide. The communication circuitry of prevailing radio technology, especially, consumes relevant amount of energy even when in idle state, i.e., even when no transmissions or receptions occur. Even duty cycling, namely, operating with the radio in low energy consumption ∗ Corresponding author. E-mail address: koutsandria@di.uniroma1.it (G. Koutsandria). https://doi.org/10.1016/j.comcom.2020.05.046 (sleep) mode for pre-set amounts of time, has been shown to only mildly alleviate the problem of making IoT devices durable [4]. An effective answer to eliminate all possible forms of energy consumption that are not directly related to communication (e.g., idle listening) is provided by ultra low power radio triggering techniques, also known as wake-up radios [5,6]. Wake-up radio-based networks allow devices to remain in sleep mode by turning off their main radio when no communication is taking place. Devices continuously listen for a trigger on their wake-up radio, namely, for a wake-up sequence, to activate their main radio and participate to communication tasks. Therefore, devices wake up and turn their main radio on only when data communication is requested by a neighboring device. Further energy savings can be obtained by restricting the number of neighboring devices that wake up when triggered. This is obtained by allowing devices to wake up only when they receive specific wake-up sequences, which correspond to particular protocol requirements, including distance from the destina- tion, current energy status, residual energy, etc. This form of selective awakenings is called semantic addressing [7]. Use of low-power wake-up radio with semantic addressing has been shown to remarkably reduce the dominating energy costs of communication and idle listening of traditional radio networking [7–12]. This paper contributes to the research on enabling green wireless networks for long lasting IoT applications. Specifically, we introduce a ABSTRACT This paper presents G-WHARP, for Green Wake-up and HARvesting-based energy-Predictive forwarding, a wake-up radio-based forwarding strategy for wireless networks equipped with energy harvesting capabilities (green wireless networks). Following a learning-based approach, G-WHARP blends energy harvesting and wake-up radio technology to maximize energy efficiency and obtain superior network performance. Nodes autonomously decide on their forwarding availability based on a Markov Decision Process (MDP) that takes into account a variety of energy-related aspects, including the currently available energy and that harvestable in the foreseeable future. Solution of the MDP is provided by a computationally light heuristic based on a simple threshold policy, thus obtaining further computational energy savings. The performance of G-WHARP is evaluated via GreenCastalia simulations, where we accurately model wake-up radios, harvestable energy, and the computational power needed to solve the MDP. Key network and system parameters are varied, including the source of harvestable energy, the network density, wake-up radio data rate and data traffic. We also compare the performance of G-WHARP to that of two state-of-the-art data forwarding strategies, namely GreenRoutes and CTP-WUR. Results show that G-WHARP limits energy expenditures while achieving low end-to-end latency and high packet delivery ratio. Particularly, it consumes up to 34% and 59% less energy than CTP-WUR and GreenRoutes, respectively. 
    more » « less
  5. Self-regulated learning conducted through metacognitive monitoring and scientific inquiry can be influenced by many factors, such as emotions and motivation, and are necessary skills needed to engage in efficient hypothesis testing during game-based learning. Although many studies have investigated metacognitive monitoring and scientific inquiry skills during game-based learning, few studies have investigated how the sequence of behaviors involved during hypothesis testing with game-based learning differ based on both efficiency level and emotions during gameplay. For this study, we analyzed 59 undergraduate students’ (59% female) metacognitive monitoring and hypothesis testing behavior during learning and gameplay with CRYSTAL ISLAND, a game-based learning environment that teaches students about microbiology. Specifically, we used sequential pattern mining and differential sequence mining to determine if there were sequences of hypothesis testing behaviors and to determine if the frequencies of occurrence of these sequences differed between high or low levels of efficiency at finishing the game and high or low levels of facial expressions of emotions during gameplay. Results revealed that students with low levels of efficiency and high levels of facial expressions of emotions had the most sequences of testing behaviors overall, specifically engaging in more sequences that were indicative of less strategic hypothesis testing behavior than the other students, where students who were more efficient with both levels of emotions demonstrated strategic testing behavior. These results have implications for the strengths of using educational data mining techniques for determining the processes underlying patterns of engaging in self-regulated learning conducted through hypothesis testing as they unfold over time; for training students on how to engage in the self-regulation, scientific inquiry, and emotion regulation processes that can result in efficient gameplay; and for developing adaptive game-based learning environments that foster effective and efficient self-regulation and scientific inquiry during learning. 
    more » « less