Local outlier techniques are known to be effective for detecting outliers in skewed data, where subsets of the data exhibit diverse distribution properties. However, existing methods are not well equipped to support modern high-velocity data streams due to the high complexity of the detection algorithms and their volatility to data updates. To tackle these shortcomings, we propose local outlier semantics that operate at an abstraction level by leveraging kernel density estimation (KDE) to effectively detect local outliers from streaming data. A strategy to continuously detect top-N KDE-based local outliers over streams is designed, called KELOS – the first linear time complexity streaming local outlier detection approach. The first innovation of KELOS is the abstract kernel center-based KDE (aKDE) strategy. aKDE accurately yet efficiently estimates the data density at each point – essential for local outlier detection. This is based on the observation that a cluster of points close to each other tend to have a similar influence on a target point’s density estimation when used as kernel centers. These points thus can be represented by one abstract kernel center. Next, the KELOS’s inlier pruning strategy early prunes points that have no chance to become top-N outliers. This empowers KELOS to skip the computation of their data density and of the outlier status for every data point. Together aKDE and the inlier pruning strategy eliminate the performance bottleneck of streaming local outlier detection. The experimental evaluation demonstrates that KELOS is up to 6 orders of magnitude faster than existing solutions, while being highly effective in detecting local outliers from streaming data. 
                        more » 
                        « less   
                    
                            
                            Efficient discovery of sequence outlier patterns
                        
                    
    
            Modern Internet of Things ( IoT ) applications generate massive amounts of time-stamped data, much of it in the form of discrete, symbolic sequences. In this work, we present a new system called TOP that deTects Outlier Patterns from these sequences. To solve the fundamental limitation of existing pattern mining semantics that miss outlier patterns hidden inside of larger frequent patterns, TOP offers new pattern semantics based on contextual patterns that distinguish the independent occurrence of a pattern from its occurrence as part of its super-pattern. We present efficient algorithms for the mining of this new class of contextual patterns. In particular, in contrast to the bottom-up strategy for state-of-the-art pattern mining techniques, our top-down Reduce strategy piggy backs pattern detection with the detection of the context in which a pattern occurs. Our approach achieves linear time complexity in the length of the input sequence. Effective optimization techniques such as context-driven search space pruning and inverted index-based outlier pattern detection are also proposed to further speed up contextual pattern mining. Our experimental evaluation demonstrates the effectiveness of TOP at capturing meaningful outlier patterns in several real-world IoT use cases. We also demonstrate the efficiency of TOP, showing it to be up to 2 orders of magnitude faster than adapting state-of-the-art mining to produce this new class of contextual outlier patterns, allowing us to scale outlier pattern mining to large sequence datasets. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1910880
- PAR ID:
- 10251262
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 12
- Issue:
- 8
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 920 to 932
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Self-regulated learning conducted through metacognitive monitoring and scientific inquiry can be influenced by many factors, such as emotions and motivation, and are necessary skills needed to engage in efficient hypothesis testing during game-based learning. Although many studies have investigated metacognitive monitoring and scientific inquiry skills during game-based learning, few studies have investigated how the sequence of behaviors involved during hypothesis testing with game-based learning differ based on both efficiency level and emotions during gameplay. For this study, we analyzed 59 undergraduate students’ (59% female) metacognitive monitoring and hypothesis testing behavior during learning and gameplay with CRYSTAL ISLAND, a game-based learning environment that teaches students about microbiology. Specifically, we used sequential pattern mining and differential sequence mining to determine if there were sequences of hypothesis testing behaviors and to determine if the frequencies of occurrence of these sequences differed between high or low levels of efficiency at finishing the game and high or low levels of facial expressions of emotions during gameplay. Results revealed that students with low levels of efficiency and high levels of facial expressions of emotions had the most sequences of testing behaviors overall, specifically engaging in more sequences that were indicative of less strategic hypothesis testing behavior than the other students, where students who were more efficient with both levels of emotions demonstrated strategic testing behavior. These results have implications for the strengths of using educational data mining techniques for determining the processes underlying patterns of engaging in self-regulated learning conducted through hypothesis testing as they unfold over time; for training students on how to engage in the self-regulation, scientific inquiry, and emotion regulation processes that can result in efficient gameplay; and for developing adaptive game-based learning environments that foster effective and efficient self-regulation and scientific inquiry during learning.more » « less
- 
            null (Ed.)The detection of abnormal moving objects over high-volume trajectory streams is critical for real-time applications ranging from military surveillance to transportation management. Yet this outlier detection problem, especially along both the spatial and temporal dimensions, remains largely unexplored. In this work, we propose a rich taxonomy of novel classes of neighbor-based trajectory outlier definitions that model the anomalous behavior of moving objects for a large range of real-time applications. Our theoretical analysis and empirical study on two real-world datasets—the Beijing Taxi trajectory data and the Ground Moving Target Indicator data stream—and one generated Moving Objects dataset demonstrate the effectiveness of our taxonomy in effectively capturing different types of abnormal moving objects. Furthermore, we propose a general strategy for efficiently detecting these new outlier classes called the minimal examination (MEX) framework. The MEX framework features three core optimization principles, which leverage spatiotemporal as well as the predictability properties of the neighbor evidence to minimize the detection costs. Based on this foundation, we design algorithms that detect the outliers based on these classes of new outlier semantics that successfully leverage our optimization principles. Our comprehensive experimental study demonstrates that our proposed MEX strategy drives the detection costs 100-fold down into the practical realm for applications that analyze high-volume trajectory streams in near real time.more » « less
- 
            Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner .more » « less
- 
            Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and unmixing in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. We consider two previous works developed for related problems,one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and one that uses a Cartesian, or checkerboard, partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we see that our proposed strategy runs up to 10x faster than the state of the art on real-world datasets.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    