skip to main content


Title: A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams
Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams.  more » « less
Award ID(s):
2019609
NSF-PAR ID:
10231456
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Big Data and Cognitive Computing
Volume:
5
Issue:
1
ISSN:
2504-2289
Page Range / eLocation ID:
1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The detection of abnormal moving objects over high-volume trajectory streams is critical for real-time applications ranging from military surveillance to transportation management. Yet this outlier detection problem, especially along both the spatial and temporal dimensions, remains largely unexplored. In this work, we propose a rich taxonomy of novel classes of neighbor-based trajectory outlier definitions that model the anomalous behavior of moving objects for a large range of real-time applications. Our theoretical analysis and empirical study on two real-world datasets—the Beijing Taxi trajectory data and the Ground Moving Target Indicator data stream—and one generated Moving Objects dataset demonstrate the effectiveness of our taxonomy in effectively capturing different types of abnormal moving objects. Furthermore, we propose a general strategy for efficiently detecting these new outlier classes called the minimal examination (MEX) framework. The MEX framework features three core optimization principles, which leverage spatiotemporal as well as the predictability properties of the neighbor evidence to minimize the detection costs. Based on this foundation, we design algorithms that detect the outliers based on these classes of new outlier semantics that successfully leverage our optimization principles. Our comprehensive experimental study demonstrates that our proposed MEX strategy drives the detection costs 100-fold down into the practical realm for applications that analyze high-volume trajectory streams in near real time. 
    more » « less
  2. Local outlier techniques are known to be effective for detecting outliers in skewed data, where subsets of the data exhibit diverse distribution properties. However, existing methods are not well equipped to support modern high-velocity data streams due to the high complexity of the detection algorithms and their volatility to data updates. To tackle these shortcomings, we propose local outlier semantics that operate at an abstraction level by leveraging kernel density estimation (KDE) to effectively detect local outliers from streaming data. A strategy to continuously detect top-N KDE-based local outliers over streams is designed, called KELOS – the first linear time complexity streaming local outlier detection approach. The first innovation of KELOS is the abstract kernel center-based KDE (aKDE) strategy. aKDE accurately yet efficiently estimates the data density at each point – essential for local outlier detection. This is based on the observation that a cluster of points close to each other tend to have a similar influence on a target point’s density estimation when used as kernel centers. These points thus can be represented by one abstract kernel center. Next, the KELOS’s inlier pruning strategy early prunes points that have no chance to become top-N outliers. This empowers KELOS to skip the computation of their data density and of the outlier status for every data point. Together aKDE and the inlier pruning strategy eliminate the performance bottleneck of streaming local outlier detection. The experimental evaluation demonstrates that KELOS is up to 6 orders of magnitude faster than existing solutions, while being highly effective in detecting local outliers from streaming data. 
    more » « less
  3. Smart grids are facing many challenges including cyber-attacks which can cause devastating damages to the grids. Existing machine learning based approaches for detecting cyber-attacks in smart grids are mainly based on supervised learning, which needs representative instances from various attack types to obtain good detection models. In this paper, we investigated semi-supervised outlier detection algorithms for this problem which only use instances of normal events for model training. Data collected by phasor measurement units (PMUs) was used for training the detection model. The semi-supervised outlier detection algorithms were augmented with deep feature extraction for enhanced detection performance. Our results show that semi-supervised outlier detection algorithms can perform better than popular supervised algorithms. Deep feature extraction can significantly improve the performance of semi-supervised algorithms for detecting cyber-attacks in smart grids 
    more » « less
  4. Summary

    This paper presents a novel state estimation approach for linear dynamic systems when measurements are corrupted by outliers. Since outliers can degrade the performance of state estimation, outlier accommodation is critical. The standard approach combines outlier detection utilizing Neyman‐Pearson (NP) type tests with a Kalman filter (KF). This approach ignores all residuals greater than a designer‐specified threshold. When measurements with outliers are used (ie, missed detections), both the state estimate and the error covariance matrix become corrupted. This corrupted state and covariance estimate are then the basis for all subsequent outlier decisions. When valid measurements are rejected (ie, false alarms), potentially using the corrupted state estimate and error covariance, measurement information is lost. Either using invalid information or discarding too much valid information can result in divergence of the KF. An alternative approach is moving‐horizon (MH) state estimation, which maintains all recent measurement data within a moving window with a time horizon of lengthL. In MH approaches, the number of measurements available for state estimation is affected by both the number of measurements per time step and the number of time stepsLover which measurements are retained. Risk‐averse performance‐specified (RAPS) state estimation works within an optimization setting to choose a set of measurements that achieves a performance specification with minimum risk of outlier inclusion. This paper derives and formulates the MH‐RAPS solution for outlier accommodation. The paper also presents implementation results. The MH‐RAPS application uses Global Navigation Satellite Systems measurements to estimate the state of a moving platform using a position, velocity, and acceleration model. In this application, MH‐RAPS performance is compared with MH‐NP state estimation.

     
    more » « less
  5. Premise

    Biological outliers (observations that fall outside of a previously understood norm, e.g., in phenology or distribution) may indicate early stages of a transformative change that merits immediate attention. Collectors of biodiversity specimens such as plants, fungi, and animals are on the front lines of discovering outliers, yet the role collectors currently play in providing such data is unclear.

    Methods

    We surveyed 222 collectors of a broad range of taxa, searched 47 training materials, and explored the use of 170 outlier terms in 75 million specimen records to determine the current state of outlier detection and documentation in this community.

    Results

    Collectors reported observing outliers (e.g., about 80% of respondents observed morphological and distributional outliers at least occasionally). However, relatively few specimen records include outlier terms, and imprecision in their use and handling in data records complicates data discovery by stakeholders. This current state appears to be at least partly due to the absence of protocols: only one of the training materials addressed documenting and reporting outliers.

    Conclusions

    We suggest next steps to mobilize this largely untapped, yet ideally suited, community for early detection of biotic change in the Anthropocene, including community activities for building relevant best practices.

     
    more » « less