skip to main content


Title: Action Sequence Augmentation for Early Graph-based Anomaly Detection
The proliferation of web platforms has created incentives for online abuse. Many graph-based anomaly detection techniques are proposed to identify the suspicious accounts and behaviors. However, most of them detect the anomalies once the users have performed many such behaviors. Their performance is substantially hindered when the users' observed data is limited at an early stage, which needs to be improved to minimize financial loss. In this work, we propose Eland, a novel framework that uses action sequence augmentation for early anomaly detection. Eland utilizes a sequence predictor to predict next actions of every user and exploits the mutual enhancement between action sequence augmentation and user-action graph anomaly detection. Experiments on three real-world datasets show that Eland improves the performance of a variety of graph-based anomaly detection methods. With Eland, anomaly detection performance at an earlier stage is better than non-augmented methods that need significantly more observed data by up to 15% on the Area under the ROC curve.  more » « less
Award ID(s):
1849816 1901059
PAR ID:
10334394
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 30th ACM International Conference on Information & Knowledge Management
Page Range / eLocation ID:
2668 to 2678
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Anomaly detection methods have a great potential to assist the detection of diseases in animal production systems. We used sequence data of Porcine Reproductive and Respiratory Syndrome (PRRS) to define the emergence of new strains at the farm level. We evaluated the performance of 24 anomaly detection methods based on machine learning, regression, time series techniques and control charts to identify outbreaks in time series of new strains and compared the best methods using different time series: PCR positives, PCR requests and laboratory requests. We introduced synthetic outbreaks of different size and calculated the probability of detection of outbreaks (POD), sensitivity (Se), probability of detection of outbreaks in the first week of appearance (POD1w) and background alarm rate (BAR). The use of time series of new strains from sequence data outperformed the other types of data but POD, Se, POD1w were only high when outbreaks were large. The methods based on Long Short-Term Memory (LSTM) and Bayesian approaches presented the best performance. Using anomaly detection methods with sequence data may help to identify the emergency of cases in multiple farms, but more work is required to improve the detection with time series of high variability. Our results suggest a promising application of sequence data for early detection of diseases at a production system level. This may provide a simple way to extract additional value from routine laboratory analysis. Next steps should include validation of this approach in different settings and with different diseases.

     
    more » « less
  2. Fraud detection is of great importance because fraudulent behaviors may mislead consumers or bring huge losses to enterprises. Due to the lockstep feature of fraudulent behaviors, fraud detection problem can be viewed as finding suspicious dense blocks in the attributed bipartite graph. In reality, existing attribute-based methods are not adversarially robust, because fraudsters can take some camouflage actions to cover their behavior attributes as normal. More importantly, existing structural information based methods only consider shallow topology structure, making their effectiveness sensitive to the density of suspicious blocks. In this paper, we propose a novel deep structure learning model named DeepFD to differentiate normal users and suspicious users. DeepFD can preserve the non-linear graph structure and user behavior information simultaneously. Experimental results on different types of datasets demonstrate that DeepFD outperforms the state-of-the-art baselines. 
    more » « less
  3. Internet of Things (IoT) cyber threats, exemplified by jackware and crypto mining, underscore the vulnerability of IoT devices. Due to the multi-step nature of many attacks, early detection is vital for a swift response and preventing malware propagation. However, accurately detecting early-stage attacks is challenging, as attackers employ stealthy, zero-day, or adversarial machine learning to evade detection. To enhance security, we propose ARIoTEDef, an Adversarially Robust IoT Early Defense system, which identifies early-stage infections and evolves autonomously. It models multi-stage attacks based on a cyber kill chain and maintains stage-specific detectors. When anomalies in the later action stage emerge, the system retroactively analyzes event logs using an attention-based sequence-to-sequence model to identify early infections. Then, the infection detector is updated with information about the identified infections. We have evaluated ARIoTEDef against multi-stage attacks, such as the Mirai botnet. Results show that the infection detector’s average F1 score increases from 0.31 to 0.87 after one evolution round. We have also conducted an extensive analysis of ARIoTEDef against adversarial evasion attacks. Our results show that ARIoTEDef is robust and benefits from multiple rounds of evolution.

     
    more » « less
  4. Lateral movement is a key stage of system compromise used by advanced persistent threats. Detecting it is no simple task. When network host logs are abstracted into discrete temporal graphs, the problem can be reframed as anomalous edge detection in an evolving network. Research in modern deep graph learning techniques has produced many creative and complicated models for this task. However, as is the case in many machine learning fields, the generality of models is of paramount importance for accuracy and scalability during training and inference. In this article, we propose a formalized approach to this problem with a framework we call Euler . It consists of a model-agnostic graph neural network stacked upon a model-agnostic sequence encoding layer such as a recurrent neural network. Models built according to the Euler framework can easily distribute their graph convolutional layers across multiple machines for large performance improvements. Additionally, we demonstrate that Euler -based models are as good, or better, than every state-of-the-art approach to anomalous link detection and prediction that we tested. As anomaly-based intrusion detection systems, our models efficiently identified anomalous connections between entities with high precision and outperformed all other unsupervised techniques for anomalous lateral movement detection. Additionally, we show that as a piece of a larger anomaly detection pipeline, Euler models perform well enough for use in real-world systems. With more advanced, yet still lightweight, alerting mechanisms ingesting the embeddings produced by Euler models, precision is boosted from 0.243, to 0.986 on real-world network traffic. 
    more » « less
  5. Attributed networks are a type of graph structured data used in many real-world scenarios. Detecting anomalies on attributed networks has a wide spectrum of applications such as spammer detection and fraud detection. Although this research area draws increasing attention in the last few years, previous works are mostly unsupervised because of expensive costs of labeling ground truth anomalies. Many recent studies have shown different types of anomalies are often mixed together on attributed networks and such invaluable human knowledge could provide complementary insights in advancing anomaly detection on attributed networks. To this end, we study the novel problem of modeling and integrating human knowledge of different anomaly types for attributed network anomaly detection. Specifically, we first model prior human knowledge through a novel data augmentation strategy. We then integrate the modeled knowledge in a Siamese graph neural network encoder through a well-designed contrastive loss. In the end, we train a decoder to reconstruct the original networks from the node representations learned by the encoder, and rank nodes according to its reconstruction error as the anomaly metric. Experiments on five real-world datasets demonstrate that the proposed framework outperforms the state-of-the-art anomaly detection algorithms. 
    more » « less