skip to main content

Title: Real-time Spread Burst Detection in Data Streaming

Data streaming has many applications in network monitoring, web services, e-commerce, stock trading, social networks, and distributed sensing. This paper introduces a new problem of real-time burst detection in flow spread, which differs from the traditional problem of burst detection in flow size. It is practically significant with potential applications in cybersecurity, network engineering, and trend identification on the Internet. It is a challenging problem because estimating flow spread requires us to remember all past data items and detecting bursts in real time requires us to minimize spread estimation overhead, which was not the priority in most prior work. This paper provides the first efficient, real-time solution for spread burst detection. It is designed based on a new real-time super spreader identifier, which outperforms the state of the art in terms of both accuracy and processing overhead. The super spreader identifier is in turn based on a new sketch design for real-time spread estimation, which outperforms the best existing sketches.

more » « less
Award ID(s):
2124858 1909077
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the ACM on Measurement and Analysis of Computing Systems
Page Range / eLocation ID:
1 to 31
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Measuring flow spread in real time from large, high-rate data streams has numerous practical applications, where a data stream is modeled as a sequence of data items from different flows and the spread of a flow is the number of distinct items in the flow. Past decades have witnessed tremendous performance improvement for single-flow spread estimation. However, when dealing with numerous flows in a data stream, it remains a significant challenge to measure per-flow spread accurately while reducing memory footprint. The goal of this paper is to introduce new multi-flow spread estimation designs that incur much smaller processing overhead and query overhead than the state of the art, yet achieves significant accuracy improvement in spread estimation. We formally analyze the performance of these new designs. We implement them in both hardware and software, and use real-world data traces to evaluate their performance in comparison with the state of the art. The experimental results show that our best sketch significantly improves over the best existing work in terms of estimation accuracy, data item processing throughput, and online query throughput. 
    more » « less
  2. null (Ed.)

    This paper introduces a hierarchical traffic model for spread measurement of network traffic flows. The hierarchical model, which aggregates lower level flows into higher-level flows in a hierarchical structure, will allow us to measure network traffic at different granularities at once to support diverse traffic analysis from a grand view to fine-grained details. The spread of a flow is the number of distinct elements (under measurement) in the flow, where the flow label (that identifies packets belonging to the flow) and the elements (which are defined based on application need) can be found in packet headers or payload. Traditional flow spread estimators are designed without hierarchical traffic modeling in mind, and incur high overhead when they are applied to each level of the traffic hierarchy. In this paper, we propose a new Hierarchical Virtual bitmap Estimator (HVE) that performs simultaneous multi-level traffic measurement, at the same cost of a traditional estimator, without degrading measurement accuracy. We implement the proposed solution and perform experiments based on real traffic traces. The experimental results demonstrate that HVE improves measurement throughput by 43% to 155%, thanks to the reduction of perpacket processing overhead. For small to medium flows, its measurement accuracy is largely similar to traditional estimators that work at one level at a time. For large aggregate and base flows, its accuracy is better, with up to 97% smaller error in our experiments.

    more » « less
  3. null (Ed.)
    The wide spread of GPS-enabled devices and the Internet of Things (IoT) has increased the amount of spatial data being generated every second. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial data streaming systems that scale to process in real-time large amounts of streamed spatial data. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, it is challenging to estimate the workload of each machine because spatial data and query streams are skewed and rapidly change with time and users' interests. Moreover, a distributed spatial streaming system often does not maintain a global system workload state because it requires high network and processing overheads to be collected from the machines in the system. This paper introduces TrioStat; an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system. It is infeasible to collect and exchange statistics with a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralised technique to collect and maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems to compare the workloads of machines as well as the workloads of data partitions. TrioStat requires minimal network and storage overhead. Moreover, the required storage is distributed across the system's machines. 
    more » « less
  4. The problem of state estimation for unobservable distribution systems is considered. A deep learning approach to Bayesian state estimation is proposed for real-time applications. The proposed technique consists of distribution learning of stochastic power injection, a Monte Carlo technique for the training of a deep neural network for state estimation, and a Bayesian bad-data detection and filtering algorithm. Structural characteristics of the deep neural networks are investigated. Simulations illustrate the accuracy of Bayesian state estimation for unobservable systems and demonstrate the benefit of employing a deep neural network. Numerical results show the robustness of Bayesian state estimation against modeling and estimation errors and the presence of bad and missing data. Comparing with pseudo-measurement techniques, direct Bayesian state estimation via deep learning neural network outperforms existing benchmarks. 
    more » « less
  5. This paper addresses the problem of detecting pedestrians using an enhanced object detection method. In particular, the paper considers the occluded pedestrian detection problem in autonomous driving scenarios where the balance of performance between accuracy and speed is crucial. Existing works focus on learning representations of unique persons independent of body parts semantics. To achieve a real-time performance along with robust detection, we introduce a body parts based pedestrian detection architecture where body parts are fused through a computationally effective constraint optimization technique. We demonstrate that our method significantly improves detection accuracy while adding negligible runtime overhead. We evaluate our method using a real-world dataset. Experimental results show that the proposed method outperforms existing pedestrian detection methods. 
    more » « less