Title: Real-time Spread Burst Detection in Data Streaming
Data streaming has many applications in network monitoring, web services, e-commerce, stock trading, social networks, and distributed sensing. This paper introduces a new problem of real-time burst detection in flow spread, which differs from the traditional problem of burst detection in flow size. It is practically significant with potential applications in cybersecurity, network engineering, and trend identification on the Internet. It is a challenging problem because estimating flow spread requires us to remember all past data items and detecting bursts in real time requires us to minimize spread estimation overhead, which was not the priority in most prior work. This paper provides the first efficient, real-time solution for spread burst detection. It is designed based on a new real-time super spreader identifier, which outperforms the state of the art in terms of both accuracy and processing overhead. The super spreader identifier is in turn based on a new sketch design for real-time spread estimation, which outperforms the best existing sketches. more »« less
Wang, Haibo; Ma, Chaoyi; Odegbile, Olufemi O; Chen, Shigang; Peir, Jih-Kwon
(, Proceedings of the VLDB Endowment)
null
(Ed.)
Measuring flow spread in real time from large, high-rate data streams has numerous practical applications, where a data stream is modeled as a sequence of data items from different flows and the spread of a flow is the number of distinct items in the flow. Past decades have witnessed tremendous performance improvement for single-flow spread estimation. However, when dealing with numerous flows in a data stream, it remains a significant challenge to measure per-flow spread accurately while reducing memory footprint. The goal of this paper is to introduce new multi-flow spread estimation designs that incur much smaller processing overhead and query overhead than the state of the art, yet achieves significant accuracy improvement in spread estimation. We formally analyze the performance of these new designs. We implement them in both hardware and software, and use real-world data traces to evaluate their performance in comparison with the state of the art. The experimental results show that our best sketch significantly improves over the best existing work in terms of estimation accuracy, data item processing throughput, and online query throughput.
Daghistani, Anas; Aref, Walid G.; Ghafoor, Arif
(, SIGSPATIAL '20: Proceedings of the 28th International Conference on Advances in Geographic Information Systems)
null
(Ed.)
The wide spread of GPS-enabled devices and the Internet of Things (IoT) has increased the amount of spatial data being generated every second. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial data streaming systems that scale to process in real-time large amounts of streamed spatial data. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, it is challenging to estimate the workload of each machine because spatial data and query streams are skewed and rapidly change with time and users' interests. Moreover, a distributed spatial streaming system often does not maintain a global system workload state because it requires high network and processing overheads to be collected from the machines in the system. This paper introduces TrioStat; an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system. It is infeasible to collect and exchange statistics with a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralised technique to collect and maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems to compare the workloads of machines as well as the workloads of data partitions. TrioStat requires minimal network and storage overhead. Moreover, the required storage is distributed across the system's machines.
Odegbile, Olufemi; Ma, Chaoyi; Chen, Shigang; Melissourgos, Dimitrios; Wang, Haibo
(, Proceedings of 6th International Conference on Networks, Communications, Wireless, and Mobile Computing)
null
(Ed.)
This paper introduces a hierarchical traffic model for spread measurement of network traffic flows. The hierarchical model, which aggregates lower level flows into higher-level flows in a hierarchical structure, will allow us to measure network traffic at different granularities at once to support diverse traffic analysis from a grand view to fine-grained details. The spread of a flow is the number of distinct elements (under measurement) in the flow, where the flow label (that identifies packets belonging to the flow) and the elements (which are defined based on application need) can be found in packet headers or payload. Traditional flow spread estimators are designed without hierarchical traffic modeling in mind, and incur high overhead when they are applied to each level of the traffic hierarchy. In this paper, we propose a new Hierarchical Virtual bitmap Estimator (HVE) that performs simultaneous multi-level traffic measurement, at the same cost of a traditional estimator, without degrading measurement accuracy. We implement the proposed solution and perform experiments based on real traffic traces. The experimental results demonstrate that HVE improves measurement throughput by 43% to 155%, thanks to the reduction of perpacket processing overhead. For small to medium flows, its measurement accuracy is largely similar to traditional estimators that work at one level at a time. For large aggregate and base flows, its accuracy is better, with up to 97% smaller error in our experiments.
The Zika virus epidemic of 2015–16, which caused over 1 million confirmed or suspected human cases in the Caribbean and Latin America, was driven by a combination of movement of infected humans and availability of suitable habitat for mosquito species that are key disease vectors. Both human mobility and mosquito vector abundances vary seasonally, and the goal of our research was to analyze the interacting effects of disease vector densities and human movement across metapopulations on disease transmission intensity and the probability of super-spreader events. Our research uses the novel approach of combining geographical modeling of mosquito presence with network modeling of human mobility to offer a comprehensive simulation environment for Zika virus epidemics that considers a substantial number of spatial and temporal factors compared to the literature. Specifically, we tested the hypotheses that 1) regions with the highest probability of mosquito presence will have more super-spreader events during dry months, when mosquitoes are predicted to be more abundant, 2) regions reliant on tourism industries will have more super-spreader events during wet months, when they are more likely to contribute to network-level pathogen spread due to increased travel. We used the case study of Colombia, a country with a population of about 50 million people, with an annual calendar that can be partitioned into overlapping cycles of wet and dry seasons and peak tourism and off tourism seasons that drive distinct cyclical patterns of mosquito abundance and human movement. Our results show that whether the first infected human was introduced to the network during the wet versus dry season and during the tourism versus off tourism season profoundly affects the severity and trajectory of the epidemic. For example, Zika virus was first detected in Colombia in October of 2015. Had it originated in January, a dry season month with high rates of tourism, it likely could have infected up to 60% more individuals and up to 40% more super-spreader events may have occurred. In addition, popular tourism destinations such as Barranquilla and Cartagena have the highest risk of super-spreader events during the winter, whereas densely populated areas such as Medellín and Bogotá are at higher risk of sustained transmission during dry months in the summer. Our research demonstrates that public health planning and response to vector-borne disease outbreaks requires a thorough understanding of how vector and host patterns vary due to seasonality in environmental conditions and human mobility dynamics. This research also has strong implications for tourism policy and the potential response strategies in case of an emergent epidemic.
Mestav, Kursat Rasim; Luengo-Rozas, Jaime; Tong, Lang
(, IEEE Transactions on Power Systems)
The problem of state estimation for unobservable distribution systems is considered. A deep learning approach to Bayesian state estimation is proposed for real-time applications. The proposed technique consists of distribution learning of stochastic power injection, a Monte Carlo technique for the training of a deep neural network for state estimation, and a Bayesian bad-data detection and filtering algorithm. Structural characteristics of the deep neural networks are investigated. Simulations illustrate the accuracy of Bayesian state estimation for unobservable systems and demonstrate the benefit of employing a deep neural network. Numerical results show the robustness of Bayesian state estimation against modeling and estimation errors and the presence of bad and missing data. Comparing with pseudo-measurement techniques, direct Bayesian state estimation via deep learning neural network outperforms existing benchmarks.
Wang, Haibo, Melissourgos, Dimitrios, Ma, Chaoyi, and Chen, Shigang. Real-time Spread Burst Detection in Data Streaming. Retrieved from https://par.nsf.gov/biblio/10471848. Proceedings of the ACM on Measurement and Analysis of Computing Systems 7.2 Web. doi:10.1145/3589979.
Wang, Haibo, Melissourgos, Dimitrios, Ma, Chaoyi, & Chen, Shigang. Real-time Spread Burst Detection in Data Streaming. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7 (2). Retrieved from https://par.nsf.gov/biblio/10471848. https://doi.org/10.1145/3589979
Wang, Haibo, Melissourgos, Dimitrios, Ma, Chaoyi, and Chen, Shigang.
"Real-time Spread Burst Detection in Data Streaming". Proceedings of the ACM on Measurement and Analysis of Computing Systems 7 (2). Country unknown/Code not available: ACM. https://doi.org/10.1145/3589979.https://par.nsf.gov/biblio/10471848.
@article{osti_10471848,
place = {Country unknown/Code not available},
title = {Real-time Spread Burst Detection in Data Streaming},
url = {https://par.nsf.gov/biblio/10471848},
DOI = {10.1145/3589979},
abstractNote = {Data streaming has many applications in network monitoring, web services, e-commerce, stock trading, social networks, and distributed sensing. This paper introduces a new problem of real-time burst detection in flow spread, which differs from the traditional problem of burst detection in flow size. It is practically significant with potential applications in cybersecurity, network engineering, and trend identification on the Internet. It is a challenging problem because estimating flow spread requires us to remember all past data items and detecting bursts in real time requires us to minimize spread estimation overhead, which was not the priority in most prior work. This paper provides the first efficient, real-time solution for spread burst detection. It is designed based on a new real-time super spreader identifier, which outperforms the state of the art in terms of both accuracy and processing overhead. The super spreader identifier is in turn based on a new sketch design for real-time spread estimation, which outperforms the best existing sketches.},
journal = {Proceedings of the ACM on Measurement and Analysis of Computing Systems},
volume = {7},
number = {2},
publisher = {ACM},
author = {Wang, Haibo and Melissourgos, Dimitrios and Ma, Chaoyi and Chen, Shigang},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.