The Internet of Things (IoT) is revolutionizing society by connect- ing people and devices seamlessly and providing enhanced user experience and functionalities. However, the unique properties of IoT networks, such as heterogeneity and non-standardized protocol, have created critical security holes and network mismanagement. We propose a new measurement tool for IoT network data to aid in analyzing and classifying such network traffic. We use evidence from both security and machine learning research, which suggests that the complexity of a dataset can be used as a metric to determine the trustworthiness of data. We test the complexity of IoT networks using Intrinsic Dimensionality (ID), a theoretical complexity mea- surement based on the observation that a few variables can often describe high dimensional datasets. We use ID to evaluate four mod- ern IoT network datasets empirically, showing that, for network and device-level data generated using IoT methodologies, the ID of the data fits into a low dimensional representation; this makes such data amenable to the use of machine learning algorithms for anomaly detection.
more »
« less
The intrinsic dimensionality of network datasets and its applications1
Modern network infrastructures are in a constant state of transformation, in large part due to the exponential growth of Internet of Things (IoT) devices. The unique properties of IoT-connected networks, such as heterogeneity and non-standardized protocol, have created critical security holes and network mismanagement. In this paper we propose a new measurement tool, Intrinsic Dimensionality (ID), to aid in analyzing and classifying network traffic. A proxy for dataset complexity, ID can be used to understand the network as a whole, aiding in tasks such as network management and provisioning. We use ID to evaluate several modern network datasets empirically. Showing that, for network and device-level data, generated using IoT methodologies, the ID of the data fits into a low dimensional representation. Additionally we explore network data complexity at the sample level using Local Intrinsic Dimensionality (LID) and propose a novel unsupervised intrusion detection technique, the Weighted Hamming LID Estimator. We show that the algortihm performs better on IoT network datasets than the Autoencoder, KNN, and Isolation Forests. Finally, we propose the use of synthetic data as an additional tool for both network data measurement as well as intrusion detection. Synthetically generated data can aid in building a more robust network dataset, while also helping in downstream tasks such as machine learning based intrusion detection models. We explore the effects of synthetic data on ID measurements, as well as its role in intrusion detection systems.
more »
« less
- PAR ID:
- 10499066
- Editor(s):
- Sural, Shamik; Lu, Haibing
- Publisher / Repository:
- IOS Press
- Date Published:
- Journal Name:
- Journal of Computer Security
- Volume:
- 31
- Issue:
- 6
- ISSN:
- 0926-227X
- Page Range / eLocation ID:
- 679 to 704
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Security monitoring is crucial for maintaining a strong IT infrastructure by protecting against emerging threats, identifying vulnerabilities, and detecting potential points of failure. It involves deploying advanced tools to continuously monitor networks, systems, and configurations. However, organizations face challenges in adapting modern techniques like Machine Learning (ML) due to privacy and security risks associated with sharing internal data. Compliance with regulations like GDPR further complicates data sharing. To promote external knowledge sharing, a secure and privacy-preserving method for organizations to share data is necessary. Privacy-preserving data generation involves creating new data that maintains privacy while preserving key characteristics and properties of the original data so that it is still useful in creating downstream models of attacks. Generative models, such as Generative Adversarial Networks (GAN), have been proposed as a solution for privacy preserving synthetic data generation. However, standard GANs are limited in their capabilities to generate realistic system data. System data have inherent constraints, e.g., the list of legitimate I.P. addresses and port numbers are limited, and protocols dictate a valid sequence of network events. Standard generative models do not account for such constraints and do not utilize domain knowledge in their generation process. Additionally, they are limited by the attribute values present in the training data. This poses a major privacy risk, as sensitive discrete attribute values are repeated by GANs. To address these limitations, we propose a novel model for Knowledge Infused Privacy Preserving Data Generation. A privacy preserving Generative Adversarial Network (GAN) is trained on system data for generating synthetic datasets that can replace original data for downstream tasks while protecting sensitive data. Knowledge from domain-specific knowledge graphs is used to guide the data generation process, check for the validity of generated values, and enrich the dataset by diversifying the values of attributes. We specifically demonstrate this model by synthesizing network data captured by the network capture tool, Wireshark. We establish that the synthetic dataset holds up to the constraints of the network-specific datasets and can replace the original dataset in downstream tasks.more » « less
-
Attack detection problems in industrial control systems (ICSs) are commonly known as a network traffic monitoring scheme for detecting abnormal activities. However, a network-based intrusion detection system can be deceived by attackers that imitate the system’s normal activity. In this work, we proposed a novel solution to this problem based on measurement data in the supervisory control and data acquisition (SCADA) system. The proposed approach is called measurement intrusion detection system (MIDS), which enables the system to detect any abnormal activity in the system even if the attacker tries to conceal it in the system’s control layer. A supervised machine learning model is generated to classify normal and abnormal activities in an ICS to evaluate the MIDS performance. A hardware-in-the-loop (HIL) testbed is developed to simulate the power generation units and exploit the attack dataset. In the proposed approach, we applied several machine learning models on the dataset, which show remarkable performances in detecting the dataset’s anomalies, especially stealthy attacks. The results show that the random forest is performing better than other classifier algorithms in detecting anomalies based on measured data in the testbed.more » « less
-
With the proliferation of IoT devices, securing the IoT-based network is of paramount importance. IoT-based networks consist of diversely purposed IoT devices. This diversity of IoT devices necessitates diverse dataset analysis to ensure effective implementation of Machine Learning (ML)-based cybersecurity. However, much-demanded real-world IoT data is still in short supply for active ML-based IoT data analysis. This paper presents an in-depth analysis of the real-time IoT-23 dataset [9]. Exhaustive analysis of all 20 scenarios of the IoT-23 dataset reveals the consistency between feature selection methods and detection algorithms. The proposed ML-based intrusion detection system (ML-IDS) achieves significant improvement in detection accuracy, which in some scenarios reaches 100%. Our analysis also demonstrates that the required number of features for a high detection rate of greater than 99% remains small, i.e., 2 or 3, enabling ML-IDS implementation even with resource-constrained IoT processors.more » « less
-
null (Ed.)Network intrusion detection systems (IDS) has efficiently identified the profiles of normal network activities, extracted intrusion patterns, and constructed generalized models to evaluate (un)known attacks using a wide range of machine learning approaches. In spite of the effectiveness of machine learning-based IDS, it has been still challenging to reduce high false alarms due to data misclassification. In this paper, by using multiple decision mechanisms, we propose a new classification method to identify misclassified data and then to classify them into three different classes, called a malicious, benign, and ambiguous dataset. In other words, the ambiguous dataset contains a majority of the misclassified dataset and is thus the most informative for improving the model and anomaly detection because of the lack of confidence for the data classification in the model. We evaluate our approach with the recent real-world network traffic data, Kyoto2006+ datasets, and show that the ambiguous dataset contains 77.2% of the previously misclassified data. Re-evaluating the ambiguous dataset effectively reduces the false prediction rate with minimal overhead and improves accuracy by 15%.more » « less
An official website of the United States government

