This research presents a hybrid intrusion detection approach that integrates Generative Adversarial Networks (GANs) for synthetic data generation with Random Forest (RF) as the primary classifier. The study aims to improve detection performance in cybersecurity applications by enhancing dataset diversity and addressing challenges in traditional models, particularly in detecting minority attack classes often underrepresented in real-world datasets. The proposed method employs GANs to generate synthetic attack samples that mimic real-world intrusions, which are then combined with real data from the UNSW-NB15 dataset to create a more balanced training set. By leveraging synthetic data augmentation, our approach mitigates issues related to class imbalance and enhances the generalization capability of the classifier. Extensive experiments demonstrate that RF trained on the combined dataset of real and synthetic data achieves superior detection performance compared to models trained exclusively on real data. Specifically, RF trained solely on the original dataset achieves an accuracy of 97.58%, whereas integrating GAN-generated synthetic data improves accuracy to 98.27%. The proposed methodology is further evaluated through comparative analysis against alternative classifiers, including Support Vector Machine (SVM), XGBoost, Gated Recurrent Unit (GRU), and related studies in the field. Our findings indicate that GAN-augmented training significantly enhances detection rates, particularly for rare attack types, while maintaining computational efficiency. Furthermore, RF outperforms other classifiers, including deep learning models, demonstrating its effectiveness as a lightweight yet robust classification method. Integrating GANs with RF offers a scalable and adaptable framework for intrusion detection, ensuring improved resilience against evolving cyber threats.
more »
« less
Boundless: Generating photorealistic synthetic data for object detection in urban streetscapes
We introduce Boundless, a photo-realistic synthetic data generation system for enabling highly accurate object detection in dense urban streetscapes. Boundless can replace massive real-world data collection and manual groundtruth object annotation (labeling) with an automated and configurable process. Boundless is based on the Unreal Engine 5 (UE5) City Sample project with improvements enabling accurate collection of 3D bounding boxes across different lighting and scene variability conditions. We evaluate the performance of object detection models trained on the dataset generated by Boundless when used for inference on a real-world dataset acquired from medium-altitude cameras. We compare the performance of the Boundless-trained model against the CARLA-trained model and observe an improvement of 7.8 mAP. The results we achieved support the premise that synthetic data generation is a credible methodology for training/fine-tuning scalable object detection models for urban scenes.
more »
« less
- Award ID(s):
- 2148128
- PAR ID:
- 10582222
- Publisher / Repository:
- arXiv:2409.03022v2 [cs.CV]
- Date Published:
- Journal Name:
- arxiv
- ISSN:
- arXiv:2409.03022v2 [cs.CV]
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The accurate detection of chemical agents promotes many national security and public safety goals, and robust chemical detection methods can prevent disasters and support effective response to incidents. Mass spectrometry is an important tool in detecting and identifying chemical agents. However, there are high costs and logistical challenges associated with acquiring sufficient lab-generated mass spectrometry data for training machine learning algorithms, including skilled personnel, sample preparation and analysis required for data generation. These high costs of mass spectrometry data collection hinder the development of machine learning and deep learning models to detect and identify chemical agents. Accordingly, the primary objective of our research is to create a mass spectrometry data generation model whose output (synthetic mass spectrometry data) would enhance the performance of downstream machine learning chemical classification models. Such a synthetic data generation model would reduce the need to generate costly real-world data, and provide additional training data to use in combination with lab-generated mass spectrometry data when training classifiers. Our approach is a novel combination of autoencoder-based synthetic data generation combined with a fixed, apriori defined hidden layer geometry. In particular, we train pairs of encoders and decoders with an additional loss term that enforces that the hidden layer passed from the encoder to the decoder match the embedding provided by an external deep learning model designed to predict functional properties of chemicals. We have verified that incorporating our synthetic spectra into a lab-generated dataset enhances the performance of classification algorithms compared to using only the real data. Our synthetic spectra have been successfully matched to lab-generated spectra for their respective chemicals using library matching software, further demonstrating the validity of our work.more » « less
-
Kohei, Arai (Ed.)This research explores practical applications of Transfer Learning and Spatial Attention mechanisms using pre-trained models from an open-source simulator, CARLA (Car Learning to Act). The study focuses on vehicle tracking using aerial images, utilizing transformers and graph algorithms for keypoint detection. The proposed detector training process optimizes model parameters without heavy reliance on manually set hyperparameters. The loss function considers both class distribution and position localization of ground truth data. The study utilizes a three-stage methodology: pre-trained model selection, fine-tuning with a custom synthetic dataset, and evaluation using real-world aerial datasets. The results demonstrate the effectiveness of our synthetic transformer-based transfer learning technique in enhancing object detection accuracy and localization. When tested with real-world images, our approach achieved an 88% detection, compared to only 30% when using YOLOv8. The findings underscore the advantages of incorporating graph-based loss functions in transfer learning and position-encoding techniques, demonstrating their effectiveness in realistic machine learning applications with unbalanced classes.more » « less
-
Arai, Igor (Ed.)This research explores practical applications of Transfer Learning and Spatial Attention mechanisms using pre-trained models from an open-source simulator, CARLA (Car Learning to Act). The study focuses on vehicle tracking using aerial images, utilizing transformers and graph algorithms for keypoint detection. The proposed detector training process optimizes model parameters without heavy reliance on manually set hyperparameters. The loss function considers both class distribution and position localization of ground truth data. The study utilizes a three-stage methodology: pre-trained model selection, fine-tuning with a custom synthetic dataset, and evaluation using real-world aerial datasets. The results demonstrate the effectiveness of our synthetic transformer-based transfer learning technique in enhancing object detection accuracy and localization. When tested with real-world images, our approach achieved an 88% detection, compared to only 30% when using YOLOv8. The findings underscore the advantages of incorporating graph-based loss functions in transfer learning and position-encoding techniques, demonstrating their effectiveness in realistic machine learning applications with unbalanced classes.more » « less
-
Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, and ParCo) and text-to-text models (GPT4o, GPT4, and Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20 Hz) while showing instability in high-frequency datasets (e.g., 200 Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.more » « less
An official website of the United States government

