skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Toward Classifying Unknown Application Traffic
Determining the particular application associated with a given flow of internet traffic is an important security measure in computer networks. This practice is significant as it can aid in detecting intrusions and other anomalies, as well as identifying misuse associated with prohibited applications. Many efforts have been expended to create models for classifying internet traffic using machine learning techniques. While research so far has proven useful, studies have focused on machine learning techniques for detecting well-known and profiled applications. Some have focused only on particular transport layer traffic (e.g., TCP traffic only). In contrast, unknown traffic is much more difficult to classify and can appear as previously unseen applications or established applications exhibiting abnormal behavior. This work presents methods to address these gaps in other research. The methods utilize k-Nearest Neighbor machine learning approaches to model known application data with the Kolmogorov-Smirnov statistic as the distance function to computer nearest neighbors. The models identify incoming data which likely does not belong to the model, thus identifying unknown applica- tions. This study shows the potential of our approach by presenting results which show successful implementation for a controlled environment, such as an organization with a fixed number of approved applications. In this setting, our approach can distinguish unknown data from known data with accuracy up to 93 percent compared to an accuracy of 57 percent for a strawman k-Nearest Neighbors approach with Euclidean distance. In addition, there are no restrictions on particular protocols. Operational considerations are also discussed, with emphasis on future work that can be performed such as exploring processing of incoming data in real-time and updating the model in an automated way.  more » « less
Award ID(s):
1642158
PAR ID:
10095205
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security (DYNAMICS) Workshop
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Diffusion State Distance (DSD) is a data-dependent metric that compares data points using a data-driven diffusion process and provides a powerful tool for learning the underlying structure of high-dimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new random-walk based algorithm that empirically finds approximate k-nearest neighbors accurately in an efficient manner. Numerical results for real-world protein-protein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate k-nearest neighbors performs well when used to predict proteins’ functional labels. 
    more » « less
  2. Abstract Invoking the manifold assumption in machine learning requires knowledge of the manifold's geometry and dimension, and theory dictates how many samples are required. However, in most applications, the data are limited, sampling may not be uniform, and the manifold's properties are unknown; this implies that neighborhoods must adapt to the local structure. We introduce an algorithm for inferring adaptive neighborhoods for data given by a similarity kernel. Starting with a locally conservative neighborhood (Gabriel) graph, we sparsify it iteratively according to a weighted counterpart. In each step, a linear program yields minimal neighborhoods globally, and a volumetric statistic reveals neighbor outliers likely to violate manifold geometry. We apply our adaptive neighborhoods to nonlinear dimensionality reduction, geodesic computation, and dimension estimation. A comparison against standard algorithms using, for example, k-nearest neighbors, demonstrates the usefulness of our approach. 
    more » « less
  3. The increasing ubiquity of network traffic and the new online applications’ deployment has increased traffic analysis complexity. Traditionally, network administrators rely on recognizing well-known static ports for classifying the traffic flowing their networks. However, modern network traffic uses dynamic ports and is transported over secure application-layer protocols (e.g., HTTPS, SSL, and SSH). This makes it a challenging task for network administrators to identify online applications using traditional port-based approaches. One way for classifying the modern network traffic is to use machine learning (ML) to distinguish between the different traffic attributes such as packet count and size, packet inter-arrival time, packet send–receive ratio, etc. This paper presents the design and implementation of NetScrapper, a flow-based network traffic classifier for online applications. NetScrapper uses three ML models, namely K-Nearest Neighbors (KNN), Random Forest (RF), and Artificial Neural Network (ANN), for classifying the most popular 53 online applications, including Amazon, Youtube, Google, Twitter, and many others. We collected a network traffic dataset containing 3,577,296 packet flows with different 87 features for training, validating, and testing the ML models. A web-based user-friendly interface is developed to enable users to either upload a snapshot of their network traffic to NetScrapper or sniff the network traffic directly from the network interface card in real time. Additionally, we created a middleware pipeline for interfacing the three models with the Flask GUI. Finally, we evaluated NetScrapper using various performance metrics such as classification accuracy and prediction time. Most notably, we found that our ANN model achieves an overall classification accuracy of 99.86% in recognizing the online applications in our dataset. 
    more » « less
  4. IntroductionAI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. MethodsThis paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models—Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)—were used to validate the proposed approach. Results and discussionExperimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification. 
    more » « less
  5. Shortest-path computation on graphs is one of the most well-studied problems in algorithmic theory. An aspect that has only recently attracted attention is the use of databases in combination with graph algorithms, so-called distance oracles, to compute shortest-path queries on large graphs. To this purpose, we propose a novel, efficient, pure-SQL framework for answering exact distance queries on large-scale graphs, implemented entirely on an open-source database engine. Our COLD framework (COmpressed Labels on the Database) may answer multiple distance queries (vertex-to-vertex, one-to-many, k-Nearest Neighbors, Reverse k-Nearest Neighbors, Reverse k-Farthest Neighbors and Top-k Range) not handled by previous methods, rendering it a complete database solution for a variety of practical large-scale graph applications. Our experimentation shows that COLD outperforms existing approaches (including popular graph databases) in terms of query time and efficiency, while requiring significantly less storage space than these methods. 
    more » « less