skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automated Generation and Selection of Interpretable Features for Enterprise Security
We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6% improvement in malicious recall, while sacrificing no more than 0.47% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.  more » « less
Award ID(s):
1717634
PAR ID:
10097455
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2018 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
1258 to 1265
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In recent years, there has been a notable increase in the prevalence of malicious websites, leading to a majority of cyber-attacks and data breaches. Malicious websites often incorporate JavaScript code to execute attacks on web browsers. Despite existing methodologies documented in the literature, the analysis and detection of malicious JavaScript pose significant challenges due to the dynamic nature of JavaScript and the use of advanced evasion techniques. These challenges motivate the need for an innovative and efficient approach to comprehensively analyze the code to identify its malicious intent. In this paper, we introduce a monitoring approach for analyzing JavaScript code, which can capture all of the code’s features at runtime. Our method leverages the security reference monitor technique to mediate JavaScript security-sensitive executions, including function calls and property accesses. Therefore, the proposed method can capture behaviors at runtime regardless of how the code is written, even with recent advanced evasion techniques like WebAssembly diversification. We have implemented our approach as a JavaScript dynamic analysis framework called JSMBox in a Chromium-based browser extension. Our experiments demonstrated that JSMBox is capable of effectively countering sophisticated evasion techniques found in modern malicious JavaScript code, including WebAssembly diversification. We have also evaluated the framework’s ability to classify malicious behaviors based on a large-scale raw dataset comprising about 20,000 malicious and benign webpages. Our developed tool automatically launches the browser to execute these webpages, records JavaScript code execution events, and captures their execution frequency as extracted features. We have tested the extracted dataset with various machine-learning models, yielding promising experimental results that confirm the effectiveness of our approach and achieve a high accuracy rate. 
    more » « less
  2. There is a long history of research on the development of models to detect and study student behavior and affect. Developing computer-based models has allowed the study of learning constructs at fine levels of granularity and over long periods of time. For many years, these models were developed using features based on previous educational research from the raw log data. More recently, however, the application of deep learning models has often skipped this feature-engineering step by allowing the algorithm to learn features from the fine-grained raw log data. As many of these deep learning models have led to promising results, researchers have asked which situations may lead to machine-learned features performing better than expert-generated features. This work addresses this question by comparing the use of machine-learned and expert-engineered features for three previously-developed models of student affect, off-task behavior, and gaming the system. In addition, we propose a third feature-engineering method that combines expert features with machine learning to explore the strengths and weaknesses of these approaches to build detectors of student affect and unproductive behaviors. 
    more » « less
  3. There is a long history of research on the development of models to detect and study student behavior and affect. Developing computer-based models has allowed the study of learning constructs at fine levels of granularity and over long periods of time. For many years, these models were developed using features based on previous educational research from the raw log data. More recently, however, the application of deep learning models has often skipped this feature engineering step by allowing the algorithm to learn features from the fine-grained raw log data. As many of these deep learning models have led to promising results, researchers have asked which situations may lead to machine-learned features performing better than expert-generated features. This work addresses this question by comparing the use of machine-learned and expert-engineered features for three previously-developed models of student affect, off-task behavior, and gaming the system. In addition, we propose a third feature-engineering method that combines expert features with machine learning to explore the strengths and weaknesses of these approaches to build detectors of student affect and unproductive behaviors. 
    more » « less
  4. Android applications are extremely popular, as they are widely used for banking, social media, e-commerce, etc. Such applications typically leverage a series of Permissions, which serve as a convenient abstraction for mediating access to security-sensitive functionality within the Android Ecosystem, e.g., sending data over the Internet. However, several malicious applications have recently deployed attacks such as data leaks and spurious credit card charges by abusing the Permissions granted initially to them by unaware users in good faith. To alleviate this pressing concern, we present DyPolDroid, a dynamic and semi-automated security framework that builds upon Android Enterprise, a device-management framework for organizations, to allow for users and administrators to design and enforce so-called Counter-Policies, a convenient user-friendly abstraction to restrict the sets of Permissions granted to potential malicious applications, thus effectively protecting against serious attacks without requiring advanced security and technical expertise. Additionally, as a part of our experimental procedures, we introduce Laverna, a fully operational application that uses permissions to provide benign functionality at the same time it also abuses them for malicious purposes. To fully support the reproducibility of our results, and to encourage future work, the source code of both DyPolDroid and Laverna is publicly available as open-source. 
    more » « less
  5. null (Ed.)
    Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs - particularly ones that have not been previously spotted - and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normal-ized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://github.com/PerryXDeng/weaponizing_unicode. 
    more » « less