Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
null (Ed.)We describe version 2.0 of our benchmarking framework, PhishBench. With the addition of the ability to dynamically load features, metrics, and classifiers, our new and improved framework allows researchers to rapidly evaluate new features and methods for machine-learning based phishing detection. Researchers can compare under identical circumstances their contributions with numerous built-in features, ranking methods, and classifiers used in the literature with the right evaluation metrics. We will demonstrate PhishBench 2.0 and compare it against at least two other automated ML systems.more » « less
-
null (Ed.)Phishing is a serious challenge that remains largely unsolved despite the efforts of many researchers. In this paper, we present datasets and tools to help phishing researchers. First, we describe our efforts on creating high quality, diverse and representative email and URL/website datasets for phishing and making them publicly available. Second, we describe PhishBench, a benchmarking framework, which automates the extraction of more than 200 features, implements more than 30 classifiers, and 12 evaluation metrics, for detection of phishing emails, websites and URLs. Using PhishBench, the research community can easily run their models and benchmark their work against the work of others, who have used common dataset sources for emails (Nazario, SpamAssassin, WikiLeaks, etc.) and URLs (PhishTank, APWG, Alexa, etc.).more » « less
-
Our society is facing a growing threat from data breaches where confidential information is stolen from computer servers. In order to steal data, hackers must first gain entry into the targeted systems. Commercial off-the-shelf intrusion detection systems are unable to defend against the intruders effectively. This research uses cyber behavior analytics to study and report how anomalies compare to normal behavior. In this paper, we present methods based on machine learning algorithms to detect intruders based on the file access patterns within a user file directory. We proposed a set of behavioral features of the user's file access patterns in a file system. We validate the effectiveness of the features by conducting experiments on an existing file system dataset with four classification algorithms. To limit the false alarms, we trained and tested the classifiers by optimizing the performance within the lower range of the false positive rate. The results from our experiments show that our approach was able to detect intruders with a 0.94 Fl score and false positive rate of less than 3%.more » « less
-
Techniques from data science are increasingly being applied by researchers to security challenges. However, challenges unique to the security domain necessitate painstaking care for the models to be valid and robust. In this paper, we explain key dimensions of data quality relevant for security, illustrate them with several popular datasets for phishing, intrusion detection and malware, indicate operational methods for assuring data quality and seek to inspire the audience to generate high quality datasets for security challenges.more » « less
-
Workplace environments are characterized by frequent interruptions that can lead to stress. However, measures of stress due to interruptions are typically obtained through self-reports, which can be affected by memory and emotional biases. In this paper, we use a thermal imaging system to obtain objective measures of stress and investigate personality differences in contexts of high and low interruptions. Since a major source of workplace interruptions is email, we studied 63 participants while multitasking in a controlled office environment with two different email contexts: managing email in batch mode or with frequent interruptions. We discovered that people who score high in Neuroticism are significantly more stressed in batching environments than those low in Neuroticism. People who are more stressed finish emails faster. Last, using Linguistic Inquiry Word Count on the email text, we find that higher stressed people in multitasking environments use more anger in their emails. These findings help to disambiguate prior conflicting results on email batching and stress.more » « less
-
Even with many successful phishing email detectors, phishing emails still cost businesses and individuals millions of dollars per year. Most of these models seem to ignore features like word count, stopword count, and punctuations; they use features like n-grams and part of speech tagging. Previous phishing email research ignores or removes the stopwords, and features relating to punctuation only count as a minor part of the detector. Even with a strong unconventional focus on features like word counts, stopwords, punctuation, and uniqueness factors, an ensemble learning model based on a linear kernel SVM gave a true positive rate of 83% and a true negative rate of 96%. Moreover, these features are robustly detected even in noisy email data. It is much easier to detect our features than correct part-of-speech tags or named entities in emails.more » « less
An official website of the United States government
