Phishing websites many a times look-alike to benign websites with the objective being to lure unsuspecting users to visit them. The visits at times may be driven through links in phishing emails, links from web pages as well as web search results. Although the precise motivations behind phishing websites may differ the common denominator lies in the fact that unsuspecting users are mostly required to take some action e.g., clicking on a desired Uniform Resource Locator (URL). To accurately identify phishing websites, the cybersecurity community has relied on a variety of approaches including blacklisting, heuristic techniques as well as content-based approaches among others. The identification techniques are every so often enhanced using an array of methods i.e., honeypots, features recognition, manual reporting, web-crawlers among others. Nevertheless, a number of phishing websites still escape detection either because they are not blacklisted, are too recent or were incorrectly evaluated. It is therefore imperative to enhance solutions that could mitigate phishing websites threats. In this study, the effectiveness of the Bidirectional Encoder Representations from Transformers (BERT) is investigated as a possible tool for detecting phishing URLs. The experimental results detail that the BERT transformer model achieves acceptable prediction results without requiring advanced URLs feature selection techniques or the involvement of a domain specialist.
more »
« less
Diverse datasets and a customizable benchmarking framework for phishing
Phishing is a serious challenge that remains largely unsolved despite the efforts of many researchers. In this paper, we present datasets and tools to help phishing researchers. First, we describe our efforts on creating high quality, diverse and representative email and URL/website datasets for phishing and making them publicly available. Second, we describe PhishBench, a benchmarking framework, which automates the extraction of more than 200 features, implements more than 30 classifiers, and 12 evaluation metrics, for detection of phishing emails, websites and URLs. Using PhishBench, the research community can easily run their models and benchmark their work against the work of others, who have used common dataset sources for emails (Nazario, SpamAssassin, WikiLeaks, etc.) and URLs (PhishTank, APWG, Alexa, etc.).
more »
« less
- Award ID(s):
- 1659755
- PAR ID:
- 10224700
- Date Published:
- Journal Name:
- Proceedings of the Sixth International Workshop on Security and Privacy Analytics
- Page Range / eLocation ID:
- 35-41
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Phishing emails have certain characteristics, including wording related to urgency and unrealistic promises (i.e., “too good to be true”), that attempt to lure victims. To test whether these characteristics affected users’ suspiciousness of emails, users participated in a phishing judgment task in which we manipulated 1) email type (legitimate, phishing), 2) consequence amount (small, medium, large), 3) consequence type (gain, loss), and 4) urgency (present, absent). We predicted users would be most suspicious of phishing emails that were urgent and offered large gains. Results supporting the hypotheses indicate that users were more suspicious of phishing emails with a gain consequence type or large consequence amount. However, urgency was not a significant predictor of suspiciousness for phishing emails, but was for legitimate emails. These results have important cybersecurity-related implications for penetration testing and user training.more » « less
-
Existing studies have demonstrated that using traditional machine learning techniques, phishing detection simply based on the features of URLs can be very effective. In this paper, we explore the deep learning approach and build four RNN (Recurrent Neural Network) models that only use lexical features of URLs for detecting phishing attacks. We collect 1.5 million URLs as the dataset and show that our RNN models can achieve a higher than 99% detection accuracy without the need of any expert knowledge to manually identify the features. However, it is well known that RNNs and other deep learning techniques are still largely in black boxes. Understanding the internals of deep learning models is important and highly desirable to the improvement and proper application of the models. Therefore, in this work, we further develop several unique visualization techniques to intensively interpret how RNN models work internally in achieving the outstanding phishing detection performance. Especially, we identify and answer six important research questions, showing that our four RNN models (1) are complementary to each other and can be combined into an ensemble model with even better accuracy, (2) can well capture the relevant features that were manually extracted and used in the traditional machine learning approach for phishing detection, and (3) can help identify useful new features to enhance the accuracy of the traditional machine learning approach. Our techniques and experience in this work could be helpful for researchers to effectively apply deep learning techniques in addressing other real-world security or privacy problems.more » « less
-
Phishing scam emails are emails that pretend to be something they are not in order to get the recipient of the email to undertake some action they normally would not. While technical protections against phishing reduce the number of phishing emails received, they are not perfect and phishing remains one of the largest sources of security risk in technology and communication systems. To better understand the cognitive process that end users can use to identify phishing messages, I interviewed 21 IT experts about instances where they successfully identified emails as phishing in their own inboxes. IT experts naturally follow a three-stage process for identifying phishing emails. In the first stage, the email recipient tries to make sense of the email, and understand how it relates to other things in their life. As they do this, they notice discrepancies: little things that are off'' about the email. As the recipient notices more discrepancies, they feel a need for an alternative explanation for the email. At some point, some feature of the email --- usually, the presence of a link requesting an action --- triggers them to recognize that phishing is a possible alternative explanation. At this point, they become suspicious (stage two) and investigate the email by looking for technical details that can conclusively identify the email as phishing. Once they find such information, then they move to stage three and deal with the email by deleting it or reporting it. I discuss ways this process can fail, and implications for improving training of end users about phishing.more » « less
-
Phishing is the simplest form of cybercrime with the objective of baiting people into giving away delicate information such as individually recognizable data, banking and credit card details, or even credentials and passwords. This type of simple yet most effective cyber-attack is usually launched through emails,phone calls, or instant messages. The credential or private data stolen are then used to get access to critical records of the victims and can result in extensive fraud and monetary loss.Hence, sending malicious messages to victims is a stepping stone of the phishing procedure. A phisher usually setups a deceptive website, where the victims are conned into entering credentials and sensitive information. It is therefore important to detect these types of malicious websites before causing any harmful damages to victims. Inspired by the evolving nature of the phishing websites, this paper introduces a novel approach based on deep reinforcement learning to model and detect malicious URLs. The proposed model is capable of adapting to the dynamic behavior of the phishing websites and thus learn the features associated with phishing website detection.more » « less
An official website of the United States government

