skip to main content


Title: Detecting Phishing Websites through Deep Reinforcement Learning
Phishing is the simplest form of cybercrime with the objective of baiting people into giving away delicate information such as individually recognizable data, banking and credit card details, or even credentials and passwords. This type of simple yet most effective cyber-attack is usually launched through emails,phone calls, or instant messages. The credential or private data stolen are then used to get access to critical records of the victims and can result in extensive fraud and monetary loss.Hence, sending malicious messages to victims is a stepping stone of the phishing procedure. A phisher usually setups a deceptive website, where the victims are conned into entering credentials and sensitive information. It is therefore important to detect these types of malicious websites before causing any harmful damages to victims. Inspired by the evolving nature of the phishing websites, this paper introduces a novel approach based on deep reinforcement learning to model and detect malicious URLs. The proposed model is capable of adapting to the dynamic behavior of the phishing websites and thus learn the features associated with phishing website detection.  more » « less
Award ID(s):
1821560 1723765
NSF-PAR ID:
10186763
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)
Page Range / eLocation ID:
227 to 232
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Phishing is a ubiquitous and increasingly sophisticated online threat. To evade mitigations, phishers try to ""cloak"" malicious content from defenders to delay their appearance on blacklists, while still presenting the phishing payload to victims. This cat-and-mouse game is variable and fast-moving, with many distinct cloaking methods---we construct a dataset identifying 2,933 real-world phishing kits that implement cloaking mechanisms. These kits use information from the host, browser, and HTTP request to classify traffic as either anti-phishing entity or potential victim and change their behavior accordingly. In this work we present SPARTACUS, a technique that subverts the phishing status quo by disguising user traffic as anti-phishing entities. These intentional false positives trigger cloaking behavior in phishing kits, thus hiding the malicious payload and protecting the user without disrupting benign sites. To evaluate the effectiveness of this approach, we deployed SPARTACUS as a browser extension from November 2020 to July 2021. During that time, SPARTACUS browsers visited 160,728 reported phishing URLs in the wild. Of these, SPARTACUS protected against 132,274 sites (82.3%). The phishing kits which showed malicious content to SPARTACUS typically did so due to ineffective cloaking---the majority (98.4%) of the remainder were detected by conventional anti-phishing systems such as Google Safe Browsing or VirusTotal, and would be blacklisted regardless. We further evaluate SPARTACUS against benign websites sampled from the Alexa Top One Million List for impacts on latency, accessibility, layout, and CPU overhead, finding minimal performance penalties and no loss in functionality. 
    more » « less
  2. Millions of people use adblockers to remove intrusive and malicious ads as well as protect themselves against tracking and pervasive surveillance. Online publishers consider adblockers a major threat to the ad-powered “free” Web. They have started to retaliate against adblockers by employing antiadblockers which can detect and stop adblock users. To counter this retaliation, adblockers in turn try to detect and filter anti-adblocking scripts. This back and forth has prompted an escalating arms race between adblockers and anti-adblockers. We want to develop a comprehensive understanding of antiadblockers, with the ultimate aim of enabling adblockers to bypass state-of-the-art anti-adblockers. In this paper, we present a differential execution analysis to automatically detect and analyze anti-adblockers. At a high level, we collect execution traces by visiting a website with and without adblockers. Through differential execution analysis, we are able to pinpoint the conditions that lead to the differences caused by anti-adblocking code. Using our system, we detect anti-adblockers on 30.5% of the Alexa top10K websites which is 5-52 times more than reported in prior literature. Unlike prior work which is limited to detecting visible reactions (e.g., warning messages) by anti-adblockers, our system can discover attempts to detect adblockers even when there is no visible reaction. From manually checking one third of the detected websites, we find that the websites that have no visible reactions constitute over 90% of the cases, completely dominating the ones that have visible warning messages. Finally, based on our findings, we further develop JavaScript rewriting and API hooking based solutions (the latter implemented as a Chrome extension) to help adblockers bypass state-of-the-art anti-adblockers. 
    more » « less
  3. Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of 99.7%. 
    more » « less
  4. In 2019, the US Department of Homeland Security issued an emergency warning about DNS infrastructure tampering. This alert, in response to a series of attacks against foreign government websites, highlighted how a sophisticated attacker could leverage access to key DNS infrastructure to then hijack traffic and harvest valid login credentials for target organizations. However, even armed with this knowledge, identifying the existence of such incidents has been almost entirely via post hoc forensic reports (i.e., after a breach was found via some other method). Indeed, such attacks are particularly challenging to detect because they can be very short lived, bypass the protections of TLS and DNSSEC, and are imperceptible to users. Identifying them retroactively is even more complicated by the lack of fine-grained Internet-scale forensic data. This paper is a first attempt to make progress at this latter goal. Combining a range of longitudinal data from Internet-wide scans, passive DNS records, and Certificate Transparency logs, we have constructed a methodology for identifying potential victims of sophisticated DNS infrastructure hijacking and have used it to identify a range of victims (primarily government agencies), both those named in prior reporting, and others previously unknown. 
    more » « less
  5. null (Ed.)
    Phishing websites trick honest users into believing that they interact with a legitimate website and capture sensitive information, such as user names, passwords, credit card numbers, and other personal information. Machine learning is a promising technique to distinguish between phishing and legitimate websites. However, machine learning approaches are susceptible to adversarial learning attacks where a phishing sample can bypass classifiers. Our experiments on publicly available datasets reveal that the phishing detection mechanisms are vulnerable to adversarial learning attacks. We investigate the robustness of machine learning-based phishing detection in the face of adversarial learning attacks. We propose a practical approach to simulate such attacks by generating adversarial samples through direct feature manipulation. To enhance the sample’s success probability, we describe a clustering approach that guides an attacker to select the best possible phishing samples that can bypass the classifier by appearing as legitimate samples. We define the notion of vulnerability level for each dataset that measures the number of features that can be manipulated and the cost for such manipulation. Further, we clustered phishing samples and showed that some clusters of samples are more likely to exhibit higher vulnerability levels than others. This helps an adversary identify the best candidates of phishing samples to generate adversarial samples at a lower cost. Our finding can be used to refine the dataset and develop better learning models to compensate for the weak samples in the training dataset. 
    more » « less