skip to main content

Title: Detecting Cyber Threats in Non-English Hacker Forums: An Adversarial Cross-Lingual Knowledge Transfer Approach
The regularity of devastating cyber-attacks has made cybersecurity a grand societal challenge. Many cybersecurity professionals are closely examining the international Dark Web to proactively pinpoint potential cyber threats. Despite its potential, the Dark Web contains hundreds of thousands of non-English posts. While machine translation is the prevailing approach to process non-English text, applying MT on hacker forum text results in mistranslations. In this study, we draw upon Long-Short Term Memory (LSTM), Cross-Lingual Knowledge Transfer (CLKT), and Generative Adversarial Networks (GANs) principles to design a novel Adversarial CLKT (A-CLKT) approach. A-CLKT operates on untranslated text to retain the original semantics of the language and leverages the collective knowledge about cyber threats across languages to create a language invariant representation without any manual feature engineering or external resources. Three experiments demonstrate how A-CLKT outperforms state-of-the-art machine learning, deep learning, and CLKT algorithms in identifying cyber-threats in French and Russian forums.
Authors:
; ; ;
Award ID(s):
1936370
Publication Date:
NSF-PAR ID:
10181241
Journal Name:
IEEE Symposium on Security and Privacy (IEEE S&P), Deep Learning and Security Workshop (DLS)
Sponsoring Org:
National Science Foundation
More Like this
  1. International dark web platforms operating within multiple geopolitical regions and languages host a myriad of hacker assets such as malware, hacking tools, hacking tutorials, and malicious source code. Cybersecurity analytics organizations employ machine learning models trained on human-labeled data to automatically detect these assets and bolster their situational awareness. However, the lack of human-labeled training data is prohibitive when analyzing foreign-language dark web content. In this research note, we adopt the computational design science paradigm to develop a novel IT artifact for cross-lingual hacker asset detection(CLHAD). CLHAD automatically leverages the knowledge learned from English content to detect hacker assets in non-English dark web platforms. CLHAD encompasses a novel Adversarial deep representation learning (ADREL) method, which generates multilingual text representations using generative adversarial networks (GANs). Drawing upon the state of the art in cross-lingual knowledge transfer, ADREL is a novel approach to automatically extract transferable text representations and facilitate the analysis of multilingual content. We evaluate CLHAD on Russian, French, and Italian dark web platforms and demonstrate its practical utility in hacker asset profiling, and conduct a proof-of-concept case study. Our analysis suggests that cybersecurity managers may benefit more from focusing on Russian to identify sophisticated hacking assets. In contrast, financialmore »hacker assets are scattered among several dominant dark web languages. Managerial insights for security managers are discussed at operational and strategic levels.« less
  2. Cyber-defense systems are being developed to automatically ingest Cyber Threat Intelligence (CTI) that contains semi-structured data and/or text to populate knowledge graphs. A potential risk is that fake CTI can be generated and spread through Open-Source Intelligence (OSINT) communities or on the Web to effect a data poisoning attack on these systems. Adversaries can use fake CTI examples as training input to subvert cyber defense systems, forcing the model to learn incorrect inputs to serve their malicious needs. In this paper, we automatically generate fake CTI text descriptions using transformers. We show that given an initial prompt sentence, a public language model like GPT-2 with fine-tuning, can generate plausible CTI text with the ability of corrupting cyber-defense systems. We utilize the generated fake CTI text to perform a data poisoning attack on a Cybersecurity Knowledge Graph (CKG) and a cybersecurity corpus. The poisoning attack introduced adverse impacts such as returning incorrect reasoning outputs, representation poisoning, and corruption of other dependent AI-based cyber defense systems. We evaluate with traditional approaches and conduct a human evaluation study with cybersecurity professionals and threat hunters. Based on the study, professional threat hunters were equally likely to consider our fake generated CTI as true.
  3. Black hat hackers use malicious exploits to circumvent security controls and take advantage of system vulnerabilities worldwide, costing the global economy over $450 billion annually. While many organizations are increasingly turning to cyber threat intelligence (CTI) to help prioritize their vulnerabilities, extant CTI processes are often criticized as being reactive to known exploits. One promising data source that can help develop proactive CTI is the vast and ever-evolving Dark Web. In this study, we adopted the computational design science paradigm to design a novel deep learning (DL)-based exploit-vulnerability attention deep structured semantic model (EVA-DSSM) that includes bidirectional processing and attention mechanisms to automatically link exploits from the Dark Web to vulnerabilities. We also devised a novel device vulnerability severity metric (DVSM) that incorporates the exploit post date and vulnerability severity to help cybersecurity professionals with their device prioritization and risk management efforts. We rigorously evaluated the EVA-DSSM against state-of-the-art non-DL and DL-based methods for short text matching on 52,590 exploit-vulnerability linkages across four testbeds: web application, remote, local, and denial of service. Results of these evaluations indicate that the proposed EVA-DSSM achieves precision at 1 scores 20% - 41% higher than non-DL approaches and 4% - 10% higher than DL-basedmore »approaches. We demonstrated the EVA-DSSM’s and DVSM’s practical utility with two CTI case studies: openly accessible systems in the top eight U.S. hospitals and over 20,000 Supervisory Control and Data Acquisition (SCADA) systems worldwide. A complementary user evaluation of the case study results indicated that 45 cybersecurity professionals found the EVA-DSSM and DVSM results more useful for exploit-vulnerability linking and risk prioritization activities than those produced by prevailing approaches. Given the rising cost of cyberattacks, the EVA-DSSM and DVSM have important implications for analysts in security operations centers, incident response teams, and cybersecurity vendors.« less
  4. Black hat hackers use malicious exploits to circumvent security controls and take advantage of system vulnerabilities worldwide, costing the global economy over $450 billion annually. While many organizations are increasingly turning to cyber threat intelligence (CTI) to help prioritize their vulnerabilities, extant CTI processes are often criticized as being reactive to known exploits. One promising data source that can help develop proactive CTI is the vast and ever-evolving Dark Web. In this study, we adopted the computational design science paradigm to design a novel deep learning (DL)-based exploit-vulnerability attention deep structured semantic model (EVA-DSSM) that includes bidirectional processing and attention mechanisms to automatically link exploits from the Dark Web to vulnerabilities. We also devised a novel device vulnerability severity metric (DVSM) that incorporates the exploit post date and vulnerability severity to help cybersecurity professionals with their device prioritization and risk management efforts. We rigorously evaluated the EVA-DSSM against state-of-the-art non-DL and DL-based methods for short text matching on 52,590 exploit-vulnerability linkages across four testbeds: web application, remote, local, and denial of service. Results of these evaluations indicate that the proposed EVA-DSSM achieves precision at 1 scores 20%-41% higher than non-DL approaches and 4%-10% higher than DL-based approaches. We demonstrated themore »EVA-DSSM's and DVSM's practical utility with two CTI case studies: openly accessible systems in the top eight U.S. hospitals and over 20,000 Supervisory Control and Data Acquisition (SCADA) systems worldwide. A complementary user evaluation of the case study results indicated that 45 cybersecurity professionals found the EVA-DSSM and DVSM results more useful for exploit-vulnerability linking and risk prioritization activities than those produced by prevailing approaches. Given the rising cost of cyberattacks, the EVA-DSSM and DVSM have important implications for analysts in security operations centers, incident response teams, and cybersecurity vendors.« less
  5. Hands-on practice is a critical component of cybersecurity education. Most of the existing hands-on exercises or labs materials are usually managed in a problem-centric fashion, while it lacks a coherent way to manage existing labs and provide productive lab exercising plans for cybersecurity learners. With the advantages of big data and natural language processing (NLP) technologies, constructing a large knowledge graph and mining concepts from unstructured text becomes possible, which motivated us to construct a machine learning based lab exercising plan for cybersecurity education. In the research presented by this paper, we have constructed a knowledge graph in the cybersecurity domain using NLP technologies including machine learning based word embedding and hyperlink-based concept mining. We then utilized the knowledge graph during the regular learning process based on the following approaches: 1. We constructed a web-based front-end to visualize the knowledge graph, which allows students to browse and search cybersecurity-related concepts and the corresponding interdependence relations; 2. We created a personalized knowledge graph for each student based on their learning progress and status; 3.We built a personalized lab recommendation system by suggesting more relevant labs based on students’ past learning history to maximize their learning outcomes. To measure the effectiveness ofmore »the proposed solution, we have conducted a use case study and collected survey data from a graduate-level cybersecurity class. Our study shows that, by leveraging the knowledge graph for the cybersecurity area study, students tend to benefit more and show more interests in cybersecurity area.« less