skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting Cyber Threats in Non-English Hacker Forums: An Adversarial Cross-Lingual Knowledge Transfer Approach
The regularity of devastating cyber-attacks has made cybersecurity a grand societal challenge. Many cybersecurity professionals are closely examining the international Dark Web to proactively pinpoint potential cyber threats. Despite its potential, the Dark Web contains hundreds of thousands of non-English posts. While machine translation is the prevailing approach to process non-English text, applying MT on hacker forum text results in mistranslations. In this study, we draw upon Long-Short Term Memory (LSTM), Cross-Lingual Knowledge Transfer (CLKT), and Generative Adversarial Networks (GANs) principles to design a novel Adversarial CLKT (A-CLKT) approach. A-CLKT operates on untranslated text to retain the original semantics of the language and leverages the collective knowledge about cyber threats across languages to create a language invariant representation without any manual feature engineering or external resources. Three experiments demonstrate how A-CLKT outperforms state-of-the-art machine learning, deep learning, and CLKT algorithms in identifying cyber-threats in French and Russian forums.  more » « less
Award ID(s):
1936370
PAR ID:
10181241
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE Symposium on Security and Privacy (IEEE S&P), Deep Learning and Security Workshop (DLS)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. International dark web platforms operating within multiple geopolitical regions and languages host a myriad of hacker assets such as malware, hacking tools, hacking tutorials, and malicious source code. Cybersecurity analytics organizations employ machine learning models trained on human-labeled data to automatically detect these assets and bolster their situational awareness. However, the lack of human-labeled training data is prohibitive when analyzing foreign-language dark web content. In this research note, we adopt the computational design science paradigm to develop a novel IT artifact for cross-lingual hacker asset detection(CLHAD). CLHAD automatically leverages the knowledge learned from English content to detect hacker assets in non-English dark web platforms. CLHAD encompasses a novel Adversarial deep representation learning (ADREL) method, which generates multilingual text representations using generative adversarial networks (GANs). Drawing upon the state of the art in cross-lingual knowledge transfer, ADREL is a novel approach to automatically extract transferable text representations and facilitate the analysis of multilingual content. We evaluate CLHAD on Russian, French, and Italian dark web platforms and demonstrate its practical utility in hacker asset profiling, and conduct a proof-of-concept case study. Our analysis suggests that cybersecurity managers may benefit more from focusing on Russian to identify sophisticated hacking assets. In contrast, financial hacker assets are scattered among several dominant dark web languages. Managerial insights for security managers are discussed at operational and strategic levels. 
    more » « less
  2. Cyber threat intelligence (CTI) is an actionable information or insight an organization uses to understand potential vulnerabilities it does have and threats it is facing. One important CTI for proactive cyber defense is exploit type with possible values system, web, network, website or Mobile. This study compares the performance of machine learning algorithms in predicating exploit types using form posts in the dark web, which is a semi- structured dataset collected from dark web. The study uses the CRISP data science approach. The results of the study show that machine learning algorithms which are function-based including support vector machine and deep-learning using artificial neural network are more accurate than those algorithms which are based on tree including Random Forest and Decision-Tree for CTI discovery from semi-structured dataset. Future research will include the use of high-performance computing and advanced deep-learning algorithms. 
    more » « less
  3. Cyber-defense systems are being developed to automatically ingest Cyber Threat Intelligence (CTI) that contains semi-structured data and/or text to populate knowledge graphs. A potential risk is that fake CTI can be generated and spread through Open-Source Intelligence (OSINT) communities or on the Web to effect a data poisoning attack on these systems. Adversaries can use fake CTI examples as training input to subvert cyber defense systems, forcing the model to learn incorrect inputs to serve their malicious needs. In this paper, we automatically generate fake CTI text descriptions using transformers. We show that given an initial prompt sentence, a public language model like GPT-2 with fine-tuning, can generate plausible CTI text with the ability of corrupting cyber-defense systems. We utilize the generated fake CTI text to perform a data poisoning attack on a Cybersecurity Knowledge Graph (CKG) and a cybersecurity corpus. The poisoning attack introduced adverse impacts such as returning incorrect reasoning outputs, representation poisoning, and corruption of other dependent AI-based cyber defense systems. We evaluate with traditional approaches and conduct a human evaluation study with cybersecurity professionals and threat hunters. Based on the study, professional threat hunters were equally likely to consider our fake generated CTI as true. 
    more » « less
  4. Hands-on practice is a critical component of cybersecurity education. Most of the existing hands-on exercises or labs materials are usually managed in a problem-centric fashion, while it lacks a coherent way to manage existing labs and provide productive lab exercising plans for cybersecurity learners. With the advantages of big data and natural language processing (NLP) technologies, constructing a large knowledge graph and mining concepts from unstructured text becomes possible, which motivated us to construct a machine learning based lab exercising plan for cybersecurity education. In the research presented by this paper, we have constructed a knowledge graph in the cybersecurity domain using NLP technologies including machine learning based word embedding and hyperlink-based concept mining. We then utilized the knowledge graph during the regular learning process based on the following approaches: 1. We constructed a web-based front-end to visualize the knowledge graph, which allows students to browse and search cybersecurity-related concepts and the corresponding interdependence relations; 2. We created a personalized knowledge graph for each student based on their learning progress and status; 3.We built a personalized lab recommendation system by suggesting more relevant labs based on students’ past learning history to maximize their learning outcomes. To measure the effectiveness of the proposed solution, we have conducted a use case study and collected survey data from a graduate-level cybersecurity class. Our study shows that, by leveraging the knowledge graph for the cybersecurity area study, students tend to benefit more and show more interests in cybersecurity area. 
    more » « less
  5. As cyber threats grow in both frequency and sophistication, traditional cybersecurity measures struggle to keep pace with evolving attack methods. Artificial Intelligence (AI) has emerged as a powerful tool for enhancing threat detection, prevention, and response. AI-driven security systems offer the ability to analyze vast amounts of data in real-time, recognize subtle patterns indicative of cyber threats, and adapt to new attack strategies more efficiently than conventional approaches. However, despite AI’s potential, challenges remain regarding its effectiveness, ethical implications, and risks of adversarial manipulation. This research investigates the strengths and limitations of AI-driven cybersecurity by comparing AI-based security tools with traditional methods, identifying key advantages and vulnerabilities, and exploring ethical considerations. Additionally, a survey of cybersecurity professionals was conducted to assess expert opinions on AI’s role, effectiveness, and potential risks. By combining these insights with experimental testing and a comprehensive review of existing literature, this study provides a nuanced understanding of AI’s impact on cybersecurity and offers recommendations for optimizing its integration into modern security infrastructures. 
    more » « less