skip to main content

Title: Distilling Contextual Embeddings Into A Static Word Embedding For Improving Hacker Forum Analytics
Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task.
Authors:
;
Award ID(s):
1921485 1917117
Publication Date:
NSF-PAR ID:
10344545
Journal Name:
Proceedings of 2021 IEEE International Conference on Intelligence and Security Informatics (IEEE ISI 2021)
Page Range or eLocation-ID:
1 to 3
Sponsoring Org:
National Science Foundation
More Like this
  1. Cybersecurity experts have appraised the total global cost of malicious hacking activities to be $450 billion annually. Cyber Threat Intelligence (CTI) has emerged as a viable approach to combat this societal issue. However, existing processes are criticized as inherently reactive to known threats. To combat these concerns, CTI experts have suggested proactively examining emerging threats in the vast, international online hacker community. In this study, we aim to develop proactive CTI capabilities by exploring online hacker forums to identify emerging threats in terms of popularity and tool functionality. To achieve these goals, we create a novel Diachronic Graph Embedding Framework (D-GEF). D-GEF operates on a Graph-of-Words (GoW) representation of hacker forum text to generate word embeddings in an unsupervised manner. Semantic displacement measures adopted from diachronic linguistics literature identify how terminology evolves. A series of benchmark experiments illustrate D-GEF's ability to generate higher quality than state-of-the-art word embedding models (e.g., word2vec) in tasks pertaining to semantic analogy, clustering, and threat classification. D-GEF's practical utility is illustrated with in-depth case studies on web application and denial of service threats targeting PHP and Windows technologies, respectively. We also discuss the implications of the proposed framework for strategic, operational, and tactical CTI scenarios.more »All datasets and code are publicly released to facilitate scientific reproducibility and extensions of this work.« less
  2. Cybercrime was estimated to cost the global economy $945 billion in 2020. Increasingly, law enforcement agencies are using social network analysis (SNA) to identify key hackers from Dark Web hacker forums for targeted investigations. However, past approaches have primarily focused on analyzing key hackers at a single point in time and use a hacker’s structural features only. In this study, we propose a novel Hacker Evolution Identification Framework to identify how hackers evolve within hacker forums. The proposed framework has two novelties in its design. First, the framework captures features such as user statistics, node-level metrics, lexical measures, and post style, when representing each hacker with unsupervised graph embedding methods. Second, the framework incorporates mechanisms to align embedding spaces across multiple time-spells of data to facilitate analysis of how hackers evolve over time. Two experiments were conducted to assess the performance of prevailing graph embedding algorithms and nodal feature variations in the task of graph reconstruction in five timespells. Results of our experiments indicate that Text- Associated Deep-Walk (TADW) with all of the proposed nodal features outperforms methods without nodal features in terms of Mean Average Precision in each time-spell. We illustrate the potential practical utility of the proposed frameworkmore »with a case study on an English forum with 51,612 posts. The results produced by the framework in this case study identified key hackers posting piracy assets.« less
  3. Online forums are an integral part of modern day courses, but motivating students to participate in educationally beneficial discussions can be challenging. Our proposed solution is to initialize (or “seed”) a new course forum with comments from past instances of the same course that are intended to trigger discussion that is beneficial to learning. In this work, we develop methods for selecting high-quality seeds and evaluate their impact over one course instance of a 186-student biology class. We designed a scale for measuring the “seeding suitability” score of a given thread (an opening comment and its ensuing discussion). We then constructed a supervised machine learning (ML) model for predicting the seeding suitability score of a given thread. This model was evaluated in two ways: first, by comparing its performance to the expert opinion of the course instructors on test/holdout data; and second, by embedding it in a live course, where it was actively used to facilitate seeding by the course instructors. For each reading assignment in the course, we presented a ranked list of seeding recommendations to the course instructors, who could review the list and filter out seeds with inconsistent or malformed content. We then ran a randomized controlledmore »study, in which one group of students was shown seeds that were recommended by the ML model, and another group was shown seeds that were recommended by an alternative model that ranked seeds purely by the length of discussion that was generated in previous course instances. We found that the group of students that received posts from either seeding model generated more discussion than a control group in the course that did not get seeded posts. Furthermore, students who received seeds selected by the ML-based model showed higher levels of engagement, as well as greater learning gains, than those who received seeds ranked by length of discussion.« less
  4. Background: Addiction to drugs and alcohol constitutes one of the significant factors underlying the decline in life expectancy in the US. Several context-specific reasons influence drug use and recovery. In particular emotional distress, physical pain, relationships, and self-development efforts are known to be some of the factors associated with addiction recovery. Unfortunately, many of these factors are not directly observable and quantifying, and assessing their impact can be difficult. Based on social media posts of users engaged in substance use and recovery on the forum Reddit, we employed two psycholinguistic tools, Linguistic Inquiry and Word Count and Empath and activities of substance users on various Reddit sub-forums to analyze behavior underlining addiction recovery and relapse. We then employed a statistical analysis technique called structural equation modeling to assess the effects of these latent factors on recovery and relapse. Results: We found that both emotional distress and physical pain significantly influence addiction recovery behavior. Self-development activities and social relationships of the substance users were also found to enable recovery. Furthermore, within the context of self-development activities, those that were related to influencing the mental and physical well-being of substance users were found to be positively associated with addiction recovery. We alsomore »determined that lack of social activities and physical exercise can enable a relapse. Moreover, geography, especially life in rural areas, appears to have a greater correlation with addiction relapse. Conclusions: The paper describes how observable variables can be extracted from social media and then be used to model important latent constructs that impact addiction recovery and relapse. We also report factors that impact self-induced addiction recovery and relapse. To the best of our knowledge, this paper represents the first use of structural equation modeling of social media data with the goal of analyzing factors influencing addiction recovery.« less
  5. The regularity of devastating cyber-attacks has made cybersecurity a grand societal challenge. Many cybersecurity professionals are closely examining the international Dark Web to proactively pinpoint potential cyber threats. Despite its potential, the Dark Web contains hundreds of thousands of non-English posts. While machine translation is the prevailing approach to process non-English text, applying MT on hacker forum text results in mistranslations. In this study, we draw upon Long-Short Term Memory (LSTM), Cross-Lingual Knowledge Transfer (CLKT), and Generative Adversarial Networks (GANs) principles to design a novel Adversarial CLKT (A-CLKT) approach. A-CLKT operates on untranslated text to retain the original semantics of the language and leverages the collective knowledge about cyber threats across languages to create a language invariant representation without any manual feature engineering or external resources. Three experiments demonstrate how A-CLKT outperforms state-of-the-art machine learning, deep learning, and CLKT algorithms in identifying cyber-threats in French and Russian forums.