skip to main content


Title: iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network
Online underground forums have been widely used by cybercriminals to trade the illicit products, resources and services, which have played a central role in the cybercrim-inal ecosystem. Unfortunately, due to the number of forums, their size, and the expertise required, it's infeasible to perform manual exploration to understand their behavioral processes. In this paper, we propose a novel framework named iDetector to automate the analysis of underground forums for the detection of cybercrime-suspected threads. In iDetector, to detect whether the given threads are cybercrime-suspected threads, we not only analyze the content in the threads, but also utilize the relations among threads, users, replies, and topics. To model this kind of rich semantic relationships (i.e., thread-user, thread-reply, thread-topic, reply-user and reply-topic relations), we introduce a structured heterogeneous information network (HIN) for representation, which is capable to be composed of different types of entities and relations. To capture the complex relationships (e.g., two threads are relevant if they were posted by the same user and discussed the same topic), we use a meta-structure based approach to characterize the semantic relatedness over threads. As different meta-structures depict the relatedness over threads at different views, we then build a classifier using Laplacian scores to aggregate different similarities formulated by different meta-structures to make predictions. To the best of our knowledge, this is the first work to use structural HIN to automate underground forum analysis. Comprehensive experiments on real data collections from underground forums (e.g., Hack Forums) are conducted to validate the effectiveness of our developed system iDetector in cybercrime-suspected thread detection by comparisons with other alternative methods.  more » « less
Award ID(s):
1650474
NSF-PAR ID:
10091250
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Page Range / eLocation ID:
1071 to 1078
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture "relatedness" (whether two variables are linked at all), rather than "similarity" (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names. 
    more » « less
  2. Online discussion forums have become an integral component of news, entertainment, information, and video-streaming websites, where people all over the world actively engage in discussions on a wide range of topics including politics, sports, music, business, health, and world affairs. Yet, little is known about their usability for blind users, who aurally interact with the forum conversations using screen reader assistive technology. In an interview study, blind users stated that they often had an arduous and frustrating interaction experience while consuming conversation threads, mainly due to the highly redundant content and the absence of customization options to selectively view portions of the conversations. As an initial step towards addressing these usability concerns, we designed PView - a browser extension that enables blind users to customize the content of forum threads in real time as they interact with these threads. Specifically, PView allows the blind users to explicitly hide any post that is irrelevant to them, and then PView automatically detects and filters out all subsequent posts that are substantially similar to the hidden post in real time, before the users navigate to those portions of the thread. In a user study with blind participants, we observed that compared to the status quo, PView significantly improved the usability, workload, and satisfaction of the participants while interacting with the forums.

     
    more » « less
  3. Over the last decade, userland memory forensics techniques and algorithms have gained popularity among practitioners, as they have proven to be useful in real forensics and cybercrime investigations. These techniques analyze and recover objects and artifacts from process memory space that are of critical importance in investigations. Nonetheless, the major drawback of existing techniques is that they cannot determine the origin and context within which the recovered object exists without prior knowledge of the application logic. Thus, in this research, we present a solution to close the gap between application-specific and application-generic techniques. We introduce OAGen, a post-execution and app-agnostic semantic analysis approach designed to help investigators establish concrete evidence by identifying the provenance and relationships between in-memory objects in a process memory image. OAGen utilizes Points-to analysis to reconstruct a runtime’s object allocation network. The resulting graph is then fed as an input into our semantic analysis algorithms to determine objects’ origin, context, and scope in the network. The results of our experiments exhibit OAGen’s ability to effectively create an allocation network even for memory-intensive applications with thousands of objects, like Facebook. The performance evaluation of our approach across fourteen different Android apps shows OAGen can efficiently search and decode nodes, and identify their references with a modest throughput rate. Further practical application of OAGen demonstrated in two case studies shows that our approach can aid investigators in the recovery of deleted messages and the detection of malware functionality in post-execution program analysis. 
    more » « less
  4. Cybercrime was estimated to cost the global economy $945 billion in 2020. Increasingly, law enforcement agencies are using social network analysis (SNA) to identify key hackers from Dark Web hacker forums for targeted investigations. However, past approaches have primarily focused on analyzing key hackers at a single point in time and use a hacker’s structural features only. In this study, we propose a novel Hacker Evolution Identification Framework to identify how hackers evolve within hacker forums. The proposed framework has two novelties in its design. First, the framework captures features such as user statistics, node-level metrics, lexical measures, and post style, when representing each hacker with unsupervised graph embedding methods. Second, the framework incorporates mechanisms to align embedding spaces across multiple time-spells of data to facilitate analysis of how hackers evolve over time. Two experiments were conducted to assess the performance of prevailing graph embedding algorithms and nodal feature variations in the task of graph reconstruction in five timespells. Results of our experiments indicate that Text- Associated Deep-Walk (TADW) with all of the proposed nodal features outperforms methods without nodal features in terms of Mean Average Precision in each time-spell. We illustrate the potential practical utility of the proposed framework with a case study on an English forum with 51,612 posts. The results produced by the framework in this case study identified key hackers posting piracy assets. 
    more » « less
  5. null (Ed.)
    Cyberbullying, identified as intended and repeated online bullying behavior, has become increasingly prevalent in the past few decades. Despite the significant progress made thus far, the focus of most existing work on cyberbullying detection lies in the independent content analysis of different comments within a social media session. We argue that such leading notions of analysis suffer from three key limitations: they overlook the temporal correlations among different comments; they only consider the content within a single comment rather than the topic coherence across comments; they remain generic and exploit limited interactions between social media users. In this work, we observe that user comments in the same session may be inherently related, e.g., discussing similar topics, and their interaction may evolve over time. We also show that modeling such topic coherence and temporal interaction are critical to capture the repetitive characteristics of bullying behavior, thus leading to better predicting performance. To achieve the goal, we first construct a unified temporal graph for each social media session. Drawing on recent advances in graph neural network, we then propose a principled graph-based approach for modeling the temporal dynamics and topic coherence throughout user interactions. We empirically evaluate the effectiveness of our approach with the tasks of session-level bullying detection and comment-level case study. Our code is released to public. 
    more » « less