skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network
Online underground forums have been widely used by cybercriminals to trade the illicit products, resources and services, which have played a central role in the cybercrim-inal ecosystem. Unfortunately, due to the number of forums, their size, and the expertise required, it's infeasible to perform manual exploration to understand their behavioral processes. In this paper, we propose a novel framework named iDetector to automate the analysis of underground forums for the detection of cybercrime-suspected threads. In iDetector, to detect whether the given threads are cybercrime-suspected threads, we not only analyze the content in the threads, but also utilize the relations among threads, users, replies, and topics. To model this kind of rich semantic relationships (i.e., thread-user, thread-reply, thread-topic, reply-user and reply-topic relations), we introduce a structured heterogeneous information network (HIN) for representation, which is capable to be composed of different types of entities and relations. To capture the complex relationships (e.g., two threads are relevant if they were posted by the same user and discussed the same topic), we use a meta-structure based approach to characterize the semantic relatedness over threads. As different meta-structures depict the relatedness over threads at different views, we then build a classifier using Laplacian scores to aggregate different similarities formulated by different meta-structures to make predictions. To the best of our knowledge, this is the first work to use structural HIN to automate underground forum analysis. Comprehensive experiments on real data collections from underground forums (e.g., Hack Forums) are conducted to validate the effectiveness of our developed system iDetector in cybercrime-suspected thread detection by comparisons with other alternative methods.  more » « less
Award ID(s):
1650474
PAR ID:
10091250
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Page Range / eLocation ID:
1071 to 1078
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% ofmicro-f1 in multi-label node classification and 5% to 70.8% of MAP in link prediction. 
    more » « less
  2. Online discussion forums have become an integral component of news, entertainment, information, and video-streaming websites, where people all over the world actively engage in discussions on a wide range of topics including politics, sports, music, business, health, and world affairs. Yet, little is known about their usability for blind users, who aurally interact with the forum conversations using screen reader assistive technology. In an interview study, blind users stated that they often had an arduous and frustrating interaction experience while consuming conversation threads, mainly due to the highly redundant content and the absence of customization options to selectively view portions of the conversations. As an initial step towards addressing these usability concerns, we designed PView - a browser extension that enables blind users to customize the content of forum threads in real time as they interact with these threads. Specifically, PView allows the blind users to explicitly hide any post that is irrelevant to them, and then PView automatically detects and filters out all subsequent posts that are substantially similar to the hidden post in real time, before the users navigate to those portions of the thread. In a user study with blind participants, we observed that compared to the status quo, PView significantly improved the usability, workload, and satisfaction of the participants while interacting with the forums. 
    more » « less
  3. Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships. Despite the recent success in object detection using deep learning techniques, inferring complex contextual relationships and structured graph representations from visual data remains a challenging topic. In this study, we propose a novel Attentive Relational Network that consists of two key modules with an object detection backbone to approach this problem. The first module is a semantic transformation module utilized to capture semantic embedded relation features, by translating visual features and linguistic features into a common semantic space. The other module is a graph self-attention module introduced to embed a joint graph representation through assigning various importance weights to neighboring nodes. Finally, accurate scene graphs are produced by the relation inference module to recognize all entities and the corresponding relations. We evaluate our proposed method on the widely-adopted Visual Genome dataset, and the results demonstrate the effectiveness and superiority of our model. 
    more » « less
  4. Abstract The organization of our knowledge about the world into an interconnected network of concepts linked by relations profoundly impacts many facets of cognition, including attention, memory retrieval, reasoning, and learning. It is therefore crucial to understand how organized semantic representations are acquired. The present experiment investigated the contributions of readily observable environmental statistical regularities to semantic organization in childhood. Specifically, we investigated whether co‐occurrence regularities with which entities or their labels more reliably occur together than with others (a) contribute to relations between concepts independently and (b) contribute to relations between concepts belonging to the same taxonomic category. Using child‐directed speech corpora to estimate reliable co‐occurrences between labels for familiar items, we constructed triads consisting of a target, a related distractor, and an unrelated distractor in which targets and related distractors consistently co‐occurred (e.g., sock‐foot), belonged to the same taxonomic category (e.g., sock‐coat), or both (e.g., sock‐shoe). We used an implicit, eye‐gaze measure of relations between concepts based on the degree to which children (N = 72, age 4–7 years) looked at related versus unrelated distractors when asked to look for a target. The results indicated that co‐occurrence both independently contributes to relations between concepts and contributes to relations between concepts belonging to the same taxonomic category. These findings suggest that sensitivity to the regularity with which different entities co‐occur in children's environments shapes the organization of semantic knowledge during development. Implications for theoretical accounts and empirical investigations of semantic organization are discussed. 
    more » « less
  5. Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture "relatedness" (whether two variables are linked at all), rather than "similarity" (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names. 
    more » « less