skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 11 until 2:00 AM ET on Saturday, July 12 due to maintenance. We apologize for the inconvenience.


Title: Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach
Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERTLDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERTLDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.  more » « less
Award ID(s):
1917117 1921485
PAR ID:
10336827
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2021 IEEE International Conference on Intelligence and Security Informatics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task. 
    more » « less
  2. Serverless Computing has quickly emerged as a dominant cloud computing paradigm, allowing developers to rapidly prototype event-driven applications using a composition of small functions that each perform a single logical task. However, many such application workflows are based in part on publicly-available functions developed by third-parties, creating the potential for functions to behave in unexpected, or even malicious, ways. At present, developers are not in total control of where and how their data is flowing, creating significant security and privacy risks in growth markets that have embraced serverless (e.g., IoT). As a practical means of addressing this problem, we present Valve, a serverless platform that enables developers to exert complete fine-grained control of information flows in their applications. Valve enables workflow developers to reason about function behaviors, and specify restrictions, through auditing of network-layer information flows. By proxying network requests and propagating taint labels across network flows, Valve is able to restrict function behavior without code modification. We demonstrate that Valve is able defend against known serverless attack behaviors including container reuse-based persistence and data exfiltration over cloud platform APIs with less than 2.8% runtime overhead, 6.25% deployment overhead and 2.35% teardown overhead. 
    more » « less
  3. Systematic reviews are a time-consuming yet effective approach to understanding research trends. While researchers have investigated how to speed up the process of screening studies for potential inclusion, few have focused on to what extent we can use algorithms to extract data instead of human coders. In this study, we explore to what extent analyses and algorithms can produce results similar to human data extraction during a scoping review—a type of systematic review aimed at understanding the nature of the field rather than the efficacy of an intervention—in the context of a never before analyzed sample of studies that were intended for a scoping review. Specifically, we tested five approaches: bibliometric analysis with VOSviewer, latent Dirichlet allocation (LDA) with bag of words, k-means clustering with TF-IDF, Sentence-BERT, or SPECTER, hierarchical clustering with Sentence-BERT, and BERTopic. Our results showed that topic modeling approaches (LDA/BERTopic) and k-means clustering identified specific, but often narrow research areas, leaving a substantial portion of the sample unclassified or in unclear topics. Meanwhile, bibliometric analysis and hierarchical clustering with SBERT were more informative for our purposes, identifying key author networks and categorizing studies into distinct themes as well as reflecting the relationships between themes, respectively. Overall, we highlight the capabilities and limitations of each method and discuss how these techniques can complement traditional human data extraction methods. We conclude that the analyses tested here likely cannot fully replace human data extraction in scoping reviews but serve as valuable supplements. 
    more » « less
  4. This study investigates the hydration, microstructure, autogenous shrinkage, electrical resistivity, and mechanical properties of Portland cement pastes modified with PEG-PPG triblock copolymers with varied molecular weights. The early age properties including setting time and hydration heat were examined using the Vicat test and isothermal calorimetry. The hydration products and pore size distribution were analyzed using thermogravimetric analysis (TGA) and nitrogen adsorption, respectively. Mechanical properties and electrical resistivity were evaluated using the compressive strength test and electrochemical impedance spectroscopy (EIS). It was shown that the addition of the copolymers reduced the surface tension of the cement paste pore solution due to the presence of a hydrophobic block (PPG) in the molecular structure of the copolymers. The setting time and hydration heat were relatively similar in the control paste as well as the pastes modified with the copolymers. The results showed that copolymers were able to reduce the autogenous shrinkage in the paste due primarily to a reduction in pore solution surface tension. TGA showed a slight increase in the hydration degree of the paste modified with the copolymers. The compressive strength was reduced in the pastes modified with the copolymers that showed an increased volume of air voids. The addition of copolymers did not affect the electrical resistivity of the pastes except in the case where there was a large volume of air voids, which acted as electrical insulators. 
    more » « less
  5. The U.S. Government is developing a package label to help consumers access reliable security and privacy information about Internet of Things (IoT) devices when making purchase decisions. The label will include the U.S. Cyber Trust Mark, a QR code to scan for more details, and potentially additional information. To examine how label information complexity and educational interventions affect comprehension of security and privacy attributes and label QR code use, we conducted an online survey with 518 IoT purchasers. We examined participants’ comprehension and preferences for three labels of varying complexities, with and without an educational intervention. Participants favored and correctly utilized the two higher-complexity labels, showing a special interest in the privacy-relevant content. Furthermore, while the educational intervention improved understanding of the QR code’s purpose, it had a modest effect on QR scanning behavior. We highlight clear design and policy directions for creating and deploying IoT security and privacy labels. 
    more » « less