skip to main content

This content will become publicly available on May 1, 2023

Title: WtaGraph: Web Tracking and Advertising Detection using Graph Neural Networks
Web tracking and advertising (WTA) nowadays are ubiquitously performed on the web, continuously compromising users' privacy. Existing defense solutions, such as widely deployed blocking tools based on filter lists and alternative machine learning based solutions proposed in prior research, have limitations in terms of accuracy and effectiveness. In this work, we propose WtaGraph, a web tracking and advertising detection framework based on Graph Neural Networks (GNNs). We first construct an attributed homogenous multi-graph (AHMG) that represents HTTP network traffic, and formulate web tracking and advertising detection as a task of GNN-based edge representation learning and classification in AHMG. We then design four components in WtaGraph so that it can (1) collect HTTP network traffic, DOM, and JavaScript data, (2) construct AHMG and extract corresponding edge and node features, (3) build a GNN model for edge representation learning and WTA detection in the transductive learning setting, and (4) use a pre-trained GNN model for WTA detection in the inductive learning setting. We evaluate WtaGraph on a dataset collected from Alexa Top 10K websites, and show that WtaGraph can effectively detect WTA requests in both transductive and inductive learning settings. Manual verification results indicate that WtaGraph can detect new WTA requests that more » are missed by filter lists and recognize non-WTA requests that are mistakenly labeled by filter lists. Our ablation analysis, evasion evaluation, and real-time evaluation show that WtaGraph can have a competitive performance with flexible deployment options in practice. « less
Authors:
; ; ;
Award ID(s):
1936968
Publication Date:
NSF-PAR ID:
10287982
Journal Name:
IEEE Symposium on Security and Privacy
Sponsoring Org:
National Science Foundation
More Like this
  1. Today’s mobile apps employ third-party advertising and tracking (A&T) libraries, which may pose a threat to privacy. State-of-the-art detects and blocks outgoing A&T HTTP/S requests by using manually curated filter lists (e.g. EasyList), and recently, using machine learning approaches. The major bottleneck of both filter lists and classifiers is that they rely on experts and the community to inspect traffic and manually create filter list rules that can then be used to block traffic or label ground truth datasets. We propose NoMoATS – a system that removes this bottleneck by reducing the daunting task of manually creating filter rules, tomore »the much easier and scalable task of labeling A&T libraries. Our system leverages stack trace analysis to automatically label which network requests are generated by A&T libraries. Using NoMoATS, we collect and label a new mobile traffic dataset. We use this dataset to train decision tree classifiers, which can be applied in real-time on the mobile device and achieve an average F-score of 93%. We show that both our automatic labeling and our classifiers discover thousands of requests destined to hundreds of different hosts, previously undetected by popular filter lists. To the best of our knowledge, our system is the first to (1) automatically label which mobile network requests are engaged in A&T, while requiring to only manually label libraries to their purpose and (2) apply on-device machine learning classifiers that operate at the granularity of URLs, can inspect connections across all apps, and detect not only ads, but also tracking.« less
  2. Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWsmore »are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 15% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 5 out of the 6 networks in the transductive setting.« less
  3. The smart parking industry continues to evolve as an increasing number of cities struggle with traffic congestion and inadequate parking availability. For urban dwellers, few things are more irritating than anxiously searching for a parking space. Research results show that as much as 30% of traffic is caused by drivers driving around looking for parking spaces in congested city areas. There has been considerable activity among researchers to develop smart technologies that can help drivers find a parking spot with greater ease, not only reducing traffic congestion but also the subsequent air pollution. Many existing solutions deploy sensors in everymore »parking spot to address the automatic parking spot detection problems. However, the device and deployment costs are very high, especially for some large and old parking structures. A wide variety of other technological innovations are beginning to enable more adaptable systems-including license plate number detection, smart parking meter, and vision-based parking spot detection. In this paper, we propose to design a more adaptable and affordable smart parking system via distributed cameras, edge computing, data analytics, and advanced deep learning algorithms. Specifically, we deploy cameras with zoom-lens and motorized head to capture license plate numbers by tracking the vehicles when they enter or leave the parking lot; cameras with wide angle fish-eye lens will monitor the large parking lot via our custom designed deep neural network. We further optimize the algorithm and enable the real-time deep learning inference in an edge device. Through the intelligent algorithm, we can significantly reduce the cost of existing systems, while achieving a more adaptable solution. For example, our system can automatically detect when a car enters the parking space, the location of the parking spot, and precisely charge the parking fee and associate this with the license plate number.« less
  4. Existing Graph Neural Network (GNN) methods that learn inductive unsupervised graph representations focus on learning node and edge representations by predicting observed edges in the graph. Although such approaches have shown advances in downstream node classification tasks, they are ineffective in jointly representing larger k-node sets, k > 2. We propose MHM-GNN, an inductive unsupervised graph representation approach that combines joint k-node representations with energy-based models (hypergraph Markov networks) and GNNs. To address the intractability of the loss that arises from this combination, we endow our optimization with a loss upper bound using a finite-sample unbiased Markov Chain Monte Carlomore »estimator. Our experiments show that the unsupervised MHM-GNN representations of MHM-GNN produce better unsupervised representations than existing approaches from the literature.« less
  5. Abstract

    Recent work has demonstrated that geometric deep learning methods such as graph neural networks (GNNs) are well suited to address a variety of reconstruction problems in high-energy particle physics. In particular, particle tracking data are naturally represented as a graph by identifying silicon tracker hits as nodes and particle trajectories as edges, given a set of hypothesized edges, edge-classifying GNNs identify those corresponding to real particle trajectories. In this work, we adapt the physics-motivated interaction network (IN) GNN toward the problem of particle tracking in pileup conditions similar to those expected at the high-luminosity Large Hadron Collider. Assuming idealizedmore »hit filtering at various particle momenta thresholds, we demonstrate the IN’s excellent edge-classification accuracy and tracking efficiency through a suite of measurements at each stage of GNN-based tracking: graph construction, edge classification, and track building. The proposed IN architecture is substantially smaller than previously studied GNN tracking architectures; this is particularly promising as a reduction in size is critical for enabling GNN-based tracking in constrained computing environments. Furthermore, the IN may be represented as either a set of explicit matrix operations or a message passing GNN. Efforts are underway to accelerate each representation via heterogeneous computing resources towards both high-level and low-latency triggering applications.

    « less