As graph data grows increasingly complicated, training graph neural networks (GNNs) on large-scale datasets presents significant challenges, including computational resource constraints, data redundancy, and transmission inefficiencies. While existing graph condensation techniques have shown promise in addressing these issues, they are predominantly designed for single-label datasets, where each node is associated with a single class label. However, many real-world applications, such as social network analysis and bioinformatics, involve multi-label graph datasets, where one node can have various related labels. To deal with this problem, we extend traditional graph condensation approaches to accommodate multi-label datasets by introducing modifications to synthetic dataset initialization and condensing optimization. Through experiments on eight real-world multi-label graph datasets, we prove the effectiveness of our method. In the experiment, the GCond framework, combined with K-Center initialization and binary cross-entropy loss (BCELoss), generally achieves the best performance. This benchmark for multi-label graph condensation not only enhances the scalability and efficiency of GNNs for multi-label graph data but also offers substantial benefits for diverse real-world applications. Code is available at https://github.com/liangliang6v6/Multi-GC.
more »
« less
This content will become publicly available on October 15, 2026
VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation
Automated detection of vulnerabilities in source code is anessential cybersecurity challenge, underpinning trust indigital systems and services. Graph Neural Networks (GNNs)have emerged as a promising approach as they can learn thestructural and logical code relationships in a data-drivenmanner. However, the performance of GNNs is severelylimited by training data imbalances and label noise. GNNscan often learn “spurious” correlations due to superficialcode similarities in the training data, leading todetectors that do not generalize well to unseen real-worlddata. In this work, we propose a new unified framework forrobust and interpretable vulnerability detection—that wecall VISION—to mitigate spurious correlations bysystematically augmenting a counterfactual trainingdataset. Counterfactuals are samples with minimal semanticmodifications that have opposite prediction labels. Ourcomplete framework includes: (i) generating effectivecounterfactuals by prompting a Large Language Model (LLM);(ii) targeted GNN model training on synthetically pairedcode examples with opposite labels; and (iii) graph-basedinterpretability to identify the truly crucial codestatements relevant for vulnerability predictions whileignoring the spurious ones. We find that our frameworkreduces spurious learning and enables more robust andgeneralizable vulnerability detection, as demonstrated byimprovements in overall accuracy (from 51.8% to 97.8%),pairwise contrast accuracy (from 4.5% to 95.8%), andworst-group accuracy increasing (from 0.7% to 85.5%) on thewidely popular Common Weakness Enumeration (CWE)-20vulnerability. We also demonstrate improvements using ourproposed metrics, namely, intra-class attribution variance,inter-class attribution distance, and node scoredependency. We provide a new benchmark for vulnerabilitydetection, CWE-20-CFA, comprising 27,556 samples fromfunctions affected by the high-impact and frequentlyoccurring CWE-20 vulnerability, including both real andcounterfactual examples. Furthermore, our approach enhancessocietal objectives of transparent and trustworthy AI-basedcybersecurity systems through interactive visualization forhuman-in-the-loop analysis.
more »
« less
- Award ID(s):
- 2340006
- PAR ID:
- 10655321
- Publisher / Repository:
- Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) 2025
- Date Published:
- Journal Name:
- Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
- Volume:
- 8
- Issue:
- 1
- ISSN:
- 3065-8365
- Page Range / eLocation ID:
- 812 to 823
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Training the deep neural networks that dominate NLP requires large datasets. These are often collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter we mean spurious correlations between inputs and outputs that do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper, we evaluate use of different attribution methods for aiding identification of training data artifacts. We propose new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction). We show that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available. We also carry out a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results. We make code for all methods and experiments in this paper available.more » « less
-
Training the large deep neural networks that dominate NLP requires large datasets. Many of these are collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter, we mean correlations between inputs and outputs that are spurious, insofar as they do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we propose methods to facilitate identification of training data artifacts, using new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction). We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data, and use it to identify previously unreported artifacts in a few standard NLP datasets. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results. We make code for all methods and experiments in this paper available.more » « less
-
Graph Neural Networks (GNNs) are a popular machine learning framework for solving various graph processing applications. This framework exploits both the graph topology and the feature vectors of the nodes. One of the important applications of GNN is in the semi-supervised node classification task. The accuracy of the node classification using GNN depends on (i) the number and (ii) the choice of the training nodes. In this article, we demonstrate that increasing the training nodes by selecting nodes from the same class that are spread out across non-contiguous subgraphs, can significantly improve the accuracy. We accomplish this by presenting a novel input intervention technique that can be used in conjunction with different GNN classification methods to increase the non-contiguous training nodes and, thereby, improve the accuracy. We also present an output intervention technique to identify misclassified nodes and relabel them with their potentially correct labels. We demonstrate on real-world networks that our proposed methods, both individually and collectively, significantly improve the accuracy in comparison to the baseline GNN algorithms. Both our methods are agnostic. Apart from the initial set of training nodes generated by the baseline GNN methods, our techniques do not need any other extra knowledge about the classes of the nodes. Thus, our methods are modular and can be used as pre-and post-processing steps with many of the currently available GNN methods to improve their accuracy.more » « less
-
Graph-based anomaly detection is pivotal in diverse security applications, such as fraud detection in transaction networks and intrusion detection for network traffic. Standard approaches, including Graph Neural Networks (GNNs), often struggle to generalize across shifting data distributions. For instance, we observe that a real-world eBay transaction dataset revealed an over 50% decline in fraud detection accuracy when adding data from only a single new day to the graph due to data distribution shifts. This highlights a critical vulnerability in purely data-driven approaches. Meanwhile, real-world domain knowledge, such as "simultaneous transactions in two locations are suspicious," is more stable and a common existing component of real-world detection strategies. To explicitly integrate such knowledge into data-driven models such as GCNs, we propose KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection. KnowGraph comprises two principal components: (1) a statistical learning component that utilizes a main model for the overarching detection task, augmented by multiple specialized knowledge models that predict domain-specific semantic entities; (2) a reasoning component that employs probabilistic graphical models to execute logical inferences based on model outputs, encoding domain knowledge through weighted first-order logic formulas. In addition, KnowGraph has leveraged the Predictability-Computability-Stability (PCS) framework for veridical data science to estimate and mitigate prediction uncertainties. Empirically, KnowGraph has been rigorously evaluated on two significant real-world scenarios: collusion detection in the online marketplace eBay and intrusion detection within enterprise networks. Extensive experiments on these large-scale real-world datasets show that KnowGraph consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs. Further ablation studies demonstrate the effectiveness of the proposed reasoning component in improving detection performance, especially under extreme class imbalance. These results highlight the potential of integrating domain knowledge into data-driven models for high-stakes, graph-based security applications.more » « less
An official website of the United States government
