<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Fine-Grained Graph Rationalization</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>11/10/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10653700</idno>
					<idno type="doi">10.1145/3746252.3761307</idno>
					
					<author>Zhe Xu</author><author>Menghai Pan</author><author>Yuzhong Chen</author><author>Huiyuan Chen</author><author>Yuchen Yan</author><author>Mahashweta Das</author><author>Hanghang Tong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Rationale refers to a subset of the input features that play a crucial role in model predictions for downstream tasks <ref type="bibr">[8,</ref><ref type="bibr">18,</ref><ref type="bibr">37,</ref><ref type="bibr">49,</ref><ref type="bibr">55,</ref><ref type="bibr">77,</ref><ref type="bibr">84]</ref>. In the context of graph machine learning, graph rationale is defined as a subgraph of the input graph containing the most taskrelevant semantics. The application of graph rationale is broad; for example, it can greatly enhance model performance for graph-level tasks <ref type="bibr">[33,</ref><ref type="bibr">55,</ref><ref type="bibr">61,</ref><ref type="bibr">67,</ref><ref type="bibr">81]</ref> by identifying the key components of the input graph. Additionally, the discovery of rationales can improve model explainability <ref type="bibr">[37]</ref>, as it highlights the parts of the input graph that significantly contribute to the final prediction.</p><p>Existing graph rationalization solutions <ref type="bibr">[37,</ref><ref type="bibr">55]</ref> employ a trainable augmenter to execute the rationale/environment decomposition. In this process, a node/edge mask is generated by the augmenter to decompose the given graph into a rationale graph and an environment graph. Inspired by the content-style decomposition <ref type="bibr">[27]</ref>, the key idea of graph rationalization is to preserve the utility of the graph rationale when combined with changing environment graphs (see Figure <ref type="figure">1</ref>). To achieve this, a technique named intervention is used, where the environment graph interacts with the rationale graph.</p><p>The intervention mechanism (named intervener) is essential in the graph rationalization process, as it must accurately represent the interaction between the rationale and the environment. Intuitively, the intervener should work in an adversarial behavior against the augmenter mentioned above, a point largely overlooked in the existing literature. If the intervener is more powerful, it can capture more detailed interactions between the rationale and environment subgraphs. Given such a powerful intervener, the augmenter is compelled to minimize these interactions between the graph rationale and the environment to obtain a "purer" graph rationale.</p><p>Unfortunately, existing works develop interveners in a coarse and non-parametric manner. After performing rationale/environment decomposition on the graph data, they compute graph-level embeddings for the rationale and environment subgraphs. The intervention is then designed as an interaction between these graph-level embeddings. For example, <ref type="bibr">[37]</ref> adds the environment embedding into the rationale embedding as the intervened rationale embedding; <ref type="bibr">[55]</ref> defines the intervened prediction as the Hadamard product between the predictions based on the rationale subgraph and the environment subgraph. We argue that such a graph-level non-parametric intervention is insufficient to fully represent the interaction between the rationale and environment graphs effectively.</p><p>In response to this limitation, we propose a fine-grained, parametric intervention named FIne-grained Graph rationalization <ref type="bibr">(FIG)</ref>. FIG draws inspiration from the self-attention module in the Transformer model, which captures interactions between input tokens. Building upon insights from Transformer <ref type="bibr">[51]</ref> and its linear variant Linformer <ref type="bibr">[54]</ref>, FIG formulates the interaction between the rationale and environment subgraphs at the node-level or the virtual node-level. The two variants are named FIG-N and FIG-VN. Furthermore, to maximize the effectiveness of the intervention, we formulate a min-max game involving the node encoder, augmenter, intervener, and predictor, forcing the rationale subgraph to be as informative as possible.</p><p>We conduct comprehensive experiments on 7 graph-level benchmarks to evaluate the proposed approach and compare FIG-N/VN against 13 state-of-the-art baseline methods. The results demonstrate that FIG and its variants outperform the baseline methods, validating their superior performance. Our primary contributions in this paper are summarized as follows: (1) we address the graph rationalization problem via a fine-grained node/virtual node-level model, FIG ; (2) a min-max game is proposed to boost the effectiveness of the proposed intervener; (3) extensive experiments covering 13 baseline methods and 7 real-world datasets demonstrate the efficacy of the proposed method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>Section 2.1 introduces the notations used throughout this paper. Then, the classic Transformer architecture is introduced in Section 2.2, whose self-attention module is an important building block of our model. Last but not least, we introduce the overall ideas of existing graph rationalization works in Section 2.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Notations</head><p>We adopt the following notation conventions: bold uppercase letters for matrices and tensors (e.g., A), bold lowercase letters for column vectors (e.g., u), lowercase and uppercase letters in regular font for scalars (e.g., &#119889;, &#119870;), and calligraphic letters for sets (e.g., T ). To index vectors/matrices/tensors, we follow the syntax from NumPy (0-based). Specifically, A[&#119901;, :] and A[:, &#119902;] represent the &#119901;-th row and the &#119902;-th column of matrix A respectively; A[&#119901;, &#119902;] represents the entry at the &#119901;-th row and the &#119902;-th column. Similarly, u[&#119901;] denotes the &#119901;-th entry of vector u. In addition, the slicing syntax for vectors/matrices/tensors is used. For example, for a matrix A, A[&#119894; : &#119895;, :] denotes rows from the &#119894;-th row (included) to the &#119895;-th row (excluded) and A[:, : &#119896;] denotes all the columns before the &#119896;-th column. &#8868; denotes the transpose of matrices and vectors. &#8857; represents the Hadamard product, and &#8226; denotes function composition. We use || for the concatenation operation, and the specific dimension of concatenation will be clarified based on the context.</p><p>An attributed graph can be represented as G = (A, X, E), where A &#8712; R &#119899;&#215;&#119899; is the adjacency matrix, X &#8712; R &#119899;&#215;&#119889; &#119883; is the node feature matrix, and E &#8712; R &#119899;&#215;&#119899;&#215;&#119889; &#119864; is the edge feature tensor. Here, &#119899; denotes the number of nodes, and &#119889; &#119883; (&#119889; &#119864; ) represents the dimensions of node (edge) features, respectively. This paper assumes the node and edge feature dimensions are the same (i.e., &#119889; &#119883; = &#119889; &#119864; = &#119889;) for brevity; if they differ, a fully connected layer can map them into a shared feature space. Our main focus in this paper is on graph property prediction tasks. The ground truth of a graph is represented by &#119910;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Graph Transformer</head><p>The core modules of the Transformer architecture <ref type="bibr">[51]</ref> are the self-attention layer and the feed-forward network layer. Given the input as a sequence of symbol representations H &#8712; R &#119899;&#215;&#119889; &#119867; , it is first transformed into the query, key, and value matrices as</p><p>where</p><p>For brevity, we set &#119889; &#119867; = &#119889; &#119876; = &#119889; &#119870; = &#119889; &#119881; = &#119889;. Then, the self-attention module is,</p><p>Typically, the non-linearity &#120590; is Softmax. The feed-forward network (FFN) updates the representations H as:</p><p>Optional techniques such as normalization <ref type="bibr">[3,</ref><ref type="bibr">24]</ref>, dropout <ref type="bibr">[47]</ref>, and multi-head attention <ref type="bibr">[51]</ref> are omitted here for brevity. While Transformers were initially designed for sequence or set data with positional encoding, numerous techniques have since been introduced to adapt Transformers for graph data. Interested readers are referred to <ref type="bibr">[41]</ref> for a detailed taxonomy. Interestingly, it is well-known in both the graph learning <ref type="bibr">[9]</ref> and natural language processing communities <ref type="bibr">[54,</ref><ref type="bibr">78]</ref> that, from the message-passing perspective, the key idea of the Transformer architecture is to construct a weighted complete graph, whose adjacency matrix is P = &#120590; QK &#8868; &#8730; &#119889; with complexity &#119874; (&#119899; 2 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Invariant Rationale Discovery on Graphs</head><p>The graph rationale is a subgraph that encodes most downstream task-relevant semantics. A typical example is the functional groups in polymer graphs <ref type="bibr">[37,</ref><ref type="bibr">55]</ref>, which fundamentally determines the chemical property of polymers. Mathematically, a given graph is decomposed into a rationale graph and an environment graph:</p><p>Commonly, the graph embeddings on G ra and G env are computed as h ra and h env . To ensure the rationale graph is invariant w.r.t. the prediction results when combined with different environments, a utility loss is minimized given the rationale embedding h ra intervened by the environment embedding henv , i.e.,</p><p>Here, henv could be either from the same graph (i.e., henv = h env ), or could be environment embeddings from other graphs, such as those in the batch.</p><p>A key difference among existing methods lies in the intervention operation, of which we mention two: (1) GREA <ref type="bibr">[37]</ref> designs the intervention as sum, i.e., h ra + henv and (2) DIR <ref type="bibr">[55]</ref> designs an element-wise intervention: &#120579; pred (h ra ) &#8857; Sigmoid(&#120579; pred ( henv )), where &#8857; is the Hadamard product and &#120579; pred is a predictor. The above intervention is graph-level because h ra and fenv are graph embeddings. In Figure <ref type="figure">2a</ref>, an overview of the GREA <ref type="bibr">[37]</ref> is presented. As a comparison, we aim to design the intervention at finer grains (e.g., node level) to handle the interaction between rationale and environment graphs, which will be detailed as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Model</head><p>This section introduces the proposed method, FIG. Figure <ref type="figure">2b</ref> provides an overview of FIG, highlighting its four parametric modules: the encoder, augmenter, intervener, and predictor. Encoder. The encoder, denoted as &#120579; enc : G &#8594; R &#119899;&#215;&#119889; , accepts a graph data as input and outputs a node embedding matrix. There are various graph encoders available, such as graph neural networks (GNNs) <ref type="bibr">[57]</ref> and graph Transformers <ref type="bibr">[41]</ref>. From the methodology perspective, the encoder module is not the main contribution of this paper, so in this section, we do not specify a specific graph encoder &#120579; enc . Predictor. The predictor, denoted as &#120579; pred : R &#119889; &#8594; R &#119888; takes as input a graph embedding and outputs a prediction vector/scalar. For graph regression tasks, &#119888; = 1; for graph classification tasks, &#119888; is the number of classes. A typical predictor is a multi-layer perceptron (MLP) with appropriate activation functions. Details of the encoder and predictor in our implementation are presented in Section 4.</p><p>In subsequent subsections, we will elaborate on the augmenter and intervener, two essential modules. Their detailed designs derive two variants of the proposed FIG.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Node-Level Variant: FIG-N</head><p>Node-level augmenter. The augmenter is a critical module of the proposed FIG. For the node-level variant, termed FIG-N, the augmenter's primary function is decomposing the node set into two distinct subsets: rationale nodes and environment nodes. This decomposition is operated by parameterizing the node-level augmenter as a learnable node partitioner, denoted by &#120579; aug-N ,</p><p>whose input is the node embedding matrix H &#8712; R &#119899;&#215;&#119889; , and its output is a partition vector m &#8712; [0, 1] &#119899; . MLP is a multi-layer perceptron. Each entry within m, such as m[&#119894;], denotes the probability of the &#119894;-th node being categorized as a rationale node.</p><p>For the partition vector m, its top-&#119870; entries are indexed as idx ra = argtopK(m) which is used to index the rationale nodes from the node embedding matrix H; naturally, the remaining nodes are categorized as the environment nodes whose indices are idx env = {1, . . . , &#119899;}/idx ra . &#119870; is a hyper-parameter whose impact is studied in Section 4.5. Also, in our implementation, we use a soft argtopK operation to maintain differentiability, whose details are in Section B.</p><p>Using the indices mentioned above, rationale and environment embeddings, denoted as H ra and H env , respectively, can be extracted from the node embedding matrix H:</p><p>Node-level intervener. The design of the fine-grained intervener draws inspiration from the Transformer architecture <ref type="bibr">[51]</ref>. Explicitly, the node-level intervener &#120601; is presented as,</p><p>In this representation, the operator || concatenates along the first dimension of the matrices H ra and H env . We dub the Eqs. ( <ref type="formula">1</ref>)-( <ref type="formula">3</ref>) as Transformer and P is the intermediate attention matrix from the self-attention layer (Eq. (6b)). Here, the self-attention module models the interactions between the rational nodes H ra and the environment nodes H env . &#120601; includes all the parameters of the Attn (Eq. (6b)) and FFN (Eq. ( <ref type="formula">3</ref>)) modules. In some contexts where the attention matrix P is not explicitly used as an output, input/output of the intervener &#120601; can be presented as</p><p>The utility loss is computed as</p><p>where L task is the task-specific objective; e.g., it is the mean squared error for regression or the cross-entropy for classification. Notice that we select the top-&#119870; rows of the output of &#120601;, i.e., the intervened rationale part and ideally, we expect</p><p>As Figure <ref type="figure">1</ref> shows, invariant rationale discovery is to find the graph rationale so that the utility loss is minimized given changing environments. Thus, the utility objective is</p><p>where Henv is the node embeddings from the changing environments. In practical implementations, Henv is the environment node embeddings from other graphs in the mini-batch. Additionally, to fully utilize the rich interactions from the fine-grained intervention module, we apply the following regularization term, where P &#8712; R &#119899;&#215;&#119899; is the self-attention matrix from Eq. (6b),</p><p>The binary s vector is used to designate whether a particular row of the matrix H ra ||H env originates from the rationale nodes or the environment nodes. The underlying notion of the regularization term Eq. ( <ref type="formula">8</ref>) is to impose penalties on interactions between the rationale nodes and the environment nodes. Namely, these two terms s &#8868; P(1-s) and (1-s) &#8868; Ps denote the total weights on the links (i.e., cut) between the rationale and environment subgraphs. To handle the changing environments, we introduce an additional regularization term on the changing environments as L reg (H ra || Henv ) where Henv is the environment node embeddings from another graph within the same mini-batch. Then, the total regularization term is</p><p>and the final objective function is L util + &#120573;L reg . To fully harness the capabilities of the fine-grained parametric intervener, it is crucial to note-as highlighted in the introduction-that the behavior of the intervener &#120601; operates in an adversarial fashion to the other modules. As a result, we formulate a min-max game that involves &#120579; = {&#120579; enc , &#120579; aug-N , &#120579; pred } and &#120601; as, min</p><p>Here, the intervener &#120601; is trained to decrease the utility of the graph rationale by facilitating interactions between the rationale nodes and the environment nodes. Conversely, the encoder, augmenter, and predictor (i.e., &#120579; ) are optimized in an opposing manner to the intervener's objectives.</p><p>Remark 1. The min-max objective is necessary. If it is replaced with the min objective, trivial solutions exist. E.g., if feature dimension &#119889; 3 concatenate H ra ||H env and H ra || Henv ; 4 compute L util and L reg via Eqs. (7), (9), and (10); 5 update &#120579; via gradient descent with &#120597; ( L util +&#120573; L reg ) &#120597;&#120579; ; 6 update &#120601; via gradient ascent with &#120597; ( L util +&#120573; L reg ) &#120597;&#120601; ; is large enough, &#120601; could be an identity function with P = I, so that (1) L reg = 0 and (2) we would always have &#120601; (H ra ||H env ) [: &#119870;] = H ra . However, in that case, the pipeline degenerates to a regular graph classification/regression model, with no intervener. Complexity of FIG-N. As the encoder &#120579; enc and the predictor &#120579; pred are off-the-shelf, the FIG-N introduces two new modules: the nodelevel augmenter, &#120579; aug-N , and the Transformer-based intervener, &#120601;. Notably, despite these additions, the increase in the number of parameters remains modest. The parameters for &#120579; aug-N originate from the MLP defined in Eq. (4). In a configuration where the MLP has 3 layers with a feature dimension of &#119889;, the parameter count is &#119874; (2&#119889; 2 ). The intervener &#120601;, driven by the Transformer layer in Eq. (6a), has its parameters confined to &#119874; (3&#119889; 2 + 2&#119889; 2 ) = &#119874; (5&#119889; 2 ), owing to its query, key, value projection matrices and the feedforward net (FFN from Eq. (3), typically a 3-layered MLP). A step-by-step algorithm for FIG-N is in Algorithm 1. In test phase, the output of &#120579; pred &#8226; Readout &#8226; &#120601; (H ra ||H env ) is evaluated. Algorithm 2: FIG-VN single training step for graph G Input : a labeled graph (G, &#119910;), a sampled graph G from the same batch as G, &#120579; = {&#120579; enc , &#120579; aug-VN , &#120579; pred }, &#120601;; Output : updated &#120579; and &#120601;; 1 compute H = &#120579; enc (&#119866;) and H = &#120579; enc ( G); 2 compute (H ra , H env ) = &#120579; aug-VN (H), ( Hra , Henv ) = &#120579; aug-VN ( H) via Eqs. (12), (13), and (14); 3 concatenate H ra ||H env and H ra || Henv ; 4 compute L util and L reg via Eqs. (7), (9), and (10); 5 update &#120579; via gradient descent with &#120597; ( L util +&#120573; L reg ) &#120597;&#120579; ; 6 update &#120601; via gradient ascent with &#120597; ( L util +&#120573; L reg ) &#120597;&#120601; ; 3.2 Virtual Node-Level Variant: FIG-VN In the previously introduced FIG-N, its augmenter decomposes the nodes into rationale nodes and environment nodes via a trainable node partitioner &#120579; aug-N so that the interaction is conducted at the node level (Eqs. (6a)) whose dense attention matrix's complexity is quadratic in terms of the node numbers. This section extends this idea to extract the graph rationale at the virtual node level, which has a lower computation complexity (linear in terms of the node numbers) compared to FIG-N and provides an intermediate intervention granularity between the node-level model (FIG <ref type="figure">-N</ref>) and the graph-level model (GREA <ref type="bibr">[37]</ref>).</p><p>Virtual node-level augmenter. Our idea is partly inspired by the speedup technique from Linformer <ref type="bibr">[54]</ref>, which reformulates both the attention matrix and node (token) embedding matrix to dimensions of R &#119899;&#215;&#119903; and R &#119903; &#215;&#119889; , respectively. This reformulation ensures that their multiplication scales linearly with the number of nodes (tokens) &#119899; because &#119903; &#8810; &#119899; and &#119889; represents the feature dimension.</p><p>Building upon this insight, given node embeddings H from the encoder, the virtual node embeddings are:</p><p>Here, the row-wise applied Softmax function, along with Softmax(W N-VN ) &#8712; R &#119903; &#215;&#119899; , yields a trainable matrix assigning &#119899; nodes to &#119903; virtual nodes, where &#119903; acts as a tunable hyper-parameter. More specifically, Softmax(W N-VN ) can be viewed as the adjacency matrix between &#119903; virtual nodes and &#119899; original nodes; the virtual node embedding, H VN , is the aggregation of original node embeddings after 1-step message passing. In experiments, we set &#119903; = 8. As all the virtual node embeddings are learned, a subset of the &#119903; virtual nodes can be designated as rationale virtual nodes, whose rationality is data-driven by the intervention procedure discussed in subsequent subsections. For brevity, the initial &#119870; virtual nodes are deemed as rationale virtual nodes, while the last &#119903; -&#119870; nodes are considered the environment virtual nodes; their embeddings are:</p><p>Like the FIG-N, here &#119870; is a hyperparameter whose impact is studied in Section 4.5. The parameter of &#120579; aug-VN is only W N-VN .</p><p>Virtual node-level intervener. This section discusses the design of a virtual node-level intervener, similar to the intervener introduced in Section 3.1. The main difference is that the intervention here functions on the virtual nodes rather than the given real ones.</p><p>We recall the rationale and environment virtual node embeddings</p><p>Thanks to the property of the Transformer that it can process sets with variable size, the design of the virtual node-level intervener &#120601; is similar to the nodelevel intervener as H inter , P = Transformer(H ra ||H env ) or short as H inter = &#120601; (H ra ||H env ) if the attention matrix P is not used. Notably, for FIG-VN, P &#8712; R &#119903; &#215;&#119903; describes the interaction among the &#119903; virtual nodes, whose complexity is only &#119874; (&#119903; 2 ). FIG-VN optimization objective. The output of &#120579; pred &#8226; Readout &#8226; &#120601; (&#8226;) is used for minimizing the utility loss L util = L util (H ra ||H env )+ &#120572; L util (H ra || Henv ), where L util (H ra ||H env ) = L task (&#120579; pred &#8226; Readout &#8226; &#120601; (H ra ||H env ) [: &#119870;], y) and L util (H ra || Henv ) is defined similarly. For modeling the changing environment, Henv is the virtual node embeddings from other graphs in the mini-batch. Additionally, the previously proposed regularization term Eq, (8) can be extended to the virtual node-level variant: L reg (H ra ||H env ) = s &#8868; P(1s) + (1s) &#8868; Ps. The total regularization term, considering the changing environment Henv , is L reg = L reg (H ra ||H env ) + L reg (H ra || Henv ). As the P depicts interactions among virtual nodes, we construct the rationale/environment indicator vector s analogously to Eq. (9). Put everything together, and the optimization objective of FIG-VN is min &#120579; max &#120601; L util + &#120573;L reg , where &#120579; = {&#120579; enc , &#120579; aug-VN , &#120579; pred }. Complexity of FIG-VN. As we mentioned the encoder &#120579; enc and the predictor &#120579; pred are off-the-shelf. Thus, the extra modules introduced by the FIG-VN are the virtual node-level augmenter &#120579; aug-VN and the Transformer-based intervener &#120601;. The parameters for &#120579; aug-VN originate from the matrix W N-VN , as defined in Eq. (12), whose complexity is &#119874; (&#119899;&#119903; ), where &#119899; denotes the number of nodes. For practical implementation purposes, &#119899; is pre-set; it is set to 10&#215; the average size of graphs from the dataset, and we truncate the input graphs if their sizes are larger than 10&#215; the average size. The intervener &#120601; parameters originate from the Transformer layer, outlined in Eq. (6a). The number of parameters here is &#119874; (5&#119889; 2 ), owing to its query, key, value projection matrices, and the feed-forward net (Eq. (3), typically a 3-layered MLP). A step-by-step algorithm for FIG-VN is in Algorithm 2. In test phase, the output of &#120579; pred &#8226; Readout &#8226; &#120601; (H ra ||H env ) is evaluated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, the datasets and baseline methods are detailed first. Subsequent subsections evaluate the effectiveness of FIG, supplemented by efficiency studies, ablation studies, sensitivity studies, training convergence, and a visualization of the attention matrix. The detected rationales are visualized in the Appendix, Section 4.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Setup</head><p>In this paper, 7 publicly-accessible real-world datasets are used: (1) graph classification datasets molhiv <ref type="bibr">[23]</ref>, moltox21 <ref type="bibr">[23]</ref>, molbace <ref type="bibr">[23]</ref>, molbbbp <ref type="bibr">[23]</ref> and (2) graph regression datasets ZINC <ref type="bibr">[14]</ref>, AQSOL <ref type="bibr">[14]</ref>, and mollipo <ref type="bibr">[23]</ref>. We strictly follow the metrics and dataset split recommended by the given benchmarks. To be concrete, the area under the ROC curve (AUC) is the metric for datasets molhiv, moltox21, molbace, molbbbp; root-mean-square deviation (RMSE) is the metric for dataset mollipo; mean absolute error (MAE) is the metric for datasets ZINC and AQSOL. The detailed statistics of the datasets are given in Table <ref type="table">5</ref> (Appendix). We report the average result with the standard deviation in 10 runs 1 .</p><p>Our baseline methods include (1) 4 graph neural network models: GIN <ref type="bibr">[59]</ref>, GAT <ref type="bibr">[52]</ref>, GATv2 <ref type="bibr">[7]</ref>, and GatedGCN <ref type="bibr">[6]</ref> (2) 7 graph Transformers: GT <ref type="bibr">[13]</ref>, GraphiT <ref type="bibr">[40]</ref>, SAN <ref type="bibr">[29]</ref>, SAT <ref type="bibr">[9]</ref>, Graphormer <ref type="bibr">[74]</ref>, GraphTrans <ref type="bibr">[56]</ref>, GPS <ref type="bibr">[46]</ref>, and (3) 2 graph rationale discovery methods: DIR <ref type="bibr">[55]</ref> and GREA <ref type="bibr">[37]</ref>. <ref type="table">1</ref>. To ensure a fair comparison, some pre-trained models, such as the pre-trained Graphormer <ref type="bibr">[74]</ref>, are excluded. As DIR <ref type="bibr">[55]</ref> is designed to conduct 1 The code is available at <ref type="url">https://github.com/pricexu/FIG</ref> interventions on the label prediction vectors, it cannot be directly applied to graph regression tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Effectiveness Study</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The effectiveness comparison between the proposed FIG-N, FIG-VN, and baseline methods is provided in Table</head><p>We have several observations. First </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Efficiency Study</head><p>A detailed comparison regarding the number of parameters and the FLOPs (floating point operations) is presented in Table <ref type="table">2</ref>, where we list 4 typical 5-layered encoders (GIN, SAT, GraphTrans, and GPS), and our proposed node-/virtual node-level augmenter, intervener modules (i.e., {&#120579; aug-N , &#120601; } and {&#120579; aug-VN , &#120601; }). The comparison shows that our proposed interveners are lightweight and only increase very minor computational costs.</p><p>We also compare the training efficiency ( iterations/second) of <ref type="table">3</ref> working with different encoders (GIN and GPS). The batch size is set as 32. We note that the inclusion of our proposed augmenter and intervener, represented as {&#120579; aug-N , &#120601; } or {&#120579; aug-VN , &#120601; }, introduces a slight reduction in training speed. That is because the proposed parametric augmenter and intervener increase the data pipeline steps, as presented in figure <ref type="figure">2b</ref>, and enlarge the computational graph for auto-gradient tools, such as PyTorch. Fortunately, the parameter count of the parametric augmenter and intervener is low, ensuring that the overall training speed of the model is not dramatically affected.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FIG-N and FIG-VN in Table</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Ablation Study</head><p>We conducted an ablation study on the proposed models, FIG-N and FIG-VN. We designed two ablated variants as baselines: (</p><p>&#120579; enc &#8226; &#120579; pred which is a pure composition of the encoder &#120579; enc and the predictor &#120579; pred without any rationale discovery module. Many of the existing graph classifiers are in this form, and here we select the GraphGPS <ref type="bibr">[46]</ref>, which also serves as the backbone of our  <ref type="formula">10</ref>)) from the objective function. Our results in Table <ref type="table">4</ref> highlight that (1) equipped the proposed Transformer-based intervener, the model's performance improves across all the datasets; e.g., the AUC is improved from 79.6% to 84.3% (FIG-N) on the molbace dataset. ( <ref type="formula">2</ref>) With the proposed regularization term, the model's performance can be improved further; e.g., the AUC of the FIG-N is improved from 72.3% to 73.8% on the molbbbp dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Sensitivity Study</head><p>In this section, we carefully study the impact of hyperparameter &#119870; (from Eq. (5a), (5b), <ref type="bibr">(13)</ref>, and ( <ref type="formula">14</ref>)), which determines the ratio of the rationale and environment subgraphs. In our implementation, we set</p><p>. We evaluate the model performance across varying K on the molbace and molbbbp datasets in Figure <ref type="figure">3</ref>. We note that the model performance degrades if most nodes/virtual nodes are marked as the environment component. Similar performance degradation is observed if too many nodes/virtual nodes are marked as the rationale nodes (e.g., K = 0.875). That is because for a small K (e.g., 0.25), too few nodes/virtual nodes are involved for the downstream</p><p>(a) molbace (b) molbbbp tasks; for a large K (e.g., 1), the model degenerates to a vanilla graph encoder. The best performance is observed when K is 0.75 or 0.675.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Training Convergence</head><p>We </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7">Attention Visualization</head><p>In this section, we aim to evaluate the significance of the regularization term by visualizing the attention matrix P of the intervener &#120601; in Figure <ref type="figure">5</ref>. For clarity in visualization, FIG-VN is chosen because its number of virtual nodes is predefined. Specifically, we set the number of virtual nodes &#119903; to 16 with &#119870; at 10, designating 10 virtual nodes to rationales and the remainder as environments. All the visualization results are obtained from the molbace dataset. It is worth noting that the attention matrix P is normalized row-wise by Softmax. From our observations, we highlight two insights:</p><p>&#8226; Interestingly, even in the absence of the regularization term, in Figure <ref type="figure">5</ref>(a), interactions between rationales and environments appear significantly weaker than those within the rationales themselves. One potential explanation is the changing of the environment. In optimizing the utility loss L util = L task (H ra ||H env ) + &#120572; L task (H ra || Henv ), the everchanging environment ( Henv ) might lead the model to minimize interactions between rationales and environments so that the utility of the rationale can be preserved. &#8226; The first observation aligns with the motivation to introduce the regularization term, which aims to penalize rationaleenvironment interactions. When the proposed regularization term (Eq. ( <ref type="formula">8</ref>)) is implemented, in Figure <ref type="figure">5</ref>(b), there is a noticeable decrease in rationale-environment interactions (the off-diagonal blocks in Figure <ref type="figure">5</ref>). As our earlier ablation study demonstrated, this leads to improved model performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>This section introduces two topics related to this paper: (1) invariant learning on graphs and (2) Transformer on graphs.  Invariant Learning on Graphs. Invariant learning, which has extensive applications in many fields such as computer vision <ref type="bibr">[2,</ref><ref type="bibr">8,</ref><ref type="bibr">20,</ref><ref type="bibr">27]</ref>, is gaining more attention in the community of graph machine learning <ref type="bibr">[4, 16, 34, 36, 39, 43, 50, 53, 58, 62-65, 68-72, 75, 76, 79, 80, 82, 83]</ref>. OOD-GNN <ref type="bibr">[31]</ref> applies random Fourier features to eliminate the statistical dependence between relevant and irrelevant graph representations. <ref type="bibr">[5]</ref> studies the graph invariant representation learning with a particular focus on the size discrepancies between the training/test graphs. DIR <ref type="bibr">[55]</ref> decomposes the given graph into a rationale subgraph and an environment subgraph; after that, it uses a graph classifier to conduct prediction on the above two graphs respectively, and its intervention is via the Hadamard product between the prediction vectors. Similarly, GREA <ref type="bibr">[37]</ref> conducts the rationale/environment decomposition at the node level, and its intervention operation is to directly add the environment graph embedding into the rationale graph embedding. In a similar vein, GIL <ref type="bibr">[32]</ref> decomposes the given graph into the invariant and variant subgraphs; based on that, it infers the environment label, as input of the invariance regularization term <ref type="bibr">[28]</ref>. Furthermore, invariant learning can also benefit graph contrastive learning <ref type="bibr">[25,</ref><ref type="bibr">26,</ref><ref type="bibr">48,</ref><ref type="bibr">85,</ref><ref type="bibr">86]</ref> to enhance data augmentation. Using invariant learning to address the OOD generalization challenges on graph <ref type="bibr">[12,</ref><ref type="bibr">19,</ref><ref type="bibr">21,</ref><ref type="bibr">30,</ref><ref type="bibr">49,</ref><ref type="bibr">73,</ref><ref type="bibr">77,</ref><ref type="bibr">87]</ref> is promising, but our paper does not concentrate on this setting, and we leave it as future work.</p><p>Transformer on Graphs. Transformer <ref type="bibr">[1,</ref><ref type="bibr">11,</ref><ref type="bibr">38,</ref><ref type="bibr">44,</ref><ref type="bibr">45,</ref><ref type="bibr">51]</ref> has achieved significant success in various domains, including natural language processing <ref type="bibr">[51]</ref>, computer vision <ref type="bibr">[22]</ref>, and more <ref type="bibr">[10,</ref><ref type="bibr">15,</ref><ref type="bibr">17,</ref><ref type="bibr">35,</ref><ref type="bibr">42,</ref><ref type="bibr">60,</ref><ref type="bibr">66]</ref>. In recent years, there has been a surge of interest in enhancing graph machine learning methods by leveraging the capabilities of Transformers. For example, GraphTrans <ref type="bibr">[56]</ref> concatenates the Transformer architecture after various messagepassing neural networks; GPS <ref type="bibr">[46]</ref> proposes a powerful layer which operates both a graph convolution layer and a Transformer layer in parallel; GT <ref type="bibr">[13]</ref> generalizes the attention module on graphs via concatenating the spectral vectors with the raw node features and computing the attention weights on all the existing edges; Graphormer <ref type="bibr">[74]</ref> encodes both the node centrality and node-pair shortest path into the Transformer architecture. A comprehensive survey about the graph Transformer can be found in <ref type="bibr">[41]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This paper studies the invariant rationale discovery problem on graphs. Distinct from existing methods, our solutions (FIG-N and FIG-VN) are designed at more fine-grained levels, specifically at the node and virtual node levels, so that they can better model the interactions between the rationale and environment subgraphs. More importantly, we formulate the intervener and the other model modules in a min-max game, which can significantly improve the quality of the extracted graph rationales. Comprehensive experiments on 7 real-world datasets illustrate the effectiveness of the proposed method compared to 13 baseline methods. Our evaluation of the proposed augmenters' and interveners' efficiency shows that they can largely retain the overall training efficiency.</p><p>A Hardware and Implementations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>We implement FIG-N, FIG-VN, and all the baseline methods in</head><p>PyTorch and PyTorch-geometric. All the efficiency study results are from one NVIDIA Tesla V100 SXM2-32GB GPU on a server with 96 Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz processors and 1.5T RAM. We directly use those statistics when baseline methods have pre-existing results for specific datasets. In cases where such results are absent, we implement the models based on either the available code or details described in the associated publications. We first introduce some shared implementations among FIG-N/VN and baseline methods. The batch size is set as 32, and the weight decay is set as 0. The hidden dimension is set as 300. The dropout rate for both the encoder and the intervener is searched between {0, 0.5}. The pooling is searched between {mean, sum}. The normalization is searched between the batch normalization <ref type="bibr">[24]</ref> and layer normalization <ref type="bibr">[3]</ref>. The learning rate is initialized as 0.0001 The statistics of the datasets used are presented in Table <ref type="table">5</ref>. Implementation of FIG-N/VN. The encoder is set as GPS <ref type="bibr">[46]</ref> on ZINC, AQSOL, mollipo, molhiv, molbace, and set as GraphTrans <ref type="bibr">[56]</ref> on moltox21 and molbbbp. We follow the typical design for the predictor module as a 3-layered MLP with ReLU activation in the intermediate layers. In our implementation, we set &#120573; = Implementation of baseline methods. We search the number of layers of GIN <ref type="bibr">[59]</ref>, GAT <ref type="bibr">[52]</ref>, GATv2 <ref type="bibr">[7]</ref>, GatedGCN <ref type="bibr">[6]</ref>, DIR <ref type="bibr">[55]</ref>, and GREA <ref type="bibr">[37]</ref> between {2, 3, 5, 10} and report the best performance, considering configurations both with and without a virtual node connecting to all the given nodes.</p><p>Regarding the Transformer-based baselines (GT <ref type="bibr">[13]</ref>, GraphiT <ref type="bibr">[40]</ref>, SAN <ref type="bibr">[29]</ref>, SAT <ref type="bibr">[9]</ref>, Graphormer <ref type="bibr">[74]</ref>, GraphTrans <ref type="bibr">[56]</ref>, GPS <ref type="bibr">[46]</ref>), for the (absolute or relative) positional encoding, we adhere to the suggestions made in their original papers. We also searched the number of layers between {2, 3, 5, 10}.</p><p>Our GIN, GAT, GATv2, and GatedGCN implementations are from the PyTorch-geometric package. Our implementations of GT, GraphiT, SAN, SAT, Graphormer, GraphTrans, GPS, DIR, and GREA are adapted from publicly available code. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Soft argtop-K Trick</head><p>The hard argtopK breaks the differentiability of the model so that &#120579; aug-N has no gradient. Here, we present a soft argtop-&#119870; trick which is inspired by the soft top-K trick <ref type="foot">2</ref> . Overall, the key ideas are as follows, (1) The index vector idx ra &#8712; N &#119870; + can be viewed as a list of onehot index encoding idx &#8712; {0, 1} &#119870; &#215;&#119899; whose every row is a one-hot vector. If the &#119894;-th row's &#119895;-th element is 1, it means the &#119895;-th element in m is the &#119894;-th largest element in m. Then, we can use the matrix multiplication to index the matrix H:</p><p>(2) We aim to find a soft and differentiable matrix to approximate idx. The trick is to use the fact that every row of idx is onehot, which can be approximated by the output of the softmax function. Hence, the implementation is to call the softmax &#119870; times repeatedly. (3) The parameterization trick can be used: y_hard -y_soft.detach() + y_soft, whose main idea is to ensure that (1) the forward process uses the hard indexing and (2) the backpropagation updates the soft indices. We provide the exemplar code as follows. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://github.com/ZIB-IOL/merlin-arthur-classifiers/blob/main/soft-topk.py</p></note>
		</body>
		</text>
</TEI>
