<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Monotone Operator Theory-Inspired Message Passing for Learning Long-Range Interaction on Graphs</title></titleStmt>
			<publicationStmt>
				<publisher>PMLR</publisher>
				<date>05/02/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10557356</idno>
					<idno type="doi"></idno>
					
					<author>Justin M Baker</author><author>Qingsong Wang</author><author>Martin Berzins</author><author>Thomas Strohmer</author><author>Bao Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Learning long-range interactions (LRI) between distant nodes is crucial for many graph learning tasks. Predominant graph neural networks (GNNs) rely on local message passing and struggle to learn LRI. In this paper, we propose DRGNN to learn LRI leveraging monotone operator theory. DRGNN contains two key components: (1) we use a full node similarity matrix beyond adjacency matrix -drawing inspiration from the personalized PageRank matrix -as the aggregation matrix for message passing, and (2) we implement message-passing on graphs using Douglas-Rachford splitting to circumvent prohibitive matrix inversion. We demonstrate that DRGNN surpasses various advanced GNNs, including Transformerbased models, on several benchmark LRI learning tasks arising from different application domains, highlighting its efficacy in learning LRI. Code is available at https:  //github.com/Utah-Math-Data-Science/ PR-inspired-aggregation.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Graph that models interacting entities is ubiquitous in real-world applications, e.g., social networks <ref type="bibr">(Perozzi et al., 2014)</ref>, citation networks <ref type="bibr">(Hamilton et al., 2017)</ref>, and molecular modeling <ref type="bibr">(Gilmer et al., 2017)</ref>. The predominant message-passing graph neural networks (MP-GNNs) -based on local message-passing mechanisms -are effective in leveraging inductive bias of graph structures, excelling in graph deep learning <ref type="bibr">(Duvenaud et al., 2015;</ref><ref type="bibr">Kipf and Welling, 2017;</ref><ref type="bibr">Gilmer et al., 2017;</ref><ref type="bibr">Corso et al., 2020;</ref><ref type="bibr">Satorras et al., 2021;</ref><ref type="bibr">Thorpe et al., 2022)</ref>. MP-GNNs, e.g., graph convolutional network (GCN) <ref type="bibr">(Kipf and Welling, 2017)</ref>, usually update the features of each node by aggregating the features of its immediate (1-hop) neighbors. MP-GNNs encode the graph structure into their message-passing mechanism. Nevertheless, the locality of the message-passing in MP-GNNs can overlook the long-range interactions (LRI between distant nodes <ref type="bibr">(Ying et al., 2021)</ref>. The failure in learning LRI is problematic for many graph learning tasks. Taking the molecular property prediction as an example, while the molecular graph is typically represented as a covalent bond network with atoms directly linked through these bonds, the 3D folding of the molecule is also governed by long-range non-covalent interactions between distant atoms <ref type="bibr">(Gilmer et al., 2017)</ref>. This underlying principle of LRI's importance extends beyond molecules, finding relevance in areas like semantic segmentation label prediction in computer vision <ref type="bibr">(Dwivedi et al., 2020)</ref> or robotics <ref type="bibr">(Kurin et al., 2020)</ref>.</p><p>Increasing the depth of MP-GNNs appears to be an obvious and plausible solution for learning LRI, as it allows the message to iteratively propagate to distant nodes. However, various studies have shown that deep MP-GNNs perform much worse than shallow models <ref type="bibr">(Li et al., 2018;</ref><ref type="bibr">Oono and Suzuki, 2020)</ref>. Furthermore, deep MP-GNNs are expensive in computational and memory costs, especially in the backward pass, which linearly scales with the number of layers. In the context of LRI learning, MP-GNNs are shown to have unsatisfactory performance <ref type="bibr">(Ying et al., 2021)</ref>. Therefore, improving message-passing mechanisms for MP-GNNs that can learn LRI effectively is desirable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Our Contribution</head><p>We present DRGNN -a new MP-GNN model for learning LRI effectively using a full (node) similarity matrix as the aggregation matrix for message passing rather than sparse adjacency matrices used in existing MP-GNNs. The full similarity matrix emerges from the Personalized PageRank matrix <ref type="bibr">(Page et al., 1998)</ref> augmented by a rank-one matrix; see Equation <ref type="formula">2</ref>in Section 3. However, directly using the full similarity matrix for message passing requires the prohibitive matrix inversion.</p><p>We address the computational challenges in matrix inversion by recognizing the monotonicity of the full similarity matrix from a monotone operator viewpoint; see Proposition 1 in Section 4. In particular, we reformulate the node representation -obtained from repeated message passing with the full similarity matrix -as the fixed point of an iterative equation. This idea aligns with deep equilibrium models (DEQs) <ref type="bibr">(Bai et al., 2019</ref><ref type="bibr">(Bai et al., , 2022;;</ref><ref type="bibr">Pokle et al., 2022;</ref><ref type="bibr">Winston and Kolter, 2020;</ref><ref type="bibr">Gu et al., 2020;</ref><ref type="bibr">Baker et al., 2023)</ref>, where the model's depth is task dependent. We notice that with a tailored design of the equilibrium equation (Equation <ref type="formula">3</ref>in Section 4) and leveraging the Douglas-Rachford (DR) splitting <ref type="bibr">(Douglas and Rachford, 1956)</ref>, we can guarantee the fixed point exists and is unique and the computation of the fixed point can be computed efficiently without matrix inversion (See Subection 4.2.1). As a result, our model's computational complexity aligns with that of GCN. Moreover, DRGNN enjoys a constant memory footprint for backpropagation using implicit differentiation <ref type="bibr">(Robinson, 1991)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Notations</head><p>We adopt the following notations throughout this paper: We represent vectors, matrices, and operators by lowercase boldface, uppercase boldface, and calligraphic letters, respectively. We use I n and 1 n to denote the identity matrix and all-one vectors of size n, respectively. We denote the adjacency matrix of graph G as A, and we use D to denote the diagonal matrix whose i-th diagonal element is the sum of the i-th column of A. For the augmented adjacency matrix, we introduce &#195; = A + I n . For the normalized adjacency matrix, we use &#256; = &#195; D-1 to denote the random walk matrix and &#194; = D-1/2 &#195; D-1/2 to denote the symmetrically normalized adjacency matrix. We use the notation diag(B) to represent the vector of diagonal elements of B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Organization</head><p>We organize this paper as follows: In Section 2, we discuss the related work. In Sections 3 and 4, we present the two key components of the proposed model. In Section 5, we present the experimental results on various benchmark graph learning tasks. Missing details and technical proofs are deferred to the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>In this section, we review some related work. We first review some GNN models that expand the receptive field of message-passing beyond the immediate neighbors. We then discuss the fixed-point neural networks.</p><p>Multi-hop message passing: A particular idea to increase the receptive field of each node in MP-GNN is to employ higher orders of the adjacency matrices as the aggregation matrix <ref type="bibr">(Wu et al., 2019;</ref><ref type="bibr">Gasteiger et al., 2019a,b;</ref><ref type="bibr">Zhu and Koniusz, 2021;</ref><ref type="bibr">Chien et al., 2021)</ref>. A remarkable model is PPNP <ref type="bibr">(Gasteiger et al., 2019a)</ref>, which uses the personalized PageRank matrix (1 -&#945;)(I n -&#945; &#194;) -1 with &#945; &#8712; (0, 1) as its aggregation matrix. Using the Neumann expansion of the matrix inverse, represented as (1</p><p>we can deduce that PPNP's aggregation matrix attends to every node in the graph based on graph informed node similarities. However, the PPNP model is computationally expensive due to using the dense Personalized PageRank matrix. This challenge is further intensified when one seeks to stack multiple such layers to boost the expressivity of GNNs. To improve the efficiency of PPNP, <ref type="bibr">Gasteiger et al. (2019a)</ref> further propose APPNP using a truncated Neumann approximation of the Personalized PageRank matrix. However, the truncated Neumann approximation inherently limits the receptive field of each node. This becomes particularly problematic when the depth required by the task is unknown. APPNP has been generalized in <ref type="bibr">(Gasteiger et al., 2019b;</ref><ref type="bibr">Zhu and Koniusz, 2021;</ref><ref type="bibr">Chien et al., 2021)</ref> to incorporate generalized PageRank filters.</p><p>Graph Transformer: Transformer-based GNNs, e.g., <ref type="bibr">(Dwivedi and Bresson, 2020;</ref><ref type="bibr">Ying et al., 2021;</ref><ref type="bibr">Kreuzer et al., 2021)</ref>, leverage the self-attention mechanism between nodes and graph-related positional embeddings to learn node representations. Transformerbased GNNs enable full node connectivity and show remarkable performance in learning LRI <ref type="bibr">(Ying et al., 2021)</ref>. However, Transformer-based GNNs possess a weak inductive bias <ref type="bibr">(Dosovitskiy et al., 2020)</ref>; the positional embeddings may not reflect the graph structure, resulting in less meaningful attention weights. Moreover, the quadratic complexity of the self-attention mechanism makes Transformer-based GNNs computationally expensive even for moderately sized graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fixed-point neural networks:</head><p>There has been a line of works that learn representations as the fixed point of some equilibrium equation; see, e.g., <ref type="bibr">(Bai et al., 2019;</ref><ref type="bibr">Gu et al., 2020;</ref><ref type="bibr">Bai et al., 2022)</ref>. This approach results in expressive models with adaptive depth while maintaining a constant memory footprint using implicit differentiation <ref type="bibr">(Robinson, 1991;</ref><ref type="bibr">Bolte et al., 2021)</ref>. For the correct implementation and training stability of the fixed-point neural networks, one needs to ensure the well-posedness of the fixedpoint iteration -the existence and uniqueness of the fixed point. A general well-posedness result is established by <ref type="bibr">Winston and Kolter (2020)</ref> that utilizes the monotone operator theory and allows us to use operator splitting schemes to find the fixed point. The fixed-point neural networks are versatile and competitive across various tasks including language modeling <ref type="bibr">(Bai et al., 2019)</ref>, semantic segmentation <ref type="bibr">(Bai et al., 2020)</ref>, optical flow <ref type="bibr">(Bai et al., 2022)</ref>, and diffusion models <ref type="bibr">(Pokle et al., 2022)</ref>.</p><p>In the context of graph learning, IGNN <ref type="bibr">(Gu et al., 2020)</ref> is a fixed-point GNN where each iteration in the fixed-point solving process is modeled as a messagepassing step using the normalized adjacency matrix.</p><p>Given the fact that the memory cost in backward propagation in each message-passing layer directly scales with the number of edges and is linearly proportional to the number of layers, IGNN's constant memory cost is favorable for graph learning tasks. This is particularly true for graph learning tasks that require LRI as the necessarily increased depth of a typical MP-GNN model will incur a significant memory However, it is pointed out that IGNN's well-posedness constrains the choices of weight matrix and poses instability in learning LRI <ref type="bibr">(Baker et al., 2023)</ref>.</p><p>Spectral GNNs: Spectral GNNs have also been designed to learn LRI; remarkable examples include BernNet <ref type="bibr">He et al. (2021)</ref> and <ref type="bibr">ChebyNet Defferrard et al. (2016)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">FULL SIMILARITY MATRICES</head><p>In this section, we discuss the motivation of the new aggregation matrix for message passing in our model. Consider a connected graph G = (V, E) with n nodes<ref type="foot">foot_1</ref> , one way to establish full node connectivity with the minimal extra computational cost is to utilize rank one matrices, e.g. 1 n 1 n 1 &#8868; n , which aggregates the averaged features of all nodes. However, the averaged feature is graph agnostic. Instead, one can turn to use the matrix J = 1 tr( D) diag( D1/2 )diag( D1/2 ) &#8868; to establish full node connectivity. Notice that matrix J accounts for node degrees; see the middle panel in Figure 1(b) for a visualization of using matrix J for a toy graph. We notice that rank-one modeling of the pairwise similarity has been used in efficient Transformers <ref type="bibr">(Katharopoulos et al., 2020;</ref><ref type="bibr">Catania et al., 2023)</ref>.</p><p>To further refine the above approach, we take inspiration from the Personalized PageRank matrix. Instead of directly relying on J for message passing, we integrate it into the Personalized PageRank matrix. This integration results in graph structure-aware full node similarities that are suitable for message passing between arbitrary nodes, as we will see later.</p><p>Modified Personalized PageRank matrix: Let &#256; := &#195; D-1 be the random walk matrix of the augmented graph G. The Personalized PageRank matrix &#928; ppr is characterized by the following equation:</p><p>where i x is the indicator vector of node x and &#945; &#8712; (0, 1) balances the random walk and restart probabilities. The Personalized PageRank matrix can be solved as &#928; ppr = (1 -&#945;) I n -&#945; &#256; -1 . The entries of &#928; ppr represent nodes similarity characterized by the probability of reaching one node from another in a random walk with restart probability 1 -&#945; <ref type="bibr">(Fouss et al., 2016)</ref>.</p><p>The Personalized PageRank matrix has been used in MP-GNN models, such as PPNP and APPNP <ref type="bibr">(Gasteiger et al., 2019a)</ref>. In these models, the random walk matrix &#256; is replaced by its symmetrically normalized variant &#194; = D-1/2 &#195; D-1/2 . In our proposed DRGNN, we modify the Personalized PageRank matrix by additionally incorporating the rank one matrix J to enhance direct communication between distant nodes and to benefit the computational efficiency; see Section 4 for details. The modified Personalized PageRank matrix &#928; ppr satisfies</p><p>which resolves to:</p><p>(2)</p><p>The entities of &#928; ppr characterize node similarities that are influenced by the graph structure with coefficients that balance the random walk, restart, and (degree weighted) global average. The matrix &#928; ppr enables multi-hop message-passing using a single MP-GNN layer, which follows from Neumann series expansion:</p><p>We illustrate matrix &#928; ppr in a simple graph in the right panel of Figure <ref type="figure">1</ref>(b). However, as encountered in PPNP <ref type="bibr">(Gasteiger et al., 2019a)</ref>, computing the inverse of (I n -&#945;J -&#946; &#194;) is prohibitive for large graphs. We address this computational bottleneck using monotone operator theory in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DRGNN</head><p>DRGNN learns node features by finding the fixed point H * that satisfies the following equilibrium equation:</p><p>where |&#946;| &lt; 1, &#945; + &#946; + 1 &gt; 0, and &#947; &gt; 1 + |&#945;| + |&#946;| are learnable. Here, &#963;(&#8226;) is the activation function (e.g., ReLU), and f &#920; (X) is a transformation of the input node features X parameterized by &#920;. The rationale of the design in Equation 3 will be discussed in the rest of this section.</p><p>Starting from H (0) , one approach to find the fixed point of Equation 3 is through the (damped) iteration using the Forward-Backward (FB) splitting <ref type="bibr">(Passty, 1979)</ref>, with a certain damping value t &gt; 0, one iterates:</p><p>for i = 0, 1, . . . , K. This iteration terminates when the difference between two consecutive iterations is smaller than some threshold &#1013; and then H (K+1) is taken as the numerical fixed point H * . Each iteration can be viewed as one message-passing layer, then fixed point H * is a result of multi-step message passing involving full node similarity matrix (I n + &#945;J + &#946; &#194;) -1 with its number of steps (depth) adaptive to the learning task. Therefore, H * captures LRI between distant nodes based on node similarities and also incorporates the nonlinearities between message-passing steps. We verify the improved LRI learning ability of DRGNN over APPNP in Section 5; see also Figure <ref type="figure">2</ref> for a visual comparison of the learned features in APPNP and DRGNN.</p><p>However, it remains unclear whether the dampened Picard iteration converges to the fixed point H * and the computation of the matrix inverse (I n +&#945;J +&#946; &#194;) -1 in each iteration is computationally expensive. We will address these two challenges in the next two subsections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Well-posedness of DRGNN</head><p>We discuss the existence and uniqueness of the fixed point of Equation <ref type="formula">3</ref>, i.e., the well-posedness of DRGNN. Our solution comes from a monotone operator theory viewpoint; see <ref type="bibr">(Bauschke et al., 2011)</ref> or Appendix B for background on monotone operator theory. We will show that the fixed point H * of Equation 3 exists and is unique, even when directly interpreting Equation 3 as an iterative scheme results in a non-contractive operator.</p><p>According to the monotone operator theory, finding the fixed point H * for Equation 3 is equivalent to solving the following monotone inclusion problem (See Appendix B for details): find 0 &#8712; (F + G)(H) with F and G being two set-valued functions, given as follows</p><p>where &#8706;g denotes the subgradient of a convex closed proper function g that satisfies &#963; = prox t g for some t &gt; 0. Here, the proximal operator is defined as prox t g (x) := arg min z 1 2 &#8741;x -z&#8741; 2 + t g(z) for any t &gt; 0. Most commonly used activation functions, such as ReLU, satisfy this condition. Specifically, ReLU = prox t g for &#8704;t &gt; 0 with g being the indicator of the positive octant, i.e. g(x) = I{x &#8805; 0}.</p><p>It's worth noting that the inclusion of 2H in equation 3 results in -I n in F and it marks a significant difference from what is used in the previous work that connects monotone inclusion problem with fixed-point neural networks <ref type="bibr">(Winston and Kolter, 2020;</ref><ref type="bibr">Baker et al., 2023)</ref>. This specific design will be a key factor in achieving computational efficiency; see Equation <ref type="formula">8</ref>in the subsequent section.</p><p>According to the monotone operator theory, the monotone inclusion problem Equation 5 admits a unique solution if the linear operator F is strongly monotone.</p><p>In particular, we have the following result whose proof can be found in Appendix A.</p><p>Proposition 1. The operator F is strongly monotone if |&#946;| &lt; 1, &#945;+&#946;+1 &gt; 0, and &#947; &gt; 1+|&#945;|+|&#946;|. Moreover, the monotone inclusion problem find 0 &#8712; (F + G)(H) with F and G defined in Equation 5 admits a unique solution, and Equation 3 admits a unique fixed point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Efficient Implementation of DRGNN</head><p>In this subsection, we present an efficient implementation of DRGNN. Specifically, in the forward pass, the fixed-point nature of the hidden encoding allows us to use any root finding solver for H * not necessarily the Picard iteration (Equation <ref type="formula">4</ref>) derived from FB splitting. Peaceman-Rachford (PR) splitting (Peaceman and Rachford, 1955) has also been studied in the context of fixed-point neural networks <ref type="bibr">(Winston and Kolter, 2020;</ref><ref type="bibr">Baker et al., 2023)</ref>. PR splitting in general converges faster than FB splitting, but it requires the computation of the matrix inverse that originates from finding the resolvent for the operator F . As we will see in this subsection, our tailored design of F allows us to compute the resolvent efficiently by circumventing the computation of the matrix inverse, which is a sharp contrast to the previous work <ref type="bibr">(Winston and Kolter, 2020;</ref><ref type="bibr">Baker et al., 2023)</ref>. We additionally opt for DR splitting <ref type="bibr">(Douglas and Rachford, 1956)</ref>, which is an averaged version of PR splitting that has a better convergence guarantee in general <ref type="bibr">(Ryu and Boyd, 2016)</ref>, and we observed more robust performance in comparison to PR splitting for our setting. For the backward pass, the implicit differentiation method from <ref type="bibr">(Geng et al., 2021)</ref> is employed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Forward Pass via DR Splitting</head><p>We now elaborate on DR splitting for solving the monotone inclusion problem Equation 5 and hence to obtain the fixed point H * of Equation 3. DR splitting is a powerful tool for solving monotone inclusion problems <ref type="bibr">(Ryu and Boyd, 2016)</ref>. With operators F and G defined in Equation <ref type="formula">5</ref>, DR splitting solves the monotone inclusion problem "find 0 &#8712; (F + G)(H)" by starting with an initialization<ref type="foot">foot_2</ref> of the intermediate state U (0) , and then updates as follows:</p><p>where I denotes the identity operator, and the operators R G and C G denote the resolvent and Cayley operators of G, respectively. They are defined as</p><p>Similarly, C F is the Cayley operator of the operator F .</p><p>Then the fixed point H * of Equation 3 is determined by H * = &#963;(U * ). It is worth noting that both resolvent and Cayley operators are non-expansive for any strongly monotone operator. DR splitting converges linearly to H * for any choice of t &gt; 0 when F is strongly monotone and Lipschitz; see <ref type="bibr">(Ryu and Boyd, 2016, Section 6)</ref> or Appendix B for details.</p><p>Plugging the detailed form of F and G in Equation <ref type="formula">5</ref>into Equation <ref type="formula">6</ref>, we obtain the following iterative schemes for the intermediate state U (i) : <ref type="formula">7</ref>) where</p><p>As discussed above, DR splitting converges to H * linearly for any t &gt; 0, and when t = 1, we have</p><p>Therefore, when using DR splitting in Equation <ref type="formula">7</ref>to compute the fixed point of Equation 3, the needs of computing I n + &#945;J + &#946; &#194; -1 and in the linear part V of the resolvent operator R F cancel out. This is a particular advantage of having -I n in the operator F which comes from our tailored design of Equation <ref type="formula">3</ref>.</p><p>Initializing with the fixed point of the previous forward pass: In training, the learned function f &#920; (X) tends to stabilize as the model converges. Consequently, the difference between fixed points in consecutive forward passes become negligible. Therefore, it is natural to leverage the numerical fixed point from the previous forward pass to initialize the iteration of the new forward pass. This strategy has been adopted to accelerate training fixed-point networks <ref type="bibr">(Bai et al., 2022;</ref><ref type="bibr">Pokle et al., 2022)</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Backward Pass</head><p>By utilizing fixed-point encoding for learning node features, the memory consumption for backpropagation becomes independent of the number of iterations (layers) by using implicit differentiation. For simplicity, we denote the DR splitting iteration in Equation <ref type="formula">7</ref>as</p><p>). Let &#952; represent a parameter in &#920;. For any loss function &#8467;, by using the implicit differentiation, we have</p><p>where J F DR (U * ) is the Jacobian matrix of F DR evaluated at the fixed-point U * . Note that the matrix I n -J F DR (U * ) is invertible because the iteration of DR splitting is contractive. According to Equation <ref type="formula">9</ref>, backpropagation only requires the gradient at the final fixed point. We can disable automatic differentiation during the fixed-point finding process and only run one more iteration with automatic differentiation to use the stored gradients for computing the true gradients through Equation <ref type="formula">9</ref>. However, the inverse Jacobian term (I n -J F DR (U * )) -1 can be computationally expensive <ref type="bibr">(Bai et al., 2019;</ref><ref type="bibr">Winston and Kolter, 2020)</ref>.</p><p>In our implementation, we employ the inexact gradient method <ref type="bibr">(Geng et al., 2021)</ref>, which can maintain the descent direction. This method, also adopted in <ref type="bibr">(Pokle et al., 2022;</ref><ref type="bibr">Anil et al., 2022)</ref>, significantly accelerates backpropagation. This process involves locating the fixed-point H * with automatic differentiation off, enabling it for a few more iterations, and using the stored gradient to update DRGNN directly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Time and Memory Complexities</head><p>In this section, we analyze the time and memory complexities of DRGNN. Let n and |E| denote the number of nodes and edges in the graph, respectively. We use d to denote the feature dimension. We assume the graph is sparse, which is common in graph learning. During the forward pass, DRGNN has a time complexity of O(|E|M d), where M represents the maximum number of iterations. Correspondingly, its space complexity is O(|E| + nd). DRGNN leverages the inexact gradients to accelerate the backpropagation. Consequently, the time complexity is O(|E|P d), where P indicates the number of inexact gradient computations. The space complexity during the backward pass is O(P |E| + nd).</p><p>Our experiments align with findings in <ref type="bibr">(Pokle et al., 2022;</ref><ref type="bibr">Anil et al., 2022)</ref> that small values of P (less than five) maintain high performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTAL RESULTS</head><p>In this section, we validate the performance of DRGNN on various graph learning tasks, including both synthetic datasets and real-world peptide molecular modeling datasets for LRI learning, as well as graph transductive node classifications. We implement DRGNN using PyG <ref type="bibr">(Fey and Lenssen, 2019)</ref> and GraphGym <ref type="bibr">(You et al., 2020)</ref> frameworks and conduct all experiments on a single NVIDIA RTX 3090 GPU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Learning LRI In Synthetic Datasets</head><p>In this subsection, we consider two synthetic datasets that are designed to test the model's ability to learn LRI; the corresponding tasks provide controlled environments where LRI is crucial for the model to achieve high performance. We further provide a visual understanding of the learning process to demonstrate the effectiveness of DRGNN in learning LRI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1">Synthetic Chains Dataset</head><p>The Synthetic Chains node classification dataset <ref type="bibr">(Gu et al., 2020)</ref> consists of a collection of chains where nodes within the same chain share identical labels. However, only the starting node on each chain has a meaningful feature as the one-hot indicator of the class, while all other nodes carry zero-valued features.</p><p>To accurately classify all nodes on the chains, a model must be able to propagate the class indicator feature from the initial node to the distant nodes in the forward pass without information loss. Consequently, the Synthetic Chains task poses a significant challenge for numerous GNN models, particularly when the depth of GNN is not sufficient or the model is unstable in learning LRI. We consider both a binary Synthetic Chain task consisting of 2 classes of 10 chains with 80 nodes each and a multi-class Synthetic Chain task consisting of 10 classes of 10 chains each. In both tasks, we follow the training procedure in <ref type="bibr">(Gu et al., 2020)</ref>.</p><p>A visualization of the learning process in binary classification: In Figure <ref type="figure">2</ref> We first train each model for 100 epochs and then use the learned parameters to predict the node labels from the learned features on the test graph at each iteration step. Each model is provided with a total of 80 iteration steps to propagate the node features. We observe that IGNN fails to correctly classify all nodes and the predicted probabilities of the two classes are not wellseparated. In contrast, both APPNP and DRGNN are able to correctly classify all nodes, which shows the effectiveness of using the personalized PageRank-style matrix for message passing in learning LRI. Additionally, DRGNN quickly obtains the correct prediction for all nodes while APPNP requires more iterations to achieve the same result. This demonstrates the effectiveness of the learned fixed point in DRGNN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multi-class classification:</head><p>The multi-class classification task poses a more challenging problem for the models to learn LRI. Figure <ref type="figure">3</ref> illustrates the mean test accuracy, calculated over 10 random initializations, for baseline models trained on multi-class Synthetic Chains with chain lengths varying from 10 to 100. In this experiment, each architecture is tuned at chain length 100 with 200 iterations. Although each model is capable of fully propagating the node information, only DRGNN consistently classifies chains accurately across all investigated lengths. Additional results for various models can be found in Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Color-counting Dataset</head><p>The Color-counting dataset <ref type="bibr">(Liu et al., 2022)</ref> introduces a node classification task that is more challenging than the Synthetic Chains. This dataset uses the same chain-like structure but rather than determining node labels based on the starting node's color, node labels within each chain are assigned according to the majority node color, encoded as one-hot feature vectors. This setup emphasizes the need for GNNs to acquire multiscale information in LRI learning for optimal performance. In light of this, Liu et al. ( <ref type="formula">2022</ref>) presented MGNNI, which surpasses other baseline models on this dataset.</p><p>We compare DRGNN with MGNNI as well as APPNP and GCN on this task. Therefore, the necessity of identifying the majority color during message passing raises additional challenges to learn LRI. We report the mean test accuracy over 10 random initializations for Color-counting chains of length 10 to 200 in Figure <ref type="figure">4</ref>. We observe that DRGNN obtains close to full accuracy across all chain lengths, while all other models fail to learn LRI or identify the majority color. This validates the effectiveness of the DRGNN message-passing scheme in complicated LRI learning tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Learning LRI In Molecular Modeling</head><p>We explore the effectiveness of DRGNN in realworld scenarios by evaluating its performance on the Peptides-Func and Peptides-Struct molecular modeling datasets <ref type="bibr">(Dwivedi et al., 2022)</ref> Table 2: Performance comparison of different models on transductive node classification of Citation Networks.</p><p>Results marked with * are reported in <ref type="bibr">(Chen et al., 2023)</ref>. Models that run out of memory are reported with OOM and models without existing results with NA. GCN and APPNP are MP-GNN baselines, Graphormer, SAN, and GraphGPS are Transformer-based models, and NAGphormer is a mixture of MP-GNN and Transformer-style models. DRGNN achieves state-of-the-art results on Amazon-Computers and outperforms the baseline MP-GNN models on each task.</p><p>is shown that Transformer-based models are more effective than MP-GNNs. Both datasets contain 15,535 graphs with a total of 2.3 million nodes, each representing a peptide with atoms as nodes and covalent bonds as edges; see Figure <ref type="figure">5</ref> for an illustration. On average, peptide graphs from these datasets consist of about 150 nodes with a typical node degree nearing 2, making them chain-like structures. Peptide-func is a multi-label graph classification task and the goal is to predict the functional properties of peptides. Peptidestruct is a graph regression task and the goal is to predict the 3D structure of peptides. In both cases, capturing non-covalent LRI becomes vital for a GNN model to excel in this setting.</p><p>We compare DRGNN against several strong baseline models, including GCN <ref type="bibr">(Kipf and Welling, 2017)</ref>, GINE <ref type="bibr">(Xu et al., 2019)</ref>, GatedGCN <ref type="bibr">(Bresson and Laurent, 2017</ref>) and its variant with random walk structural encoding (RWSE) GatedGCN+RWSE, Transformer <ref type="bibr">(Vaswani et al., 2017)</ref> with Laplacian position encoding (LapPE), SAN variants <ref type="bibr">(Kreuzer et al., 2021)</ref> (SAN+LapPE and SAN+RWSE), and GraphGPS <ref type="bibr">(Ramp&#225;&#353;ek et al., 2022)</ref>. We summarize the mean training and test performance over 10 random initializations in Table <ref type="table">1</ref>, showing that DRGNN outperforms all baseline models on both datasets in terms of both training and test performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Learning Transductive Datasets</head><p>We further demonstrate the effectiveness of DRGNN in learning LRI across five transductive node classification tasks, including the citation networks Coauthor-  CS/Physics (Shchur et al., 2018), ogbn-arxiv (Hu et al., 2020), and the co-purchasing networks Amazon-Computers/Photo (Shchur et al., 2018). These tasks consist of single large graphs with up to 170K nodes, 1.2M edges, and 23 length diameters, and contrasting the prior mini-batch training on multiple molecular graphs. Comprehensive dataset details can be found in Appendix D.1. Consequently, the transductive tasks bring to light memory challenges for several mainstream Transformer-based architectures and illustrate the memory efficiency of DRGNN. We compare against existing results from (Chen et al., 2023) which serves to illustrate the difference between MP-GNN and Transformer-based models. In particular, the MP-GNN baselines are GCN and APPNP, the Transformer-based architectures include Graphormer, SAN, and GraphGPS, and the NAGphormer (Chen et al., 2023) is a mixture of MP-GNN and Transformerstyle models.</p><p>Table <ref type="table">2</ref> verifies the memory issues for prominent Transformer-based models and illustrates the strength of DRGNN. In particular, DRGNN achieves the best performance on ogbn-arxiv and Amazon-Computers/Photo and outperforms the baseline MP-GNNs on all five tasks. DRGNN also outperforms many Transformer-based models (only slightly inferior to NAGphormer on Coauthor-CS/Physics), highlighting its superior memory efficiency and effectiveness in learning LRI for inductive tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>The 1-hop message-passing nature of many MP-GNNs limits their ability to learn LRI, which is crucial for many graph learning tasks. To effectively learn LRI, We propose DRGNN leveraging a modified Personalized PageRank matrix -a full node similarity matrix. DRGNN is designed and efficiently implemented through the lens of monotone operator theory, which allows us to circumvent the matrix inversion that comes from the use of a full node similarity matrix. We demonstrate the effectiveness of DRGNN over existing strong baseline models on various graph learning tasks, including synthetic datasets and benchmark real-world datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Checklist</head><p>1. For all models and algorithms presented, check if you include:</p><p>(a) A clear description of the mathematical setting, assumptions, algorithm, and/or model.</p><p>[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm.</p><p>[Yes] (c) (Optional) Anonymized source code, with specification of all dependencies, including external libraries.</p><p>[Yes]</p><p>2. For any theoretical claim, check if you include:</p><p>(a) Statements of the full set of assumptions of all theoretical results. [Yes] (b) Complete proofs of all theoretical results.</p><p>[Yes] (c) Clear explanations of any assumptions. <ref type="bibr">[Yes]</ref> 3. For all figures and tables that present empirical results, check if you include:</p><p>(a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL).</p><p>[Yes] (b) All the training details (e.g., data splits, hyperparameters, how they were chosen).</p><p>[Yes] (c) A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times).</p><p>[Yes] (d) A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider).</p><p>[Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses existing assets. [Yes] (b) The license information of the assets, if applicable. [Yes] (c) New assets either in the supplemental material or as a URL, if applicable. [Yes] (d) Information about consent from data providers/curators. [Yes] (e) Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable] 5. If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to participants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] (c) The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable] &#8226; Meanwhile, a subdifferential operator &#8706;f is maximal monotone if and only if f is a convex closed proper function.</p><p>Recall the resolvent and Cayley operator of an operator G are defined by</p><p>for any t, where I is the identity operator. The following results for the resolvent of linear and subdifferential operators are used in Section 4:</p><p>&#8226; when F (x) = Gx + h is a linear operator, then</p><p>&#8226; when F = &#8706;f for some CCP function f , then the resolvent is given by the following proximal operator</p><p>If an operator F is strongly monotone, its resolvent R F is contractive, and its Cayley operator is non-expansive (for any t &gt; 0). If F is strongly monotone and Lipschitz, then its Cayley operator is also contractive, see <ref type="bibr">(Ryu and Boyd, 2016, Section 6)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Forward-Backward splitting</head><p>This section provides additional details for Section 4 about the monotone inclusion problem and the Forward-Backward splitting method. We will first introduce the Forward-Backward splitting method and then show that the monotone inclusion problem is equivalent to the fixed-point problem of the layer update Equation <ref type="formula">3</ref>. Then we derive the damped Picard iteration used in Equation <ref type="formula">4</ref>.</p><p>Forward-backward (FB) splitting is a technique used to find a zero in the sum of operators, i.e., locate x such that 0 &#8712; (F + G)(x),</p><p>where F and G are maximal monotone, and F is single-valued. Then for any t &gt; 0, we have</p><p>Where the last step uses the definition of the resolvent operator R G := (I + tG) -1 . Thus, x is a solution if and only if it is a fixed point of R G (I -tF ) for any t.</p><p>Equivalence between the fixed-point problem (Equation <ref type="formula">3</ref>) and the monotone inclusion problem (Equation <ref type="formula">5</ref>). Recall that the monotone inclusion problem used in Equation <ref type="formula">5</ref>reads</p><p>where G is such that R G is the activation function &#963;. Then for any t &gt; 0, the FB splitting updates as the following damped variant of Picard iteration</p><p>Therefore, the FB splitting iteration recovers exactly the fixed-point formulation in Equation 3 when t = 1 and the fixed-point in Equation 3 can be computed via a suitable damping factor t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Douglas-Rachford splitting</head><p>The Douglas-Rachford splitting method is derived by averaging the composition of two Cayley operators. For any t &gt; 0, the following equivalence relation holds:</p><p>This can be further expanded as:</p><p>Then the fixed point iteration for this setup is:</p><p>C Additional Experimental Results on Multi-class Chain Classification</p><p>100 ( &#33449; ) &gt; U &#24052; n u u v 75 50 25 &#12290; ------&#19969;--------____&#65281;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;&#19968;r__________ ! !</p><p>nodes per chain, the total number of nodes in the chain dataset is c &#215; n c &#215; l. In this task, we use the 5/15/80% train/valid/test splits.</p><p>Color-counting. The Color-counting dataset <ref type="bibr">(Liu et al., 2022)</ref> is introduced to further test the model's ability to learn LRI while processing multiscale information. The dataset contains a collection of chains, with each chain containing 10 randomly selected nodes that are colored into one of two colors. Each node's label is determined by the majority color of the chain containing it. The dataset is balanced with 50% of the nodes in each class. In this task, we use the 5/15/80% train/valid/test splits.</p><p>Peptides-func. The peptides-func is a multi-label graph classification task introduced in <ref type="bibr">Dwivedi et al. (2022)</ref>.</p><p>The dataset consists of 15,535 graphs with a total of 2.3 million nodes. Each graph is constructed from a peptide sequence, where each node represents the heavy atom while the edges represent the chemical bonds between them. The OGB library <ref type="bibr">(Hu et al., 2020)</ref> molecular features are used in this dataset. There is a total of 10 classes that represent peptide functions, e.g., anti-bacterial, anti-viral, cell-cell communication, etc. on average to 1.65 of the 10 classes. The labels are imbalanced, only 16.5% of the data is in the positive class, with the richest class having 62.7% positives and the poorest 1.9%. The license of this dataset is CC BY-NC 4.0.</p><p>Peptide-struct. The peptide-struct is also introduced in Dwivedi et al. ( <ref type="formula">2022</ref>) and contains the same graphs as in the peptides-func dataset. This graph regression task aims to predict the peptide's graph level 3D structure, including inertia mass, inertia valence, length, sphericity, etc. The license of this dataset is CC BY-NC 4.0.</p><p>Transductive Datasets. The ogbn-arxiv is a large citation network with 169343 nodes and 1,166,243 edges introduced in <ref type="bibr">(Hu et al., 2020)</ref>. Each node is an arXiv computer science paper, and each directed edge indicates the source node cites the target node. The node feature is a 128-dimensional vector obtained from the word embedding of the title and abstract of its corresponding paper. The Coauthor CS and Physics datasets are introduced in <ref type="bibr">(Shchur et al., 2018)</ref> represent co-authorship graphs from Microsoft Academic Graph. Each node represents an author, and each edge indicates two authors have co-authored a paper. The node feature represents the keywords of the corresponding authors' papers, and the label of the node is the author's most active field of study. The Amazon Computers and Amazon Photo datasets are introduced in <ref type="bibr">(Shchur et al., 2018)</ref> and represent products bought together on Amazon. Each node represents a product, and each edge indicates two products are frequently bought together. The node feature is constructed from the bag-of-words of the product's reviews, and the label of the node is the product's category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Model architectures</head><p>This section details the encoder, DRGNN, and decoder architectures used in Section 5. Additional hyperparameter details are provided in Appendix D.4.</p><p>The general architecture for DRGNN includes three auxiliary learnable parameters &#945;, &#946;, &#947; as in Equation <ref type="formula">3</ref>and a bias term f &#920; (&#8226;). The forward method is then performed via Equation <ref type="formula">7</ref>. Fixed point iteration is first calculated without gradients until convergence or the maximum number of iterations is reached. The iteration is called again with gradients until the number of phantom gradients is reached. The maximum number of forward iterations and the number of phantom gradient steps may be adjusted. The forward method also allows for an optional initial guess to the iterative solver, which can significantly improve the computational efficiency of DRGNN (see Section 5). DRGNN automatically stores an initial guess for static graphs. For dynamic graphs, it is recommended that this value is augmented to the data and passed to the model with the batched data. We provide a data augmentation transform using the PyG library <ref type="bibr">(Fey and Lenssen, 2019)</ref> in the publically available code.</p><p>Synthetic Chains. In the Synthetic Chains dataset, we use a single linear layer without bias followed by a dropout layer for encoding. We then use the DRGNN layer with ReLU activation and a bias term of two sequential linear layers without bias separated by a ReLU activation function. The DRGNN layer is followed by a decoding process of a ReLU activation function followed by a linear layer without bias. Finally, model predictions are generated using a logarithmic softmax activation.</p><p>Color-counting. The color-counting dataset uses the same encoder and decoder architecture as the Synthetic Chains dataset. The DRGNN layer uses a ReLU activation function and a bias term of two sequential linear layers without bias separated by a ReLU activation function.</p><p>Peptides-func. The atomic information provided by the peptides-func dataset is first encoded using the Atom encoder as in <ref type="bibr">(Dwivedi et al., 2022)</ref>. Then the feature is passed through four layers of DRGNN with LeakyReLU activation and a GCN-style bias term. The decoder architecture is a two sequential linear layer. The hidden dimension is chosen to be 300 for all layers to fit within the 500k parameter limit. The key hyperparameters are provided in Table <ref type="table">3</ref>.</p><p>Peptides-struct. Similar to the peptides-func dataset, the atomic information provided by the peptides-struct dataset is first encoded using the Atom encoder as in <ref type="bibr">(Dwivedi et al., 2022)</ref>. Then the feature is passed through three layers of DRGNN with LeakyReLU activation and a GCN-style bias term. The decoder architecture is a three-sequential linear layer. The hidden dimension is chosen to be 300 for all layers to fit within the 500k parameter limit. The key hyperparameters are provided in Table <ref type="table">3</ref>.</p><p>Transductive Datasets. The transductive datasets use the same encoder and decoder architecture as the Synthetic Chains dataset. The DRGNN layer uses a LeakyReLU activation function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Training procedure details</head><p>In this section, we provide the training details for the experiments in Section 5. All experiments are run using 8 NVIDIA RTX3090 GPUs and timing and memory experiments run on single A100 GPUs provided by Google Colab.</p><p>Synthetic Chains. The dataset is pre-processed by taking the symmetric normalized adjacency matrix. In these tasks, we utilize the Adam optimizer <ref type="bibr">(Kingma and Ba, 2014)</ref>, cross entropy loss, and early stopping based on the validation accuracy. Specifically, we train with a maximum of 2000 forward epochs with early stopping if the validation accuracy does not improve after 100 epochs. We save the model at the epoch with the best validation accuracy and use the saved model for testing.</p><p>Color-counting. The training procedure for the color-counting dataset is identical to the synthetic chains dataset.</p><p>Peptides-func and Peptides-struct. We follow the same training procedure as in <ref type="bibr">(Dwivedi et al., 2022)</ref>. The optimizer is AdamW <ref type="bibr">(Loshchilov and Hutter, 2017)</ref> with a total of 250 epochs. The learning rate is halved when the validation loss does not improve after 20 epochs. The fixed 75/15/15% train/valid/test splits are used. The batch size is 128 for Peptides-func and Peptides-struct.</p><p>Transductive Datasets. The optimizer is AdamW <ref type="bibr">(Loshchilov and Hutter, 2017)</ref> with a total of 100 epochs.</p><p>The learning rate follows a cosine scheduler with 5 warmup epochs. The random 60/20/20% train/valid/test splits are used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.4 Hyperparameter details</head><p>To find the optimal hyperparameters of each model we use the hyperparameter searching range reported in Table <ref type="table">3</ref>. In the interest of space, when the spacing between values is linear, we use the notation {min:max:step}.</p><p>For instance dropout may be abbreviated {0.0:0.8:0.1}. We use lr to denote the initial learning rate and wd to denote the weight decay. For DRGNN, we use &#945; init to denote the initialization of &#945;, &#946; init to denote the initialization of &#946;, &#947; init to denote the initialization of &#947;. max iter to denote the maximum number of forward iterations, phantom grad to denote the number of steps used at the Phantom gradient stage, tol to denote the tolerance used for fixed-point iteration.</p><p>Synthetic Chains. For all models we use 16 hidden features. Each model is tuned using Bayesian metalearning to maximize the test accuracy for 50 iterations. For binary classification, the models are tuned with a</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proceedings of the 27 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain. PMLR: Volume 238. Copyright 2024 by the author(s). Correspond to: wangbaonj@gmail.com.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>The results in this section apply to graphs with multiple connected components by considering the relevant matrix as a block diagonal for each component.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>In practice, we initialize the solver with the fixed point found in the previous forward pass.</p></note>
		</body>
		</text>
</TEI>
