<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Towards Learning Generalities Across Graphs via Task-trees</title></titleStmt>
			<publicationStmt>
				<publisher>The International Conference on Machine Learning (ICML)</publisher>
				<date>08/15/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10637679</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zehong Wang</author><author>Zheyuan Zhang</author><author>Tianyi Ma</author><author>Nitesh Chawla</author><author>Chuxu Zhang</author><author>Yanfang Ye</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Foundation models are pretrained on large-scale corpora to learn generalizable patterns across domains and tasks-such as contours, textures, and edges in images, or tokens and sentences in text. In contrast, discovering such generalities in graph-structured data, especially across heterogeneous graph tasks, remains an open challenge. To address this, we propose a novel approach to cross-task generalization in graphs via task-trees, which serve as unified learning instances aligning node-, edge-, and graph-level tasks. We theoretically analyze the stability, transferability, and generalization properties of tasktrees, showing that pretraining a graph neural network (GNN) on diverse task-trees with a reconstruction objective induces transferable knowledge. This enables efficient adaptation to downstream tasks with minimal fine-tuning. To validate our framework, we introduce Graph Generality Identifier on Task-Trees (GIT), a graph foundation model that demonstrates strong performance on over 30 graphs across five domains via finetuning, in-context learning, and zero-shot generalization. Code and data are available at https:  //github.com/Zehong-Wang/GIT.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Foundation models have emerged as a cornerstone of general-purpose machine learning, enabling cross-task and cross-domain generalization. Representative examples include large language models (LLMs) for text <ref type="bibr">(Achiam et al., 2023;</ref><ref type="bibr">Touvron et al., 2023)</ref> and large vision models (LVMs) for images <ref type="bibr">(He et al., 2022;</ref><ref type="bibr">Yuan et al., 2021)</ref>. Pretrained on massive datasets, these models capture trans-ferable patterns-such as contours and textures in images, or tokens and sentences in text-that reflect modality-specific generalities. This broad knowledge base allows for efficient adaptation to downstream tasks via in-context learning <ref type="bibr">(Xie et al., 2022;</ref><ref type="bibr">Chen et al., 2024c)</ref> and zero-shot generalization <ref type="bibr">(Wei et al., 2021)</ref>.</p><p>Despite the success of foundation models in text and vision, their extension to graph-structured data remains nascent <ref type="bibr">(Liu et al., 2024a)</ref>, primarily due to the high variability across graph datasets <ref type="bibr">(Mao et al., 2024)</ref>. Graphs from different domains often encode distinct phenomena-e.g., social networks model human relationships <ref type="bibr">(Freeman, 2004)</ref>, whereas molecular graphs represent chemical structures <ref type="bibr">(Zeng et al., 2022)</ref>-leading to both feature <ref type="bibr">(Wang et al., 2024b)</ref> and structural heterogeneity <ref type="bibr">(Qiu et al., 2020;</ref><ref type="bibr">Wang et al., 2024c)</ref>. Crucially, graph tasks operate on different learning units, such as nodes, edges, or entire graphs, limiting cross-task compatibility within a unified model <ref type="bibr">(Wang et al., 2024b)</ref>. These challenges hinder the development of graph foundation models capable of capturing transferable generalities. In this work, we specifically address the challenge of task heterogeneity.</p><p>Is it possible to identify cross-task generalities across graphs? Despite inherent challenges, prior work has explored this question via two main approaches. (1) A graphtheoretic perspective employs the concept of graphons <ref type="bibr">(Ruiz et al., 2020)</ref> to model transferable patterns across graphs. If graphs are sampled from the same graphon, they are expected to share structural properties, enabling effective transfer <ref type="bibr">(Ruiz et al., 2020;</ref><ref type="bibr">Cao et al., 2023)</ref>. However, graphon-based methods rely on strong generative assumptions that rarely hold in real-world settings <ref type="bibr">(Levie et al., 2021)</ref>, and inferring a shared graphon from diverse graphs remains computationally intractable. (2) A substructurebased perspective seeks recurring motifs-such as triangles, stars, or k-cliques-across domains <ref type="bibr">(Zhao et al., 2023;</ref><ref type="bibr">Mao et al., 2024)</ref>. These motifs appear in various contexts (e.g., social, citation, and molecular networks), motivating methods that sample subgraphs consisting of substructures and encode them via GNNs <ref type="bibr">(Sun et al., 2023;</ref><ref type="bibr">Liu et al., 2024a)</ref>. However, message-passing GNNs are fundamentally limited in capturing such substructures <ref type="bibr">(Garg et al., 2020;</ref><ref type="bibr">Esser et al., 2021;</ref><ref type="bibr">Zhang et al., 2024a)</ref>, restricting their efficacy in learning transferable subgraph representations.</p><p>Given the limitations of prior approaches, we introduce a novel perspective centered on the learning dynamics of message-passing GNNs <ref type="bibr">(Kipf &amp; Welling, 2017;</ref><ref type="bibr">Hamilton et al., 2017)</ref>. In such models, predictions are made based on task-relevant nodes: the target node in node-level tasks, edge endpoints in edge-level tasks, and all nodes in graphlevel tasks <ref type="bibr">(Srinivasan &amp; Ribeiro, 2020)</ref>. Regardless of task type, GNNs aggregate embeddings over these task-relevant nodes, which can be conceptualized as introducing a virtual task node connected to all task-relevant nodes. We define the computation tree rooted at this virtual node as a tasktree (Figure <ref type="figure">1</ref>). Task-trees offer three key advantages: (1) Learnability: tree-structured information can be effectively captured by message-passing GNNs <ref type="bibr">(Gupta et al., 2024)</ref>; (2) Uniformity: task-trees apply seamlessly across node-, edge-, and graph-level tasks, mitigating task heterogeneity; (3) Efficiency: encoding task-trees operationally equals to encoding the virtual nodes appended to original graphs. Analogous to images in vision or sentences in language, task-trees serve as unified learning instances and may encode transferable patterns across graph tasks, leading to the assumption:</p><p>Task-Tree Generality Assumption. The generalities shared across graphs are (at least partially) preserved within the task-trees of the involved graphs.</p><p>To evaluate this assumption, we conduct a theoretical analysis of task-trees with respect to stability, transferability, and generalization. Our main result shows that pretraining a GNN on diverse task-trees via a reconstruction objective yields transferable representations that adapt well to downstream tasks with moderate fine-tuning. Furthermore, the model can be specialized to specific domains via posttraining <ref type="bibr">(Wei et al., 2021)</ref> on domain-specific task-trees.</p><p>To empirically validate our theoretical insights, we introduce Graph Generality Identifier on Task-Trees (GIT), a graph foundation model pretrained on task-trees extracted from diverse graphs spanning multiple domains and tasks. GIT is evaluated on 32 graphs across 5 domains under three paradigms: fine-tuning, in-context learning (few-shot without fine-tuning), and zero-shot learning. Results show that pretraining on a small set of graphs significantly improves performance on a broad range of downstream tasks, supporting the hypothesis that task-trees capture transferable generalities. Additionally, we propose an instruction tuning method to adapt the general model to specific domains, yielding performance comparable to or exceeding domainspecific expert models. Our key contributions are:</p><p>&#8226; We introduce task-trees as unified learning instances for aligning heterogeneous graph tasks, demonstrating advantages over conventional units such as subgraphs.</p><p>&#8226; We present the first theoretical framework addressing task heterogeneity in graph learning, establishing the effectiveness of task-trees for cross-task generalization.</p><p>&#8226; We propose GIT, a graph foundation model pretrained on task-trees to acquire generalizable knowledge and support domain specialization.</p><p>&#8226; Extensive experiments across 32 graphs and five domains validate the effectiveness of GIT under finetuning, in-context, and zero-shot settings.</p><p>2. Task-Trees: Rethinking Basic Learning Instances on Graphs</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Preliminary</head><p>We begin with a brief introduction to message-passing GNNs and some related concepts. Let G = (V, E) represent a graph with node set V and edge set E, where each node v &#8712; V is associated with a feature vector x &#8712; R d .</p><p>A GNN encoder &#981; takes the graph as input and performs message passing to learn node embeddings Z = &#981;(V, E). Specifically, a GNN encoder can be defined as:</p><p>where N i denotes the 1-hop neighbors of node i, z (l) represents the node embedding at the l-th GNN layer with z (0) = x, and W 1 , W 2 are learnable matrices. The functions &#963;, &#961;, and g are the activation function, aggregation function and update function, respectively. To simplify the analysis, we assume &#961; is an averaging operation and g is the identity function. Without loss of generality (WLOG), these functions can be replaced with any permutation-invariant and Lipschitz-continuous functions, respectively, without affecting the analysis in the paper. Definition 2.1 (Task-Relevant Nodes). Graph tasks can be roughly categorized into node-level, edge-level, and graphlevel tasks, where the basic learning instances are nodes, edges, and entire graphs, respectively. For node classification, the task-relevant node v t i is the node to be classified. In edge classification, the task-relevant nodes are the start and end nodes {v t i , v t j } of the target edge e ij . For graph classification, the task-relevant nodes {v t i }</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>|V|</head><p>i=1 include all nodes in the target graph G.</p><p>For any graph task instance, the prediction relies solely on the embeddings of the corresponding task-relevant nodes. These node embeddings capture the surrounding subtree structures, which are also known as computation trees. Definition 2.2 (Computation Trees <ref type="bibr">(Chuang &amp; Jegelka, 2022)</ref>). Given a node v in graph G, the L-layer computation tree T L v is constructed by recursively expanding the subtrees of its neighboring nodes, starting with T 1 v = v.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Task-Tree Construction and Encoding</head><p>The learning process of message-passing GNNs can be interpreted as recursive aggregation over computation trees, where the representation of a node v produced by an L-layer GNN corresponds to the embedding of its L-hop computation tree T L v . Since predictions in graph tasks rely exclusively on the embeddings of task-relevant nodes, and those embeddings are determined by their respective computation trees, we construct a unified task-tree for each learning instance-whether a node, edge, or entire graph-by merging the relevant computation trees, as illustrated in Figure <ref type="figure">1</ref>. Definition 2.3 (Task-Trees). For any graph instancewhether a node, edge, or graph-we have a set of taskrelevant nodes {v t 1 , ..., v t n } and their corresponding L-layer computation trees {T 1 , ..., T n }. These computation trees can be reformulated into a larger task-tree T t by introducing a virtual node that connects all task-relevant nodes.</p><p>To encode a task-tree, we adopt a simple yet effective aggregation strategy. Given a task-tree T t composed of a virtual node v t and task-relevant nodes {v t 1 , . . . , v t n }, we compute its representation using a MEAN aggregator over the embeddings of the individual computation trees:</p><p>where T i denotes the computation tree rooted at v t i , and &#981; is a shared GNN encoder. This representation serves as the input for downstream objectives such as reconstruction, classification, or alignment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Comparison to Existing Works</head><p>Unlike our proposed task-trees, several existing approaches <ref type="bibr">(Qiu et al., 2020;</ref><ref type="bibr">Sun et al., 2023;</ref><ref type="bibr">Huang et al., 2023;</ref><ref type="bibr">Liu et al., 2024a;</ref><ref type="bibr">He &amp; Hooi, 2024)</ref> utilize k-hop subgraphs extracted from graphs as the basic learning instances. For instance, in node classification, ego-graphs are constructed around each node, where the label of the central node is assigned to the induced subgraph, effectively reformulating node classification as a subgraph classification task. A similar transformation can be applied to edge-level and graph-level tasks by converting them into subgraph-level learning problems. This method involves: (1) extracting ego-graphs centered around task-relevant nodes and (2) applying GNNs to learn graph-level embeddings for classification. However, this subgraph extraction process incurs substantial computational overhead, increasing both time and memory requirements due to the necessity of storing and processing induced subgraphs. Moreover, information within these subgraphs is not always effectively captured by message-passing GNNs, as GNNs may struggle to learn essential substructures preserved in graphs <ref type="bibr">(Garg et al., 2020;</ref><ref type="bibr">Chen et al., 2020;</ref><ref type="bibr">Zhang et al., 2024a)</ref>, thereby limiting the efficacy of subgraphs as learning instances. In contrast, our proposed task-trees offer both greater efficiency and improved learnability (Table <ref type="table">1</ref>). Specifically, encoding task-trees for node, link, or graph-level tasks involves: (1) augmenting the original graph by adding virtual task nodes and connecting them to task-relevant nodes and</p><p>(2) applying GNNs over the augmented graph to encode the embeddings of these virtual nodes for prediction. For example, in node classification, we first introduce virtual nodes connected to each node in the original graph and subsequently apply GNNs to this augmented structure, allowing virtual node embeddings to be learned for classification. Consequently, our method requires only the addition of nodes and edges to the existing graph, making it significantly more efficient than extracting and storing subgraphs. Furthermore, encoding task-trees is equivalent to directly encoding virtual nodes through message passing, ensuring that task-tree information remains fully learnable by standard GNNs. Empirically, task-trees consistently outperform subgraphs in both effectiveness and efficiency (Section 5.6 and Appendix I).</p><p>The most closely related work to our task-tree framework is GFT <ref type="bibr">(Wang et al., 2024b)</ref>, which introduces computation trees to align heterogeneous graph tasks. While both approaches share the core intuition of structuring task-specific trees, GFT adopts a model-centric perspective, featuring a learnable vocabulary, a multi-faceted reconstruction objective, and specialized adaptation classifiers. In contrast, our task-tree framework is theory-driven and emphasizes the design of learning instances rather than model complexity. Notably, GFT empirically demonstrates the potential of computation trees for transferability, while our work complements it with a formal theoretical foundation, offering a principled understanding of task alignment in graph learning. Together, these works represent complementary advances toward general-purpose graph foundation models. We present additional discussions on related work in Appendix A. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Theoretical Analysis of Task-Trees</head><p>In this section, we present a theoretical analysis of task-trees, focusing on their stability, transferability, and generalization as foundational learning instances. This analysis provides formal support for the Task-Tree Generality Assumption, which posits that transferable patterns across graph tasks are preserved within task-tree structures.</p><p>Our goal is not to assert the universal superiority of tasktrees over other learning units such as subgraphs, but rather to establish the theoretical plausibility of using task-trees to capture cross-task generalities. By grounding the construction and use of task-trees in formal guarantees, we lay the foundation for principled pretraining and transfer learning across heterogeneous graph tasks.</p><p>We begin by examining the stability of GNNs in learning task-tree representations, showing that task-trees with similar subtree structures produce analogous embeddings. To facilitate this analysis, we first define the notation for describing subtree information:</p><p>(0) i = x i denotes the original node feature, and x (l) denotes the subtree information of nodes in l-th layer, as illustrated in Figure <ref type="figure">2</ref>. In this figure, for l = 1, only the nodes in the first layer of the tree are considered, and for l = 2, only the nodes in the second layer are considered. Theorem 3.1 (Stability on Task-Trees). Given two L-layer task-trees T 1 t and T 2 t , with task-relevant nodes {v 1 , ..., v n } and {v 1 , ..., v m }, respectively. The distance between tasktrees is defined as</p><p>where &#981; is the GNN encoder, T i is the computation tree corresponding to node i, and C 1 , C 2 are constants related to the encoder, and B x represents the bounded norm of x.</p><p>Theorem 3.1 (proved in Appendix D.1) suggests that two task-trees are likely to have similar representations if their subtrees are similar. This theorem highlights the significance of similarity between pairs of subtrees, while downplaying the impact of the number of subtrees (i.e., the width of the task-trees), despite having more subtrees could potentially increase diversity and thus magnify discrepancy. The theorem also implies that increasing the number of GNN layers may lead to a loose bound, which aligns with previous analyses <ref type="bibr">(Garg et al., 2020;</ref><ref type="bibr">Ju et al., 2023a)</ref>.</p><p>Illustration 3.2. This theorem provides theoretical support for using task-trees as basic learning instances in graph tasks. Consider two task-trees: one representing a node (with a single subtree) and the other representing a graph</p><p>Avg.</p><p>&#119909; (") &#119909; ($)   Avg.</p><p>Avg. Avg. (with multiple subtrees). While the widths of these task-trees differ significantly, if their subtrees share some degree of similarity, they can produce similar representations. Thus, this theorem ensures that task-trees of nodes, edges, or graphs can potentially be similar, making it possible to use a GNN encoder to capture the shared patterns among them.</p><p>We now examine the transferability of task-trees. Specifically, assuming a model is pretrained on a task-tree reconstruction task<ref type="foot">foot_2</ref> , we aim to quantify how the knowledge acquired during pretraining can be transferred to downstream tasks. The pretraining objective is defined as L P (g &#8226; &#981;) := E ( T ,T )&#8764;P &#8741;g(&#981;( T )) -&#981;(T )&#8741; 2 , where P represents the task-tree distribution used for pretraining, &#981; &#8712; &#934; and g &#8712; G are the GNN encoder and the reconstruction head, respectively. T denotes the task-tree and T is the corrupted version of T , generated using arbitrary augmentations. Note that the reconstruction head g is used only during pretraining and is discarded during fine-tuning. Then, we define the risk on downstream task as R T (f &#8226; &#981;) := E (T,y)&#8764;T &#954;(f (&#981;(T )), y), where f &#8712; F is a linear head for predictions, T represents the downstream task distribution with task-tree T and label y, and &#954; denotes the loss function.</p><p>Theorem 3.3 (Transferability on Task-Trees). Given two task-tree encoders &#981;, &#981; &#8242; &#8712; &#934;, we have</p><p>where C &#948; &#8776; O(1) and &#948; = 1 2 .</p><p>The proof is provided in Appendix D.2. In summary, Theorem 3.3 demonstrates that knowledge gained through pretraining on task-tree reconstruction tasks is transferable to downstream tasks, and it quantifies the extent of this transfer. The left-hand side (LHS) of the theorem shows how different representations impact performance on downstream tasks, while the right-hand side (RHS) reflects the difference in pretraining losses between two encoders. Therefore, if two encoders exhibit similar losses during pretraining, their transferability to a new task should be comparable.</p><p>Illustration 3.4. To give a better understanding on why Theorem 3.3 imply the model pretrained on task-trees can bring transferable information to downstream tasks, we present an example. Let's consider the case where &#981; is the pretrained encoder and &#981; &#8242; is a randomly initialized encoder. The LHS term</p><p>) measures the amount of knowledge that is acquired during pretraining and is capable to be transferred to downstream tasks, and the RHS term min g&#8712;G L P (g &#8226; &#981;) -min g &#8242; &#8712;G L P (g &#8242; &#8226; &#981; &#8242; ) measures the total knowledge acquired during pretraining. Thus, the constants C &#948; and &#948; quantify how much of this knowledge is transferable to downstream tasks. Since both C &#948; and &#948; are reasonably small, we conclude that pretraining on task-trees provides sufficient knowledge to benefit downstream tasks.</p><p>To further explain why the task-tree-based pretraining and fine-tuning framework is effective for downstream tasks, we derive the following generalization bound.</p><p>Theorem 3.5 (Generalization on Task-Trees). Given two task-tree distributions, P for pretraining and T for finetuning, suppose the encoder &#981; is pretrained on a set of task-trees {T i } m i=1 sampled from P and finetuned on tasktrees {T i } n i=1 sampled from T , the generalization bound of the finetuned model, with probability at least 1 -v, is</p><p>where &#981; * = arg min &#981;&#8712;&#934; min g&#8712;G L P (g &#8226; &#981;) is the optimal task-tree encoder obtained on P, E P (g, &#981;) = L P (g &#8226; h)min g &#8242; &#8712;G,&#981; &#8242; &#8712;&#934; L P (g &#8242; &#8226; &#981; &#8242; ) defines the excess risk during pretraining. Constants C 1 and C 2 are related to downstream tasks, while C &#948; &#8776; O(1) and &#948; = 1 2 are the same as Theorem 3.3. X &#981; denotes the distribution of task-tree embeddings encoded via &#981;, and &#8741;T &#981; (x) -P &#981; (x)&#8741; measures the distance between task-tree distributions of pretraining and fine-tuning data.</p><p>The proof can be found in Appendix D.3. This theorem outlines key factors affecting model generalization on downstream tasks, such as the transferability of task-trees (C &#948; (E P (g, &#981;)) &#948; ) and the quality of the pretrained encoder (E P (g, &#981;)). With regard to the number of task-trees, we find that while increasing the number of fine-tuning samples contributes to more stable optimization ( 4C1 n n i=1 &#8741;&#981;(T i )&#8741; 2 ), it does not significantly reduce the generalization bound</p><p>). This provides theoretical evidence that a reasonable number of fine-tuning samples can be sufficient for training a model with strong generalization capabilities. Moreover, the discrepancy between the pretraining and fine-tuning distributions ( x&#8712;X &#981; &#8741;T &#981; (x) -P &#981; (x)&#8741;) is crucial-smaller distribution gaps lead to better generalization. This highlights the importance of increasing the diversity of pretraining data, which provides a boarder pretraining distribution P. It also supports the potential of developing specialized models for specific domains based on a pretrained general model, discussed in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Graph Generality Identifier on Task-Trees</head><p>The theoretical analysis establishes the feasibility of constructing graph foundation models based on task-trees. Building on these insights, we develop the GIT model to empirically validate the Task-Tree Generality Assumption.</p><p>To focus on aligning task spaces across heterogeneous graph tasks, we adopt widely used text-attributed graph benchmarks <ref type="bibr">(Chen et al., 2024b;</ref><ref type="bibr">Zhang et al., 2024b;</ref><ref type="bibr">Feng et al., 2024)</ref>, which simplify feature alignment across datasets. Specifically, we follow <ref type="bibr">Liu et al. (2024a)</ref> and encode all node features into a shared 768-dimensional embedding space using Sentence-BERT <ref type="bibr">(Reimers &amp; Gurevych, 2019)</ref>. This design allows us to isolate and examine the effect of task-tree-based pretraining while holding node features consistent across domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">GIT-G: Pretraining to Acquire General Knowledge</head><p>We propose a task-tree reconstruction task as a pretext for pretraining. The key is to use two corrupted task-trees to reconstruct each other, thereby capturing corruptioninvariant semantics. Given a set of task-trees {T t 1 , ..., T t n } sampled from a graph database, we apply corruption techniques to generate two views of each task-tree, denoted as { T t 1 , ..., T t n } and { T t 1 , ..., T t n }. For corruption, we use random edge masking and random attribute masking <ref type="bibr">(Zhu et al., 2020;</ref><ref type="bibr">Wang et al., 2025c)</ref> due to its computational efficiency. We then use an encoder &#981; to obtain embeddings for the corrupted task-trees, resulting in { &#7825;1 , ..., &#7825;n } and { z1 , ..., zn }. Inspired by <ref type="bibr">(Thakoor et al., 2022)</ref>, we perform reconstruction as</p><p>where g is a non-linear MLP projector, &#961;(z) = (z/&#8741;z&#8741;) serves for normalization, sg is the stop-gradient operation, and h is the average of all instances z. The reconstruction loss captures the semantics of the task-trees in a predictive manner, while the KL regularizer ensures the embeddings are projected into a shared space by minimizing the KL divergence between individual instances and their center.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">GIT-S: Specification via Instruction Tuning</head><p>Theorem 3.5 highlights the relationship between model generalization and the distribution gap between pretraining data P and fine-tuning data T , showing that a smaller gap leads to better generalization. Based on this finding, it is feasible to develop a specialized model for specific domains from a pretrained general model. This is based on the mild assumption that graphs from the same domain have similar task-tree distributions {T 1 , .., T n }. If the pretrained model is post-trained on a task-tree distribution P post sampled from {T 1 , .., T n }, the pretraining data distribution P can be adjusted towards these task-tree distributions. This reduces the discrepancy x&#8712;X &#981; &#8741;T &#981; (x) -P &#981; (x)&#8741; in Theorem 3.5, thereby improving model generalization on the target domain. To achieve this, we propose an instruction-tuning method for post-training the pretrained model.</p><p>Instruction tuning is a supervised fine-tuning (SFT) technique designed to enhance the capabilities of a pretrained model by post-training it on a small dataset. Our goal is to fine-tune the model using instructions to specialize it for a particular domain of interest. Given a pretrained model &#981; * and a set of task-trees {T 1 , ..., T n } from the target domain, we post-train the model using the SFT loss:</p><p>where &#968; is the instruction generation function for each tasktree, and &#954; is the corresponding loss function. In this paper, as we use text-attributed graphs in our experiments, we define instructions as the embeddings of label descriptions encoded by a LLM, which is similar to <ref type="bibr">Liu et al. (2024a)</ref>, and we use mean squared error as the loss function &#954;. Additional model analysis is provided in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiment</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Setup</head><p>Datasets. We conduct experiments on over 30 textattributed graphs spanning five domains: academic networks, e-commerce networks, knowledge graphs, molecular graphs, and temporal graphs. Pretraining is performed on a diverse subset including Arxiv (academic), Products (e-commerce), WN18RR and FB15K237 (knowledge), and Chemblpre and PCBA (molecular). Specialization is evaluated on representative datasets for each domain: Arxiv, Products, FB15K237, and PCBA. For temporal graphs, which are e-commerce temporal graphs, we also use Table <ref type="table">2</ref>: We report the model performance across five graph domains: academia, e-commerce, knowledge base, molecular, and temporal graphs, with results averaged over all graphs within each domain. Note that -G and -S represent the general and specialized versions of GIT, respectively. The comprehensive results can be found in Appendix F.</p><p>Domain Academic E-commerce KG Molecule Temporal Held-out Avg. Avg. 0-shot Sup. GNN -------GraphMAE 15.42 8.19 -47.19 -26.67 25.11 OFA 13.98 8.73 -50.49 -27.20 26.14 GIT -G 14.88 8.79 -53.34 -28.56 27.50 GIT -S 23.45 17.06 -62.83 -35.19 36.32 3-shot Sup. GNN -------GraphMAE 49.25 48.20 56.56 56.01 40.31 50.15 52.07 OFA 45.93 57.06 56.97 57.03 38.92 51.84 53.70 GIT -G 54.00 57.22 67.55 55.96 39.95 56.09 57.82 GIT -S 55.18 58.01 67.80 62.82 41.38 58.69 60.15 Finetune Sup. GNN 73.57 78.21 66.86 73.65 62.61 71.14 72.25 GraphMAE 73.81 76.57 72.61 71.41 62.75 71.37 72.79 OFA 72.18 76.64 72.38 74.03 62.31 71.48 73.08 GIT -G 75.82 78.55 75.73 74.57 64.59 73.84 75.37 GIT -S 75.88 78.83 76.15 75.20 64.68 74.19 75.72 Citeseer Cora Pubmed DBLP Arxiv Arxiv23 The model performance on all datasets in the fine-tuning setting. For GIT, the best result is selected between GIT-G and GIT-S, while the best baseline performance is chosen among GCN/GAT/GIN, BGRL, GraphMAE, and OFA.</p><p>Products for SFT to assess robustness under temporal distribution shifts. We provide the full dataset details in Appendix E.1.</p><p>Baselines. We compare against a broad spectrum of baselines, including supervised GNNs (GCN <ref type="bibr">(Kipf &amp; Welling, 2017)</ref>, GAT <ref type="bibr">(Veli&#269;kovi&#263; et al., 2018)</ref>, GIN <ref type="bibr">(Xu et al., 2019)</ref>), self-supervised models (BGRL <ref type="bibr">(Thakoor et al., 2022)</ref>, GraphMAE <ref type="bibr">(Hou et al., 2022)</ref>), and graph foundation models (OFA <ref type="bibr">(Liu et al., 2024a)</ref>, GraphPrompt+ <ref type="bibr">(Liu et al., 2023)</ref>, All in One <ref type="bibr">(Sun et al., 2023)</ref>, OpenGraph <ref type="bibr">(Xia et al., 2024)</ref>, AnyGraph <ref type="bibr">(Xia &amp; Huang, 2024)</ref>). We also include domain-specific expert models for comparison. See Appendix E.2 for full descriptions.</p><p>Experimental Protocols. We use GraphSAGE <ref type="bibr">(Hamilton et al., 2017)</ref> as the encoder and repeat each experiment five times with different random seeds. We evaluate three paradigms: fine-tuning, in-context learning, and zero-shot learning. Fine-tuning updates all model parameters on downstream data. In in-context learning (few-shot without finetuning), we randomly sample k instances per class, compute prototype embeddings via averaging, and classify using nearest-prototype inference.</p><p>Following Liu et al. (2024a); He &amp; Hooi (2024), we sample 500 5-way 3-shot tasks; if the number of classes is fewer than 5, we use the actual number of classes as the number of ways. Zero-shot learning replaces prototypes with class description embeddings generated by an LLM. For evaluation, we report accuracy for node classification and edge classification, and AUC for graph classification and link prediction. Additional details are provided in Appendix E. Dataset-Wise Performance. Figure <ref type="figure">3</ref> shows fine-tuning performance across all datasets. GIT consistently outperforms the strongest baseline in the majority of cases, validating the effectiveness of task-trees as generalizable learning units across heterogeneous graph tasks.</p><p>Effect of Specialization. We summarize three key observations: (1) Specialization (GIT-S) enhances the performance of GIT-G across most settings (Table <ref type="table">2</ref>). However, the impact of specialization varies depending on the dataset used (Appendix Figure <ref type="figure">9</ref> and Table <ref type="table">25</ref>).</p><p>(2) Specialization does not significantly degrade model performance on other domains (Table <ref type="table">27</ref>).</p><p>(3) Applying the proposed specialization approach to other models also yields performance improvements in specific domains (Table <ref type="table">4</ref>), showing the generalization of post-training in enhancing model capacity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Comparison to State-of-The-Art Methods</head><p>To evaluate the effectiveness of GIT, we compare it against recent graph foundation models, including GraphPrompt+ <ref type="bibr">(Liu et al., 2023)</ref>, All in One <ref type="bibr">(Sun et al., 2023)</ref>, OpenGraph <ref type="bibr">(Xia et al., 2024)</ref>, and AnyGraph <ref type="bibr">(Xia &amp; Huang, 2024)</ref>, under the pretrain-then-finetune paradigm. As shown in Table <ref type="table">3</ref>, we benchmark GIT-G and all baselines across three representative domains: academic networks (node classification), knowledge graphs (edge classification), and molecular graphs (graph classification). GIT-G consistently outperforms all competing methods across tasks and domains. While many existing models pursue broad generalization by addressing multiple aspects-e.g., domain shifts, modality fusion, and prompt engineering-GIT adopts a targeted approach centered on task alignment. Its core innovation lies in the use of task-trees, which provide a unified abstraction across graph tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Ablation on Training Strategies</head><p>We conduct an ablation study on academic networks to evaluate the impact of different training strategies. Specifically, we examine four approaches: (1) Base Model: Pretraining on the target graph. ( <ref type="formula">2</ref>) Expert Model: Pretraining on all academic networks. (3) General Model: Pretraining on the default pretraining datasets. (4) Specialized Model: Pretraining on the default datasets followed by specialization on Arxiv. The results, averaged across all academic graphs, are reported in Table <ref type="table">4</ref>. Notably, the general model of GIT maintains stable performance relative to both the base and expert models. In contrast, GraphMAE and OFA exhibit performance degradation when transitioning from the base and expert models to the general model. This finding underscores the potential of task-trees in mitigating such negative transfer. Furthermore, specialization in GIT allows its performance to closely approximate that of expert models trained exclusively on academic graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Comparison to Domain Experts</head><p>We compare GIT to domain experts in molecular and knowledge graphs, as these domains require domain-specific knowledge. Our findings indicate that the specialized GIT (GIT-S) closely approximates the performance of these experts. In molecular graphs, GIT-S achieves an average performance of 62.83, ranking second to the state-of-the-art GIMLET (64.15) <ref type="bibr">(Zhao et al., 2023)</ref> while outperforming other domain experts <ref type="bibr">(Zeng et al., 2022;</ref><ref type="bibr">Su et al., 2022;</ref><ref type="bibr">Taylor et al., 2022)</ref>, whose best performance is 55.82 (Table <ref type="table">24</ref>). Similarly, in knowledge graphs, GIT-S attains an average score of 67.80, approaching the KG expert ULTRA (68.53) <ref type="bibr">(Galkin et al., 2024)</ref>, as shown in Table <ref type="table">23</ref> in Appendix. We compare the efficiency and effectiveness of task-trees against subgraph-based learning units in the fine-tuning setting, as shown in Figure <ref type="figure">4</ref> and Table <ref type="table">5</ref>. For a fair comparison, we implement a subgraph-based variant of our model, denoted GIT-SubG, by substituting task-trees with subgraphs while keeping all other components unchanged.</p><p>Experimental results across three domains-academic (node classification), knowledge graphs (edge classification), and molecular graphs (graph classification)-demonstrate that task-trees consistently outperform subgraphs in both computational efficiency (Figure <ref type="figure">4</ref>) and predictive performance (Table <ref type="table">5</ref>). This highlights the advantage of task-trees in serving as compact, expressive, and structurally aligned learning instances, especially in cross-task and cross-domain generalization scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7.">Results on Non-Text-Attributed Graphs</head><p>While prior experiments use text-attributed graphs, GIT does not inherently rely on textual information. We adopt textattributed datasets to isolate the impact of task heterogeneity, while sidestepping the confounding effects of feature heterogeneity. Textual attributes enable feature alignment across graphs via a shared encoder, allowing us to more clearly assess the benefits of task-tree generalization. Importantly, GIT is compatible with non-textual graphs. To demonstrate this, we introduce a lightweight module that addresses feature heterogeneity by applying SVD to project features into a shared space. As shown in</p><p>Table 6, we pretrain GIT-G on non-text-attributed datasets-PubMed (node classification), Citeseer (link prediction), and IMDB-50 100 Usage (%) Memory Usage 2 9 2 10 2 11 2 12 2 13 2 14 2 15 Batch Size (log) 200 250 Seconds Time (s) / Epoch GIT (Task-Tree) GIT (Subgraph) GraphMAE B (graph classification)-and fine-tune on Cora (link prediction). Results, evaluated via AUC, confirm that GIT-G remains effective without relying on textual information, validating its broader applicability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Limitations</head><p>We introduce task-trees as unified learning instances for aligning heterogeneous graph-based tasks, supported by both theoretical and empirical analyses. Based on this abstraction, we develop GIT, a pretrained graph foundation model that leverages a small set of source graphs to generalize across over 30 datasets spanning five diverse domains.</p><p>Our results demonstrate that task-trees preserve transferable patterns across tasks, offering a scalable and principled pathway for cross-domain generalization in graph learning.</p><p>Limitations and Future Work. While task-trees serve as a promising abstraction-analogous to images in vision and sentences in language-several challenges remain. First, although our results suggest that task-trees capture generalities across tasks, pinpointing the precise transferable patterns remains an open question. Future work should explore this from both theoretical and empirical perspectives. Second, transforming graph neighborhoods into tree structures may lead to information loss due to the limitations of message-passing GNNs. This can affect model expressiveness, especially for graphs with complex higher-order structures. Incorporating more expressive architectures, such as higher-order GNNs <ref type="bibr">(Morris et al., 2019;</ref><ref type="bibr">2023)</ref> or advanced graph modeling <ref type="bibr">(Wang et al., 2025b)</ref>, may help mitigate this issue. Finally, while GIT demonstrates strong performance with a lightweight design, exploring more sophisticated variants-e.g., incorporating attention mechanisms, graph prompting, or task-conditioned decoders-offers exciting directions for extending the generality and adaptability of graph foundation models. We outline several of these opportunities in Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Related Work</head><p>Graph Neural Networks. GNNs are a class of learning models specifically designed to operate on graph-structured data and have demonstrated substantial success across a variety of domains. Their strength lies in their ability to perform relational learning, where information from neighboring nodes is aggregated and used to enhance node representations. For instance, GCN <ref type="bibr">(Kipf &amp; Welling, 2017)</ref> utilizes message-passing to aggregate information from neighboring nodes to central nodes.</p><p>Building on this, models such as GraphSAGE <ref type="bibr">(Hamilton et al., 2017)</ref> and GAT <ref type="bibr">(Veli&#269;kovi&#263; et al., 2018)</ref> introduce innovative techniques like neighborhood sampling and attention mechanisms, respectively, further advancing performance on graph learning tasks. However, these methods are limited to solving a single task by training from the scratch <ref type="bibr">(Zhang et al., 2019;</ref><ref type="bibr">Fan et al., 2022;</ref><ref type="bibr">Qian et al., 2025;</ref><ref type="bibr">Ma et al., 2023;</ref><ref type="bibr">Qian et al., 2024)</ref>.</p><p>Transferability of GNNs. Existing works that analyze the shared concepts (generalities) across different graphs primarily follow two approaches. The first is graphon theory, which provides bounds on the distance between graphs generated from the same graphon. This method has been used to study transferability in pretraining and fine-tuning settings <ref type="bibr">(Cao et al., 2023)</ref>, to develop more expressive fine-tuning techniques <ref type="bibr">(Sun et al., 2024)</ref>, and to design new model architectures <ref type="bibr">(Ruiz et al., 2020)</ref>. However, despite its theoretical advantages, graphon-based approaches face practical challenges, particularly the strong assumptions required and the difficulty of identifying graphons in large-scale graphs, which limits their applicability in building graph foundation models. The second approach involves leveraging substructures within graphs to identify transferable patterns <ref type="bibr">(Mao et al., 2024)</ref>. This method focuses on extracting subgraphs composed of meaningful substructures for prediction tasks. While this approach offers theoretical insights into stability <ref type="bibr">(Levie et al., 2019;</ref><ref type="bibr">Zhu et al., 2021)</ref>, it struggles to fully capture substructures that are beneficial for downstream tasks <ref type="bibr">(Zhang et al., 2024a)</ref>.</p><p>Graph Foundation Models. Foundation models are designed as general-purpose solvers capable of handling various tasks across different domains. For instance, LLMs, the foundation models in natural language processing, are capable of performing tasks such as summarization, translation, and entity recognition, as well as question-answering. However, building such versatile foundation models for graphs presents unique challenges due to the inherent feature, structural, and task heterogeneity across different graph domains and tasks. To address these challenges, <ref type="bibr">Qiu et al. (2020)</ref> pretrained GNNs using subgraphs as basic units, mitigating structural heterogeneity. Building on this, Sun et al. ( <ref type="formula">2023</ref>) reformulated node-, edge-, and graph-level tasks into subgraph-level tasks, tackling task heterogeneity. Additionally, <ref type="bibr">Huang et al. (2023)</ref> and <ref type="bibr">Liu et al. (2024a)</ref> applied LLMs to unify the feature spaces of cross-domain graphs, addressing feature heterogeneity. These approaches enable models to operate on cross-domain and cross-task graphs. Further advancements, such as <ref type="bibr">He &amp; Hooi (2024)</ref> and <ref type="bibr">Li et al. (2024)</ref>, improve node embeddings by jointly optimizing GNN and LLM encoders, facilitating various downstream tasks like few-shot learning and zero-shot learning. However, most of these approaches rely on subgraphs as the primary learning instances, which can result in inefficient training and reduced expressiveness, as discussed in the main paper. Other efforts to resolve feature heterogeneity include methods like singular vector decomposition (SVD) <ref type="bibr">(Zhao et al., 2024a;</ref><ref type="bibr">Yu et al., 2024)</ref>, non-parametric encoders <ref type="bibr">(Zhao et al., 2024b)</ref>, or synthesizing text-attributed graphs <ref type="bibr">(Wang et al., 2025a)</ref>. Notably, a recent study by <ref type="bibr">Wang et al. (2024b)</ref> explores computation trees as transferable patterns in graphs. However, our work differs in three key aspects: (1) While <ref type="bibr">Wang et al. (2024b)</ref> focus on identifying transferable patterns across graphs, our approach is centered on designing fundamental learning instances.</p><p>(2) Our theoretical framework also serves as a foundational basis for <ref type="bibr">Wang et al. (2024b)</ref>.</p><p>(3) Our proposed GIT significantly simplifies the model proposed by <ref type="bibr">Wang et al. (2024b)</ref>, demonstrating a minimally applicable design that retains effectiveness.</p><p>Another line of research focuses on designing GFMs for single tasks or domains, thereby avoiding the complexities of feature, structural, or task heterogeneity. For example, <ref type="bibr">Galkin et al. (2024)</ref> propose a foundation model for reasoning tasks on knowledge graphs, using triplets as basic transferable patterns.  <ref type="formula">2024a</ref>) have explored using LLMs as graph reasoners to solve graph tasks, similar to their role in vision language models. While these methods excel at specific tasks or domains, they are not suitable as general graph solvers across diverse tasks. In contrast to these approaches, our proposed GIT model is pretrained on diverse task trees to acquire general reasoning capabilities, allowing it to quickly specialize in specific domains through instruction tuning.   It is challenging for a single graph model to handle tasks across various domains due to pattern conflicts, where the same structural pattern can have different meanings in different domains. To illustrate this issue, we provide an intuitive example<ref type="foot">foot_3</ref> . Consider a pretraining process involving datasets from multiple domains, such as social networks, molecular networks, academic networks, and knowledge graphs. Suppose the model learns triangle structures during pretraining. In social networks, the semantic meaning of these triangles is stable, following the principle of "the friend of my friend is my friend". However, in molecular graphs, the meaning of triangle patterns may be unstable due to chemical properties. This pattern conflict can significantly degrade the performance of graph models <ref type="bibr">(Cao et al., 2023;</ref><ref type="bibr">Mao et al., 2024)</ref>. Specialization helps resolve this issue by aligning the meanings of certain structural patterns with the semantics specific to the target domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. More Analysis on Domain Regularizer</head><p>The Necessity of Domain Alignment. Datasets from multiple domains are often projected into different subspaces, potentially due to misalignment of node attributes <ref type="bibr">(Chen et al., 2024b)</ref> and the frequent patterns across domains. As a result, the model may "memorize" information specific to each domain rather than learning transferable patterns. This can lead to misunderstandings when the same pattern appeared across different graphs is projected into different subspaces.</p><p>Consequently, the model struggles to acquire transferable knowledge that would benefit unseen tasks and specialized domains. Properly aligning the embedding spaces of different domains is crucial for obtaining transferable knowledge and improving performance on unseen graphs and specialized domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>How to Regulate Domain Distances?</head><p>We propose a domain regularizer to control domain distances by projecting crossdomain graphs with different characteristics into a shared embedding space. Specifically, we define a shared subspace across domains and pull the subspaces of other domains into alignment with this defined space. The shared subspace should be positioned at the center of the cross-domain datasets to minimize the effort required to adjust the subspaces of all domains.</p><p>In particular, the basis vector of the shared subspace is defined as the average of all instances:</p><p>where P (D) represents the domain distribution, T D is a distribution of task-trees within domain D, and &#981;(T i ) is the embedding of the task-tree T i . Given the shared subspace basis, we optimize the KL divergence between each instance and the basis. However, obtaining the global basis vector h Basis directly is impractical due to dataset size, so we approximate it by averaging the embeddings of all instances within a batch to compute the local basis &#293;Basis . We then optimize the KL divergence for all instances in the batch. To mitigate randomness, we empirically use a relatively large batch size (4,096). Formally, the domain regularizer is defined as</p><p>where B denotes the batch, and H and Z i represent the distributions of the local basis vector &#293;Basis and instance embedding z i , respectively.</p><p>How the Domain Regularizer Works? To better understand how the domain regularizer functions, we conduct an analysis to demonstrate its benefits in regulating domain distances while preserving task structures for each dataset. We use eight datasets provided by <ref type="bibr">Liu et al. (2024a)</ref> for pretraining: Cora, Pubmed, Arxiv, WikiCS, WN18RR, FB15K237, CHEMHIV, and CHEMPCBA. The analysis results are presented in Figure <ref type="figure">5</ref>.</p><p>In the figure, we display (a) a heatmap of similarity between different datasets with varying weights, and visualizations of the embedding space before (b) and after (c) applying the domain regularizer. The results show that the domain regularizer effectively adjusts the distances between datasets by pushing apart overly similar graphs and bringing closer those that are too distinct. Furthermore, we show that the regularizer does not significantly alter the task structures of each dataset, as illustrated in subfigure (d). In this subfigure, we apply k-means algorithm on each dataset, setting k to the number of classes, and compare to the ground-truth by using NMI as the metric. The assumption is that if two sets of vectors yield similar clustering results, the classification outcomes of the same classifier will be similar, indicating that the task structure across the two sets is consistent. The results demonstrate that changing the regularizer weight does not significantly affect task structures. This may be because the regularizer acts by translating vectors toward a central point without altering the relationships between individual pairs of vectors. To further evaluate the impact of domain regularizer for the downstream tasks, we present the model performance average over the used eight datasets across all settings in Table <ref type="table">7</ref>. We observe the use of domain regularizer can boost the model performance, especially in in-context and zero-shot settings. In addition, we empirically find that &#955; = 10 can lead to better performance. Thus, we set &#955; = 10 as the default weight in this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3. Discussion on Homophily and Heterophily</head><p>In node-level tasks, it is important to consider both graph homophily and heterophily <ref type="bibr">(Ma et al., 2022)</ref>. Homophily describes the close relationships between connected entities, while heterophily refers to distant relationships between connected entities. Empirically, basic message-passing GNNs tend to perform well on homophily graphs but struggle with heterophily graphs <ref type="bibr">(Luan et al., 2022)</ref>. Despite using GraphSAGE as the backbone in our GIT, it still performs well on heterophily graphs, such as Children and Ratings<ref type="foot">foot_4</ref> . The experimental results for node classification and link prediction on these two graphs are presented in Table <ref type="table">17</ref> and <ref type="table">Table 18</ref>, where GIT generally achieves the best performance. We hypothesize that the proposed task-tree structure captures not only homophily relationships but also heterophily relationships. A potential question is whether our message-passing GNN can effectively capture these heterophily relationships, despite <ref type="bibr">Ma et al. (2022)</ref> suggesting that basic GNNs may handle heterophily graphs by memorizing the patterns between the target node and its neighbors. We plan to use more advanced GNNs capable of encoding heterophily structures to further validate our hypothesis.</p><p>observe a clear trend where more data led to better performance. We consider there are three potential reasons.</p><p>(1) From a model perspective, we use a GraphSAGE encoder with limited layers and parameters, which may not fully capture the knowledge contained in the pretraining data. Additionally, we apply basic mean pooling to derive task-tree embeddings from task-relevant node embeddings, which may prevent the model from identifying the relative importance of task-relevant nodes, thereby limiting its expressiveness.</p><p>(2) From a training paradigm perspective, we employ a negative-free contrastive learning framework similar to <ref type="bibr">Thakoor et al. (2022)</ref>, but this basic approach may not be expressive enough to extract meaningful knowledge from the pretraining graphs.</p><p>(3) From a data perspective, despite using over 30 graphs in this study, the number of instances is still significantly lower than that of textual or visual instances extracted from the Internet. Furthermore, the pretraining datasets may not be well-aligned. Although we used a textual encoder to align node features, we cannot guarantee that the encoded node features are in the same embedding space <ref type="bibr">(Chen et al., 2024b)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Potential Model Extensions</head><p>C.1. Pretraining</p><p>How to Design Reconstruction Tasks? Theorem 3.5 suggests that a well-designed encoder, capable of effectively handling reconstruction tasks during pretraining, can improve the model's generalization ability. One approach is to use more powerful encoders to enhance reconstruction performance. Another approach is to introduce additional reconstruction losses to further refine the encoder. For example, methods such as those proposed by <ref type="bibr">Qiu et al. (2020</ref><ref type="bibr">), and Hou et al. (2022</ref><ref type="bibr">), Ju et al. (2023b</ref><ref type="bibr">), and Wen et al. (2024)</ref>, or designing more comprehensive reconstruction objectives could be explored.</p><p>How to Improve Transferability? The pretraining task, i.e., task-tree reconstruction, differs from the downstream task of task-tree classification, as the task heterogeneity may hinder model transferability <ref type="bibr">(Hu et al., 2020)</ref>. To mitigate this, one could develop more effective adaptation methods, such as graph prompt learning <ref type="bibr">(Sun et al., 2022)</ref>, to reduce task heterogeneity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2. Specialization via Instruction Tuning</head><p>How to Define Instructions? In this paper, as we focus on experiments with text-attributed graphs, we define instructions as label descriptions encoded by LLMs. However, this approach is not applicable to non-textual graphs. Other methods could be explored to define instructions, such as using proxy models <ref type="bibr">(Hu et al., 2019)</ref> or graph heuristics <ref type="bibr">(Jin et al., 2020)</ref> to generate instructions.</p><p>How to Choose SFT Data? We manually select graphs as supervised fine-tuning datasets for each domain, though the selected graphs may not be fully representative. Unlike textual data, evaluating the quality of graph datasets poses a challenge. Improved dataset selection methods could enhance the SFT process by identifying more representative or diverse data from graph databases. Additionally, while we perform instruction tuning over entire graphs, it is possible that only specific subgraphs are beneficial <ref type="bibr">(Hashemi et al., 2024)</ref>. Developing data selection methods that focus on high-quality subgraphs within a single SFT dataset could improve task-tree selection. Another worthy research line is to select SFT data that aligns with user preferences <ref type="bibr">(Song et al., 2024)</ref>.</p><p>How to Leverage SFT Data? In scenarios with limited instructions, standard supervised fine-tuning may struggle to capture sufficient knowledge of the target domain. To address this, methods could be proposed to better utilize the unlabeled instances in the SFT dataset, thus enhancing model adaptation <ref type="bibr">(Sohn et al., 2020;</ref><ref type="bibr">Ni et al., 2025)</ref>.</p><p>How to Maintain General Inference Capability? While instruction tuning specializes the model for a specific domain, it may compromise the model's general inference capabilities across other domains. This could hinder the model's performance when it needs to function both as a domain expert and a general reasoner. To mitigate this, regularization techniques could be designed to preserve the general knowledge encoded in the model during the instruction tuning process.</p><p>Why SFT Works on Graphs? Instruction tuning is a common post-training process in modern large language models (e.g., LLAMA, GPT) that significantly improves instruction-following capabilities. The success of this method in LLMs may stem from the fact that natural language serves as an interface between humans and models <ref type="bibr">(Wei et al., 2021)</ref>. However, the reason instruction tuning works for graphs remains an open question and presents a potential direction for future research.</p><p>C.3. More Scenarios.</p><p>The paper leverages text-attributed graphs to align node features. However, the pre-processing of TAGs can be timeconsuming, raising the challenge of how to effectively apply the model to graphs without aligned node features. Furthermore, while we primarily focus on homogeneous graphs in this work, most real-world applications involve heterogeneous graphs.</p><p>Addressing the question of how to design a single model capable of handling various types of graphs remains an open challenge. Finally, applying the model to specific applications <ref type="bibr">Zhang et al. (2024c)</ref>, which may exhibit unique characteristics, is another important consideration for future research.</p><p>D. Proof D.1. Proof of Theorem 3.1</p><p>Proof. We begin by introducing the basic GNN architecture used in the proof. Given a GNN encoder &#981;(&#8226;) with parameters W = (W 1 , W 2 ), we use a GraphSAGE-like architecture, defined as follows (with some notation abuse):</p><p>where &#963; is the non-linear activation function, x i is the node feature of node i, and N i represents the neighbors of node i, corresponding to its children in the computation tree. T L i denotes the computation tree of node i with L layers. Neighborhood information is incorporated by averaging the embeddings of neighboring nodes. WLOG, the averaging operation can be replaced with any permutation-invariant set operation without affecting the analysis in this paper. For simplicity, we assume all GNN layers share the same parameters; this assumption does not affect the validity of our proofs. Since these functions and neural networks exhibit Lipschitz continuity, we denote the Lipschitz constant of &#963;(&#8226;) as C &#963; . Additionally, we assume the norm of node features is bounded by &#8741;x&#8741; &#8804; B x , and the model weights by &#8741;W 1 &#8741; &#8804; B W1 and &#8741;W 2 &#8741; &#8804; B W2 . While real-world graphs typically exhibit varied node features, standard techniques (as employed in this paper) like normalization can ensure that B x remains a small value. We define the distance between task-trees T t 1 with n task-relevant nodes {v 1 , ..., v n } and T t 2 with m task-relevant nodes {v 1 , ..., v m } as:</p><p>where &#8741; &#8226; &#8741; is the L2 distance. Following, we expand the stability term &#8710; as:</p><p>.</p><p>Then, we separately analyze the term (a) and term (b). The term (a) can be bounded as follows:</p><p>That is, term (a) is bounded by the distance between the average features of nodes in the first layer of the task-trees (i.e., the nodes directly connected to the root). Next, we bound term (b):</p><p>.</p><p>Term (c) describes the distance between the average features of nodes in the second layer of the task-trees, while term (d) follows a recursive formula, similar to term (b). By combining terms (a) and (b), we have:</p><p>We can extend the formula recursively through all layers until the final layer. The recursive nature of the formula allows us to more easily reformulate the bound:</p><p>where &#8710; l denotes the distance between task-trees at the l-th layer. For clarity, we can interpret C 1 &#8710; 1 as corresponding to Term (a) and C 2 &#8710; 2 as corresponding to Term (c). Next, we explain how to determine C l and &#8710; l for each layer.</p><p>By analyzing the recursive formula, we determine C l as follows:</p><p>We then define the &#8710; l . For a concise definition, we introduce an additional notation for describing the subtree information:</p><p>By using the term, we can define the &#8710; l as:</p><p>By using a formulation like expression 11, we can decompose the impact of different layers, facilitating further analysis of the upper bound on the distance between two task-trees. Next, we will analyze the upper bound of each term. To begin, we first introduce a lemma.</p><p>Lemma D.1. Given two sets of random vectors S 1 = {v 1 , ..., v n } and S 2 = {v 1 , ..., v m }, the following holds:</p><p>Proof. Let's consider two sets</p><p>We have:</p><p>WLOG, this analysis can be extended to cases where the size of A is n and the size of B is m.</p><p>Based on the Lemma, we have</p><p>which is displayed in Theorem 3.1.</p><p>We then use the Lemma to bound the &#8710; l . For example, the upper bound of &#8710; 1 is:</p><p>Similarly, the upper bound of &#8710; 2 is:</p><p>where d i = |N i | and d j = |N j | represent the number of children (i.e., the degree) of nodes i and j, respectively. Thus, the upper bound of &#8710; l is:</p><p>Now we can have the highest upper bound of &#8710; as:</p><p>Note that this upper bound is an extreme case; the bound can be tightened by adding supplementary information or making additional assumptions. Additionally, the distance between task-trees can be reduced by applying techniques like normalization, which lowers the values of the constants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2. Proof of Theorem 3.3</head><p>Proof. We begin by restating the notations used in the theorem. Let P represent the task-tree distribution, T the downstream task distribution, &#981; &#8712; &#934; the GNN encoder, g &#8712; G the predictor head used during pretraining, and f &#8712; F the predictor head for the downstream task. The pretraining objective is defined as</p><p>, where T is the task-tree and T is the corresponding corrupted task-tree obtained via arbitrary corruption functions. The risk on the downstream task is defined as</p><p>where T is the task-tree, y is the associated label, and &#954; denotes the loss function. Before we begin the proof, we present an additional helper proposition.</p><p>Proposition D.2 (Ruhe's Trace Inequality <ref type="bibr">(Ruhe, 1970)</ref>). If X and Y are positive semi-definite Hermitian matrices with eigenvalues,</p><p>x i y i .</p><p>Note that we are going to prove</p><p>where C &#948; &#8776; O(1) and &#948; = 1 2 . The proof involves deriving the upper bound for the term</p><p>followed by the lower bound for min g&#8712;G L P (g &#8226; &#981;) -min g &#8242; &#8712;G L P (g &#8242; &#8226; &#981; &#8242; ). To simplify the proof, we assume the downstream task is a binary classification task, though this approach can be extended to multi-classification scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>We analyze the upper bound of min</head><p>where (T, y) &#8764; T and &#954;(f (&#981;(T ))) is shorthand for &#954;(f (&#981;(T )), y) for notational convenience. Since we define f &#8712; F as a linear predictor for binary classification, we can rewrite this equation in the following form:</p><p>Note that in the previous formula, we define &#952; <ref type="bibr">Deng et al., 2024)</ref>. We can select a sufficiently large B &#952; to ensure an adequately large function space. Additionally, we define &#952; &#8242; as the optimal head for the encoder &#981; &#8242; . The expression &#8730; &#952; &#8242;&#8868; &#923;&#952; &#8242; is equivalent to tr(&#923;&#952; &#8242;&#8868; &#952; &#8242; ), which can be further simplified using Proposition D.2.</p><p>where &#963; i is the i-th eigenvalue for the matrix and &#963; max denotes the maximum eigenvalue.</p><p>Then, we are going to demonstrate the lower bound of min g&#8712;G L P (g</p><p>where ( T , T ) &#8764; P. Here we also consider the predictor g as a linear function, so that we have the following form:</p><p>where C P = E P &#8741;&#981; &#8242; (T ) -&#981;(T )&#8741; 2 is a constant, and W &#8242; is defined as the optimal transformation matrix for the encoder &#981; &#8242; . Based on this formula, we can further simplify the bound on</p><p>Now that we have the upper bound for</p><p>) and the lower bound for min g&#8712;G L P (g &#8226; &#981;) -min g &#8242; &#8712;G L P (g &#8242; &#8226; &#981; &#8242; ), we can establish the relationship between them as follows:</p><p>Based on our definition, &#952; &#8242; and W &#8242; are optimal heads for the encoder &#981; &#8242; , the complexity term</p><p>would be a constant which in the order of O(1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3. Proof of Theorem 3.5</head><p>Proof. We begin by introducing some essential notations. Let P represent the pretraining task-tree distribution and T the downstream task-tree distribution. The pair (T, y) &#8764; T denotes a task-tree and its corresponding label, where we define the labeling function as &#968;, meaning y = &#968;(T ). The GNN encoder is &#981; &#8712; &#934;, the pretraining predictor is g &#8712; G, and the downstream predictor head is f &#8712; F . As in the previous proof, we consider a binary classification task for simplicity, though this can be extended to multi-class settings. The downstream risk is given by R</p><p>where &#954; is a loss function.</p><p>Then, we define the excess risk on the downstream distribution T as</p><p>+ min</p><p>,</p><p>where &#981; and f represent encoder obtained during pretraining and the prediction head learned in downstream task, respectively, while &#981; &#8242; and f &#8242; are the optimal encoder and predictor head on the downstream distribution. &#981; * is the optimal encoder obtained during pretraining, defined as &#981; * = arg min &#981;&#8712;&#934; min g&#8712;G L P (g &#8226; &#981;). We will analyze these four terms separately.</p><p>To bound the term (a), we need to introduce the empirical Rademacher complexity (Definition 1, <ref type="bibr">(Deng et al., 2024)</ref>) as</p><p>where &#949; i is i.i.d., and P(&#949; = 1) = P(&#949; = -1) = 1 2 .</p><p>Using this definition, we can bound the term (a):</p><p>, where f * is the optimal predictor head over the distribution T , defined as f * = arg min f &#8712;F R T (f &#8226; &#981;). The term (a.3) represents the empirical risk gap between the learned head f and the best head f * , which implies that the term is a constant greater than or equal to 0. Term (a.1) and (a.2) describe the gap between the risk and the empirical risk. According to uniform convergence, these two terms can be expressed in terms of empirical Rademacher complexity. Thus, term (a) can be bounded as:</p><p>where B &#954; is the bound of the Lipschitz of loss function &#954;. We then further simplify the empirical Rademacher complexity for a more reasonable expression.</p><p>As the &#949; i are i.i.d. with zero mean as our definition, we cancel the term, thus</p><p>Then, we bound the term (b). To do this, we introduce a notation</p><p>where B &#954; represents the upper bound of the Lipschitz constant of &#954;, and X &#981; denotes the distribution of task-tree embeddings produced by the encoder &#981;. This term measures the distributional distance of task-trees between the pretraining and downstream distributions.</p><p>Following, we bound the term (c), as</p><p>The term L P (g &#8226; h) -min g &#8242; &#8712;G,&#981; &#8242; &#8712;&#934; L P (g &#8242; &#8226; &#981; &#8242; ) describes the excess risk on pretraining task, which can be replaced by a notation E P (g, &#981;).</p><p>Lastly, we bound the term (d),</p><p>By combining the four terms, we obtain the generalization bound for a model pretrained on task-tree distribution P and fine-tuned on task-tree distribution T :</p><p>We can set C 1 = C &#954; C f and C 2 = B &#954; as two downstream task-related constants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Experimental Settings</head><p>E.1. Datasets Dataset Statistics. We utilize 32 datasets spanning five domains in this paper. Since these datasets are text-attributed graphs, we use Sentence-BERT <ref type="bibr">(Reimers &amp; Gurevych, 2019)</ref> to align the node textual features into 768-dimensional vectors. The dataset statistics are presented in Table <ref type="table">8</ref>. For the temporal graphs, we split each graph into 10 snapshots, with the statistics shown in Figure <ref type="figure">8</ref>. We classify Children and Ratings as heterophily graphs due to their relatively low homophily ratios <ref type="bibr">(Chen et al., 2024b)</ref>.</p><p>Splitter. For each dataset, we use the same splitting strategy as provided in the original paper <ref type="bibr">(Chen et al., 2024b;</ref><ref type="bibr">Galkin et al., 2024;</ref><ref type="bibr">Feng et al., 2024;</ref><ref type="bibr">Zhang et al., 2024b)</ref>. If multiple splits are provided, we evaluate model performance on each split using different random seeds. For datasets with a single split, we repeat the experiments five times with different random seeds. For GDELT and ICEWS1819, which are originally temporal knowledge graphs, we apply an 80%/10%/10% split based on timestamps for train/validation/test settings. For the temporal graphs Enron and Googlemap CT used for edge classification, we split each snapshot by timestamps, using the first 70% for training, the next 15% for validation, and the remaining 15% for testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2. Baselines</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BASELINES APPLICABLE FOR ALL GRAPHS</head><p>GCN <ref type="bibr">(Kipf &amp; Welling, 2017)</ref>. A supervised message-passing GNN trained from scratch for each task. As a result, it cannot be applied to in-context learning or zero-shot learning. GAT <ref type="bibr">(Veli&#269;kovi&#263; et al., 2018)</ref>. A supervised GNN that uses an attention mechanism to learn the importance of received messages.</p><p>GIN <ref type="bibr">(Xu et al., 2019)</ref>. A supervised GNN specifically designed for graph-level tasks.</p><p>BGRL <ref type="bibr">(Thakoor et al., 2022)</ref>. A popular self-supervised learning framework for graphs that employs a contrastive learning loss without negative samples.</p><p>GraphMAE <ref type="bibr">(Hou et al., 2022)</ref>. A graph learning framework pretrained in a masked auto-encoder fashion.</p><p>OFA <ref type="bibr">(Liu et al., 2024a)</ref>. A cross-task and cross-domain graph foundation model that treats subgraphs as the basic learning instances. It introduces a graph prompt learning framework to enable in-context and zero-shot learning.</p><p>EXPERT MODELS DESIGNED FOR SPECIFIC DOMAINS ULTRA <ref type="bibr">(Galkin et al., 2024)</ref>. A foundation model designed specifically for knowledge graphs, which we treat as the domain expert for KGs.</p><p>KVPLM <ref type="bibr">(Zeng et al., 2022)</ref>. A language model based on SMILES representations of molecules, serving as an expert model for molecular graphs.</p><p>MoMu <ref type="bibr">(Su et al., 2022)</ref>. Another expert model for molecules that leverages GNNs to improve molecular representations.</p><p>Galactica <ref type="bibr">(Taylor et al., 2022)</ref>. A foundation model for molecular graphs that utilizes multi-task learning with instructions. GIMLET <ref type="bibr">(Zhao et al., 2023)</ref>. A foundation model for molecules that incorporates advanced models and instruction-based learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.3. Evaluation Protocol</head><p>Pretraining Datasets. We select six datasets for pretraining, including Arxiv, Products, WN18RR, FB15K237, Chemblpre, and PCBA, due to their diversity in domains and tasks. For self-supervised learning methods, these six datasets are used for pretraining unless otherwise specified.</p><p>SFT Datasets. For specialization via SFT in each domain, we use Arxiv for academic networks, Products for ecommerce networks, FB15K237 for knowledge graphs, and PCBA for molecular networks. For temporal graphs, which are e-commerce-based, we also use Products for SFT to evaluate robustness under temporal distribution shifts.</p><p>Backbone. We use a GraphSAGE-like encoder <ref type="bibr">(Hamilton et al., 2017)</ref>. Following the encoding of task-trees, we add an additional linear transformation as the projector. Note that we does not leverage edge features to make the task harder except for Enron and Googlemap CT where node features are IDs and edge contains messages. As the edge information may significantly benefit some tasks like knowledge graph completion and molecule property prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.4. Evaluation Settings</head><p>Finetune. This is the basic setting that directly finetune the full parameters of the pretrained model by appending a linear classifier on the top of the model encoder.  In-context Learning. This is a kind of few-shot learning without fine-tuning the model parameters. We randomly select samples from a certain class, and average the selected samples to form prototypes, which is used for classification. We follow existing GFM works <ref type="bibr">(Liu et al., 2024a;</ref><ref type="bibr">He &amp; Hooi, 2024)</ref> to conduct 500 randomly sampled 5-way 3-shot learning tasks. If the number of classes is less than 5, the number of ways is set to the number of classes.</p><p>Zero-shot Learning. The zero-shot learning is similar to in-context learning, yet we use the LLM-encoded class description embeddings as the prototypes for prediction. Similar to in-context learning, we also randomly sample 500 tasks for evaluation. Another zero-shot setting involves using an additional LLM for zero-shot inference <ref type="bibr">(Chen et al., 2024a)</ref>. We leave this in our future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.5. Hyper-Parameters</head><p>Baselines. For the baseline methods, we follow the hyperparameters reported in <ref type="bibr">(Liu et al., 2024a;</ref><ref type="bibr">Chen et al., 2024b)</ref>. If the hyperparameters are not provided, we set the number of epochs to 1,000, the batch size to 4,096, early stopping at 200, and the hidden dimension to 768, using a 2-layer GraphSAGE as the backbone with batch normalization and ReLU activation. For optimization, we use AdamW with a weight decay of 1e-6 and tune the learning rate from 1e-3, 1e-4, 1e-5, reporting the best performance. For methods with attention mechanisms, we set 4 attention heads.</p><p>GIT. The model architecture and pretraining parameters of our GIT are presented in Table <ref type="table">9</ref>. The specific fine-tuning   <ref type="table">10</ref>, <ref type="table">11</ref>, 12, 13, and 14. For in-context learning and zero-shot learning results without fine-tuning, the general model does not involve any hyperparameters. For the specialized model, we tune the hyperparameters of SFT epochs from 10 to 500, in steps of 10, the SFT learning rate from 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, and the normalization method from None, BN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Comprehensive Model Results</head><p>F.1. Domain: Academia Node Classification. We perform node classification on academic networks across three settings: basic fine-tuning, 3-shot in-context learning, and zero-shot learning. The comprehensive node classification results on academic networks, measured in terms of accuracy, are presented in Table <ref type="table">15</ref>. Notably, the specialized model (GIT-S) does not always outperform the general model (GIT-G). This may be because the manually selected SFT data does not adequately capture the underlying distribution of the domain. It would be valuable to explore dataset selection or instance selection methods to better optimize the choice of SFT data. * indicates the results from paper <ref type="bibr">(Zhao et al., 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H. Ablation on Specialization (SFT)</head><p>Ablation Study on SFT Data used for Specialization. We also evaluate the impact of the SFT dataset used for specialization. The model's zero-shot performance is reported in Table <ref type="table">25</ref>, comparing the default SFT dataset PCBA with another SFT dataset, HIV. We find that the model performance with HIV as the SFT data is lower than with PCBA. We hypothesize that this is due to HIV having fewer graphs and tasks, which may provide less information for reducing the distribution discrepancy. Nevertheless, HIV still improves the model's performance over the general model on 5 out of 8 datasets. Ablation Study on SFT Data used for Specialization. We analyze the impact of SFT data in the experiments, as shown in Figure <ref type="figure">9</ref>. The results show that changes in SFT data do not significantly affect model performance, particularly in fine-tuning and in-context learning settings. Even when the SFT and downstream data are the same, the model does not necessarily outperform models fine-tuned on other SFT datasets. This observation supports the motivation behind our proposed specialization method, which aims to shift the pretraining distribution P toward the distribution of target domains. It also highlights the importance of designing an instance selection method to identify the most effective SFT data. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>University of Notre Dame</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>University of Connecticut. Correspondence to: Zehong Wang &lt;zwang43@nd.edu&gt;, Yanfang Ye &lt;yye7@nd.edu&gt;.Proceedings of the 42 nd International Conference on MachineLearning, Vancouver, Canada. PMLR 267, 2025.  Copyright 2025 by the author(s).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_2"><p>The scope of reconstruction task is large. We consider contrastive learning is also a kind of reconstruction.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3"><p>This example was first illustrated in<ref type="bibr">Cao et al. (2023)</ref> </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>Our collected graphs include two heterophily graphs, Children and Ratings, with homophily ratios of 0.42 and 0.38, respectively, whereas other graphs generally have a homophily ratio greater than 0.60 (Table12,(Chen et al., 2024b)).</p></note>
		</body>
		</text>
</TEI>
