<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Iterative active learning strategies for subgraph matching</title></titleStmt>
			<publicationStmt>
				<publisher>Elsevier</publisher>
				<date>02/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10524389</idno>
					<idno type="doi">10.1016/j.patcog.2024.110797</idno>
					<title level='j'>Pattern Recognition</title>
<idno>0031-3203</idno>
<biblScope unit="volume">158</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yurun Ge</author><author>Dominic Yang</author><author>Andrea L Bertozzi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The subgraph matching problem arises in a variety of domains including pattern recognition for segmented images, meshes of 3D objects, biochemical reactions, and security applications. Large and complex solution spaces are common for this graph-based problem, especially when the world graph contains many more nodes and edges in comparison to the template graph. Researchers have focused on the task of finding one match or many matches, however a real use-case scenario can necessitate identifying specific matches from a combinatorially complex solution space. Our work directly addresses this challenge. We propose to introduce additional queries to the subgraph that iteratively reduce the size of the solution space, and consider the optimal strategy for doing so. We formalize this problem and demonstrate that it is NP-complete. We compare different quantitative criteria for choosing nodes to query. We introduce a new method based on a spanning tree that outperforms other graph-based criteria for the multichannel datasets. Finally, we present numerical experiments for single channel and multichannel subgraph matching problems created from both synthetic and real world datasets.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The subgraph matching problem poses a simple question: given a small graph (the template) and a large graph (the world), find all the subgraphs of the world isomorphic to the template. For multichannel graphs, an isomorphic subgraph should preserve the direction of edges as well as the labels of edges and nodes in all channels. Figure <ref type="figure">1</ref> depicts a small version of the problem for which there exists four distinct isomorphic subgraphs. Despite its simple statement, the problem has been proven to be NP-complete <ref type="bibr">[10]</ref> and has applications to a wide variety of domains (protein matching, plagiarism detection, adversarial activity detection, to name a few). For certain matching problems, a significant source of difficulty arises from the extremely large number of solutions, mostly related to symmetries in graphs. (In <ref type="bibr">[44]</ref>, one instance had 10 384 solutions!). If a user has in mind one particular match but finds that their template matches named as "active learning for subgraph matching," involves the iteration between the following procedures:</p><p>We obtain the candidate set for all template nodes using a filter based subgraph matching algorithm.</p><p>Then, an acquisition function determines the optimal nodes for subject matter experts to obtain additional information. After that, the additional information is fed back into the subgraph matching algorithm to get reduced candidate sets. A flowchart describing how this active learning system might be used in a real world setting in shown in Figure <ref type="figure">3</ref>. Our contributions to the development of this system are twofold. First, we delve into the higher level mathematical constructions, including the general formulation of the problem and the proof of inherent complexity to design query strategies. Second, we discuss the details to design various acquisition functions and compare their applicability to datasets of different scales.</p><p>The need for the active learning system for multichannel subgraph matching is also illustrated quite well by the work under the DARPA MAA (Modeling Adversarial Activity) program <ref type="bibr">[30]</ref>. The theme of this program involves adversarial activity detection by constructing template graphs that describe a series of actions and the world graph constructed from relevant data. The benchmark datasets developed in this program also demonstrate that need of input from a human expert is necessary. Our research in this paper proposes theories and methods to incorporate such expert input, which not only builds upon the initiatives from the DARPA project and has direct application to solving the difficult benchmark datasets, but also offers a wider application in addressing analogous large scale subgraph matching problems where expert intervention is crucial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Related Work</head><p>Classical approaches to the subgraph matching problem generally fall into three categories: tree search, constraint propagation, and graph indexing. Tree search algorithms iteratively build a partial match by assigning template vertices to world vertices. If at any point, the current partial matching cannot be extended to a full matching, the algorithm backtracks by unassigning vertices and reassigning them to new world vertices. Ullmann's algorithm <ref type="bibr">[37]</ref> represents an early application of the tree search methodology in subgraph matching algorithms. This foundational approach was subsequently refined and expanded upon through the development of the VF2 algorithm <ref type="bibr">[7]</ref>, which established itself as a cornerstone in the field.</p><p>Constraint propagation approaches differ from tree search approaches by explicitly maintaining candidate sets for each template vertex which are iteratively refined using methods derived from constraint satisfaction solvers. LAD <ref type="bibr">[33]</ref> is one of the first solvers of this kind, and PathLAD <ref type="bibr">[18]</ref> and Glasgow <ref type="bibr">[22]</ref> are a few examples that extend this approach. Graph indexing approaches build indexes of graph features by which large graph databases can be quickly searched for subgraph matches. These methods generally take a filterverify scheme whereby a significant portion of templates which have no corresponding match in the world may be quickly filtered using the index. Recent work in this framework include TurboISO <ref type="bibr">[13]</ref>, BoostISO <ref type="bibr">[29]</ref>, CFL-Match <ref type="bibr">[4]</ref>, and CNI-Match <ref type="bibr">[26]</ref>. These works focus on utilizing different network structures to optimize the order of tree search. All of the above methods may lead to a final tree search to identify all matches.</p><p>Learning methods for graph matching include convolutional neural networks, graph neural networks, and affinity models. In a recent work <ref type="bibr">[19]</ref>, the authors developed efficient learning methods for obtaining a sufficiently similar solution for the induced subgraph matching problem. In <ref type="bibr">[19]</ref>, the sizes of graphs in their datasets are 20 times the sizes in prior learning-based subgraph matching methods. However, compared with the datasets considered in this paper, they are still relatively small.</p><p>There are few papers in the literature that address problems related to active learning for similar problems to subgraph matching. For example, researchers introduce active learning into the network alignment problem. In <ref type="bibr">[31]</ref>, researchers compare three probability matrix based query strategies. In <ref type="bibr">[6]</ref>, the cost function of network alignment is updated after interaction with human. In <ref type="bibr">[21]</ref>, the authors examine the case where experts only provide partial information about the mapping of certain nodes instead of the exact answer. In <ref type="bibr">[28]</ref>, the authors study vertex nomination in the inexact subgraph matching problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Paper Outline</head><p>The outline of this paper is as follows: In section 2, we introduce the key concepts and terminologies of the subgraph matching problem and the formal definitions of the active learning framework. In section 3, we study the complexity of the active learning problem and prove its NP-completeness. In section 4 and 5, we formulate different active learning strategies to query template nodes as well as an original algorithm to quickly estimate the frequency of solutions associated with each node. In section 6, we present the results of numerical experiments conducted on benchmark datasets generated from both single channel and multichannel networks, and compare the performance of the different active learning strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Definitions and Problem Statement</head><p>In the first subsection, we introduce the definition of a multichannel graph and the subgraph matching problem for multichannel graphs. The second subsection provides the intuition and algorithm for our active learning framework. The third subsection formalizes the theoretical problems associated to this framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Subgraph Matching Problem</head><p>The following definition is from <ref type="bibr">[25]</ref>:</p><p>) is a set of nodes (or vertices), directed edges between the nodes, labels on the nodes, and channels on the edges. The number of nodes is denoted n. Each node v &#8712; V has a label L(v) belonging to some arbitrary set of labels. There can be any number of edges between each pair of nodes (u, v) in either direction. Each edge belongs to one of the channels C. Edges between the same pair of nodes in the same channel with the same direction are indistinguishable. The function E : V &#215; V &#8594; N |C| describes the number of edges in each channel between each pair of nodes. In particular, E(u, v) can be represented as a |C|-dimensional vector the k th element of which is the number of edges from node u to node v in the k th channel.</p><p>We will refer to graphs with only one channel as single-channel graphs. Now we introduce some standard graph terminology which we extend to the multichannel setting. Two vertices are adjacent or neighboring if there is an edge between them in either direction in any channel.</p><p>An edge is incident to a vertex if the vertex is one of the edge's endpoints. We define the degree of a vertex v, denoted deg(v), in a multichannel graph as the number of edges incident to a vertex. The neighborhood of a vertex v, denoted N (v), is the set of all adjacent vertices. A subgraph of a multichannel</p><p>The subgraph matching problem (SMP) can be succinctly stated: given two multichannel networks, a template G t = (V t , E t , L t , C) and a world G w = (V w , E w , L w , C), we wish to find all subgraphs of the world that match the template. To formalize what we mean by match, we introduce the subgraph isomorphism (SI) as defined in <ref type="bibr">[25]</ref>.</p><p>The set of all SIs from G t to G w is denoted</p><p>This definition allows for isomorphisms in which the world graph has more edges than the template. If the function additionally satisfies (2) at equality, it is an induced subgraph isomorphism. A function which satisfies (1) and (2) but is not necessarily injective is an edge preserving map (EPM).</p><p>With these definitions, we can establish the general subgraph matching problem:</p><p>Problem 1 (Subgraph Matching Problem). Given a template G t and world G w , produce the full set of subgraph isomorphisms F(G t , G w ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Active Learning Framework</head><p>In this subsection, we will present the general algorithmic framework we will be using to test various active learning strategies. For each experiment, we will assume we have a template graph G t , world graph G w , and there exists a valid subgraph matching f which we are trying to determine. f (t) will represent the ground truth world vertex which is associated with a given template vertex t. We do not have immediate access to f (t) for any template vertex t, but after a potentially expensive template query, we can produce f (t).</p><p>As a toy example of how the active learning problem may proceed, we consider the example template and world presented in Figure <ref type="figure">4</ref>. Initially, there are four possible subgraph matchings in the world graph and the ground truth matching maps nodes 1 and 2 to A and B, respectively. At this stage, we query template vertex 1 and determine its true assignment, A. From this knowledge, we can rule out any solution to uniquely identify a solution.</p><p>Algorithm 1 Active Learning Template Query Loop Input: Template G t , World G w , Matching f C &#8592; InitializeCands(G t , G w ) &#9655; Initially all world vertices are candidates C &#8592; Filter(G t , G w , C) &#9655; Initial pass reducing candidate sets count &#8592; 0 while Not all t in V t have 1 candidate do t &#8592; QueryStrategy(G t , G w , C) &#9655; Determine which vertex to query Query t to determine f (t) &#9655; Add information about template vertex t Match(t, f (t)) C &#8592; Filter(G t , G w , C) &#9655; Using new information, reduce candidate sets count &#8592; count + 1 Return count We now formally state the problem addressed in this paper: Problem 2 (Optimal Template Query Problem). Given G t , G w , candidate sets C(t) for t &#8712; V t , and a fixed choice of filter, determine a query strategy which will require the fewest template queries as given by Algorithm 1?</p><p>The definition of a query strategy we leave intentionally vague so as to allow any manner of criterion for deciding template vertices. The only stipulation that we impose on the query strategy is that it only makes choices based on information derived from the properties of the graphs, candidate information, and the results of template queries. Our metric for comparing query strategies is the number of queries required to fully determine a subgraph matching. We envision this as the most expensive operation as it potentially requires expert involvement to perform the query. In this paper, each template node requires the same amount of work to query, but future work may study an extension of this problem which varies the work required on a per-node basis. We also note that this problem depends on the choice of filter. Stronger filters (filters which eliminate more candidates) will give us more information and it is possible that strategies which make better use of this information may perform better in that context. We will sidestep this problem instead opting to fix the choice of filter to that of the LAD filter introduced in <ref type="bibr">[33]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Associated Theoretical Problems</head><p>Within this framework, we can consider two associated theoretical problems given a template G t and a world G w . The first problem involves a candidate solution f &#8712; F(G t , G w ), and we wish to determine the minimal number of template vertices we need to query to verify that f is the true solution. This is formalized in the following definitions:</p><p>Obviously the entire set V t is a solution verifying set, but smaller solution verifying sets are possible.</p><p>In the example in Figure <ref type="figure">4</ref>, we observe that for any of the solutions, we need only to query one template vertex to verify a given solution. For solutions 1 and 2, we query vertex 1 and for solutions 3 and 4, we query vertex 2. Hence for any subgraph isomorphism f , the minimal solution verifying set is of size 1.</p><p>For the second problem, we do not have a candidate solution. Rather, we wish to know the minimal size of a subset of V t for which if we knew the images of these vertices, we could uniquely identify the images of the remaining vertices. We introduce the following definition and problem to this end:</p><p>Problem 4 (Minimal Determining Set Problem). Given G t and G w , produce a determining set of minimal cardinality.</p><p>For the problem in Figure <ref type="figure">4</ref>, we need to query both template vertices 1 and 2 to determine the subgraph isomorphism in all cases. If we query template vertex 1 and 1 maps to world vertex C, there are still two SIs which are possible. Similarly, if we query template vertex 2 and find that 2 maps to world vertex D, there are also two possible SIs. Hence, the minimal determining set is {1, 2} and is of size 2. Note that this is a counterexample to the statement that the size of the minimal determining set is the maximal size of a minimal verifying set over all SIs f &#8712; F(G t , G w ).</p><p>The decision variants of both of these problems are NP-complete. We will prove that the solution verification set problem is NP-complete in Section 3 by reduction from the minimum set cover problem. As for the minimal determining set problem, we note that in the case that G t = G w , finding a determining set of a certain size is equivalent to finding a base for the automorphism group of G t , which is already known to be NP-complete <ref type="bibr">[5]</ref>. In practice, the optimal template query problem is intermediate to both of these theoretical problems as we will generally not have a solution f at our disposal (and almost certainly not the whole set of SIs F(G t , G w )). However, we may have some prior information and in the process of querying template vertices, we will be gathering additional information which will aid us in determining the ground truth SI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">NP-Completeness of the Minimal Solution Verification Set Problem</head><p>In this section, we will present results on the complexity of the optimal template query problem. We first introduce the set cover problem which is well known for being NP-complete. Then we prove that the minimum set cover problem is reducible to the minimal solution verifying set problem proving NPcompleteness of this problem. As the optimal template query problem is an extension of the minimal solution verifying set problem, it is NP-complete as well..</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Reduction of Minimum Set Cover to Minimum Solution Verification Set</head><p>We define the minimum set cover problem as follows:</p><p>Problem 5 (Minimum Set Cover Problem). Suppose S is a set S = {1, 2, . . . , m}, and we have k subsets S i &#8838; S, 1 &#8804; i &#8804; k. The minimum set cover problem is to find an index set I &#8838; {1, 2 . . . , k} with minimum cardinality, such that S &#8838; &#8746; i&#8712;I S i .</p><p>We assume the problem is non-trivial and has at least one set cover that consists of n subsets, i.e.,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8746; n</head><p>i=1 S i = S. We denote the members of each subset S i = {a</p><p>ri }. The associated decision problem (i.e., determining if there is a set cover with fewer than N elements) is NP-complete <ref type="bibr">[16]</ref> and therefore believed to be unsolvable in polynomial time with k. We will prove the solution verification set problem is NP-complete via reduction from the minimum set cover problem.</p><p>Given an instance of the minimum set cover problem (S, {S 1 , . . . , S n }), we will produce an equivalent instance of the solution verification set problem. Let V t = {v 1 , v 2 , . . . , v k } be the vertices of complete single channel simple graph with k nodes. That is to say</p><p>We impose that each node has a unique label:</p><p>ar i } and ground truth vertex v i gt . The labels of these vertices are equal to i. That is to say</p><p>The vertices of world graph are defined to be the union of all the vertex sets and ground truth vertices:</p><p>We also define a mapping f on the world vertices:</p><p>where P(A) is the power set of A. The edges in the world graph are defined as followed:</p><p>The ground truth is set to be: SI(v i ) = v i gt . Lemma 1. Using the above definitions, g is a subgraph isomorphism from the template graph G t to the world graph G w if and only if:</p><p>So g is edge-preserving. Finally, L(g(v i )) = i, so g preserves labels. So g is a subgraph isomorphism ( &#8656;= ) Let g be a subgraph isomorphism. Then g preserves labels. So L(g</p><p>for some a</p><p>q } has only one element. Since g is edge-preserving, if</p><p>Lemma 2. Suppose S has a set cover S &#8712; &#8746; i&#8712;K S i , where K &#8834; {1, 2, . . . , k}, then &#8746; i&#8712;K {v i } is a valid solution verifying set. The converse is also true.</p><p>Proof. For any subgraph isomorphism g, suppose g has the same image as the ground truth SI for the queried nodes, i.e. &#8704;i &#8712; K, g</p><p>On the other hand, suppose&#8746; i&#8712;K is not the set cover of S, then &#8707;t &#8712; S, t / &#8712; &#8746; i&#8712;K S i . For each i, suppose t &#8712; f (v i ax ), then let g(v i ) = v i ax ,by the previous theorem, g is a subgraph isomorphism and g is different from SI.</p><p>From the above arguments, we have our main result for this section: Theorem 1. The minimum set cover problem is reducible to solving the minimum solution verification set problem.</p><p>This demonstrates that the decision variant of the minimum solution verification problem (finding a verification set with fewer than N queries) is NP-hard. If we note that any such set is a certificate for the problem, it follows the problem is in NP, and we have NP-completeness of the solution verification problem.</p><p>Theorem 2. The decision variant of the solution verification problem is NP-complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Solving the Minimal Solution Verification Set Problem</head><p>In spite of the fact that determining if there is a solution verification set of a given size is NP-complete, it may still be possible to solve this problem if the solution space F := F(G t , G w ) can be computed and is of a manageable size.</p><p>Given F and a solution f &#8712; F, we can reduce finding a minimal verification set to the minimum set cover problem in the following manner. For each subset of template vertices S &#8834; V t , we define F S := {g &#8712; F : &#8707;t &#8712; S, g(t) &#824; = f (t)}, the set of SIs that we can rule out if we know f (t) for all t &#8712; S. A solution verification set for f is any set S &#8834; V t for which F S = F \ {f }, i.e., with the set, we can rule out all solutions but the ground truth mapping f . We then define |V t | sets, S 1 = F t1 , . . . , S |Vt| = F t |V t | , the solution sets ruled out if we queried vertices t 1 , . . . , t |Vt| , respectively. Our goal is then to determine an index set I &#8834; {1, 2, . . . , |V t |} of minimal cardinality such that i&#8712;I S i = F \ {f }. This is precisely the minimum set cover problem as defined in Definition 5.</p><p>We can solve the minimum set cover using a binary program where we introduce binary variables z 1 , . . . , z k &#8712; {0, 1} where z i represents whether or not we are using subset S i in our set cover. We define a matrix A &#8712; {0, 1} k&#215;m with entries a ij for i = 1, . . . , k and j = 1, . . . , m where a ij = 1 if element j is in subset S i and 0 otherwise. We can then write the optimization problem we wish to solve with the following formulation:</p><p>where the objective that we are minimizing in ( <ref type="formula">4</ref>) is exactly the count of how many subsets we use. The constraints in <ref type="bibr">(5)</ref> verify that each element in S is in at least one chosen subset. In practice, the matrix A often has a significant number of duplicate columns which we can safely drop without changing the solution set of the problem. We can then solve this integer program using any standard integer program solver; for this paper, we use the state-of-the-art solver Gurobi <ref type="bibr">[12]</ref>. Once we have solved this problem, the size of the given solution verifying set can then be used as a lower bound on the number of queries under any given template query strategy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methods</head><p>In this section, we introduce query strategies listed in Table <ref type="table">2</ref>. We divide these methods into roughly two categories. First in Section 4.1, we list methods from our initial study <ref type="bibr">[11]</ref> which considered more central vertices in the template. Section 4.2 presents methods based on estimating the probability that any given template vertex is mapped to a given world vertex. Finally, in Section 4.3, we utilize symmetry in the problem to better inform our template query selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Local template centrality based strategies</head><p>These strategies prioritize the complexity of graph structures within local regions that encompass the neighbors of template nodes, and thus they exhibit a preference for nodes with higher degrees. In <ref type="bibr">[11]</ref>, these strategies are shown to be effective at reducing the solution space with a limited number of queries.</p><p>&#8226; Maximum degree: choose the template node with the highest degree.</p><p>t = arg max t&#8712;Vt deg(t). (7) Query Strategy Description Most Candidates (MC) Node with the most candidates Min Degree (MD) Node with the lowest degree Max Neighbor Candidate Sum (MNCS) Node with most candidates in neighborhood by (8) Edge Entropy (EE) Node with the highest edge entropy as defined in (9) Max Entropy (ME) Node given by (19) with P SI defined as in (11) Max Local Entropy (MLE) Node given by (19) with P LSI defined as in (13) Max Local Entropy Approx. (MLE&#8764;) Node given by (19) with P LSH defined as in (15) Spanning Tree Entropy (STE) Node given by (19) with P ST EP M as defined in (16) Random (R) Random template node Optimal (O)</p><p>The optimal node to query as given by solving (4) </p><p>&#8226; Edge entropy: Let t &#8712; V t and t &#8242; &#8712; N (t). Then let N tw t &#8242; be the number of candidates for t</p><p>Then we can interpret P EE t &#8242; (t = w) := N tw t &#8242; /N t t &#8242; as an estimate for the probability that t is mapped to w based on the possible mappings of the edge with endpoints t &#8242; and t. The edge entropy formula is then given as follows:</p><p>Figure <ref type="figure">5</ref> shows a computation of the edge entropy.</p><p>In addition, we also introduce a simplified version of 8 which simply selects the template vertex with the most candidates. As we will see in Section 6, this method actually performs quite well in practice.</p><p>&#8226; Max candidates: Choose the node with the most candidate world vertices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Probabilistic Query Strategies</head><p>There are a variety of ways of establishing a probability space on the set of subgraph isomorphisms but the simplest involves granting all subgraph isomorphisms for a given template graph and world graph equal probability. The probability that a template vertex t &#8712; V t and world vertex w &#8712; V w are paired together is simply the proportion of subgraph isomorphisms where they are matched:</p><p>In the case that this problem is still too costly to solve, we can compute a crude approximation to the number of LSIs by removing the injectivity requirement in which case we are counting the number of local edge preserving mappings (LEPM). This quantity is very easy to count as it is just given by the product of the sizes of each of the candidate sets for all the neighbors of t:</p><p>Then our probability is defined in the same manner as before:</p><p>We can also obtain a rough approximation of the number of SIs by instead considering EPMs for a spanning tree of the template. As we will discuss in Section 5, it is tractable to compute this and we present an algorithm for doing so. We denote the number of EPMs between a spanning tree T of template G t world G w where t is mapped to w by ST EP M (G t , G w , t, w, T ). We then have an associated probability computed in the same manner:</p><p>Once we have computed the associated probabilities, we can devise a variety of strategies for selecting query vertices. There are well-established strategies in active learning (surveyed here <ref type="bibr">[32]</ref>) and three popular choices for determining queries are minimum confidence sampling, margin sampling, and maximum entropy sampling. We define these as follows:</p><p>&#8226; Minimum Confidence Sampling: Select the template vertex which is least confident about its most likely assignment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>t = arg min</head><p>t&#8712;V T max w&#8712;C(t)</p><p>&#8226; Margin Sampling: Select the template vertex whose two most probable assignments are closest together in probability. If w * = arg max w&#8712;C(t) P (t = w), then this is given by t = arg min</p><p>&#8226; Maximum Entropy Sampling: Select the template vertex which maximizes entropy. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>t = arg max</head><p>Results from preliminary experiments on the biochemical_reactions dataset (to be introduced in full detail in Section 6.1) are depicted in Figure <ref type="figure">6</ref>. In these experiments, we compare methods by the average amount of queries needed to uniquely determine a solution for a random selection of subgraph matching of these nodes. All of these isomorphisms are valid based on the information apparent in the problem. To discern which permutation is the true solution necessarily requires querying at least M -1 of these nodes (the last may be determined by process of elimination). This discussion naturally leads to the following proposition:</p><p>Proposition 1. Suppose we have a satisfiable subgraph isomorphism problem with template G t = (V t , E t ) and world G w = (V w , E w ). Let V t = N i=1 S i partition the template vertices into structural equivalence classes. Then at least N i=1 (|S i | -1) template queries are needed to identify a unique solution.</p><p>Discerning between isomorphisms which are generated by applying an automorphism to the template graph is more challenging. Relevant to this task is the notion of a base of the automorphism group of a template graph. A base of an automorphism group is a sequence of vertices (v 1 , v 2 , . . . , v n ) such that for any pair of automorphisms &#981; 1 &#824; = &#981; 2 , the two sequences</p><p>are distinct. This definition implies that the values of &#981; on v 1 , . . . , v n uniquely identify &#981;. Hence, to distinguish isomorphisms generated by the automorphism group would involve querying all vertices of a base of the automorphism group. The problem of generating a base for the automorphism group is difficult but the library nauty <ref type="bibr">[23]</ref> in the process of finding generators for the automorphism group of a general graph will also find a base. Querying nodes which constitute a base for the automorphism group then may be a useful strategy for active learning which exploits symmetry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Spanning-Tree-Based Method for Template Vertex Queries</head><p>In this section, we develop a new method that is computationally tractable on larger multichannel networks, especially those with long paths, such as transportation networks. The idea is motivated by the estimation of the number of subgraph isomorphisms in the CFL-Match algorithm <ref type="bibr">[4]</ref>. The authors of this algorithm proposed a new data structure called the compact path index (CPI). Using our notation, the structure of CPI is defined by the rule: There is an edge between v &#8712; C(u) and v 0 &#8712; C(u 0 ) for adjacent nodes u and u 0 in CPI if and only if E(v, v 0 ) = 1 in G. We use this data structure to estimate the frequency of solutions when the template is a path graph, and extended the method to the case when the template is a tree. In such a case, the estimation is the number of edge-preserving mappings generated by the current candidate list. For a more general template, we use a breadth first search (BFS) starting from the selected node to generate a spanning tree of the template, and do the estimation using the spanning tree.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Solution Frequency Estimation Algorithm</head><p>Solving the subgraph isomorphism problem is hard, even when the template structure is simple. This is in part because we need to solve an all-different problem due to the injectivity requirement on the nodes, and the counting of all assignments under all-different constraint is P# complete. Research conducted by <ref type="bibr">[42]</ref> and <ref type="bibr">[27]</ref> discussed methods for estimating the cardinality of the solution space. However, these methods cannot be easily adapted to estimate the number of mappings that each world node participates in within the solution space, primarily due to the large size of the world graph. A good estimation of this number is needed to inform the active learning problem. We use the number of possible edge-preserving mappings to the current candidate set to approximate the number of subgraph isomorphisms. By removing the injectivity constraint, we no longer need to solve the all-different problem, and complete the estimation in polynomial time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Estimation for Path Graphs</head><p>We start from the estimation on the root of a template which is a path graph. First, we give the definition of a path graph for multichannel networks: Definition 5 (Path graph). T is called a path graph if there exists an indexing of template nodes V t = {x 1 , x 2 , . . . , x k }, such that the edge of the template satisfies the following conditions:</p><p>Here, x k is defined as the root of this path graph.</p><p>From the definition, if x k is the root of the path graph template, then the template nodes can be indexed by x 1 , x 2 , . . . , x k such that edges only exist in nodes with consecutive indices. Problem 6. The solution frequency estimation problem is defined as followed: Suppose x k is a given node of a template G t . Suppose C(x i )[j] represents the jth element in the candidate set C(x i ). The solution frequency estimation problem is to approximate the cardinality of the following sets of subgraph isomorphism solutions</p><p>If the template is a path graph with a root x k , the solution frequency estimation problem with given node x k is called the path graph root estimation problem. Definition 6. In the setting of the path graph root estimation problem, we suppose the template is a path graph. the CPI adjacency matrix M i,i+1 is a 0 -</p><p>In the setting of the path graph root estimation problem, the CPI graph is defined as follows: Definition 7. The CPI graph is a single-channel graph with k i=1 |C(x i )| nodes. The nodes are indexed by the following set:</p><p>This means that each node corresponds to a unique template-candidate pair. The edge set is defined to be:</p><p>This means the edges of the CPI graph contains all the template-candidate pairs such that the template node and candidate are simultaneously neighbouring to each other.</p><p>The following theorem gives a formula to calculate the exact number of edge preserving mappings for path graph templates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 3. In the setting of the path graph template root estimation problem, the cardinality of edgepreserving mappings</head><p>Proof. We first show that the mapping m is a bijection between paths of length k in the CPI graph and edge-preserving mappings (EPMs). Define the mapping m as follows:</p><p>By the definition 7 of the CPI graph, {(i, g(x i )) | i = 1, 2, . . . , k} forms a path in the CPI graph if and only if g is an edge-preserving mapping. This implies that m is injective. Conversely, for any node list {(i, g(x i )) | i = 1, 2, . . . , k} that forms a path in the CPI graph, by the definition of the CPI graph, this node list satisfies the edge-preserving condition, making m -1 an edge-preserving mapping. Thus, m is also surjective. Since m is both injective and surjective, it is a bijection. Then we calculate the number of paths of length k in the CPI graph.</p><p>Let the adjacency matrix of the CPI graph be denoted by ADJ. This matrix has dimensions</p><p>From the definition of the edge set in the CPI graph, we have:</p><p>, and all other blocks are zero matrices. By performing block matrix multiplication, we obtain:</p><p>For any s &#8712; {1, 2, . . . , |C(x 1 )|} and j &#8712; {1, 2, . . . , |C(x k )|}, the number of paths of length k from (x 1 , C(x 1 )[s]) to (x k , C(x k )[j]) is given by:</p><p>Summing over all possible starting points s, the total number of edge-preserving mappings is:</p><p>This completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">Estimation for Trees</head><p>A tree in graph theory is defined for single channel graphs as a connected acyclic undirected graph. A tree is also a notion in data structure usually visualized by a tree in the graph theory sense, and stores the connection and hierarchy of the nodes. These two definitions are linked together by assigning a root to the tree. The following definitions extend the definition of a tree to multichannel graphs and define the tree data structure similarly by assigning a root node. Definition 8 (Underlying graph). Given a directed or undirected multichannel graph G, the underlying graph G u is a single channel, undirected simple graph which is defined as followed:</p><p>Definition 9. A multichannel graph G has a tree structure if and only if the underlying graph G u is a tree.</p><p>The definition of rooted tree is generalized from <ref type="bibr">[3]</ref> is equivalent to the tree data structure.</p><p>Definition 10 (Rooted tree). Suppose G is a tree or G is a multichannel graph with tree structure, and u &#8712; V G , then the pair (G, u) is called a rooted tree with root u.</p><p>Definition 11 (Parent, Ancestor, Child, Descendant, and Sibling). For a rooted tree(T, r), suppose w is any node other than r. If G is a tree, then there is a unique path P in G that connects w and r. If G has tree structure, then there is a unique path P in G u that connects w and r. Suppose P = (v 0 , v 1 , . . . , v k ), where v 0 = r, v k = w. Then v k-1 is called the parent of w, w is called the child of v k-1 . v 0 , v 1 , . . . , v k-1 are called the ancestors of w, and w is the descendant of v 0 , v 1 , . . . , v k-1 . Vertices with same parent are called siblings to each other. Definition 12 (Subtree for rooted tree). Suppose (T, r) is a rooted tree, and w &#8712; V T . The subgraph T w generated by w and all its descendants also have a tree structure. (T w , w) is called the subtree of (T, r) with root w.</p><p>We call the formula given in Theorem 5.1.1 pathEP M , and use the following divide and conquer algorithm which we call T reeEP M to calculate the number of EPMs. To avoid repeat calculations, we pre-calculate all the CPI matrices and store them in a hash map M . We denote by M (x, y) the CPI matrix for template nodes x and y.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 2 TreeEPM</head><p>Input: Template T with tree structure, root x, CPI matrix dictionary M y 1 . . . , y m &#8592; Children(x) (T 1 , y1), . . . , (T m , y m ) &#8592; Subtrees if T is a path graph then:</p><p>In Algorithm 2, &#8857; denotes the element-wise product of two vectors. This algorithm returns the exact number of EPMs regarding each candidate node of the selected root x.</p><p>Theorem 4. The cardinality of edge-preserving mappings can be calculated by the following formula:</p><p>Proof. We proceed by induction on the number of nodes in the tree T .</p><p>Base Case: When |V T | = 1, T is a path graph. TreeEPM degenerates to pathEPM, which returns the expected result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inductive</head><p>Step: Assume that for |V T | &lt; k, the proposition is true. Now consider a tree T with |V T | = k. Suppose x is the vertex in the input of TreeEPM that we select as root, for the rooted tree (T, x) as we defined in definition 10, suppose x has children y 1 , y 2 , . . . , y m . The subtrees rooted at y 1 , y 2 , . . . , y m are (T 1 , y 1 ), (T 2 , y 2 ), . . . , (T m , y m ) following definition 12. Then T i are the subtrees of T with number of vertexes strictly less than k. By the induction hypothesis,</p><p>Then we calculate the number of edge preserving mappings containing T i &#8746; {x}. For each subtree T i , there are C(y i ) choices for g(x). Thus,</p><p>We can use 20 to simplify 21 and get: This completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.3.">Extension to general templates</head><p>For a general template, we do not have a tree structure naturally generated by the template. In this setting, we use a breadth first search to produce a spanning tree of our template. We then can estimate the number of EPMs on our general template by using the algorithm in the previous section on the spanning tree. Then we give the following estimation algorithm for a general template. For a selected node x, we use the TreeEPM algorithm on the rooted tree structure with root x. The algorithm returns a vector of length |C(x)| recording an estimate of the number of EPMs that map x to each element in C(x). In Figure <ref type="figure">7</ref>, we</p><p>show the example of a template and its related data structures in each step of the EPM estimation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Complexity analysis of Spanning Tree Entropy method</head><p>We introduce first some terminology to be used in the complexity analysis. 6. Experiments</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Single Channel Networks</head><p>In this section, we describe a series of experiments performed on a collection of real and synthetic single channel subgraph isomorphism problems. We consider ten strategies for querying nodes listed in Table <ref type="table">2</ref>.</p><p>In the event that multiple nodes tie on one of these criteria, the node with minimal index is selected. We consider the benchmark dataset compiled by Solnon <ref type="bibr">[34]</ref>: The LV dataset is composed of a variety of graphs with theoretically interesting properties (e.g., biconnected, triconnected, highly symmetric, etc.). The SI dataset has four different graph matching problem sets involving bounded valence graphs, modified bounded valence graphs, meshes, and Erd&#337;s-R&#233;nyi graphs. There are two sets of instances representing images for pattern recognition problems, images-cv and images-pr which we combine into one dataset images. The biochemical_reactions dataset represents matching problems on systems of biochemical reactions. The phase dataset are instances involving randomly generated Erd&#337;s-R&#233;nyi graphs, known to be very difficult.</p><p>For our experiments, we only consider problems for which there are at least ten subgraph isomorphisms.</p><p>From these, we divide our datasets into "easy" and "hard" instances. Easy instances are those for which we can fully enumerate the solution space and therefore can compute the optimal amount of queries as in Section 3. Hard instances are the subgraph matching problems where we do not have the full solution space and as such we cannot compute the number of queries using the O or ME methods. We also include in this set, problems for which the MLE does not terminate within ten minutes. A compilation of basic graph statistics for these datasets is presented in Table <ref type="table">3</ref>.</p><p>For these problems, we select ten random solutions from the solution space for a given subgraph matching problem. For the easy instances, this is uniformly chosen over all solutions, and for hard instances, it is randomly chosen from a selection of solutions found by the Glasgow solver <ref type="bibr">[22]</ref> within one minute. Then for each solution, we proceed using the framework described in Algorithms 1.</p><p>The results for the easy data sets are presented in Figure <ref type="figure">8</ref>. The optimal query selection can vary significantly across different randomly selected solutions of the same subgraph matching problem. We count</p><p>Dataset |V t | |V w | Log Search Space MaxD MNCS STE MD R EE MC BTN 53 262377 75.02 25.07 21.32 17.41 17.92 21.44 20.96 17.4 BTN100 100 262377 110.57 42.06 36.64 30.25 30.66 35.93 35.88 30.15 Ivysysv7 94 2488 262.11 78.96 77.32 69.64 69.98 72.73 76.46 69.81 Covid 28 87580 29.86 15.28 15.28 12.96 13.0 14.1 15.16 13.15</p><p>Table 4: This table shows the average of the statistics of different datasets and the average number of nodes to query for different active learning results. The search space of a dataset is defined to be the product of the size of candidate sets. We take the logarithm with base 10 and calculate the average. See Section 6.1 for definition of the methods.</p><p>degree, we often get better results than most methods that need much more calculation.</p><p>The STE strategy gives similar query numbers as the minimum degree. But the way of selecting nodes is different. In our experiment, for a given ground truth, we include the nodes in the query list of any active learning method if querying all the other nodes can not reduce the solution space to one. In addition, for any non-trivial equivalent class of size k, we need to include at least k -1 nodes in the query list. We show a case study in Figure <ref type="figure">13</ref> to illustrate how this works with a particular instance from the BTN dataset; the theoretical optimal query is reached by an exhaustive search on all the selections. The optimal query is one node less than both methods. Notice that the STE strategy queried an unnecessary node (the node with 1 inside the circle). That node can be determined by the all-different constraint and the information from other nodes. minimum degree method does not differ between the nodes with same degree, so the performance of this method depends on some randomness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Future Work</head><p>In this paper, we present a rigorous mathematical treatment of the active learning query problem for subgraph matching where only one ground truth solution is desired. This has not been considered in the prior literature; however, it is an important problem for real-world problems in which a subgraph-matching problem may have a multitude of solutions yet the solution space can be greatly reduced with additional information brought by a subject matter expert. Such active learning problems are well-known in semisupervised machine learning but have not been studied in the subgraph matching setting. We prove that finding the optimal template vertices to query to be NP-complete even if the solution space and the ground truth are known. We present several strategies for determining template queries, some based on simple graph structure and others that estimate the probability of matching template and world nodes. We design an inexpensive and fast method to estimate the solution space based on the spanning tree. We assess our results on benchmark data and compare the performance of different strategies on single channel and multichannel network datasets with varying graph size and number of solutions. Our experiments show that for single channel networks, the maximum entropy strategy which requires the calculation of the solution space is nearly optimal and other methods that approximate this approach are close but with room for that the template graph is sparse with a tree-like structure. This strategy may be less effective for templates that are not sparse or that have a cyclical structure. Since the template is known a priori, this strategy could be restricted to problems with a tree-like structure.</p><p>Also, we did not focus on an efficient strategy for finding the optimal solution to the active learning problem, noting that this problem is NP-complete. Additionally, we find that the sequential strategies mentioned here have a total number of queries that are of the same order of magnitude as the number of template graph nodes. The example in Figure <ref type="figure">13</ref> shows an optimal choice that also has the same order of magnitude, indicating that for such combinatorially complex solution spaces, one will need to provide additional information about a substantial fraction of the template nodes to reduce the solution space to a single isomorphism. This issue is a limitation of the problem rather than our suggested strategies.</p><p>Finally, our research are limited on finding exact matches for subgraph matching. However, the exact subgraph matching problem could be extended to related problems. One example is the inexact subgraph matching problem, which replaces the strict edge-preserving constraints of exact isomorphism with a penalty for edge and label mismatches. The exact form of the penalty varies across applications, and there are numerous approaches, many taking inspiration from the exact matching problem. These approaches involve a variety of techniques including filtering <ref type="bibr">[36,</ref><ref type="bibr">17]</ref>, A* search <ref type="bibr">[15]</ref>, indexing <ref type="bibr">[20]</ref>, and continuous techniques <ref type="bibr">[35]</ref>. Another related problem is the maximum common subgraph detection problem, which deals with finding the largest common subgraphs between two input graphs. In <ref type="bibr">[2]</ref>, the authors designed a "branch and bound" algorithm called GLSearch based on graph neural networks. Their work shows that combining search methods with reinforcement learning and deep learning improves efficiency and is promising in solving these combinatorial problems.</p></div></body>
		</text>
</TEI>
