<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Using Active Queries to Infer Symmetric Node Functions of Graph Dynamical Systems</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 August</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10376919</idno>
					<idno type="doi"></idno>
					<title level='j'>Journal of machine learning research</title>
<idno>1533-7928</idno>
<biblScope unit="volume">23</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>A. Adiga</author><author>C. Kuhlman</author><author>M. Marathe</author><author>D. Rosenkrantz</author><author>S.S. Ravi</author><author>R. Stearns</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Developing techniques to infer the behavior of networked social systems has attracted alot of attention in the literature. Using a discrete dynamical system to model a networkedsocial system, the problem of inferring the behavior of the system can be formulated as theproblem of learning the local functions of the dynamical system. We investigate the problemassuming an active form of interaction with the system through queries. We consider twoclasses of local functions (namely, symmetric and threshold functions) and two interactionmodes, namely batch (where all the queries must be submitted together) and adaptive(where the set of queries submitted at a stage may rely on the answers to previous queries).We establish bounds on the number of queries under both batch and adaptive query modesusing vertex coloring and probabilistic methods. Our results show that a small number ofappropriately chosen queries are provably sufficient to correctly learn all the local functions.We develop complexity results which suggest that, in general, the problem of generatingquery sets of minimum size is computationally intractable. We present efficient heuristicsthat produce query sets under both batch and adaptive query modes. Also, we present a query compaction algorithm that identifies and removes redundant queries from a givenquery set. Our algorithms were evaluated through experiments on over 20 well-knownnetworks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Discrete dynamical systems and their significance</head><p>Discrete dynamical systems are used in a host of settings to understand population-level social contagion dynamics in terms of individual (human) agent behavior. Examples include the spread of health behaviors <ref type="bibr">(Valente, 2010)</ref> such as needle sharing in drug use <ref type="bibr">(Latkin, 1998;</ref><ref type="bibr">Valente, 2012)</ref> and overdose prevention <ref type="bibr">(Sherman et al., 2009)</ref>; segregation <ref type="bibr">(Schelling, 1971)</ref>; financial contagions <ref type="bibr">(Gai and Kapadia, 2010)</ref>; starting to use online communication tools <ref type="bibr">(Karsai et al., 2014)</ref>; and coordination <ref type="bibr">(Rosenthal et al., 2015)</ref>. The frameworks in these papers and our paper are network representations of populations, where nodes and edges represent entities (such as humans) and pairwise interactions, respectively. Our focus is on a particular class of discrete dynamical systems, called Synchronous Dynamical Systems (SyDSs). A formal definition of this model is given in Section 2.</p><p>Each of the works cited above can be viewed as capturing influence through threshold models <ref type="bibr">(Granovetter, 1978;</ref><ref type="bibr">Schelling, 1978)</ref>, where a node v i contracts a contagion if at least a particular number of its neighbors have already contracted it. This number for node v i is called its threshold &#964; i . We are interested in complex contagions <ref type="bibr">(Centola and Macy, 2007)</ref> that are characteristic of social contagions. In this case, some agents may need multiple reinforcing interactions to adopt a contagion; that is, some nodes have thresholds of 2 or more. These models are used to study many problems, such as those defined above, and other social phenomena (e.g., <ref type="bibr">Granovetter (1978)</ref>; <ref type="bibr">Schelling (1978)</ref>; <ref type="bibr">Centola and Macy (2007)</ref>; <ref type="bibr">Valente (1995)</ref>; <ref type="bibr">Easley and Kleinberg (2010)</ref>). Furthermore, <ref type="bibr">Watts (2002)</ref> argues that threshold models are used in a host of settings where incomplete information exists or when there is insufficient time to make more deliberate decisions. Thus, threshold models are used widely in the social sciences for understanding social contagions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Motivation for local function inference</head><p>Our work is focused on inferring threshold (or more generally, symmetric) functions of nodes in dynamical systems. It is known that small changes in the thresholds of agents can make a large difference in population dynamics. An example is provided in <ref type="bibr">(Granovetter, 1978)</ref>, where a change in one agent's threshold, by a value of 1, changes population-level collective action from non-existent to full. Several papers have used mined data to infer thresholds for applications ranging from protests, to Twitter messaging, to joining social media (e.g., <ref type="bibr">Gonz&#225;lez-Bail&#243;n et al. (2011)</ref>; <ref type="bibr">Romero et al. (2011b)</ref>; <ref type="bibr">Ugander et al. (2012)</ref>; <ref type="bibr">Easley and Kleinberg (2010)</ref>). Importantly, in all of these cases, heterogeneous (i.e., non-uniform) thresholds among agents were inferred. Also, researchers have identified different types of influence based on agent attributes. For example, there are gender-based differences in contracting obesity and depression from peers <ref type="bibr">(Christakis and Fowler, 2007;</ref><ref type="bibr">Stevens and Prinstein, 2005)</ref>. Thus, agent thresholds must be determined based on an agent's (local) neighborhood, behavior (and states) of agents in this neighborhood, and individual behavior. Symmetric functions, which generalize threshold functions, also serve as natural models in game theoretic settings <ref type="bibr">(Papadimitriou and Roughgarden, 2003)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Active querying and its significance</head><p>In general, an inference algorithm may receive observed data about the dynamical system or submit queries to the system and obtain responses from the system. The former approach, which has been studied by many researchers (e.g., <ref type="bibr">Gonz&#225;lez-Bail&#243;n et al. (2011)</ref>; <ref type="bibr">Adiga et al. (2017)</ref>; <ref type="bibr">Romero et al. (2011a)</ref>), will be discussed briefly in Section 1.5. In this paper, we pursue the latter approach. We study inference by active querying, where the user has some control over what information is extracted from the dynamical system by querying it. Several behavioral studies have been conducted in a network setting in the context of public goods games, collective identity, and team-building exercises <ref type="bibr">(Broere et al., 2019;</ref><ref type="bibr">Cedeno-Mieles et al., 2020;</ref><ref type="bibr">Charness et al., 2014)</ref>. In all these works, human subjects (agents) are allowed or required to coordinate with one another (thus forming a network) to achieve their goals. The principal who conducts these games designs situations by providing incentives for agents to act in a certain manner. The individual and collective behaviors resulting from these experiments are then analyzed. Such settings have been effectively modeled as network dynamical systems <ref type="bibr">(Santos et al., 2008;</ref><ref type="bibr">Cedeno-Mieles et al., 2020)</ref>. Active querying of a dynamical system is akin to such controlled experimentation in the real-world experiments to understand agent behavior in a network setting. Since every experiment has a cost associated with it, a natural objective is to minimize the number of queries used to infer a system.</p><p>Here, we focus on a model where each query is a system configuration C, with each configuration being an ordered tuple of states of the nodes of a dynamical system at a particular time t. The response from the system is the successor C of C, that is, the configuration of the system at time t + 1. The goal is to infer the node functions of the system from the responses. Our query model is motivated by the following two lines of research.</p><p>1. First, learning Boolean functions is an important area of learning theory (e.g., <ref type="bibr">Abasi et al. (2014)</ref>; <ref type="bibr">Angluin and Slonim (1994)</ref>; <ref type="bibr">Sloan et al. (2013)</ref>; <ref type="bibr">Angluin et al. (1992</ref><ref type="bibr">Angluin et al. ( , 1993))</ref>). In this setting, each query (usually referred to as a membership query) specifies an input &#945; to an unknown function f and an oracle returns the value f (&#945;). Our query model is an extension of membership queries to a network setting. In our case, since multiple Boolean functions (one for each node of the dynamical system) must be inferred, each query (i.e., a configuration) provides inputs to all the functions, and the output is another configuration which provides the values of all the functions for the specified inputs. Our work is similar in form to the teacher model (e.g., <ref type="bibr">Goldman and Kearns (1995)</ref>; <ref type="bibr">Dasgupta et al. (2019)</ref>) in the context of concept learning, where a teacher selects instances so that a student can learn the target concept from a concept class. In all these cases (including ours), the objective is to find a minimum set of queries that is sufficient to learn the object.</p><p>2. Second, our query model can also be thought of as an idealized version of hypothetical queries (i.e., alternate scenarios) that are useful in inferring local behaviors of individuals in the context of social contagions such as opinions or health issues (e.g., <ref type="bibr">Sloan et al. (2013)</ref>; <ref type="bibr">Centola (2010</ref><ref type="bibr">Centola ( , 2011))</ref>). For concreteness, we consider opinion propagation over a network, where the values 0 and 1 represent respectively positive and negative opinions on a certain topic. To understand when an individual (corresponding to a node in a social network) will change her opinion, a principal (e.g., a social scientist or a healthcare professional) may pose a query of the form "If the opinions of your neighbors are represented by this Boolean vector v, what will be your opinion?". Such queries do not set the opinions of the nodes in the social network to true (or actual) values. Instead, the principal tries to determine the situations that cause changes in an individual's behavior by analyzing the responses of the individual to a set of queries where the opinions of the neighbors of the individual are hypothetical values. While this can be done by issuing such queries to each node separately <ref type="bibr">(Centola (2010</ref><ref type="bibr">(Centola ( , 2011))</ref>), the underlying network provides a way to combine these individual queries into a single configuration so that the number of queries can be further reduced. Thus, our query model exploits the network structure in constructing a small set of queries to learn the local functions.</p><p>Our work is similar in spirit-but quite different in problem domain and results-to some of the recent works on inference (e.g., <ref type="bibr">Kleinberg et al. (2017)</ref>) or the popular and well-studied area of active learning <ref type="bibr">(Settles, 2009;</ref><ref type="bibr">Dasgupta and Langford, 2009)</ref>. We will discuss this in more detail in Section 1.5. To the best of our knowledge, this is the first work which approaches the inference problem for dynamical systems from a combinatorial and algorithmic perspective. In doing so, we relate it to well-studied graph theoretic approaches such as node coloring and probabilistic methods. The formulation also enables us to quantify rigorously the complexity of inferring such systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Summary of Results</head><p>Our focus is on the following problem: given the underlying graph of a dynamical system, construct queries to identify all the local functions. We study two query modes, namely batch and adaptive modes, that differ in their degrees of control. Under the batch mode, all the queries must be submitted together. In the adaptive mode, queries can be submitted in several stages, and queries at a stage can depend on the answers to previous queries; this strategy is similar to the one used in games such as "Twenty Questions" 1 . The optimization goal is to minimize the number of queries. We present both theoretical and experimental results as summarized below. In the following, we use G to denote the underlying network of the SyDS with n nodes and &#8710; to denote the maximum node degree of G.</p><p>(a) Generating concise query sets for symmetric node functions under the batch mode. We develop an algorithm for generating query sets under the batch mode to correctly identify all the symmetric local functions of the dynamical system. Such a set is called a complete query set. The main idea used in this algorithm is distance-two vertex coloring of the network G. We show that the size of a complete query set is at most &#967;(G 2 ) + 1 &#8804; min{&#8710; 2 + 2, n + 1}, where &#967;(G 2 ) is the minimum number of colors required for distance-two coloring of G (Theorem 10). We also establish a lower bound of &#8710; + 2 on the size of any complete query set (Proposition 3).</p><p>(b) Complexity of generating optimal query sets. The query set generated by the method in (a) satisfies a monotonicity property (defined in Section 4.1). We show that the problem of finding an optimal monotone complete query set is . This provides an indication of the difficulty of efficiently generating optimal query sets.</p><p>(c) Sharper bounds based on probabilistic methods. We show that a query set of size O(&#8710; 1.5 log n) which is complete with probability at least 1 -1 n can be obtained using a sampling technique (Theorem 12). Using more sophisticated techniques based on Lov&#225;sz Local Lemma <ref type="bibr">(Mitzenmacher and Upfal, 2005)</ref>, we show that there exists a complete query set of size O(&#8710;(log &#8710;) 2.5 ) (Theorem 13). We note that this bound is asymptotically better than the &#8710; 2 + 2 bound based on distance-two coloring mentioned above.</p><p>(d) Query set compaction. We formulate the problem of query set compaction, i.e., minimizing the size of a query set by deleting redundant queries. We show that the problem is, in general, NP-complete. We present an approximation algorithm for the problem and prove that it achieves the best performance guarantee to within constant factors, under the common assumption that P = NP (Theorems 23 and 26).</p><p>(e) Inferring threshold functions under the adaptive query mode. As can be expected, adaptive query mode, when applicable, can produce significantly smaller query sets compared to the batch mode. Our results for adaptive querying are based on binary search. This enables us to find the threshold of a node whose local function has k inputs using O(log k) queries. Using an extension of this technique, we show that the thresholds of all the nodes in a scale-free (i.e., power law) graph with exponent &#947; &#8805; 1 can be found using O [n log n] 2 &#947;+1 queries (Theorem 22). In addition, we show that for a dynamical system where the underlying network is a clique and each node is associated with a threshold function, n + 1 queries are necessary even under the adaptive query mode (Theorem 5). For general graphs, we develop a greedy heuristic based on binary search and distance-two coloring to construct query sets.</p><p>(f ) Extensive experimentation on synthetic and real-world networks. We evaluated the various query generation algorithms under the batch mode on more than 20 real-world and synthetic networks. For most real-world networks, our algorithm based on distance-two coloring generates query sets of minimum size. We present experimental results to show that the combination of randomized query generation followed by query set compaction produces small query sets under the batch mode for many well known social networks. We evaluated the greedy heuristic under the adaptive mode (mentioned in Item (e) above) for various settings of networks and threshold assignments. Our results (Section 7) show that for most cases, it significantly outperforms the batch mode algorithms. We also study the effect of network structure and threshold assignments on the size of query sets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.5">Related Work</head><p>The problem of inferring unknown components of systems has received attention in the literature. There are several works on the passive mode of inference. For instance, many researchers have studied the problem of learning automata (e.g., <ref type="bibr">Murphy (1996)</ref>). <ref type="bibr">Kearns and Vazirani (1994)</ref> study the problem of learning normal forms and Boolean functions. Works such as <ref type="bibr">(Gonz&#225;lez-Bail&#243;n et al., 2011;</ref><ref type="bibr">Romero et al., 2011b)</ref> infer thresholds from social media data. <ref type="bibr">Abrahao et al. (2013</ref><ref type="bibr">), Gomez-Rodriguez et al. (2010)</ref> and <ref type="bibr">Soundarajan and Hopcroft (2010)</ref> consider the problem of inferring the network structure given the contagion propagation model. Learning the source nodes of infection for contagion spreading is addressed in <ref type="bibr">(Zhu et al., 2017)</ref>. Many of these problems are formally hard even for simple local functions. The work of <ref type="bibr">(Adiga et al., 2017)</ref> provides several problems aimed at inferring thresholds in threshold-based discrete dynamical systems. Recently, there have been several works on learnability of dynamical system properties under the Probably Approximately Correct (PAC) learning framework. <ref type="bibr">Lokhov (2016)</ref> uses a dynamic message-passing algorithm to reconstruct parameters of a spreading model given infection cascades. <ref type="bibr">Narasimhan et al. (2015)</ref> and <ref type="bibr">He et al. (2016)</ref> have studied the learnability of the influence function of popular stochastic propagation models, and <ref type="bibr">Adiga et al. (2019)</ref> consider the problem of learning node thresholds. Inference problems for a class of local functions, motivated by user behavior in Facebook like networks, are studied in <ref type="bibr">Adiga et al. (2020)</ref>. Several works show the sensitivity of results to other dynamical systems features, such as network structure <ref type="bibr">(Allen and Gale, 2000;</ref><ref type="bibr">Ganesh et al., 2005;</ref><ref type="bibr">Chakrabarti et al., 2008;</ref><ref type="bibr">Prakash et al., 2012)</ref>.</p><p>Active querying is studied in <ref type="bibr">(Kleinberg et al., 2017)</ref> in the context of determining user choices from a finite set of ranked options-the choice set problem. The goal is to minimize the number of queries of arbitrary subsets S of size k, of a universal set U , to learn a user's choice from among the elements of each set S. With these results, the algorithm can then predict the user's choice for any subset S &#8838; U of size k. They show that this can be accomplished with O(n log n) queries where n = |U |. The independent cascade model under the active query model has also been studied <ref type="bibr">(Adiga et al., 2018a)</ref>.</p><p>We note that our work is different from the area of active learning (e.g., <ref type="bibr">Dasgupta and Langford (2009)</ref>; <ref type="bibr">Settles (2009)</ref>). An active learning algorithm has access to a limited number of labeled instances and a large number of unlabeled instances initially. The algorithm submits appropriately chosen unlabeled instances as queries to an oracle which returns the labels of those instances. This process allows an algorithm to learn more effectively using fewer labeled instances since the algorithm is allowed to select the instances used in the learning process. The main difference in our setting is that there are no labels; instead, our inference algorithms use queries, with each query specifying a configuration C and requiring the oracle to return the successor configuration C . Like active learning, our algorithms also choose queries appropriately so that the local functions can be inferred with a small number of queries.</p><p>There are several challenges in determining individual node thresholds in realistic settings: (i) data are collected at discrete time intervals (not continuously), (ii) there may be time delay effects in agents observing their neighborhoods, and (iii) inherent stochasticity <ref type="bibr">(Valente, 1996;</ref><ref type="bibr">Berry and Cameron, 2017)</ref>. Practical guidelines and issues for threshold measurement are discussed in <ref type="bibr">(Berry and Cameron, 2017)</ref>. Here, we investigate problems of inferring local functions using rigorous formulations, supplementing them with experimental results from heuristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.6">Organization</head><p>The remainder of this paper is organized as follows. Section 2 presents the formal definition of a synchronous dynamical system and also explains our query model. Section 3 presents lower bounds on the sizes of query sets under both batch and adaptive modes. Sections 4 and 5 present upper bounds on the query set sizes under the batch and adaptive query models respectively. Section 6 presents results for the query set compaction problem where the goal is to remove redundant queries. Section 7 presents experimental results using both synthetic and real-world social networks. Conclusions and some directions for future work appear in Section 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Synchronous Dynamical Systems (SyDSs) and Query Model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Formal Definitions</head><p>Let B denote the Boolean domain {0,1}. A Synchronous Dynamical System (SyDS) S over B is specified as a pair S = (G, F ), where (a) G(V, E), an undirected graph with |V | = n, represents the underlying graph of the SyDS, with node set V and edge set E, and (b) F = {f 1 , f 2 , . . . , f n } is a collection of functions in the system, with f i denoting the local function associated with node v i , 1 &#8804; i &#8804; n.</p><p>Each node of G has a state value from B. Each function f i specifies the local interaction between node v i and its neighbors in G. The inputs to function f i are the state of v i and those of the neighbors of v i in G; function f i maps each combination of inputs to a value in B. This value becomes the next state of node v i .</p><p>At any time t, the configuration C of a SyDS is the n-vector (s t 1 , s t 2 , . . . , s t n ), where s t i &#8712; B is the state of node v i at time t (1 &#8804; i &#8804; n). In a SyDS, all nodes compute and update their next state synchronously.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Classes of Local Functions</head><p>We consider two classes of local functions, namely threshold and symmetric functions. They are defined below. (i) Threshold functions: The local function f v associated with node v of a SyDS S is a &#964; v -threshold function for some integer &#964; v &#8805; 0 if the following condition holds: the value of f v is 1 if the number of ones in the input to f v is at least &#964; v ; otherwise, the value of the function is 0. Let d v denote the degree of node v, and let &#964; v denote the threshold of node v.</p><p>The number of inputs to the function</p><p>(The threshold values 0 and d v + 2 allow us to realize local functions that always output 1 and 0 respectively.) (ii) Symmetric functions: A local function f v at node v is symmetric if the value of the function depends only on the number of ones in the input. Thus, a symmetric function f v with k inputs can be specified using a table with k + 1 rows, with row i specifying the value of the function when the number of ones in the input to the function is exactly i, 0 &#8804; i &#8804; k. Note that each threshold function is also a symmetric function; however, symmetric functions are more general than threshold functions. For example, consider the following symmetric function f with five inputs: let the value of f be 1 when the number of ones in its input is (exactly) 1; otherwise, the value of f is 0. This is not a threshold function since the value of the function is 1 when the number of ones in the input is 1 but changes to 0 when the number of ones in the input is larger than 1. We will use the term "symmetric SyDS" ("threshold SyDS") to refer to a SyDS whose local functions are all symmetric (threshold).</p><p>Example: Consider the graph of a threshold SyDS shown in Figure <ref type="figure">1</ref>(a). The thresholds are assigned in Figure <ref type="figure">1(b)</ref>. Also, we show an example trajectory (i.e., time sequence of configurations) of the system given that initially nodes 4, 6 and 7 are in state 1 and the rest are in state 0. Once the system reaches the configuration C = (1, 1, 1, 0, 0, 1, 0) at time step 3, it remains in that configuration forever; that is, C is a fixed point for this system.  Additional Terminology: If a given SyDS can transition in one step from a configuration C to a configuration C , then C is a successor of C and C is a predecessor of C . Since our local functions are deterministic, each configuration has a unique successor; however, a configuration may have zero or more predecessors. Given a graph G(V, E) and a node</p><p>Thus, the inputs to the function f i at v i are the states of the nodes in N [v i ].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Query Model</head><p>The general problem addressed in this paper is that of correctly identifying the local functions of a SyDS by querying the system. We assume that the underlying network is known. Each query specifies a configuration C and the response from the system is the successor C of C. Since the state of each node is either 0 or 1, each query q and the response to q are bit vectors. We consider two query modes. In the batch query mode, a user must submit all the queries at the same time as a single batch. In the adaptive query mode, a user may submit the queries in several batches; the queries chosen in a batch may rely on the responses received from the system for the previous batches. As will be seen, for threshold SyDSs, the adaptive query mode can significantly decrease the number of queries. Figure <ref type="figure">2</ref> summarizes our active query framework. User defined query q</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Propagation model over a network</head><p>Unknown SyDS</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Successor q</head><p>Inference algorithm Inferred propagation model</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Adaptive querying</head><p>Figure <ref type="figure">2</ref>: Active querying framework. The dashed line corresponds to active querying where the inference algorithm constructs the current query based on the response received from the system for the past queries. In batch querying, there is no such feedback.</p><p>The following additional definitions regarding queries will be used throughout this paper. Given a query q and a node v i , the score of q with respect to v i , denoted by score(q,v i ), is the number of nodes in the closed neighborhood N [v i ] of v i that are set to 1 by q. Thus, score(q,v i ) gives the number of ones in the input provided by q to the function f i at v i .</p><p>Definition 1 Let S be a symmetric SyDS. For any node v i , let</p><p>When a query set Q covers a node v, the local symmetric function f v can be correctly inferred from the responses to the queries in Q. Thus, complete query sets have the following property.</p><p>Observation 2 Let S be a symmetric SyDS. If Q is a complete query set for S, then each local function of S can be determined given the successor of each query in Q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Lower Bounds on Sizes of Query Sets</head><p>Here, we present lower bounds under batch and adaptive query modes. We begin with a result that provides a lower bound for any symmetric SyDS under the batch mode.</p><p>Proposition 3 Let S be a symmetric SyDS where the underlying graph G(V, E) has a maximum node degree &#8710;. Under the batch query model, every complete query set must contain at least &#8710; + 2 queries.</p><p>Proof: Under the batch mode, suppose a complete query set Q has fewer than &#8710;+2 queries. Since G has a node v of degree &#8710;, the number of ones in the input to the symmetric function f v at v varies from 0 to &#8710; + 1, a total of &#8710; + 2 values. Thus, there is at least one value k such that none of the queries in Q has a score of k with respect to v. Hence, the query set cannot correctly determine the value of the function f v when the number of ones in the input to f v is exactly k. This contradicts the assumption that Q is a complete query set. The proposition follows.</p><p>As a simple consequence of the above proposition, the following result points out that there are SyDSs with n nodes for which every complete query set must have n + 1 queries. This lower bound matches the upper bound of n + 1 given by Theorem 10 (Section 4) for all graphs.</p><p>Corollary 4 For symmetric SyDSs whose underlying graph is a clique on n nodes, every complete query set under the batch mode must have at least n + 1 queries.</p><p>We now establish a lower bound under the adaptive query model to show that there are threshold SyDSs for which a large number of queries are needed even under the adaptive query mode. However, this result does not rule out the possibility of smaller query sets for special graph classes.</p><p>Theorem 5 For every n &#8805; 1, there is a threshold SyDS whose underlying graph is a clique on n nodes such that at least n + 1 queries are necessary under the adaptive query mode to correctly identify all the threshold values.</p><p>Proof: Consider a threshold SyDS S whose underlying graph G(V, E) is a clique on n nodes. Let V = {v 1 , v 2 , . . . , v n }. We will tentatively choose the threshold of node v i to be i to answer queries under the adaptive model, We will show that if the total number of queries is less than n + 1, the answers to the queries cannot distinguish between this tentative assignment and a slightly different assignment of threshold values.</p><p>Suppose Q is a sequence of queries under the adaptive model, with |Q| &#8804; n. We generate the responses to the queries using the chosen tentative assignment of threshold values to nodes. The threshold of any node of S is in the range 0 through n + 1 (where the threshold value n + 1 indicates the function which is zero for every input). Since G is a clique, the the closed neighborhood of each node is the node set V . Thus, each query q &#8712; Q provides the same score to each node of G. There are n + 1 scores in the range 0 through n. Since Q has at most n queries, there is at least one value k, 0 &#8804; k &#8804; n, such that none of the queries in Q provides the score k. We have three cases depending on the value of k.</p><p>Case 1: k = 0. In this case, from the responses to the queries in Q, one cannot distinguish between the case where the threshold of node v 1 is 0 and the case where the threshold of v 1 is 1. (In both cases, the new state of v 1 is 1 in the response to each query in Q.)</p><p>Case 2: k = n. In this case, from the responses to the queries in Q, one cannot distinguish between the case where the threshold of node v n is n and the case where the threshold of v n is n + 1. (In both cases, the new state of v n is 0 in the response to each query in Q.)</p><p>Case 3: 1 &#8804; k &#8804; n -1. In this case, from the responses to the queries in Q, one cannot distinguish between the case where the threshold of node v k is k and the case where the threshold of v k is k + 1. (In both cases, the responses have the following property. For any query q &#8712; Q where score(q,v k ) &#8804; k -1, the new state of v k is 0 in the response. For any query q &#8712; Q where score(q,v k ) &#8804; k + 1, the new state of v k is 1 in the response.) Thus, under the adaptive model, a query set with n or fewer queries cannot correctly identify all the thresholds for the chosen SyDS. This completes the proof of Theorem 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Querying Under the Batch Mode: Results for Symmetric Node Functions</head><p>In this section, we first present an algorithm for generating query sets under the batch mode for symmetric SyDSs and derive an upper bound on the optimal query set size. We then derive an asymptotically better bound using probabilistic methods. Finally, we present complexity results that suggest that in general, generating complete query sets of minimum size is computationally intractable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Generating Query Sets Based on Node Coloring</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">Obtaining a Query Sequence From Node Coloring</head><p>Overview: We begin by defining the notion of a monotone query sequence. The sequence of queries constructed can be submitted as a batch to learn all the local functions of a symmetric SyDS. Using the notion of "sequence" allows us to point out an interesting connection between the problem of identifying symmetric local functions and a variant of the node coloring problem for the underlying graph. We remind the reader that each query is a bit vector.</p><p>Definition 6 (a) Given two queries q 1 and q 2 , we use the notation q 1 &#8804; q 2 to mean that every bit which is 1 in q 1 is also 1 in q 2 . (b) A query sequence q 1 , q 2 , . . . , q r is monotone if for each i, 1 &#8804; i &#8804; r -1, q i &#8804; q i+1 . (c) Let S be a symmetric SyDS and let M be a monotone query sequence. If M is also a complete query set for S (i.e., each node v of S is covered by M ), then M is a complete monotone query sequence.</p><p>The following lemma points out a property of complete monotone query sequences.</p><p>Lemma 7 Let S be a symmetric SyDS and let M = q 1 , q 2 , . . . q r be a complete monotone query sequence with 2 or more queries for S. Then q 1 is the vector of all zeros and q r is the vector with all ones.</p><p>Proof: Suppose q 1 is not the vector with all zeros. Let v i be a node such that the value assigned by q 1 to v i is 1. Thus, score(q 1 ,v i ) &#8805; 1. Since M is a monotone query sequence, score(q j ,v i ) &#8805; 1 for 2 &#8804; j &#8804; r. In other words, there is no query q j &#8712; M for which score(q j , v i ) = 0, contradicting the assumption that M covers v i . Likewise, if q r is not the vector with all ones, there is a node v k such that the value assigned to v k by q r is 0. Thus, there is no query q j in M such that score(q j ,v k ) = degree(v k )+1. In other words, M does not cover v k .</p><p>We now present an algorithm to show that if the underlying graph G has n nodes, then there is a monotone complete query sequence M for S with at most min{&#8710; 2 + 2, n + 1} queries, where &#8710; is the maximum node degree of G. This sequence of queries can be submitted as a batch to learn all the symmetric local functions. To establish this result, we recall the following definitions.</p><p>Definition 8 (a) Given an undirected graph H(V H , E H ) and an integer k &#8805; 1, a k-coloring of H assigns a color from {1, 2, . . . , k} to each node of H such that for each edge {u, v} &#8712; E H , the colors assigned to u and v are different. (b) Given an undirected graph G(V, E), the square of G, denoted by G 2 (V, E ), is an undirected graph on the same vertex set V . The edge set E is defined as follows: {u, v} &#8712; E iff there is a path with at most 2 edges between u and v in G.</p><p>We will also use the following result called Brooks' Theorem <ref type="bibr">(West, 2001, Theorem 5.1.22)</ref>.</p><p>Lemma 9 Let H(V H , E H ) be a graph with maximum node degree &#8710; H . Then, H can be colored efficiently using at most &#8710; H + 1 colors.</p><p>. ., C k denote the color classes created in the above coloring step.</p><p>(Color class C j consists of all nodes assigned color j, 1 &#8804; j &#8804; k.) 4 Create the query sequence M = q 0 , q 1 , . . . , q k with k + 1 queries as follows: query q 0 is a bit vector where every element is 0. 5 for j = 1, 2, . . . , k do 6 Create query q j by choosing the value 1 for all the nodes in C 1 &#8746; . . . &#8746; C j and 0 for the other nodes.</p><p>7 end 8 Output the query sequence M .</p><p>Our algorithm Alg-Monotone-Seq for generating a monotone complete query sequence M for the given SyDS S is shown in Algorithm 1. It is easy to see that the algorithm runs in polynomial time. The following theorem shows its correctness and estimates the number of queries generated.</p><p>Theorem 10 Let S be a symmetric SyDS whose graph G(V, E) has n nodes and maximum node degree &#8710;. Algorithm Alg-Monotone-Seq (Algorithm 1) produces a monotone complete query sequence M with at most min{&#8710; 2 + 2, n + 1} queries.</p><p>Proof: We first show that Step 2 of the algorithm can indeed color G 2 using at most min{&#8710; 2 + 1, n} colors. Since the maximum node degree in G is &#8710;, each node v of G has at most &#8710; neighbors and at most &#8710;(&#8710; -1) nodes at a distance of 2 from v. Thus, the maximum node degree in G 2 is at most &#8710;(&#8710; -1) + &#8710; = &#8710; 2 . Hence, by Lemma 9, G 2 can be colored using at most &#8710; 2 + 1 colors. Since G 2 has n nodes, n colors are always sufficient. Thus, G 2 can be colored with at most k = min{&#8710; 2 + 1, n} colors. Hence, the number of queries in M = k + 1 is at most min{&#8710; 2 + 2, n + 1}.</p><p>We now argue that the query sequence M = q 0 , q 1 , . . . , q k is monotone. Query q 0 is the bit vector with all zeros. For any j &#8805; 1, query q j sets all the nodes in color classes C 1 through C j to 1 and the remaining nodes to 0. Thus, each node that is set to 1 in query q j remains 1 in all the subsequent queries q j+1 , . . ., q k . In other words, the sequence is monotone.</p><p>Thus, we are left with the proof that M is complete; that is, for each node v with degree &#945; in G and each value , 0 &#8804; &#8804; &#945; + 1, there is a query q in M such that score(q,v) =</p><p>. Query q 0 ensures that score(q,v) = 0. For the other values of , consider the closed neighborhood</p><p>, there is a path consisting of at most two edges in G. Thus, the nodes in</p><p>appear, and assume without loss of generality that j 1 &lt; j 2 &lt; . . . &lt; j &#945;+1 . It is easy to see that for 1 &#8804; &#8804; &#945; + 1, query q j ensures that score(q j ,v) = . This completes the proof of Theorem 10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Generating Coloring from a Monotone Complete Query Sequence</head><p>The above algorithm and Theorem 10 show that a monotone complete query sequence can be constructed from the coloring of the graph G 2 . Theorem 11 shows that this relationship is not accidental; indeed, a valid coloring of G 2 can be generated from any monotone complete query sequence. This result is useful in proving that it is NP-hard to generate complete monotone query sequences of minimum length.</p><p>Theorem 11 Let G(V, E) be the underlying graph of a symmetric SyDS S. Suppose there is a monotone complete query sequence M with queries for S. Then, G 2 can be colored using -1 colors.</p><p>Proof: Let M = q 1 , q 2 , . . . , q denote the given monotone complete query sequence for S. Let C i denote the set of nodes of G which have the value 0 in q i and the value 1 in q i+1 , 1 &#8804; i &#8804; -1. Assign color i to all the nodes in C i , 1 &#8804; i &#8804; -1. We now prove that this scheme assigns a color to each node and that this is a valid coloring of G 2 .</p><p>First, we prove that each node is assigned a color. To see this, note that since M is a monotone complete query sequence, by Lemma 7, query q 1 has all its bits set to 0 and q has all its bits set to 1. Therefore, for each node v, there is an index r such that the value assigned to v in q r is 0 and that in q r+1 is 1. Thus, v appears in set C r and receives color r. The monotonicity of M ensures that v remains 1 in queries q r+1 through q .</p><p>We now prove by contradiction that the above method produces a valid coloring of G 2 . So, suppose that v i and v j are two nodes that receive the same color, say k, but G 2 has the edge {v i , v j }. By our coloring scheme, both v i and v j had the value 0 in q k and the value 1 in q k+1 . There are two cases to consider.</p><p>Case 1: The edge {v i , v j } is in G.</p><p>Note that both v i and v j have color k. Let score(q k ,v i ) = &#945;. Since both v i and v j changed from 0 in q k to 1 in q k+1 and {v i , v j } is an edge in G, score(q k+1 ,v i ) &#8805; &#945; + 2. Because M is monotone, none of the other queries in M provides a score of &#945; + 1 to v i . This contradicts the assumption that M is complete.</p><p>Case 2: The edge</p><p>In this case, there is a node v x such that the edges {v i , v x } and {v j , v x } are in G. Let score(q k ,v x ) = &#946;. Since both v i and v j changed from 0 to 1 in q k+1 and both {v i , v x } and {v j , v x } are edges in G, score(q k+1 ,v x ) &#8805; &#946; + 2. Because M is monotone, none of the other queries in M provides a score of &#946; + 1 to v x . Again, this contradicts the assumption that M is complete, and Theorem 11 follows. Figure <ref type="figure">3</ref> gives an example of a batch query set generated using the distance-2 coloring method discussed above. The figure also includes a query set generated using an adaptive query algorithm discussed in Section 7. Color &#8594; a b c d 1 0 1 1 1 1 2 0 0 1 1 1 3 0 0 0 1 1 4 0 0 0 0 1 5 0 1 1 1 1 6 0 0 0 1 1 7 0 0 1 1 1 (c) Batch mode query set 1 2 3 4 5 1 0,4 0 0,2 0 0,1 0 0,1 0 1,1 0 1,1 2 0,4 1 0,2 0 0,1 0 0,1 0 0,0 0 0,0 3 0,5 1 0,2 1 2,2 1 2,2 0 2,2 1 2,2 4 0,5 0 2,5 <ref type="table">1 3,</ref><ref type="table">5 1 4,</ref><ref type="table">5 0 4,</ref><ref type="table">5 1 4,</ref><ref type="table">4  5 0,</ref><ref type="table">5 0 2,</ref><ref type="table">5 1 3,</ref><ref type="table">4 1 4,</ref><ref type="table">4 1 4,</ref><ref type="table">4 1 4,</ref><ref type="table">4  6 0,</ref><ref type="table">4 1 0,</ref><ref type="table">2 0 1,</ref><ref type="table">1 0 1,</ref><ref type="table">1 1 1,</ref><ref type="table">1 0 1,</ref><ref type="table">1  7 0,</ref><ref type="table">5 1 2,</ref><ref type="table">5 0 3,</ref><ref type="table">4 1 3,</ref><ref type="table">4 1 3,</ref><ref type="table">3 1 3,</ref><ref type="table">3</ref> (d) Adaptive querying Figure <ref type="figure">3</ref>: Query sets for the threshold SyDS of in Figure <ref type="figure">1</ref>. In (c), the shaded rows correspond to nodes which have been assigned the color indicated by the column heading. In (d), the queries were generated adaptively using Algorithm 3 described in Section 7. There are five iterations and therefore, five queries. For each iteration, the corresponding query as well as the uncertainty (range) in the thresholds of the vertices are shown.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Generating query sets based on probabilistic methods</head><p>In Section 4.1, we showed that for any symmetric SyDS whose underlying graph has n nodes and maximum node degree &#8710;, one can obtain a complete query set with at most min{&#8710; 2 + 2, n + 1} queries. For some graphs with maximum node degree &#8710;, this method may generate a query set with &#8486;(&#8710; 2 ) queries. Here, using probabilistic methods, we provide better bounds. First, using a simple sampling technique, we will show that there exists a query set of size at most O(&#8710; 1.5 log n) that is complete with high probability. This is asymptotically better than the previously established bounds for graphs where &#8710; &#8805; (log n) 2 . Moreover, such a query set can be generated efficiently. Next, using more sophisticated techniques from <ref type="bibr">F&#252;redi and Kahn (1986)</ref> based on the Lov&#225;sz Local Lemma, we show an upper bound of O(&#8710;(log &#8710;) 2.5 ) on the size of complete query sets.</p><p>Theorem 12 Let S be a symmetric SyDS with graph G(V, E) where |V | = n and maximum node degree = &#8710;. A query set Q of size O(&#8710; 1.5 log (n)) which is complete with probability at least 1 -1 n can be constructed for S.</p><p>The proof of the above theorem appears in Appendix A. Here, we provide the general idea for the query set construction method used in proving the above theorem. Let D(p) be a probability distribution defined on the set of configurations of a SyDS such that the state of each vertex is set to 1 independently with probability p. For a query q, we use the notation q &#8764; D(p) to mean that query q is drawn from the distribution D(p). We will generate a query set</p><p>for some number of repetitions r. This construction enables us to prove that for any vertex v and positive integer k &#8804; &#8710; + 1, with high probability, there is a query q ik &#8712; Q for which score(q ik , v) = k. The reader is referred to Appendix A for details. Now, we show an upper bound of O(&#8710;(log &#8710;) 2.5 ) on the size of complete query sets under the batch mode for SyDSs with symmetric local functions.</p><p>Theorem 13 Let G(V, E) be the underlying graph of a symmetric SyDS S. Let &#8710; denote the maximum node degree in G. Under the batch mode, for any &#8710; &#8805; 2, there is a complete query set Q for S with |Q| = O &#8710;(log &#8710;) 2.5 .</p><p>The following lemma is a key ingredient for the proof of Theorem 13. Suppose G(V, E) is a graph and</p><p>The following notation is used in our proof of the lemma. Given a positive real number x, we let x to denote the integer obtained by rounding x to the nearest integer; that is, x = x if the fractional part of x is less than 0.5; otherwise,</p><p>Proof: The all zeros query 0 is such that &#8704;v &#8712; V , score(v, 0) = 0. The configuration q with all vertices of V in state 1 yields &#8704;v &#8712; V , score(v, q) = |N [v, V ]|. These two queries account for the additive term of 2. Let D(p, V ) be the distribution where each q &#8764; D(p, V ) is constructed as follows: &#8704;v &#8712; V , Pr(q(v) = 1) = p and &#8704;v &#8712;</p><p>where the second inequality follows from Lemma 27 which is given in Appendix A.</p><p>Pr score(v, q) = b for any</p><p>Note that if &#8707;v &#8712; V and b &#8712; {1, . . . , } such that score(v, q) = b for any q &#8712; Q, then there is a subset of V of size &#8804; for which in no query, b vertices are in state 1. Therefore, using union bound,</p><p>Hence, there exists a query set of size at most |Q| = ( -1) &#215; 22 &#8730; + 1 log |V | + 2 &lt; 22 3/2 log |V | + 2 that satisfies the conditions in the statement of the lemma.</p><p>We also use the following two lemmas of F&#252;redi and Kahn <ref type="bibr">(F&#252;redi and Kahn, 1986)</ref> that are based on the Lov&#225;sz Local Lemma <ref type="bibr">(Mitzenmacher and Upfal, 2005)</ref>.</p><p>Lemma 15 Let H(V, E H ) be a hypergraph on a set of n elements V such that each hyperedge has at most b elements and each element belongs to at most b hyperedges, where b &#8805; 500.</p><p>Proof of Theorem 13: We will prove that for any &#8710; &#8805; 2, the query set size is at most 2500&#8710;(log &#8710;) 2.5 . We first note that Lemma 15 has a technical requirement that b be at least 500. In order to apply that result, we require &#8710; &#8805; 500. Hence, we first consider the case &#8710; &#8804; 500 separately. From Theorem 10, we have |Q| &#8804; &#8710; 2 + 2. For any constant c &#8805; 42, it can be verified that |Q| &#8804; &#8710; 2 +2 &#8804; c&#8710;(log &#8710;) 2.5 for the entire range 2 &#8804; &#8710; &#8804; 500. Therefore, the bound holds trivially for this case. So, we will assume for the remainder of this proof that &#8710; &gt; 500.</p><p>For any graph G, let H be the hypergraph where each hyperedge H v corresponds to the closed neighborhood of vertex v. Since the maximum degree is &#8710;, for all v, |H v | &#8804; &#8710; + 1 and v belongs to at most &#8710; + 1 hyperedges. By Lemma 15, for any &#8710; &#8805; 500, the vertices of G can be partitioned into &#945; &#8804; &#8710;+1 log(&#8710;+1) subsets X 1 , X 2 , . . . , X &#945; , such that every vertex is adjacent to at most = 4.7 log(&#8710; + 1) vertices in any X i . For 1 &#8804; i &#8804; &#945;, let n &#8804;i (v) denote the number of neighbors of v in j&#8804;i X j . The query set Q is structured in the following manner. It is partitioned into</p><p>for each v, where n &#8804;0 (v) = 0 by definition. In the remaining part, we will show that this can be achieved with |Q i | &#8804; 22 &#8730; &#8710; log 2 &#8710; + 2 for each i. We will partition each X i into subsets X i1 , X i2 , . . . , X ik such that every vertex in V is adjacent to at most one vertex in X ij for any j.</p><p>we note that there exist at most 22 2 &#8730; log k + 2 &#8804; 23 2 &#8730; log( &#8710;) queries q such that &#8704;v &#8712; V , q(v) = 0 and for each b = 0, 1, ..., |N [v, X i ]|, there exists a query q such that v is adjacent to exactly b vertices in {x ij | 1 &#8804; j &#8804; k}. Let this set of configurations be denoted by Q i .</p><p>For each q &#8712; Q i , we construct a query q &#8712; Q i as follows: &#8704;v &#8712; X j , j &lt; i, we set q(v) = 1, and &#8704;v &#8712; X j , j &gt; i, we set q(v) = 0. For v &#8712; X i , q(v) = q(X ij ), where X ij is the set to which v belongs. Suppose in a configuration q, v &#8712; V has b neighbors in state 1. We will now show that score(v, q) = n &#8804;i-1 (v) + b. By definition of q, for any u &#8712; X i , q(u) = 1 if and only if q(X ij ) = 1. Since v is adjacent to at most one vertex in X ij for any j, there are exactly b vertices of N [v, X i ] with state 1 in q. Further, recalling that every vertex with color &lt; i is in state 1 in q, score(v, q) = n &#8804;i-1 (v) + b. Finally, since for every b = n &#8804;i-1 (v) + 1, . . . , n &#8804;i (v), there exists a query q &#8712; Q with b neighbors of v in state 1, the proof follows.</p><p>Thus,</p><p>, where the last inequality is due to the fact that = 4.7 log(&#8710; + 1) . Theorem 13 follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Complexity of Generating Small Monotone Complete Query Sequences</head><p>Here, we present a result that provides an indication of the difficulty of efficiently generating small query sets. In particular, we will show the NP-completeness of the following problem.</p><p>Short Monotone Complete Query Sequence (SMCQS) Given: The underlying graph G(V, E) of a symmetric SyDS S and a positive integer k. Question: Is there a monotone complete query sequence Q with at most k queries for S?</p><p>Theorem 17 Problem SMCQS is NP-complete.</p><p>Proof: It is easy to see that SMCQS is in NP. To prove NP-hardness, we use a reduction from the Distance-2 Coloring (D2C) problem defined as follows: given an undirected graph G(V, E) and an integer r, is G 2 r-colorable? It is known that D2C is NP-complete <ref type="bibr">(McCormick, 1983</ref>) even for planar graphs <ref type="bibr">(Lloyd and Ramanathan, 1992</ref>).</p><p>The reduction is straightforward. Given an instance of the D2C problem consisting of graph G and integer r, we obtain an instance of the SMCQS problem where the graph is G itself and the length k of the query sequence is r + 1. It was shown in the proof of Theorem 10 that when G 2 is r-colorable, there is a monotone complete query sequence with k = r + 1 queries. Also, from Theorem 11, from any monotone complete query sequence of length r + 1, one can obtain a valid coloring of G 2 with r colors. Thus, there is a solution to the SMCQS problem iff there is a solution to the D2C problem, and this completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Generating Concise Query Sets: A Generalized Version</head><p>We now consider the complexity of a more general version of the problem of generating concise complete query sets under the batch model. Recall that for a symmetric SyDS, for each node v and each integer in {0, 1, . . ., degree(v)+1}, a complete query set Q must have a query q such that score(q,v) = . In the version of the problem considered below, we will only require the query set to provide a small subset of score values. This problem can be formulated as follows.</p><p>Concise Query Set for Specified Scores (CQS-SS) Instance: The underlying graph G(V, E) of a SyDS S where each local function is symmetric, a set L of score values and a positive integer k.</p><p>Question: Is there a query set Q with at most k queries for S such that for each node v and each score value s &#8712; L, there is a query q &#8712; Q for which score(q,v) = s?</p><p>We now show the CQS-SS problem is NP-complete even when the set L of scores contains only one value and the underlying graph is highly restricted (e.g., graphs of maximum node degree 4, planar graphs). To prove this result, we use a reduction from the One-In-Three 3SAT (OIT-3SAT) problem which is defined as follows: given a set of Boolean variables X = {x 1 , x 2 , . . . , x n } and a collection C = {C 1 , C 2 , . . . , C m } of clauses where each clause C j is a disjunction of exactly three variables from X, is there a truth assignment to the variables such that exactly one variable is set to true in each clause? In the definition of OIT-3SAT, note that none of the clauses contains negated literals. It is well known that OIT-3SAT is NP-complete <ref type="bibr">(Garey and Johnson, 1979)</ref>. The NP-completeness holds even under some restrictions as indicated in the following theorem.</p><p>Theorem 18 OIT-3SAT is NP-complete even when either of the following conditions hold.</p><p>(a) Each variable occurs in exactly three clauses <ref type="bibr">(Schmidt, 2010)</ref>.</p><p>(b) Consider the following graph H(V H , E H ) constructed from an OIT-3SAT instance. V H contains one node for each variable in X and one node for each clause in C. For a variable x i and clause C j , there is an edge between the corresponding nodes if x i appears in C j . OIT-3SAT is NP-hard even when graph H is planar <ref type="bibr">(Hunt III et al., 1998;</ref><ref type="bibr">Mulzer and Rote, 2008)</ref>.</p><p>Theorem 19 Problem CQS-SS is NP-complete even when set of scores has only one value. The problem remains NP-complete even when the underlying graph is constrained to belong to one of the following classes: (a) graphs with maximum node degree 4 or (b) planar graphs.</p><p>Proof: It is easy to see that CQS-SS is in NP. To prove NP-hardness, we use a reduction from OIT-3SAT defined above. Consider an instance of OIT-3SAT consisting of the variable set X and the clause set C. We construct an instance of the CQS-SS problem as follows.</p><p>We first describe the construction of the underlying graph G(V, E).</p><p>1. For each variable x i , V has a node v i , 1 &#8804; i &#8804; n. For each clause C j , V has a node w j , 1 &#8804; j &#8804; m.</p><p>2. For each variable x i and clause C j , if x i occurs in C j , then E contains the edge {v i , w j }, 1 &#8804; i &#8804; n and 1 &#8804; j &#8804; m.</p><p>3. For each node v i , G has another node z i and the edge {v i , z i }.</p><p>4. For each node w j , G has 4 additional nodes, denoted by y 1 j , y 2 j , y 3 j and y 4 j , and the following four edges: {w j , y 1 j }, {y 1 j , y 2 j }, {y 1 j , y 4 j }, {y 2 j , y 3 j }.</p><p>An example of this graph construction is shown in Figure <ref type="figure">4</ref>. The local function at each node is symmetric. The set L of scores contains just one value, namely 1. The number of queries k is set to 1. This completes the construction of the CQS-SS instance, and it can be seen that the construction can be done in polynomial time. We now prove that the resulting CQS-SS instance has a solution iff the OIT-3SAT instance has a solution.</p><p>Suppose the OIT-3SAT instance has a solution. We construct the query q which provides a score of 1 to each node of G as follows. Note that G has a total of N = 2n + 5m nodes. For each node v of G, we will use the notation q(v) to denote the value of v in q.</p><p>(a) Consider each variable x i , 1 &#8804; i &#8804; n. If it is set to true in the solution to OIT-3SAT, set q(v i ) to 1 and q(z i ) to 0; otherwise, set q(v i ) to 0 and q(z i ) to 1.</p><p>(b) For each clause C j , 1 &#8804; j &#8804; m, set q(w j ) to 0.</p><p>(c) For each clause C j , 1 &#8804; j &#8804; m, the values in q for the nodes y 1 j through y 4 j are chosen as follows: q(y 1 j ) = q(y 2 j ) = 0 and q(y 3 j ) = q(y 4 j ) = 1.</p><p>It is easy to verify that for each v &#8712; V , score(q,v) = 1. In other words, we have a solution to the CQS-SS instance.</p><p>Now suppose the CQS-SS instance has a solution. Let q be the corresponding query. We have the following claim.</p><p>Claim 1: (a) For 1 &#8804; j &#8804; m, q(w j ) = 0. (b) For 1 &#8804; j &#8804; m, q(y 1 j ) = 0. Proof of Claim 1: Part (a): Suppose q(w j ) = 1 for some j, 1 &#8804; j &#8804; m. Then since {w j , y 1 j } is an edge in E and q provides a score of 1 to each node, we must have q(y 1 j ) = 0. Now, since y 1 j is the only neighbor of y 4 j and score(q,y 4 j ) = 1, we must have q(y 4 j ) = 1. Now, however, score(q,y 1 j ) &gt; 2 (since q(w j ) = 1 and q(y 4 j ) = 1). This contradicts the validity of q and establishes Part (a) of Claim 1.</p><p>Part (b): Suppose q(y 1 j ) = 1 for some j, 1 &#8804; j &#8804; m. Then since {y 1 j , y 2 j } is an edge in E and q provides a score of 1 to each node, we must have q(y 2 j ) = 0. Now, since y 2 j is the only neighbor of y 3 j and score(q,y 3 j ) = 1, we must have q(y 3 j ) = 1. Now, however, score(q,y 2 j ) = 2 (since q(y 1 j ) = 1 and q(y 3 j ) = 1). This contradicts the validity of q and establishes Part (b) of Claim 1.</p><p>We now continue with the main proof. We choose an assignment to the variables of X as follows. Consider each variable x i , 1 &#8804; i &#8804; n. If q(v i ) = 1, set x i to true; otherwise, set x i to false. To prove that this is a solution to the OIT-3SAT instance, we need to prove that each clause has exactly one variable set to true. Consider any clause C j and the corresponding node w j of G. We know that score(q,w j ) = 1. Since q(w j ) = 0 and q(y 1 j ) = 0 (Claim 1), the q provides a score of 1 to w j , there is exactly one node v r such that q(v r ) = 1 and the variable x r appears in clause C j . Since we set x r to true, exactly one of the variables in C j is set to true. In other words, we have a solution to the OIT-3SAT instance. This completes the proof of NP-hardness of CQS-SS for general graphs.</p><p>To obtain the NP-hardness of CQS-SS for graphs with maximum node degree of 4, we carry out the above reduction from the version of OIT-3SAT in which each variable occurs in exactly three clauses (Part (a) of Theorem 18). To obtain the NP-hardness of CQS-SS for planar graphs, we use the a reduction from the version of OIT-3SAT mentioned in Part (b) of Theorem 18 and note that the additional vertices and edges used in the construction do not affect planarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Query Sets Under the Adaptive Mode: Results for Threshold Functions</head><p>We now consider the adaptive mode of queries. We show that for threshold SyDSs, the adaptive query mode can reduce the number of queries significantly. To illustrate this, consider a SyDS whose underlying graph is a star graph with n nodes; that is, there is one node v 1 with degree n -1 which is the root of the tree and each of the other nodes v 2 through v n is child of the root. Under the adaptive mode, using the following method, O(log n) queries are sufficient.</p><p>The idea is simple: use binary search to identify the threshold of node v 1 whose degree is n -1 using O(log n) queries. Note that the threshold of v 1 can vary from 0 to n + 1. We first try a query q for which score(q,v 1 ) = n/2 . Depending on the response to the query, the range of possible threshold values for v 1 changes to either 0 through n/2 -1 or n/2 + 1 through n + 1. When this is repeated, after O(log n) queries under the adaptive model, we can identify the threshold of v 1 . After this, the following 3 additional queries are sufficient to identify the thresholds of the remaining n -1 nodes: a query with all zeros, a second query with all ones and a third one in which v 1 has the value 1 and all the remaining nodes have the value 0. Thus, all the thresholds can be identified using O(log n) queries under the adaptive mode. The following proposition summarizes the above discussion.</p><p>Proposition 20 For a SyDS where the underlying graph is the star graph with n nodes and each local function is a threshold function, O(log n) queries are sufficient under the adaptive query mode to determine all the local functions.</p><p>Since the center node of a star graph with n nodes has a degree of n -1, its threshold can be any one of the n + 1 values in {0, 1, . . . , n + 1}. Thus, a simple information theoretic argument points out that &#8486;(log n) is a lower bound on the number of adaptive queries needed to infer the threshold values of all the nodes of a star graph. Thus, the upper bound given by Proposition 20 is tight to within a constant factor.</p><p>As was shown in Section 3, in the batch mode, n + 1 queries are sometimes necessary to identify all the thresholds. Thus, the adaptive query mode can reduce the number of queries significantly.</p><p>The above idea can be applied to a more general class of graphs. Let &#945; and &#946; be nonnegative integers. We say that a graph is (&#945;, &#946;)-simple, if &#945; nodes have degree &gt; &#946; and the remaining n&#945; nodes have a degree of at most &#946;. For example, any star graph on n nodes is a (1, 1)-simple graph. Trivially, every graph is a (0, n -1)-simple graph. However, when &#945; and &#946; are required to be constants independent of n, a clique on n nodes is not an (&#945;, &#946;)-simple graph. Under the adaptive query model, we can obtain small query sets for (&#945;, &#946;)-simple graphs when &#945; and &#946; are small. In what follows, we show that the number of adaptive queries required for an (&#945;, &#946;)-simple graph is at most &#945;( log n + 1) + &#946; 2 + 2. When &#945; and &#946; are constants, this bound is O(log n).</p><p>Theorem 21 For any threshold SyDS whose underlying graph G belongs to the class of (&#945;, &#946;)-simple graphs, &#945;( log n + 1) + &#946; 2 + 2 queries are sufficient in the adaptive mode to identify all the threshold values. In particular, if &#945; and &#946; are constants independent of n, the number of queries used in this approach is O(log n).</p><p>Proof: We present an approach that uses at most &#945;( log n + 1) + &#946; 2 + 2 queries under the adaptive query mode. To discuss this approach, let V 0 denote the subset of nodes of G such that each node in V 0 has a degree larger than &#946;. Thus, |V 0 | = &#945;. Let V 1 denote the set of remaining n&#945; nodes (each of which has a degree of at most &#946;). The steps of our method are as follows.</p><p>1. We first construct a set Q 1 of queries which are submitted in a single batch to identify the thresholds of all the nodes in V 1 as follows.</p><p>(a) Let G 1 be the subgraph of</p><p>Since the degree of each node in G 1 is at most &#946;, the maximum node degree in G 2 1 is at most &#946; 2 . Hence, by Lemma 9, G 2 1 can be efficiently colored using at most k = &#946; 2 + 1 colors. Let C 1 , C 2 , . . . , C k denote the color classes generated in this step.</p><p>(c) Choose an arbitrary order, say v i 1 , v i 1 , . . ., v i&#945; , for the nodes in V 0 . The first &#945; + 1 queries of Q 1 , denoted by Q j 1 , 0 &#8804; j &#8804; &#945;, are chosen as follows. In Q 0 1 , all the nodes have the value 0. In Q j 1 , 1 &#8804; j &#8804; &#945;, nodes v i 1 through v i j have the value 1 and all other nodes have the value 0.</p><p>(d) The remaining k queries of Q 1 , denoted by Q j 1 , &#945; + 1 &#8804; j &#8804; &#945; + k, are chosen as follows. For all these queries, the values for all the &#945; nodes of V 0 are 1. The values for the nodes in V 1 are chosen as follows. In Q j 1 , all the nodes in color classes C 1 through C j have the value 1 and all the remaining nodes of V 1 have the value 0, 1 &#8804; j &#8804; k.</p><p>(e) The total number of queries in Q 1 generated in Steps 1(c) and 1(d</p><p>2. Recall that |V 0 | = &#945;. We use the adaptive mode for the nodes in V 0 ; that is, we use a separate binary search for each of the &#945; nodes in V 0 . For each node v &#8712; V 0 with degree d v , the batch queries in Q 1 include a query with scores 0 and d v + 1. Therefore, the binary search needs to consider only scores 1 through d v for v. This uses at most log (d v + 1) queries for v; this number is at most log n since d v &lt; n. Thus, for all the &#945; nodes in V 0 , the number of adaptive queries is at most &#945; log n . Q 0 denote the set of queries used in this step.</p><p>As argued above, the queries in Q 0 identify the thresholds of all the nodes in V 0 . We now argue that the resulting query set Q 1 covers all the nodes of V 1 . To see this, consider any node v &#8712; V 1 , and let d v be the degree of V in G. Thus, we need to show that for every value &#947; &#8712; {0, 1, . . . , d v + 1}, there is a query q &#8712; Q 1 such that score(q,v) = &#947;. Let r 0 and r 1 denote the number of neighbors of v in V 0 and V 1 respectively. Thus, d v = r 0 + r 1 . For 0 &#8804; j &#8804; &#945;, in query Q j 1 , exactly j nodes from V 0 are 1 and all the other nodes are 0. Thus, for each value &#947;, where 0 &#8804; &#947; &#8804; r 0 , there is a query q &#8712; Q 1 such that score(q,v) = &#947;. Now consider any value &#947; in the range r 0 + 1 through</p><p>may appear in the same color class. Let C p 1 &lt; C p 2 &lt; . . . &lt; C p r 1 +1 denote the color classes in which nodes of N 1 [v] appear. It can be seen that for 1 &#8804; &#8804; r 0 + r 1 + 1, query Q 1 is such that score(Q 1 ,v) = . Thus, Q 1 covers all the nodes in V 1 , and this completes the proof.</p><p>From the above discussion, |Q 0 | &#8804; &#945; log n and |Q 1 | &#8804; &#945; + &#946; 2 + 2. Thus, the total number of queries used by the above method is at most &#945;( log n + 1) + &#946; 2 + 2. This completes the proof of Theorem 21.</p><p>The above result can be used to establish an upper bound on the number of queries under the adaptive mode for scale-free graphs. Following <ref type="bibr">Easley and Kleinberg (2010)</ref>, we say that a graph with n nodes and maximum degree &#8710; is scale-free with exponent &#947; if there are constants c &gt; 0 and &#947; &gt; 1 such that for any integer d, 0 &#8804; d &#8804; &#8710;, the fraction of nodes with degree d in G is given by c /d &#947; . Our upper bound for the number of queries for scale-free graphs is stated below.</p><p>Theorem 22 For a threshold SyDS whose underlying graph G(V, E) is scale-free with exponent &#947; &gt; 1, the thresholds can be found using O [n log n] 2 &#947;+1 queries under the adaptive query mode.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof:</head><p>For a positive integer &#946; (to be chosen), let V 0 &#8838; V denote the set of nodes of G with degree &gt; &#946; and let V 1 &#8838; V denote the remaining nodes (each of which has degree at most &#946;). By Theorem 21, the number of queries g(&#946;) required satisfies g(&#946;) &#8804; |V 0 |( log n + 1) + &#946; 2 + 2. Since |V 0 | is the number of nodes with degree &#8805; &#946; + 1, we have</p><p>for some constants c, c and c . Therefore, g(&#946;) = c n &#946; &#947;-1 log n + &#946; 2 + 2. For &#946; &gt; 0, g(&#946;) is a convex function. Equating its first derivative to 0 and rearranging, we note that g(&#946;) attains a minimum value for &#946; satisfying</p><p>, and this completes our proof of Theorem 22.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Query Set Compaction</head><p>Overview: Some methods for generating queries in the batch mode (e.g., the randomized query generation method discussed in Section 4.2) may a generate complete query set Q with redundant queries. Since our goal is to construct query sets of minimum size, it is of interest to consider the problem of eliminating such redundant queries while preserving the property of completeness (i.e., the query set can correctly identify all the local functions). We call this the query set compaction problem; a precise formulation appears below. We show that it is NP-hard to obtain a performance guarantee of o(log n) for this problem, where n is the number of nodes of a given SyDS. We also present an approximation algorithm that provides a performance guarantee of O(log n) for the problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Problem Definition</head><p>A precise formulation of the query set compaction problem is as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Query Set Compaction (QSC)</head><p>Given: The underlying graph G(V, E) of a symmetric SyDS S, a complete query set Q for S and an integer k &#8804; |Q|.</p><p>Question: Is there a subset Q &#8838; Q such that (i) |Q | &#8804; k and (ii) Q is also a complete query set for S?</p><p>Our hardness results for QSC rely on known results for the Minimum Set Cover (MSC) problem which is defined as follows: given a base set X = {x 1 , x 2 , . . . , x n }, a collection Y = {Y 1 , Y 2 , . . . , Y m }, where each Y j is a subset of X, 1 &#8804; j &#8804; m, and an integer &#945; &#8804; m, is there a subcollection Y &#8838; Y such that (i) |Y | &#8804; &#945; and (ii) the union of all the sets in Y is equal to X? It is well known that MSC is NP-complete <ref type="bibr">(Garey and Johnson, 1979)</ref> and that unless P = NP, it cannot be approximated to within the factor (1 -) ln (n), for any , 0 &lt; &lt; 1, where n is the size of the base set <ref type="bibr">(Vazirani, 2001)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Complexity Results for QSC</head><p>Theorem 23 (a) The problem QSC is NP-complete even when the underlying graph has no edges. (b) Unless P = NP, QSC cannot be approximated to within the factor o(log n) in polynomial time, where n is the number of nodes in the underlying graph of the SyDS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof:</head><p>Part (a): It is easy to see that QSC is in NP. We prove NP-hardness through a reduction from MSC. Let the given instance of MSC consist of base set X = {x 1 , x 2 , . . . , x n }, collection Y = {Y 1 , Y 2 , . . . , Y m } of nonempty subsets of X and integer &#945; &#8804; m. Without loss of generality, we may assume that each element of X appears in some subset in Y ; otherwise, there is no solution to the MSC instance. We will construct the underlying graph G(V, E) of a SyDS S and a complete query set Q for S as follows.</p><p>one correspondence with the base set X = {x 1 , x 2 , . . . , x n } of the MSC instance and V 2 = {v n+1 } consists of just one node. (Thus, V has a total of n + 1 nodes.)</p><p>2. The edge set E of G is empty; that is, the degree of each node is 0. Thus, for each node v &#8712; V , the number of ones in the input to the local function f v at v can only be either 0 or 1.</p><p>3. The query set Q consists of m + 1 queries (where m = |Y |) constructed as discussed below. Note that each query is an (n + 1)-bit vector, where the i th bit specifies the value of node v i , 1 &#8804; i &#8804; n + 1.</p><p>(a) For each Y j &#8712; Y , 1 &#8804; j &#8804; m, Q contains a query q j constructed as follows. Let Y j = {x j 1 , x j 2 , . . . , x jr }. Then, in query q j , the bits corresponding to the nodes v j 1 , v j 2 , . . . , v jr are all 1 and the other bits are 0.</p><p>(b) We add one more query q m+1 to Q; in query q m+1 , bits 1 through n are set to 0 and bit n + 1 is set to 1.</p><p>4. The upper bound on the size of the required subset Q of queries is set to &#945; + 1.</p><p>This completes the construction of the QSC instance. It can be seen that the construction can be carried out in polynomial time. We now show that Q is a complete query set for S.</p><p>Claim 1: The query set Q constructed above is a complete query set for S.</p><p>Proof of Claim 1: We must show that for each node v i &#8712; V , Q contains two queries, say q i 0 and q i 1 , such that score(q i 0 ,v i ) = 0 and that score(q i 1 ,v i ) = 1. First, consider any node v i , where 1 &#8804; i &#8804; n. Query q m+1 sets the value of v i to 0; thus score(q m+1 ,v i ) = 0. Suppose the element x i (corresponding to node v i ) appears in subset Y j . By our construction, query q j sets the value of v i to 1; thus, score(q j ,v i ) = 1. For node v n+1 , each query q j created from Y j sets the value of v n+1 to 0; that is, score(q j ,v n+1 ) = 0; Also, query q m+1 sets the value of v n+1 to 1; thus, score(q m+1 ,v n+1 ) = 1. The claim follows.</p><p>We now prove that there is a solution to the QSC instance if and only if there is a solution to the MSC instance.</p><p>Part 1: Suppose there is a solution Y to the MSC instance consisting of sets Y j 1 , Y j 2 , . . ., Y j , for some &#8804; &#945;. Consider the query set Q = {q j 1 , q j 2 , . . . , q j , q m+1 }, which includes the queries corresponding to the sets in Y along with query q m+1 . Note that Q &#8838; Q. Also, since &#8804; &#945;, |Q | &#8804; &#945; + 1. Thus, we only need to show that Q is a complete query set. Consider any node v i , where 1 &#8804; i &#8804; n. Query q m+1 sets the value of v i to 0; thus, score(q m+1 ,v i ) = 0. Further, Since Y is a set cover, the element x i (corresponding to node v i ) appears in some subset Y jz &#8712; Y . By our construction, in query q jz , the value of v i is 1; thus, score(q jz ,v i ) = 1. For node v n+1 , each query q &#8712; Q -{q m+1 } sets the value of v n+1 to 0; thus, score(q,v n+1 ) = 0. Further, query q m+1 sets the value of v n+1 to 1; thus, score(q m+1 ,v n+1 ) = 1. Hence, Q is a complete query set.</p><p>Part 2: Let Q be a solution to the QSC instance with |Q | &#8804; &#945; + 1. We claim that q m+1 &#8712; Q . This is because q m+1 is the only query for which score(q m+1 , v n+1 ) is 1. Define Q = Q -{q m+1 }. Let |Q | = and note that &#8804; &#945;. Further, let Q = {q j 1 , q j 2 , . . . , q j }. Consider the following subcollection Y of Y given by Y = {Y j 1 , Y j 2 , . . . , Y j }. We now show that Y is a solution to the MSC instance. To see this, consider any element x i &#8712; X, where 1 &#8804; i &#8804; n. Query q m+1 &#8712; Q sets node v i to 0. Since Q is a complete query set, some query q jz &#8712; Q must set v i to 1. By our construction, the subset Y jz contains x i . Thus, Y is a set cover. Since |Y | &#8804; &#945;, Y is a solution to the MSC instance, and this completes the NP-hardness proof.</p><p>We use the same reduction to prove the non-approximability result. Suppose A is an approximation algorithm that provides a performance guarantee of &#961; = o(log n) for the QSC problem, where n is the number of nodes in the underlying graph of the SyDS. We will show that A can be used to construct a 2&#961; = o(log n) approximation algorithm for the MSC problem, contradicting the known non-approximability result for MSC. Towards this proof, consider any instance of the MSC optimization problem. Let OPT(MSC) denote the value of an optimal solution (i.e., the minimum size of a set cover) for the MSC instance and let OPT(QSC) denote the value of an optimal solution (i.e., the minimum size of a complete query subset) for the QSC instance constructed from the MSC instance. In the NP-hardness proof, we showed that OPT(QSC) &#8804; OPT(MSC) + 1.</p><p>(2)</p><p>Suppose we run Algorithm A on the resulting QSC instance. Since A provides a &#961; approximation, the solution produced by A has at most &#961; OPT(QSC) queries. From this query set, it was shown in the NP-hardness proof that a solution to the MSC instance with &#961; OPT(QSC) -1 subsets can be constructed. Letting APPROX(MSC) denote the resulting number of subsets, we have</p><p>Thus, if A provides a performance guarantee of &#961; = o(log n) for the QSC problem, then there is a 2&#961; = o(log n) approximation algorithm for the MSC problem. This completes the proof of Theorem 23.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">An Approximation Algorithm for QSC</head><p>To complement the non-approximability result of the previous section, we will present an efficient approximation algorithm with a performance guarantee of O(log n) for the QSC problem. The basic idea is to use a reduction from the QSC problem to the MSC problem and then to use a known (greedy) approximation algorithm <ref type="bibr">(Vazirani, 2001)</ref> for the MSC problem.</p><p>Algorithm 2: Steps of Algorithm Approx-QSC Input: The underlying graph G(V, E) of a symmetric SyDS S and a complete query set Q. Output: A subset Q &#8838; Q such that Q is also a complete query set and |Q | is a good approximation for a complete query set of minimum size. 1 To construct the base set X of the MSC instance, consider each node v i ; let d i denote the degree of v i . Create a set A i of d i + 1 elements, given by</p><p>3 Use the greedy algorithm <ref type="bibr">(Vazirani, 2001)</ref> to get an approximate solution Y to the resulting MSC instance. 4 Construct the query set Q by choosing the query corresponding to each subset in Y and output Q .</p><p>The steps of our approximation algorithm Approx-QSC for QSC are shown in Algorithm 2. We assume that an instance of the QSC problem is specified by an underlying graph G(V, E) and a complete query set Q. Let V = {v 1 , v 2 , . . . , v n } and Q = {q 1 , q 2 , . . . , q m }. Note that each query q j &#8712; Q is an n-bit vector, with bit i specifying the value of node</p><p>Steps 1 and 2 of the algorithm construct an instance of the MSC problem consisting of the base set X and a collection</p><p>Step 3 applies a well known greedy algorithm (which, in each iteration adds a subset Y j from Y such that Y j contains the maximum number of elements of X which have not yet been covered) to construct an approximate solution Y to the MSC problem. The final step chooses a query corresponding to each subset in Y and outputs the resulting query subset Q . It can be seen that the approximation algorithm runs in polynomial time. To establish the performance guarantee provided by Approx-QSC, we need the following lemma.</p><p>Lemma 24 The reduction from QSC to MSC used in Steps 1 and 2 of Approx-QSC (Algorithm 2) produces an instance of MSC such that any solution with r subsets to the MSC instance is a solution with r queries to the QSC instance and vice versa.</p><p>Proof: First, consider any solution Y = {Y j 1 , Y j 2 , . . . Y jr } with r subsets to the MSC instance. Let Q = {q j 1 , q j 2 , . . . , q jr } be the corresponding query set with r queries. We need to show that Q is a complete query set; that is, for any node v i (1 &#8804; i &#8804; n) and any integer k, 0 &#8804; k &#8804; d i , there is a query q &#8712; Q such that score(q, v i ) = k. To see this, note that Y is a set cover. Thus, there is a set Y jz &#8712; Y such that the element a ik &#8712; X appears in Y jz . By our construction, a ik was added to Y jz because score(q jz , v i ) = k.</p><p>To prove the converse, let Q = {q j 1 , q j 2 , . . . , q jr } be a solution with r queries to the QSC instance. Consider the collection Y of sets given by Y = {Y j 1 , Y j 2 , . . . Y jr }. We claim that Y is a solution to the MSC instance. To see this, consider any element a ik &#8712; X. Since Q is a complete query set, there is some query q jz &#8712; Q such that score(q jz , v i ) = k. By our construction, set Y jz contains the element a ik . In other words, Y is a solution to the MSC instance with r sets.</p><p>The following is an immediate consequence of the above lemma.</p><p>Observation 25 Let OPT(QSC) denote the size of an optimal query set for a given QSC instance and let OPT(MSC) denote the size of an optimal solution to the MSC instance obtained at the end of Step 2 of Approx-QSC. Then, OPT(QSC) = OPT(MSC).</p><p>We can now establish the performance guarantee provided by Approx-QSC.</p><p>Theorem 26 Algorithm Approx-QSC provides a performance guarantee of O(log n) for the QSC problem, where n is the number of nodes in the underlying graph of the SyDS.</p><p>Proof: Let OPT(QSC) denote the size of an optimal query set for the QSC instance and let OPT(MSC) denote the size of an optimal solution to the MSC instance. We observe that the size of the base set X is 2|E| + n. To see this, note that for each node v i &#8712; V with degree d i , the the number of elements added to X in Step 1 of the algorithm in Algorithm 2 is d i + 1. Therefore, |X| = n i=1 (d i + 1) = ( n i=1 d i ) + n = 2|E| + n since the sum of all the node degrees is 2|E|. Since |E| &lt; n 2 , |X| &lt; 3n 2 . The greedy algorithm for MSC provides a performance guarantee of O(log |X|), which is O(log n) since |X| &lt; 3n 2 . Thus, the approximation algorithm for MSC produces a solution with at most O(log n) OPT(MSC) sets. By Lemma 24, any solution to MSC with r subsets leads to a solution with r queries for QSC. Thus, the size of the resulting query set is at most O(log n) OPT(MSC) which is equal to O(log n) OPT(QSC) by Observation 25. Thus, Approx-QSC has a performance guarantee of O(log n).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experimental Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Overview</head><p>The theoretical results of the previous sections establish that finding a smallest set of queries to infer the local functions of threshold or symmetric SyDSs is non-trivial. In this section, through extensive experimentation<ref type="foot">foot_0</ref> on more than 20 real-world and synthetic networks, we seek to answer the following questions.</p><p>1. We study the performance of the proposed algorithms for batch mode. How do the the resulting query sets compare with the lower bounds and thus the optimal query set sizes for these networks? We studied an approach based on coloring G 2 for inferring threshold and symmetric SyDS.</p><p>2. While we were able to develop algorithms for batch mode with provable performance guarantees, coming up with a technique for adaptive mode for general graphs turns out to be hard. We propose a greedy heuristic inspired by distance-two coloring to infer thresholds of a SyDS (namely, Algorithm 3 presented in Section 7.4). How does this heuristic perform relative to the developed lower bounds? We experimentally analyze this question.</p><p>3. The bounds established above are in terms of maximum degree or the size of the graph. However, many real-world networks are scale-free, having a low average degree, but a very high maximum degree. How do our algorithms perform on these networks? Does network density (number of edges) have any role in determining the query set size?</p><p>4. We note that the query set size depends not only on the network structure, but also on threshold assignments. How does the uncertainty in the range of values the threshold can take influence the query set size? Our experiments also study this question.</p><p>Networks and threshold assignment. The diverse real-world and synthetic networks considered in this work are listed in Table <ref type="table">1</ref> along with some of their properties. As indicated in that table, all the mined networks used in our experiments are from the SNAP library <ref type="bibr">(Leskovec and Krevl, 2014)</ref>. We present representative results for selected networks, with other networks exhibiting the same behavior unless stated otherwise. We assigned thresholds in the following manner. Let 0 &#8804; &#952; &#8804; 1 be a real number. For a fixed value of &#952;, each node v was assigned a threshold value uniformly at random from the interval (d(v) + 2)(1&#952;)/2, (d(v) + 2)(1 + &#952;)/2 . Note that for &#952; = 0, the interval corresponds to the fixed threshold of (d(v) + 2)/2 and for &#952; = 1, any value from 0 to d(v) + 2 is possible.</p><p>Our theoretical results indicate that both network structure and the threshold assignments influence the number of queries required to infer the system. The experiments conducted were designed to further explore these aspects. Details regarding the computing environment used for our experiments are presented in Appendix B which also includes plots showing wall clock times for generating queries. We studied the performance of Alg-Monotone-Seq (Algorithm 1). For most real world networks considered in this paper, this algorithm gives the best possible performance of &#8710; + 2 (Proposition 3). This is due to the fact that in all these cases, n c (G 2 ), the number of colors used to color G 2 is equal to &#8710; + 1, the lower bound on the number of required colors. Compare the values in Table <ref type="table">1</ref> for columns "max. deg. &#8710;" and "Results: Query set size, Method 1." For synthetic networks (random regular and Erd&#337;s-R&#233;ny though, n c (G 2 ) is significantly higher than &#8710; + 2, yet much lower than &#8710; 2 + 2, the upper bound (Theorem 10). The should note that the observed performance is due to a combination of the structure of G 2 and the nature of the greedy coloring scheme. We observe that unlike the synthetic networks considered, most of the real-world networks are scalefree with maximum degree being much larger than average degree d avg . This is a possible reason for the superior performance of this approach. We also compared the results to the spectral radius bound, that is, the number of colors needed to color G 2 is at most 1 + &#955; 2 max <ref type="bibr">(Miao and Fan, 2014)</ref>, where &#955; max is the largest eigenvalue of the adjacency matrix of G.</p><p>Again, see Table <ref type="table">1</ref> and "&#955; max ." It is a well-known fact that &#8730; &#8710; &#8804; &#955; max &#8804; &#8710;, and for the real-world networks considered, &#955; max is indeed much less than &#8710;. However, despite this fact, we observe that &#955; 2 max + 1 is much larger than n c (G 2 ) + 1 in these cases.</p><p>Compaction. We note that the query set generated by this approach is already compact, i.e., no proper subset of queries can be complete. To see this, suppose this set is not compact. Then, there exists at least one query which is not required. We recall that by construction, the query set can be arranged as a monotone increasing sequence q 0 , q 1 , . . . , such that in query q i all nodes of color i are set to state 1. Suppose query q k is not required for the set to be complete. Then, using the arguments in the proof of Theorem 11, it can be seen that if all vertices of color k were assigned color k + 1, the coloring would still be valid. This means that in the greedy strategy, all nodes colored k + 1 could actually have been assigned color k. Since the greedy strategy always gives the minimum color available to nodes, this is a contradiction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Experiments on Query Compaction</head><p>Here, we present some experimental results obtained by first generating query sets using the probabilistic method discussed in Section 4.2 and then applying the query set compaction heuristic presented in Section 6.3. To generate the complete query sets, we used the method of Theorem 12. Recall that the query set contains the configurations of all zeros, of all ones and &#8710; random queries where queries are sampled from distributions D(i/&#8710;) for 1 &#8804; i &#8804; &#8710;.</p><p>The query set generation process may have to be repeated a number of times to obtain a complete set. The resulting set, even though large, can be compressed using the compaction algorithm.</p><p>We constructed 50 such query sets for three values of (2, 5 and 10) and checked whether each of them is a complete set. For = 2, out of the 50 sets constructed, none of them was complete. However, for = 10, from 5 to 50 query sets turned out to be complete sets depending on the network. We applied the compaction algorithm on the complete sets generated by the randomized algorithm. The results for = 10 are in Table <ref type="table">2</ref>. The compaction ratio depends on the size of complete set which was given as input. We note that compaction of query sets generated by this method yields around 80% reduction in the size of the query set in most cases (Table <ref type="table">2</ref>). On the average, the combination of randomized algorithm and compaction gives query sets of size around 1.5 to 2 times that of the monotone query sequence method discussed in Section 4.1. However, the method of random query set generation followed by compaction is comparatively much easier to implement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Method 2: An Adaptive Algorithm</head><p>While the previous methods are for the batch mode, here we develop an adaptive algorithm to infer the thresholds. We give an outline of the approach. The steps of the algorithm are shown in Algorithm 3. In this algorithm, we use the notation N [v, G 2 ] to denote the set of nodes of v that are at distance of at most 2 in G (including v). We note that N [v, G 2 ] can be computed by a two-step breadth-first search of G starting at v.</p><p>For every node, let t L (v) and t H (v) be the minimum and maximum possible values of threshold that v can be assigned. These values quantify the uncertainty about the threshold. The threshold is said to have been inferred when t H (v) = t L (v). In a query q, if score(q, v) falls in the range [t L (v), t H (v)-1], the uncertainty reduces to either [score(q, v)+1, t H (v)-1] or [t L (v), score(q, v)] depending on the state observed in the successor configuration. In this heuristic, we use a greedy adaptive approach where the current query is constructed iteratively in the following way. Initially, all the nodes are in state 0. We first choose a vertex, say v max , for which the threshold range is maximum. We set exactly (t L (v) + t H (v))/2 of nodes in its closed neighborhood to state 1. This guarantees a reduction in the range by half. In the next iteration, we ignore all nodes in G within distance-2 of v max and repeat this process. The query is fully constructed when there are no more vertices to consider. After each query, the range [t L (v), t H (v) -1] for every v is updated based on its state in the Algorithm 3: Greedy heuristic to infer the thresholds.</p><p>Data: Network G(V, E); true thresholds t v for every node v. (These thresholds are not available to the inference algorithm; they are used only to compute the successor for a query.) Result: Infer the thresholds of all the nodes of G. (For data analysis purposes, the algorithm also returns the set Q of all the queries used in the inference procedure.)</p><p>the set of nodes for which threshold needs to be inferred; 5 Let Q = &#8709; be the query set; successor. We terminate this process when for all v, t L (v) = t H (v). Results are summarized in the last column of Table <ref type="table">1</ref>. The analysis of our experimental results follows.</p><p>Influence of threshold values and ranges. In general, the number of queries required is highly dependent on the possible threshold values the nodes can be assigned. We conducted experiments in the following manner. Recall the threshold assignment procedure described in Section 7.1. The results are in Figures <ref type="figure">5(a</ref>) and 6(a) for random k-regular and real-world networks respectively. For the random-regular graphs, the number of queries (averaged over 10 instances of graphs for each k) increases from an order of log k to as high as n, the size of the graph. We note that for k = n -1, this is in accordance with Theorem 5. For the real-world graphs, we see that increasing the range of threshold has the effect of gradually increasing the number of queries, but the number is less than 1.5&#8710;. In Figure <ref type="figure">5</ref>(b), we investigate the influence of the threshold value on query set size. Again, we considered random k-regular graphs with varying k. Every node was assigned the same threshold. We see that the number of queries used is a maximum when the threshold is around &#8710;/2, and it decreases as the threshold approaches either 0 or &#8710;.</p><p>Influence of network structure. The theoretical results developed in the previous sections provide bounds with respect to size of the graph and maximum degree. Here, our objectives are two-fold. Firstly, we compare our adaptive approaches to the non-adaptive bounds, particularly the number of queries required relative to &#8710;. Secondly, we investigate the effect of graph density and degree distribution on the performance of the heuristic. We note that graph density plays an important role in the performance of the algorithm. First, we consider synthetic networks. In Figure <ref type="figure">5</ref>(b), we see that for low values of k the number of queries required is very small, but it increases rapidly (for higher values of thresholds). When the graph is sparse, for every node, the number of nodes within distance two (i.e., k 2 + 1) is small. Therefore, for every query constructed by the heuristic, the uncertainty range of around n/k 2 nodes (the "v max " vertices) is halved. However, as k increases, this number decreases drastically. Hence we see that the number of queries used increases. However, as the graph density increases, the intersection of neighborhoods of any two nodes is large which has the effect of reducing the variation in the scores of nodes. Therefore, particularly when the range of threshold values is limited, the thresholds can be inferred with fewer queries than for sparser networks.</p><p>Progress towards inferring thresholds. In Figure <ref type="figure">6</ref>(b), we plot the accumulated threshold ranges for all vertices as the algorithm moves from one query to the next. We note that for most of the networks considered, within one-tenth of the total query size, the total accumulated threshold range decreases to 5% of its original value for all the studied networks. Thus, the uncertainty in the threshold range can be decreased significantly using a small number of adaptive queries.</p><p>In Appendix B, we have provided plots (Figure <ref type="figure">7</ref>) for the wall clock time taken by Method 2 to generate queries.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Directions for Future Work</head><p>In this work, we studied the problem of inferring the behavior of a networked dynamical system through active querying. We now discuss several avenues for future work. We first mention an open problem that arises directly from our results. Theorem 13 shows the existence of a small complete query set for SyDSs with symmetric functions. Developing an efficient algorithm that can construct such a query set is an interesting research question.</p><p>In our framework, the user has complete knowledge of the network, and queries and the responses from the system are assumed to specify the states of all the nodes in the system. In a typical real-world scenario, only partial knowledge of the system is available. Besides, the user might only require partial information from the system. For example, it may be sufficient for a user to know whether the threshold of every node is at most some chosen integer &#945;. Methods that can accomplish threshold inference with a small number of queries in such contexts are of practical interest. Another useful direction to explore is to develop inference methods when the system response consists of successors for k &#8805; 2 time steps (instead of a single time step). It is also of interest to explore the use of queries to infer other components of a dynamical system (e.g., the network topology).</p><p>where the inequality in the second of the three lines above follows from Lemma 27 (below). Now, Pr f v (b) is not queried by Q &#8804; Pr f v (b) is not queried by q zj , 1 &#8804; j &#8804; 22 &#8730; &#8710; + 2 log(n&#8710;) .</p><p>The quantity on the right hand side of the above inequality is at most 1 -</p><p>&lt; 1 (n&#8710;) 2 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Note that</head><p>Pr Q is not a complete set = Pr &#8707;v, b such that f v (b) is not queried by Q .</p><p>By the union bound, the quantity on the right hand side of the above equation is at most We will first prove the following claims. . Let b = 2k + 1.</p><p>The inequality follows from Claim 28. Therefore, when d is odd,</p><p>.</p><p>Claim 30 For any positive x &#8804; 1 2 , 1x &#8805; e -2x . Proof: e 2x (1x) &gt; (1 + 2x)(1x) = 1 + x(1 -2x) &#8805; 1.    This completes the proof of Lemma 27.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix B. Implementation, computing environment and time</head><p>Implementations were done in Python 2.7 and 3.8. Compute nodes with 2 x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz processors with 20 cores per CPU and 386GB memory were used. In Figure <ref type="figure">7</ref>, we have provided plots for (i) total time taken and (ii) average time per query as a function of graph size (nodes and edges) and threshold interval &#952; for ER graphs. Also, for reference, we have provided the number of queries generated for each case. We observe that both total time as well as average time per query are super-linear in the number of nodes even though the number of queries generated is close to being linear in the number of nodes. This is because, the time taken depends on the total number of queries as well as the number of operations required to generate a query. Also, as &#952; increases, the time taken decreases for a fixed network. This can be mainly attributed to the decrease in the number of queries (last row).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>The code is available in https://github.com/NSSAC/active_queries_threshold_gds_published_ code.git.</p></note>
		</body>
		</text>
</TEI>
