skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Found graph data and planted vertex covers
A typical way in which network data is recorded is to measure all interactions involving a specified set of core nodes, which produces a graph containing this core together with a potentially larger set of fringe nodes that link to the core. Interactions between nodes in the fringe, however, are not present in the resulting graph data. For example, a phone service provider may only record calls in which at least one of the participants is a customer; this can include calls between a customer and a non-customer, but not between pairs of non-customers. Knowledge of which nodes belong to the core is crucial for interpreting the dataset, but this metadata is unavailable in many cases, either because it has been lost due to difficulties in data provenance, or because the network consists of “found data” obtained in settings such as counter-surveillance. This leads to an algorithmic problem of recovering the core set. Since the core is a vertex cover, we essentially have a planted vertex cover problem, but with an arbitrary underlying graph. We develop a framework for analyzing this planted vertex cover problem, based on the theory of fixed-parameter tractability, together with algorithms for recovering the core. Our algorithms are fast, simple to implement, and out-perform several baselines based on core-periphery structure on various real-world datasets.  more » « less
Award ID(s):
1830274
PAR ID:
10173816
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT. A typical way in which network data is recorded is to measure all the interactions among a specified set of core nodes; this produces a graph containing this core together with a potentially larger set of fringe nodes that have links to the core. Interactions between pairs of nodes in the fringe, however, are not recorded by this process, and hence not present in the resulting graph data. For example, a phone service provider may only have records of calls in which at least one of the participants is a customer; this can include calls between a customer and a non-customer, but not between pairs of non-customers. Knowledge of which nodes belong to the core is an important piece of metadata that is crucial for interpreting the network dataset. But in many cases, this metadata is not available, either because it has been lost due to difficulties in data provenance, or because the network consists of “found data” obtained in settings such as counter-surveillance. This leads to a natural algorithmic problem, namely the recovery of the core set. Since the core set forms a vertex cover of the graph, we essentially have a planted vertex cover problem, but with an arbitrary underlying graph. We develop a theoretical framework for analyzing this planted vertex cover problem, based on results in the theory of fixed- parameter tractability, together with algorithms for recovering the core. Our algorithms are fast, simple to implement, and out-perform several methods based on network core-periphery structure on various real-world datasets. 
    more » « less
  2. Data collection often involves the partial measurement of a larger system. A common example arises in collecting network data: we often obtain network datasets by recording all of the interactions among a small set of core nodes, so that we end up with a measurement of the network consisting of these core nodes along with a potentially much larger set of fringe nodes that have links to the core. Given the ubiquity of this process for assembling network data, it is crucial to understand the role of such a “core-fringe” structure. Here we study how the inclusion of fringe nodes affects the standard task of network link prediction. One might initially think the inclusion of any additional data is useful, and hence that it should be beneficial to include all fringe nodes that are available. However, we find that this is not true; in fact, there is substantial variability in the value of the fringe nodes for prediction. Once an algorithm is selected, in some datasets, including any additional data from the fringe can actually hurt prediction performance; in other datasets, including some amount of fringe information is useful before prediction performance saturates or even declines; and in further cases, including the entire fringe leads to the best performance. While such variety might seem surprising, we show that these behaviors are exhibited by simple random graph models. 
    more » « less
  3. Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This calls for complex representations that can capture, for instance, edges of different types as well as so-called “higher-order interactions” that involve more than two nodes at a time. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels — or different interaction types — where clusters corresponds to groups of nodes that frequently participate in the same type of interaction. Our methodology is based on a combinatorial objective function that is related to correlation clustering on graphs but enables the design of much more efficient algorithms that also seamlessly generalize to hypergraphs. When there are only two label types, our objective can be optimized in polynomial time, using an algorithm based on minimum cuts. Minimizing our objective becomes NP-hard with more than two label types, but we develop fast approximation algorithms based on linear programming relaxations that have theoretical cluster quality guarantees. We demonstrate the efficacy of our algorithms and the scope of the model through problems in edge-label community detection, clustering with temporal data, and exploratory data analysis. 
    more » « less
  4. We give an algorithm to find a minimum cut in an edge-weighted directed graph with n vertices and m edges in O ̃(n · max{m^{2/3}, n}) time. This improves on the 30 year old bound of O ̃(nm) obtained by Hao and Orlin for this problem. Using similar techniques, we also obtain O ̃ (n^2 /ε^2 )-time (1+ε)-approximation algorithms for both the minimum edge and minimum vertex cuts in directed graphs, for any fixed ε. Before our work, no (1+ε)-approximation algorithm better than the exact runtime of O ̃(nm) is known for either problem. Our algorithms follow a two-step template. In the first step, we employ a partial sparsification of the input graph to preserve a critical subset of cut values approximately. In the second step, we design algorithms to find the (edge/vertex) mincut among the preserved cuts from the first step. For edge mincut, we give a new reduction to O ̃ (min{n/m^{1/3} , √n}) calls of any maxflow subroutine, via packing arborescences in the sparsifier. For vertex mincut, we develop new local flow algorithms to identify small unbalanced cuts in the sparsified graph. 
    more » « less
  5. Meka, Raghu (Ed.)
    We study the stochastic minimum vertex cover problem for general graphs. In this problem, we are given a graph G=(V, E) and an existence probability p_e for each edge e ∈ E. Edges of G are realized (or exist) independently with these probabilities, forming the realized subgraph H. The existence of an edge in H can only be verified using edge queries. The goal of this problem is to find a near-optimal vertex cover of H using a small number of queries. Previous work by Derakhshan, Durvasula, and Haghtalab [STOC 2023] established the existence of 1.5 + ε approximation algorithms for this problem with O(n/ε) queries. They also show that, under mild correlation among edge realizations, beating this approximation ratio requires querying a subgraph of size Ω(n ⋅ RS(n)). Here, RS(n) refers to Ruzsa-Szemerédi Graphs and represents the largest number of induced edge-disjoint matchings of size Θ(n) in an n-vertex graph. In this work, we design a simple algorithm for finding a (1 + ε) approximate vertex cover by querying a subgraph of size O(n ⋅ RS(n)) for any absolute constant ε > 0. Our algorithm can tolerate up to O(n ⋅ RS(n)) correlated edges, hence effectively completing our understanding of the problem under mild correlation. 
    more » « less