We consider the problem of counting motifs in bipartite affiliation networks, such as author-paper, user-product, and actor-movie relations. We focus on counting the number of occurrences of a "butterfly", a complete 2x2 biclique, the simplest cohesive higher-order structure in a bipartite graph. Our main contribution is a suite of randomized algorithms that can quickly approximate the number of butterflies in a graph with a provable guarantee on accuracy. An experimental evaluation on large real-world networks shows that our algorithms return accurate estimates within a few seconds, even for networks with trillions of butterflies and hundreds of millions of edges.
more »
« less
This content will become publicly available on June 1, 2026
Fast Counting and Utilizing Induced 6-Cycles in Bipartite Networks
Bipartite graphs are a powerful tool for modeling the interactions between two distinct groups. These bipartite relationships often feature small, recurring structural patterns called motifs which are building blocks for community structure. One promising structure is the induced 6-cycle which consists of three nodes on each node set forming a cycle where each node has exactly two edges. In this paper, we study the problem of counting and utilizing induced 6-cycles in large bipartite networks. We first consider two adaptations inspired by previous works for cycle counting in bipartite networks. Then, we introduce a new approach for node triplets which offer a systematic way to count the induced 6-cycles, used in BATCHTRIPLETJOIN. Our experimental evaluation shows that BATCHTRIPLETJOIN is significantly faster than the other algorithms while being scalable to large graph sizes and number of cores. On a network with 112M edges, BATCHTRIPLETJOIN is able to finish the computation in 78 mins by using 52 threads. In addition, we provide a new way to identify anomalous node triplets by comparing and contrasting the butterfly and induced 6-cycle counts of the nodes. We showcase several case studies on real-world networks from Amazon Kindle ratings, Steam game reviews, and Yelp ratings.
more »
« less
- Award ID(s):
- 2107089
- PAR ID:
- 10634343
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Transactions on Knowledge and Data Engineering
- Volume:
- 37
- Issue:
- 6
- ISSN:
- 1041-4347
- Page Range / eLocation ID:
- 3386 to 3398
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider the ad-hoc networks consisting of n wireless nodes that are located on the plane. Any two given nodes are called neighbors if they are located within a certain distance (communication range) from one another. A given node can be directly connected to any one of its neighbors, and picks its connections according to a unique topology control algorithm that is available at every node. Given that each node knows only the indices (unique identification numbers) of its one and two-hop neighbors, we identify an algorithm that preserves connectivity and can operate without the need of any synchronization among nodes. Moreover, the algorithm results in a sparse graph with at most 5n edges and a maximum node degree of 10. Existing algorithms with the same promises further require neighbor distance and/or direction information at each node. We also evaluate the performance of our algorithm for random networks. In this case, our algorithm provides an asymptotically connected network with n(1+o(1)) edges with a degree less than or equal to 6 for 1-o(1) fraction of the nodes. We also introduce another asynchronous connectivity-preserving algorithm that can provide an upper bound as well as a lower bound on node degrees.more » « less
-
This work considers necessary conditions for privacy guarantees under active fingerprinting attacks in power-law bipartite networks. The scenario arises naturally in social network analysis, tracking user mobility in wireless networks, and forensics applications, among others. A stochastic growing network generation model — called the popularity-based model — is investigated, where the bipartite network is generated iteratively, and in each iteration vertices attract new edges based on their assigned popularity values. It is shown that using the appropriate choice of initial popularity values, the node degree distribution follows a power-law distribution with arbitrary parameter α > 2, i.e. fraction of nodes with degree d is proportional to d −α . An active fingerprinting deanonymization attack strategy called the augmented information threshold attack strategy (A-ITS) is proposed which uses the attacker’s knowledge of the node degree distribution along with the concept of information values for deanonymization. Sufficient conditions for the success of the A-ITS, based on network parameters, are derived. It is shown through simulations that the proposed attack significantly outperforms the state-of-the-art attack strategies.more » « less
-
Abstract In recommender systems, users rate items, and are subsequently served other product recommendations based on these ratings. Even though users usually rate a tiny percentage of the available items, the system tries to estimate unobserved preferences by finding similarities across users and across items. In this work, we treat the observed ratings data as partially observed, dense, weighted, bipartite networks. For a class of systems without outside information, we adapt an approach developed for dense, weighted networks to account for unobserved edges and the bipartite nature of the problem. The approach begins with clustering both users and items into communities, and locally estimates the patterns of ratings within each subnetwork induced by restricting attention to one community of users and one community of items community. The local fitting procedure relies on estimating local sociability parameters for every user and item, and selecting the function that determines the degree correction contours which best models the underlying data. We compare the performance of our proposed approach to existing methods on a simulated data set, as well as on a data set of joke ratings, examining model performance in both cases at differing levels of sparsity. On the joke ratings data set, our proposed model performs better than existing alternatives in relatively sparse settings, though other approaches achieve better results when more data is available. Collectively, the results indicate that despite struggling to pick up subtler signals, the proposed approach’s recovery of large scale, coarse patterns may still be useful in practical settings where high sparsity is typical.more » « less
-
Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich NetworksProc. 2023 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (Ed.)Representation learning on networks aims to derive a meaningful vector representation for each node, thereby facilitating downstream tasks such as link prediction, node classification, and node clustering. In heterogeneous text-rich networks, this task is more challenging due to (1) presence or absence of text: Some nodes are associated with rich textual information, while others are not; (2) diversity of types: Nodes and edges of multiple types form a heterogeneous network structure. As pretrained language models (PLMs) have demonstrated their effectiveness in obtaining widely generalizable text representations, a substantial amount of effort has been made to incorporate PLMs into representation learning on text-rich networks. However, few of them can jointly consider heterogeneous structure (network) information as well as rich textual semantic information of each node effectively. In this paper, we propose Heterformer, a Heterogeneous Network-Empowered Transformer that performs contextualized text encoding and heterogeneous structure encoding in a unified model. Specifically, we inject heterogeneous structure information into each Transformer layer when encoding node texts. Meanwhile, Heterformer is capable of characterizing node/edge type heterogeneity and encoding nodes with or without texts. We conduct comprehensive experiments on three tasks (i.e., link prediction, node classification, and node clustering) on three large-scale datasets from different domains, where Heterformer outperforms competitive baselines significantly and consistently.more » « less
An official website of the United States government
