skip to main content


Title: Making social networks more human: A topological approach
Abstract

A key problem in social network analysis is to identify nonhuman interactions. State‐of‐the‐art bot‐detection systems like Botometer train machine‐learning models on user‐specific data. Unfortunately, these methods do not work on data sets in which only topological information is available. In this paper, we propose a new, purely topological approach. Our method removes edges that connect nodes exhibiting strong evidence of non‐human activity from publicly available electronic‐social‐network datasets, including, for example, those in the Stanford Network Analysis Project repository (SNAP). Our methodology is inspired by classic work in evolutionary psychology by Dunbar that posits upper bounds on the total strength of the set of social connections in which a single human can be engaged. We model edge strength with Easley and Kleinberg's topological estimate; label nodes as “violators” if the sum of these edge strengths exceeds a Dunbar‐inspired bound; and then remove the violator‐to‐violator edges. We run our algorithm on multiple social networks and show that our Dunbar‐inspired bound appears to hold for social networks, but not for nonsocial networks. Our cleaning process classifies 0.04% of the nodes of the Twitter‐2010 followers graph as violators, and we find that more than 80% of these violator nodes have Botometer scores of 0.5 or greater. Furthermore, after we remove the roughly 15 million violator‐violator edges from the 1.2‐billion‐edge Twitter‐2010 follower graph, 34% of the violator nodes experience a factor‐of‐two decrease in PageRank. PageRank is a key component of many graph algorithms such as node/edge ranking and graph sparsification. Thus, this artificial inflation would bias algorithmic output, and result in some incorrect decisions based on this output.

 
more » « less
NSF-PAR ID:
10459555
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
12
Issue:
6
ISSN:
1932-1864
Page Range / eLocation ID:
p. 449-464
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The Twitter-Based Knowledge Graph for Researchers project is an effort to construct a knowledge graph of computation-based tasks and corresponding outputs. It will be utilized by subject matter experts, statisticians, and developers. A knowledge graph is a directed graph of knowledge accumulated from a variety of sources. For our application, Subject Matter Experts (SMEs) are experts in their respective non-computer science fields, but are not necessarily experienced with running heavy computation on datasets. As a result, they find it difficult to generate workflows for their projects involving Twitter data and advanced analysis. Workflow management systems and libraries that facilitate computation are only practical when the users of these systems understand what analysis they need to perform. Our goal is to bridge this gap in understanding. Our queryable knowledge graph will generate a visual workflow for these experts and researchers to achieve their project goals. After meeting with our client, we established two primary deliverables. First, we needed to create an ontology of all Twitter-related information that an SME might want to answer. Secondly, we needed to build a knowledge graph based on this ontology and produce a set of APIs to trigger a set of network algorithms based on the information queried to the graph. An ontology is simply the class structure/schema for the graph. Throughout future meetings, we established some more specific additional requirements. Most importantly, the client stressed that users should be able to bring their own data and add it to our knowledge graph. As more research is completed and new technologies are released, it will be important to be able to edit and add to the knowledge graph. Next, we must be able to provide metrics about the data itself. These metrics will be useful for both our own work, and future research surrounding graph search problems and search optimization. Additionally, our system should provide users with information regarding the original domain that the algorithms and workflows were run against. That way they can choose the best workflow for their data. The project team first conducted a literature review, reading reports from the CS5604 Information Retrieval courses in 2016 and 2017 to extract information related to Twitter data and algorithms. This information was used to construct our raw ontology in Google Sheets, which contained a set of dataset-algorithm-dataset tuples. The raw ontology was then converted into nodes and edges csv files for building the knowledge graph. After implementing our original solution on a CentOS virtual machine hosted by the Virginia Tech Department of Computer Science, we transitioned our solution to Grakn, an open-source knowledge graph database that supports hypergraph functionality. When finalizing our workflow paths, we noted some nodes depended on completion of two or more inputs, representing an ”AND” edge. This phenomenon is modeled as a hyperedge with Grakn, initiating our transition from Neo4J to Grakn. Currently, our system supports queries through the console, where a user can type a Graql statement to retrieve information about data in the graph, from relationships to entities to derived rules. The user can also interact with the data via Grakn's data visualizer: Workbase. The user can enter Graql queries to visualize connections within the knowledge graph. 
    more » « less
  2. The 2020 United States (US) presidential election was — and has continued to be — the focus of pervasive and persistent mis- and disinformation spreading through our media ecosystems, including social media. This event has driven the collection and analysis of large, directed social network datasets, but such datasets can resist intuitive understanding. In such large datasets, the overwhelming number of nodes and edges present in typical representations create visual artifacts, such as densely overlapping edges and tightly-packed formations of low-degree nodes, which obscure many features of more practical interest. We apply a method, coengagement transformations, to convert such networks of social data into tractable images. Intuitively, this approach allows for parameterized network visualizations that make shared audiences of engaged viewers salient to viewers. Using the interpretative capabilities of this method, we perform an extensive case study of the 2020 United States presidential election on Twitter, contributing an empirical analysis of coengagement. By creating and contrasting different networks at different parameter sets, we define and characterize several structures in this discourse network, including bridging accounts, satellite audiences, and followback communities. We discuss the importance and implications of these empirical network features in this context. In addition, we release open-source code for creating coengagement networks from Twitter and other structured interaction data. 
    more » « less
  3. Many network/graph structures are continuously monitored by various sensors that are placed at a subset of nodes and edges. The multidimensional data collected from these sensors over time create large-scale graph data in which the data points are highly dependent. Monitoring large-scale attributed networks with thousands of nodes and heterogeneous sensor data to detect anomalies and unusual events is a complex and computationally expensive process. This paper introduces a new generic approach inspired by state-space models for network anomaly detection that can utilize the information from the network topology, the node attributes (sensor data), and the anomaly propagation sets in an integrated manner to analyze the entire network all at once. This article presents how heterogeneous network sensor data can be analyzed to locate the sources of anomalies as well as the anomalous regions in a network, which can be impacted by one or multiple anomalies at any time instance. Experimental results demonstrate the superior performance of our proposed framework in detecting anomalies in attributed graphs. Summary of Contribution: With the increasing availability of large-scale network sensors and rapid advances in artificial intelligence methods, fundamentally new analytical tools are needed that can integrate data collected from sensors across the networks for decision making while taking into account the stochastic and topological dependencies between nodes, sensors, and anomalies. This paper develops a framework to intelligently and efficiently analyze complex and highly dependent data collected from disparate sensors across large-scale network/graph structures to detect anomalies and abnormal behavior in real time. Unlike general purpose (often black-box) machine learning models, this paper proposes a unique framework for network/graph structures that incorporates the complexities of networks and interdependencies between network entities and sensors. Because of the multidisciplinary nature of the paper that involves optimization, machine learning, and system monitoring and control, it can help researchers in both operations research and computer science domains to develop new network-specific computing tools and machine learning frameworks to efficiently manage large-scale network data. 
    more » « less
  4. A graph spanner is a fundamental graph structure that faithfully preserves the pairwise distances in the input graph up to a small multiplicative stretch. The common objective in the computation of spanners is to achieve the best-known existential size-stretch trade-off efficiently. Classical models and algorithmic analysis of graph spanners essentially assume that the algorithm can read the input graph, construct the desired spanner, and write the answer to the output tape. However, when considering massive graphs containing millions or even billions of nodes not only the input graph, but also the output spanner might be too large for a single processor to store. To tackle this challenge, we initiate the study of local computation algorithms (LCAs) for graph spanners in general graphs, where the algorithm should locally decide whether a given edge (u,v)∈E belongs to the output spanner. Such LCAs give the user the `illusion' that a specific sparse spanner for the graph is maintained, without ever fully computing it. We present the following results: -For general n-vertex graphs and r∈{2,3}, there exists an LCA for (2r−1)-spanners with O˜(n1+1/r) edges and sublinear probe complexity of O˜(n1−1/2r). These size/stretch tradeoffs are best possible (up to polylogarithmic factors). -For every k≥1 and n-vertex graph with maximum degree Δ, there exists an LCA for O(k2) spanners with O˜(n1+1/k) edges, probe complexity of O˜(Δ4n2/3), and random seed of size polylog(n). This improves upon, and extends the work of [Lenzen-Levi, 2018]. We also complement our results by providing a polynomial lower bound on the probe complexity of LCAs for graph spanners that holds even for the simpler task of computing a sparse connected subgraph with o(m) edges. 
    more » « less
  5. Abstract

    Mammalian genomes are folded into a hierarchy of compartments, topologically associating domains (TADs), subTADs, and long-range looping interactions. The higher-order folding patterns of chromatin contacts within TADs and how they localize to disease-associated single nucleotide variants (daSNVs) remains an open area of investigation. Here, we analyze high-resolution Hi-C data with graph theory to understand possible mesoscale network architecture within chromatin domains. We identify a subset of TADs exhibiting strong core-periphery mesoscale structure in embryonic stem cells, neural progenitor cells, and cortical neurons. Hyper-connected core nodes co-localize with genomic segments engaged in multiple looping interactions and enriched for occupancy of the architectural protein CCCTC binding protein (CTCF). CTCF knockdown andin silicodeletion of CTCF-bound core nodes disrupts core-periphery structure, whereasin silicomutation of cell type-specific enhancer or gene nodes has a negligible effect. Importantly, neuropsychiatric daSNVs are significantly more likely to localize with TADs folded into core-periphery networks compared to domains devoid of such structure. Together, our results reveal that a subset of TADs encompasses looping interactions connected into a core-periphery mesoscale network. We hypothesize that daSNVs in the periphery of genome folding networks might preserve global nuclear architecture but cause local topological and functional disruptions contributing to human disease. By contrast, daSNVs co-localized with hyper-connected core nodes might cause severe topological and functional disruptions. Overall, these findings shed new light into the mesoscale network structure of fine scale genome folding within chromatin domains and its link to common genetic variants in human disease.

     
    more » « less