<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework</title></titleStmt>
			<publicationStmt>
				<publisher>Oxford Academic</publisher>
				<date>06/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10506368</idno>
					<idno type="doi">10.1093/bioinformatics/btad373</idno>
					<title level='j'>Bioinformatics</title>
<idno>1367-4811</idno>
<biblScope unit="volume">39</biblScope>
<biblScope unit="issue">6</biblScope>					

					<author>Tony C Pan</author><author>Sriram P Chockalingam</author><author>Maneesha Aluru</author><author>Srinivas Aluru</author><author>Lenore Cowen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <sec><title>Motivation</title><p>Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes.</p></sec> <sec><title>Results</title><p>We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene–gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements.</p></sec> <sec><title>Availability and implementation</title><p>Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.</p></sec>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Gene network inference from high-throughput gene expression data is a compute intensive task. Although several different inference algorithms based on Bayesian networks <ref type="bibr">(Hartemink 2005)</ref>, mutual information (MI) theory <ref type="bibr">(Faith et al. 2007</ref><ref type="bibr">, Aluru et al. 2013)</ref> Pearson correlation <ref type="bibr">(Langfelder and Horvath 2008)</ref>, regression <ref type="bibr">(Bonneau et al. 2006</ref><ref type="bibr">, Huynh-Thu et al. 2010)</ref>, and random forest <ref type="bibr">(Aibar et al. 2017</ref>) have been developed over the past two decades, scalability remains a critical challenge when working with tens of thousands of genes and/or observations. <ref type="bibr">Aluru et al. (2022)</ref> found that 9 of the 15 methods analysed failed to infer networks from a dataset with $18 000 genes. Furthermore, of the six that succeeded, four required between 4 and 49 days to complete. To construct large networks with tens of thousands of genes in reasonable time, scalable algorithms and efficient parallel implementations are necessary. Arboreto <ref type="bibr">(Moerman et al. 2019)</ref> recently introduced distributed computing capability to construct random forest-based transcription factor (TF)-target gene regulatory networks (GRNs); however, this is still not scalable for large networks. TINGe <ref type="bibr">(Zola et al. 2010)</ref>, using a parallel MI-based approach can construct large genome-scale gene networks in a significantly shorter amount of time.</p><p>Past surveys conducted with established benchmark data suggest that Bayesian and MI-based methods are among the best performing <ref type="bibr">(Marbach et al. 2012</ref><ref type="bibr">, Lachmann et al. 2016</ref><ref type="bibr">, Aluru et al. 2022)</ref> with respect to the quality and accuracy of inferred interactions. However, unlike Bayesian networks, MI-based methods are more amenable to large scale network reconstruction <ref type="bibr">(Chockalingam et al. 2017)</ref>. One key caveat though is that these methods require post-processing such as Stouffer Transform in CLR <ref type="bibr">(Faith et al. 2007)</ref> or MI value filtering based on P-values and data processing inequality (DPI) measures in ARACNe-adaptive partitioning (AP) <ref type="bibr">(Lachmann et al. 2016)</ref> and TINGe. Such filtering complicates network evaluation by standard metrics such as the area under the precision-recall curve (AUPRC) as it introduces discontinuities in the value range. Additionally, DPI requires a usersupplied tolerance parameter that is challenging to determine a priori. As DPI reflects the information transmission capacity, it is a special case of the maximum capacity path (MCP) problem in graph theory, which has previously found applications in network routing <ref type="bibr">(Pollack 1960)</ref>, image compositing <ref type="bibr">(Fernandez et al. 1998)</ref>, and metabolic pathway analysis <ref type="bibr">(Ullah et al. 2009)</ref>. Formulated as such, the tolerance parameter is no longer required.</p><p>In this paper, we present a novel gene network reconstruction approach based on MCP to characterize and compare indirect gene interactions to identify significant gene-gene interactions or TF-target relationships without thresholding or user-specified parameters. Our unsupervised approach examines all indirect paths between two genes to compute the maximum indirect interaction capacity and allows efficient investigation of paths of arbitrary lengths. Using the same framework, we also established an ensemble method that combines interactions from multiple path lengths using optimized weights identified with partial groundtruth. We further created an efficient parallel implementation of the algorithm, called MCPNet, that can scale to hundreds of cores. We evaluated performance of our unsupervised and ensemble approaches using MI as the direct interaction measure and paths with length up to four for both gene-gene and TF-target interactions and demonstrate that our method delivers higher network quality and superior computational performance when compared to other state-of-the-art gene network reconstruction software.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Materials and methods</head><p>A gene expression dataset with samples S is formulated as a matrix P with jVj rows and jSj columns, each row corresponding to expression levels for a gene, and each column corresponding to the gene expression profile of a sample in S.</p><p>Here, we reconstruct and examine two different types of gene networks-gene co-expression networks (GCN) and GRN. GCN is defined as an undirected weighted graph G &#188;hV; E; Wi, where V &#188;fv 1 ...v N g is the set of N genes, E &#188;f&#240;v i ; v j &#222;jv i ; v j 2 Vg is the set of edges representing interacting genes, and W is a matrix of size N &#194; N, where the entry W ij is an edge weight that quantifies the interaction strength of v i with v j . GRN is directed in contrast to a GCN, and can be defined as G &#188;hU; V; E; Wi, where U &#188;fu 1 ...u M g represent the regulators, V is the set of N target genes. The edge set is E &#188;f&#240;u i ; v j &#222;ju i 2 U; v j 2 Vg, and W ij contains the weights corresponding to the edges in E. Since a GRN is directed, W is rectangular.</p><p>Different simplifying mathematical models have been adopted to compute the edge weights. Correlation measures such as Pearson computes W ij from the expression profiles of genes v i and v j . Correlation measures are commutative, W ij &#188; W ji , thus the square weight matrix of GCN is symmetric but may contain negative elements. In the Arboreto model <ref type="bibr">(Huynh-Thu et al. 2010)</ref>, W ij indicates the degree to which the expression of gene v i can predict v j 's gene expression, thus GCN W is non-negative but may not be symmetric. An alternative to correlation is MI, where the expression profiles for genes v i ; v j 2 V, rows i and j in matrix P are modeled as random variables X i and X j , from which MI is estimated. Since MI is non-negative and commutative, GCN W is symmetric and positive semi-definite. As these measures compute pairwise interaction strengths, the weight matrix W of a GRN can be computed element-wise in the same way as W for a GCN.</p><p>Our proposed method transforms the input edge weight matrix W to produce a new MCP score that accounts for indirect interaction strengths and can leverage different interaction strength measures while preserving the directionality of the measures. In this paper, MI is used to demonstrate the performance of our proposed method, though our method is agnostic of the semantics of the W matrix or its symmetry and thus is suited for both undirected GCN and directed GRN reconstruction. For W with negative values, our proposed method can use the magnitude, i.e. absolute value, of the interaction strength instead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">MCP score</head><p>We define an indirect interaction between two genes as one that is mediated through one or more intermediary genes and can be modeled as a length-L path &#240;v s ; ...; v ir ; ...; v t &#222; in G, where v s and v t are source and target genes, v ir is an intermediary gene in V indexed by i r 2f1 ...jVjg, and r 2 f1 ...L &#192; 1g indicates position along the path. Indirect biological interactions are well-known in eukaryotic organisms-e.g. the gene networks of biochemical, cellular, and signal transduction pathways where a change in the structure or function of a gene/protein can have indirect effects on pathway genes that are not adjacent to each other through regulatory positive or negative feedback <ref type="bibr">(Harris and Levine 2005</ref><ref type="bibr">, Lu et al. 2007</ref><ref type="bibr">, Mitrophanov and Groisman 2008</ref><ref type="bibr">, Itzhack et al. 2013</ref><ref type="bibr">, Vermeirssen et al. 2014</ref><ref type="bibr">, Rittschof and Robinson 2016</ref><ref type="bibr">, Cowen et al. 2017)</ref>. Reconstructing the network from gene expression profiles is guided by two goals: identifying strong direct interactions between gene pairs, and simplifying the network by removing direct interactions made redundant by strong indirect interactions. Our proposed method seeks to accomplish both by first identifying the strongest indirect interactions between two genes, then compare it to the direct interaction strength.</p><p>The maximum indirect interaction strength is determined by solving the MCP problem. Figure <ref type="figure">1</ref> illustrates different length-2 and length-3 paths between genes v s and v t as dashed lines and their direct interactions as solid lines. For each path, the minimum edge weight is its maximum capacity. Among all the possible length-L paths between v s and v t in G, the MCP is the path with the highest minimum edge weight. We use g L st to denote the capacity of such a path between v s and v t , and refer to it as L-path Capacity or Path Capacity. Formally,</p><p>Figure 1. Computing g 2 st and g 3 st in a network requires exploring different paths with one and two intermediate vertices, respectively, as shown above. g L st &#188; max i 1 ;...; i L&#192;1 2f1...jVjg &#240;min&#240;W si 1 ; W i 1 i 2 ; ...; W i L&#192;1 t &#222;&#222; (1) The maximum operation finds a combination of indices i 1 ; ...; i L&#192;1 with the maximum path capacity, with each index having range f1 ...jVjg. Multiple indices may refer to the same gene, thus forming a cycle in the L-path, representing, e.g. feedback loops for gene regulation. In subsequent sections, the range f1 ...jVjg for the indices is implied in the maximum operation.</p><p>Once computed, g L st is compared to the direct interaction strength W st . We define the L-path MCP score between the genes v s and v t , denoted as q l st , as the following ratio:</p><p>We use both the terms L-path MCP Score and MCP Score to refer to q L st . The output of our proposed method is denoted R L , a matrix of all the q L st 's for every pair of s and t. As a ratio of the edge weight to the path capacity, we expect q L st to be an indicator of the interaction strength of genes v s and v t while factoring in the effect of other interactions in the L-neighborhood of v s and v t . Strong direct interactions relative to indirect ones suggest that the edge should be kept in the network. We note that the DPI-based approach used by ARACNe-AP and TINGe can be modeled as a specialization of the L-path MCP score, as shown in Supplementary Section S1.1. Whereas ARACNe-AP and TINGe uses a DPI threshold to filter the MI matrix thus removing edges representing weak direct interactions compared to their indirect counterparts, our approach quantifies the relative strengths of the direct and indirect interactions and allows for post-processing according to application needs, e.g. thresholding to retain relatively strong direct interactions by examining the precision-recall curve.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">MCP score algorithm</head><p>The 2-path capacity g 2 st &#188; max i &#240;min&#240;W si ; W it &#222;&#222; can be computed explicitly by enumerating all possible length-2 paths between v s and v t . Algorithm 1 shows the pseudo-code for this approach. The 2-path MCP score defined in Section 2.1, q 2 st , can be calculated for the complete network directly by first computing g 2 st for a pair of genes via Algorithm 1, iterating over the two gene expression profiles in O&#240;jVj&#222; time. The algorithm has a run-time complexity of O&#240;jVj 3 &#222; as it iterates over all elements of a jVj&#194;jVj matrix (Algorithm 2).</p><p>Algorithm 2 applies for L-path MCP scores for any arbitrary L. Naive extension of Algorithm 1 to longer range interactions is exponential in computational complexity, however, as an increase of L by 1 increases computation by jVjfold. For an indirect interaction of length L, the naive g L st computation complexity is O&#240;jVj L&#192;1 &#222;, leading to O&#240;jVj l&#254;1 &#222; to compute the scores for all gene-gene pairs. <ref type="bibr">Pollack (1960)</ref> showed that the maximum capacity g L st of all length-L paths between two vertices in a graph can be computed efficiently via recursive path bisection. Formally,</p><p>where h &#188;bL=2c. The length-2 realization, g 2 st ,f o r m st h eb a s e case for the recursion. This partitioning leverages the associative properties of the maximum and minimum operations. Let c si h &#188; min&#240;W si1 ; ...;</p><p>An L-path does not require unique genes on the path. This implies that c si h does not depend on i h&#254;1 ...i L&#192;1 , and c si h is effectively a constant with respect to the max i h&#254;1 ;...;iL&#192;1 &#240;&#222; operation. Since the maximum and minimum operators are distributive over each other with respect to a constant, i.e. max&#240;min&#240;z; x&#222;; min&#240;z; y&#222;&#222; &#188; min&#240;z; max&#240;x; y&#222;&#222;,</p><p>Applying the identity again leads to Equation (3).The recursive path bisection allows L-length MCP scores to be computed in O&#240;jVj log 2 L&#222; for a single gene-gene pair, and the L-path capacity for all gene pairs to be computed in O&#240;jVj 3 log 2 L&#222; time. A realization of Equation ( <ref type="formula">3</ref>) is shown in Algorithm 3. For a gene pair (v s , v t ), the g values for half-length indirect interactions between v s and v t with all possible intermediary genes v r 2 V are computed. The g value for (v s , v t ) is then computed following the MCP problem solution as the twopath capacity case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Parallel implementation</head><p>MCPNet pipeline consists of a sequence of functions. We first compute the MI between pairs of genes, using the gene expression profile matrix as input. The resulting MI matrix was then transformed to MCP score via the algorithm described in First, we note that Algorithm 2 has data access pattern identical to matrix-matrix multiplication, with Algorithm 1 replacing vector dot product as the computational kernel. Indeed, fast matrix multiply algorithms have been applied to the MCP problem <ref type="bibr">(Vassilevska et al. 2007, Duan and</ref><ref type="bibr">Pettie 2009)</ref> to achieve sub-cubic run-time. For simplicity and ease of parallelization, we compute each g directly in O&#240;jVj&#222; time. In contrast to the dominant product algorithm by <ref type="bibr">Vassilevska et al. (2007)</ref> that requires random memory access, our algorithm incurs only sequential memory access and therefore is cache-friendly and can leverage hardware memory prefetching.</p><p>Since each element can be computed independently from the other elements of the output matrix, a simple parallelization scheme is sufficient. The output matrix is partitioned into 8 &#194; 8 tiles and tiles are assigned to the compute nodes and CPUs. Use of tiles improves cache reuse as the eight rows and eight columns of the input matrices needed to compute one tile are likely to remain in cache memory during computation.</p><p>To ensure that the required input matrix row and column to compute one g 2 st value are in local memory, the MI matrix is replicated on each compute node, and shared between cores of the same node. The MI computation is similarly organized across all nodes, with the gene expression profile matrix replicated and the MI matrix tiles partitioned across cores. One round of all-to-all personalized communication via message passing interface reconstructs the MI or MCP score matrices on all compute nodes. The input replication reduces communication during computation. The overall parallelization approach is shown in Figure <ref type="figure">2</ref>.</p><p>Finally, the computation to produce g 2 st is readily vectorized using element-wise minimum and maximum operators. For reconstruction where the input weight matrix W is symmetric, the MCP score matrix is symmetric as well thus half of the computation can be avoided.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Ensemble multipath MCP scores</head><p>MCP scores as defined in Section 2.1 are for a fixed path length. To account for indirect interactions of multiple lengths, we consider a linear combination of the g values in computing an ensemble DPI score:</p><p>where a 2 , a 3 , ..., a L are coefficients for the different indirect interaction path lengths, with the standard constraint that the weights add to 1.0. The optimal ensemble parameters are generally challenging to determine. We propose to use available partial groundtruth to help optimize the ensemble parameters. For large gene networks, there are known and experimentally validated genegene and TF-target interactions. Using the known interactions as partial groundtruth for optimizing the ensemble parameters, the proposed ensemble approach promises to improve predictions of unknown interactions in the remainder of the network.</p><p>For the ensemble multi-path MCP score, we compute a weighted average of the 2-, 3-, and 4-path capacities by iterating over combinations of coefficients at a default interval of 0.1 for each coefficient. The AUPRC of the resulting networks are computed based on known interactions. The coefficient combination with the maximum AUPRC is used as the optimal combination for the ensemble. The ensemble approach allows the network inferring process to adjust to the dataset available, the organism of interest, and the available partial groundtruth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Evaluation methodology</head><p>We evaluate performance of MCPNet using both simulated and real datasets, and compare it with seven other popular gene network reconstruction software-four MI-based methods [ARACNe-AP <ref type="bibr">(Lachmann et al. 2016)</ref>, CLR <ref type="bibr">(Faith et al. 2007)</ref>, MRNET <ref type="bibr">(Meyer et al. 2008)</ref>, TINGe <ref type="bibr">(Aluru et al. 2013)</ref>], a Pearson correlation-based method (WGCNA; Langfelder and Horvath 2008), a random forest-based method (Arboreto; <ref type="bibr">Aibar et al. 2017)</ref>, and a regression-based method (Inferelator; <ref type="bibr">Bonneau et al. 2006)</ref>. Arboreto and Inferelator are ensemble methods, in contrast to the other methods referenced above which are standalone. Each method evaluated is executed with its default parameters, or published parameters for the target dataset <ref type="bibr">(Tchourine et al. 2018)</ref>. We applied the standard statistical measure AUPRC  for assessing the inferred networks. AUPRC is the area under the precision-recall curve, and is recommended in cases where there is an imbalance of positive and negative edges in the underlying network <ref type="bibr">(Davis and Goadrich 2006)</ref>. Using AUPRC for both ensemble optimization and final network evaluation leads to the ensemble network having the maximal AUPRC for the given partial groundtruth. However, when different groundtruth are used to parameterize the AUPRC function, the generalization performance of the MCPNet ensemble method can be assessed as demonstrated in Section 4.4.</p><p>For in silico evaluations, we used NetBenchmark <ref type="bibr">(Bellot et al. 2015)</ref>, an R Bioconductor package that internally employs the GeneNetWeaver simulator <ref type="bibr">(Schaffter et al. 2011)</ref> to generate synthetic yeast datasets. NetBenchmark uses 5196 real interactions from the Yeast Transcriptional Regulatory Network <ref type="bibr">(Balaji et al. 2006</ref>) and employs non-linear ordinary differential expression simulation to produce 2000 observations for 2000 genes. Noise is then injected at five different levels in ten replicas to create 51 total datasets. The 5196 real interactions are used as true positives in the groundtruth, while the remaining 1 993 804 gene-pairs are considered as true negatives. The groundtruth for the simulated, as well as real Yeast and Arabidopsis thaliana are majority TF-target interactions and are used for both GCN and GRN evaluations. For synthetic TF-target network reconstruction, 187 TFs from Inferelator source repository <ref type="bibr">(Castro et al. 2019)</ref> were included.</p><p>To determine accuracy and scalability of gene network reconstruction with real-world data, we used gene expression data from two different organisms-S. cerevisiae and A. thaliana. The S. cerevisiae (yeast) dataset is a compilation of multiple yeast RNA-seq expression studies, and contains 2577 observations and 5716 genes <ref type="bibr">(Tchourine et al. 2018)</ref>. To determine the quality of the real-world yeast networks, we adopted the groundtruth from <ref type="bibr">Castro et al. (2019)</ref>, consisting of a matrix with 97 TFs and 956 targets. The 1360 non-zero entries are considered as true positives, while the remaining 91 372 are true negatives. Castro et al. generated the matrix by combining TF-binding and TF-knockout data collected by <ref type="bibr">Tchourine et al. (2018)</ref> from YEASTRACT <ref type="bibr">(Teixeira et al. 2006)</ref> and SGD <ref type="bibr">(Costanzo et al. 2014)</ref> databases, and TF-target interactions derived from chromatin accessibility data <ref type="bibr">(Boyle et al. 2008</ref><ref type="bibr">, Buenrostro et al. 2013)</ref> and TF binding sites <ref type="bibr">(Weirauch et al. 2014</ref>) by associating TFs with the closest downstream genes. For real yeast TF-target network reconstruction, the 563 TFs from the Inferelator source repository <ref type="bibr">(Castro et al. 2019)</ref> were included.</p><p>The A. thaliana microarray data was downloaded from public repositories <ref type="bibr">(Aluru et al. 2022)</ref>, and processed according to <ref type="bibr">(Chockalingam et al. 2016)</ref>. Data from six different tissues/conditions with varying sizes of gene expression datasets (Table <ref type="table">1</ref>) were used for gene network reconstruction. Microarray data was used as it was readily available from our previous studies, and it demonstrated the applicability of the MCPNet methods to both RNAseq and microarray data. Arabidopsis thaliana network(s) were evaluated using interactions from the following two known networks as groundtruth: (i) Arabidopsis Transcriptional Regulatory Map (ATRM) constructed from a priori biological knowledge by <ref type="bibr">Jin et al. (2015)</ref>, which includes 1359 TF-target interactions; (ii) 295 highest confidence TF-target interactions with scores greater than 9.5 selected from the 6863N-response dynamic factor graph (DFG) network edges identified in Brooks et al.</p><p>(2019, Supplementary File 10). These 6863 edges are themselves considered high confidence by Brooks et al., as they were selected from the network generated by DFG with timeseries nitrogen-treatment response data <ref type="bibr">(Marbach et al. 2012)</ref> based on the 71 836 validated TF-target interactions from TARGET assays. This approach of incorporating highconfidence computed interactions have been used previously to validate gene networks <ref type="bibr">(Vermeirssen et al. 2014</ref><ref type="bibr">, Castro et al. 2019)</ref>. The interactions from these two given networks can be considered as positives, i.e. interactions that are expected to be highly weighted edges in any predicted network. For true negatives, we used a set of 4347 interaction pairs between chloroplast-encoded and mitochondriaencoded genes <ref type="bibr">(Aluru et al. 2022)</ref> given the unlikelihood of a direct transcriptional interaction occurring between such genes in A. thaliana tissues <ref type="bibr">(Woodson and Chory 2008)</ref>. For A. thaliana TF-target network reconstruction, 2015 TFs from Chen et al. (2017) were included.</p><p>All software were run on a computing cluster with each node having two 2.7 GHz 12-Core Intel Xeon Gold 6226 Processor processors and 256 GB of main memory and running RedHat Enterprise Linux (RHEL) 7.0 operating system. For the simulated and Saccharomyces cervisiae RNA-seq datasets, all 24 cores of a node were used for software that could be run with multiple cores. For example, Arboreto, WGCNA, TINGE, Inferelator, and MCPNet are capable of using all the cores in a node. For A. thaliana datasets, we used up to eight nodes and all of the 192 cores distributed across these eight nodes for Arboreto, TINGE and MCPNet, which are capable of using multiple distributed cores. While Inferelator is also capable of utilizing multiple shared cores, it was not used in this case due to extremely long run-times. Scripts used for the evaluations presented here can be accessed at <ref type="url">https://doi.org/10.5281/zenodo.6499756</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>The proposed MCPNet method takes as input correlation measures, which for the purpose of evaluation presented here are the MI values between pairs of gene expression profiles. We evaluated three common methods for estimating MI values from observed data and chose adaptive partitioning (AP) with rank transform for its accuracy and speed (see Supplementary Section S1.2). We use our implementation of the AP MI estimation algorithm based on the hybrid multithread, multi-node parallelization approach in Section 2.3. The existing tools utilized their respective built-in MI algorithms where applicable.</p><p>The unsupervised MCP score has a single parameter, path length L. Empirical testing with simulated Yeast data showed that the network quality reached maximum at L &#188; 4 and improvements diminished beyond 4 (Supplementary Section S1.2). Subsequent evaluations of MCPNet, including the ensemble multipath scores, were conducted with L of 2, 3, and 4. The ensemble method optimizes the multipath score via a global coefficient search at a regular interval, which is a useradjustable parameter. We used 0.1 as the search interval as it represents a good balance between search quality and computation time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Evaluation using simulated yeast expression data</head><p>We begin by evaluating the effect of noise in the gene expression profile data on GCN quality. Simulated yeast datasets were generated with varying noise levels using the NetBenchmark package.</p><p>For each of the datasets MCPNet generates three different sets of networks: (i) L-path MCP score networks (i.e. MCP score matrices R 2 ; R 3 , and R 4 from Section 2.1); (ii) ensemble networks constructed using ensemble method discussed in Section 2.4 [i.e. M 4 matrix in Equation ( <ref type="formula">4</ref>)] constructed as the weighted combination of R 2 ; R 3 , and R 4 matrices; and (iii) Stouffer-transformed ensemble network, where Stouffer's Z-transform is applied to each of the M 4 matrices and the AUPRC for the best combination is reported as M 4 Z . This is similar to post-MI processing in CLR <ref type="bibr">(Faith et al. 2007</ref>) to reduce background contribution to the score.</p><p>Table <ref type="table">2</ref> shows the average AUPRC for networks constructed by MCPNet and other network construction methods. Results for all of the simulated data instances are given in Supplementary Table <ref type="table">S2</ref>. The rows R 2 ; R 3 , and R 4 are networks constructed using the path scores [Equation ( <ref type="formula">2</ref>)] with path lengths of 2, 3, and 4, respectively. Note that for M 4 and M 4 Z networks, the average AUPRCs for the best combination is reported.</p><p>As expected, the AUPRC values, and hence the network performance, decreases with increasing noise levels. Our results also show that ensemble MCP Score achieves the best performance for low and moderately noisy datasets when compared to all the other gene network reconstruction methods. At noise levels higher than 50%, MCPNet is the second best performing method. The selection of best coefficients for each of the L-path capacities (i.e. R 2 ; R 3 , and R 4 ) results in an improvement over the individual MCP score networks. By exploiting known interaction information from user supplied partial groundtruth, the ensemble approach is able to identify the best combination for each of the noise levels. Thus, MCPNet's ensemble method is able to adjust to the unique data and noise characteristics of each dataset through fine tuning of its coefficients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation with real gene expression datasets</head><p>We next evaluated performance of MCPNet using real-world data from two different organisms-yeast (S. cervisiae) and A. thaliana. For GCNs inferred from the real yeast data, our results show that Arboreto is the highest performing method followed by M 4 Z (Table <ref type="table">3</ref>). For this dataset, all methods showed low AUPRC scores, and the absolute AUPRC differences between Arboreto and M 4 Z can be considered marginal. The improvement of M 4 Z over M 4 and the relatively high AUPRC for CLR suggest that Stouffer transformation as a post-processing step is beneficial for this particular dataset. While Arboreto achieved the best network quality, it does so in $29 h and 7 GB of memory using 24 CPU cores whereas MCPNet completed the computation in 38 s and less than half of the memory. The quality and run-time balance compared to existing tools strongly favors MCPNet for its ability to reverse-engineer a good-quality network in reasonable time. An interesting observation is the uniformly low AUPRC values when compared to those in Section 4.1, and A. thaliana. The same observation can be seen in <ref type="bibr">Castro et al. (2019)</ref> which suggest this maybe inherent in the expression profile data or the groundtruth.</p><p>Arabidopsis thaliana networks were constructed using data from the Development category (Table <ref type="table">1</ref>) containing the largest number of microarray datasets. Data were processed both on one node and eight distributed nodes for all network inference methods, with the exception of Inferelator which does not support multi-node execution and whose run-time for the development dataset is expected to exceed the cluster's job Z networks are (0.7, 0.0, 0.0, 0.3) and (0.875, 0.0, 0.0, 0.125), respectively.</p><p>c ARACNe-AP has identical memory usage for 1C and 24C, as does WGCNA.</p><p>d CLR and MRNET are not run using 24 cores as they do not support multi-threading.</p><p>e M 4 and M 4 Z are computed together in the same pass and share memory usage.</p><p>f Arboreto and Inferelator are expected to exceed run-time limit for single-core run. Z networks vary for each randomly sampled network. scheduler time limit. Table <ref type="table">4</ref> shows that for the A. thaliana dataset, R 4 and M 4 produced the best results when compared to all other methods. Importantly, all the MCPNet algorithm variants (R 2 ; R 3 , and R 4 ) and ensemble methods (M 4 ; M 4 z ) outperformed all existing network inference methods in terms of AUPRC. With the large A. thaliana data, MCPNet's performance advantage over existing tools is even more evident. The two tools with the next best AUPRC values, CLR and Arboreto, required 18.5 h using 1 CPU core and 44.1 h and 437 GB of memory using eight nodes, respectively, while MCPNet completed the computation in &lt;10 min on one node and 84 s on eight nodes with moderate per node memory footprints.</p><p>Finally, we evaluate the proposed methods using six A. thaliana datasets of different tissues and experimental conditions, each with differing number of samples and genes (see Table <ref type="table">1</ref>). Table <ref type="table">5</ref> shows the AUPRC and run-times for the multipath ensemble MCP score network (M 4 ) and the next best results amongst ARACNe-AP, CLR, MRNET, TINGe, WGCNA, and Arboreto. Complete Results for all the methods are given in Supplementary Table <ref type="table">S3</ref>. For all of the datasets analysed, MCPNet M 4 shows higher or equivalent AUPRC values while the next best AUPRC is achieved by CLR, MRNET, or Arboreto depending on the datasets, while being consistently faster by up to three orders of magnitude.</p><p>MCPNet consistently produces networks that are amongst the highest in quality as assessed by AUPRC and fastest in performance by orders of magnitude (see Supplementary Section S1.3 for a more in-depth computational performance discussion). Our results show that dataset characteristics can significantly influence the quality of the reconstructed network. The coefficients of the weighted average of L-path capacities for the best performing M 4 and M 4 z networks also varied by dataset as noted in Tables <ref type="table">3</ref> and <ref type="table">4</ref>. The optimization approach therefore is well-suited for the ensemble methods as it adjusts automatically to data characteristics. In addition, since MCPNet is capable of efficiently generating multiple networks in one run (R 2 ; R 3 ; R 4 ; M 4 , and M 4 z ), the user can efficiently choose and further evaluate the networks with an appropriately chosen, biologically relevant downstream metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">GRNs with synthetic and real yeast data sets</head><p>In this section, we evaluate MCPNet for the task of reconstructing GRNs, namely to compute the maximum path capacities between TFs and their target genes. By the definition of indirect interactions in Section 2.1, indirect TF-target interaction path may be modeled as a combinatorial sequence of TF-TF, TF-gene, and gene-gene interactions. For this evaluation, we focus on the mathematical path model consisting of a sequence of TF-TF interactions followed by one TF-target interactions as a simplification, similar to ARACNe-AP's DPI modeling for length 2 indirect interactions. We note that an alternative simplification, TF-target gene...gene-gene, is trivially contained in the GCN output matrix. GRNs were reconstructed MCPNet, ARACNe-AP, Arboreto, and Inferelator using real yeast and noise-free simulated datasets. CLR, MRNET, TINGe, and WGCNA were not included as they do not support GRN inferencing. Arabidopsis thaliana is not used for the evaluation as its true negative groundtruth do not involve the genomic TFs, thus precluding evaluation by AUPRC.</p><p>Table <ref type="table">6</ref> shows the qualities of the reconstructed GCNs and GRNs. All of the reconstructed TF-target GRNs have significantly higher AUPRC than their GCN counterparts. MCPNet methods, M 4 and M 4 Z particularly, outperformed existing methods for both GCN and GRN reconstructions using simulated datasets. For the real yeast dataset, MCPNet's M 4 Z method is second only to Arboreto in network quality, reflecting the same trend as for the GCN reconstruction. The high AUPRC values suggest that the mathematical formulation and algorithm implementation of MCPNet is well suited for both GCN and GRN inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Ensemble optimization with partial ground truths</head><p>The MCPNet ensemble method optimizes the coefficients of g 2 s;t ; g 3 s;t ; g 4 s;t to maximize the AUPRC given available partial groundtruth and gene expression data as input. The M 4 results in Table <ref type="table">2</ref> were generated by optimizing the ensemble network AUPRC using the full groundtruth. For real-world Z networks are &#240;0:2; 0:2; 0:2; 0:4&#222; and &#240;0:0; 1:0; 0:0; 0:0&#222;, respectively.</p><p>Total memory usage for eight nodes is reported. c ARACNe-AP, CLR, MRNET, and WGCNA do not support multiple nodes.</p><p>d CLR and MRNET do not support multi-threading so the experiments used only 1 core of the node.</p><p>e M 4 and M 4 Z are computed together in the same pass and share memory usage.</p><p>f Arboreto is expected to exceed run-time limit for single-node run.</p><p>data, often only partial groundtruth is available, and the size and membership of the partial groundtruth affect network quality.</p><p>To assess the impact of the partial groundtruth, we divide the full yeast groundtruth randomly into training and testing sets, and use the training set to optimize for an ensemble network from the noise-free simulated yeast gene expression data. The ensemble network is then evaluated using the test set and the full groundtruth. As the AUPRC is parameterized with different groundtruth, effectively the AUPRC functions for optimization and final network evaluation are distinct and independent. This process is repeated 100 times to characterize the distribution of the AUPRC values and the robustness of the ensemble algorithm with varying groundtruth. We evaluated training-testing split proportions of 0.5%, 1%, 2%, 5%, and 10% of the full groundtruth. The yeast groundtruth contain 92 732 gene-gene interactions, representing 0.57% of all possible gene-gene interactions between 5716 yeast genes.</p><p>Figure <ref type="figure">3</ref> shows the distributions of the AUPRC values for each training-testing splits and full groundtruth. As expected, smaller training sets produced greater AUPRC value dispersion, as the ensemble network optimization relies on fewer groundtruth elements, and AUPRC computation is more influenced by variations in the gene-gene or TF-target interactions.</p><p>In all GCN cases, the AUPRC values are tight in dispersion, with standard deviations for the test AUPRCs being 0.0118, 0.0045, 0.0040, 0.0023, and 0.0026 for the 0.5-10% training-testing splits, respectively. Even for networks generated from only 0.5% of the full set, the AUPRC distribution lies completely above the ARACNe-AP's AUPRC value of 0.3503 and nearly all above R 4 's AUPRC of 0.3938, and the majority of the networks have AUPRC more than 0.4. The means, 0.4135, 0.4169, 0.4173, 0.4184, and 0.4201, are remarkably close to the 0.4198 AUPRC of the network optimized with the full groundtruth, despite the networks being generated using small training sets. The similarity of the AUPRC distributions between test sets and full groundtruth reflects the fact that the test sets are 90-99.5% identical.</p><p>For the GRN cases, the AUPRC values are significantly higher than their GCN counterparts, likely due to better curated groundtruth. The AUPRC distribution standard deviation tightens (0.0088, 0.0063, 0.0050, 0.0036, 0.0036) as the size of training set increases. Similarly, the mean AUPRCs (0.4979, 0.4999, 0.5025, 0.5054, 0.5070) for the trainingtesting splits are higher than that of R 4 , 0.4988, and converges to the AUPRC of the ensemble network optimized using the full TF-only groundtruth, 0.5069, even when training set represents a small fraction of the already small set of TF-  These results suggest that even when the ensemble network is optimized using a small partial groundtruth set, with high probability it will be of higher quality than the R 4 network. Furthermore, a small partial groundtruth, even at 1% of full groundtruth, will on average produce a network with similar quality as one optimized using the full groundtruth. Finally, the ensemble method is robust to varying make-up of the training set, and thus the choice of ensemble weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>The efficient algorithm and parallel implementation of MCPNet creates opportunities for data exploration and biological hypothesis testing. Its scalability to hundreds of CPU cores enables analyses of data with sample and gene counts that are previously prohibitive, and exploration of parameter spaces and multiple methods at the same time, e.g. computing length-2, -3, and -4 MCP scores, our ensemble network and its Stouffer transform in one pass. This is useful where the individual methods are suspected to behave in a data-dependent manner, e.g. with Stouffer transform in CLR and M 4 Z for the real yeast data, and inspecting multiple networks is desirable. In addition, as the MCP score formulation differs from existing algorithms, MCPNet provides complementary network that can contribute to multi-method ensemble approaches, i.e. combining networks generated by the tools evaluated in this work.</p><p>There are multiple aspects of the MCPNet algorithm and implementation that deserve further exploration. An important consideration for MCPNet development is the usability of the software. While the evaluations in this paper utilized nodes in a high performance cluster, the real yeast and the A. thaliana experiments showed that commodity personal computers are well suited for smaller datasets, while professional workstations should easily exceed the requirements for processing large datasets using MCPNet. At the same time, user friendly integration into common, existing bioinformatic programming environments and pipelines will ease downstream processing and biological interpretation of the reconstructed network and promote user adoption. Memory footprint reduction and language binding for Python and R/Bioconductor represent two future software engineering efforts for MCPNet.</p><p>The MCP mathematical model underlying MCPNet is agnostic of the edge weight formulation. While we focused on MI values as edge weights, other measures such as Boolean values and Pearson correlations may allow different network modeling and incorporate information such as up-and downregulations. Our evaluations showed that MCPNet performed well with MI of RNAseq and microarray data as averaging improves signal to noise ratio, but such may not be the case for single cell transcriptomic data as MI estimation with highly sparse data may present significant challenges for MCPNet and gene network reconstruction in general, and information regarding groundtruth for individual cell types is still limited for validation. Single cell transcriptomic data, promises to allow cell-type specific GCNs and GRNs, thus MCPNet applicability to such data is of significant interest.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we present a new method MCPNet for GCN and GRN reconstruction that utilizes a novel edge scoring metric, the MCP score, based on the MCP problem in network and graph analysis. MCP score characterizes indirect gene-gene and regulator-target interactions to identify significant direct interactions in an unsupervised manner and with minimal user-specified parameters. We further present an ensemble approach that uses partial network groundtruth to optimize the weighted average of MCP scores from multiple Lpaths scores, with the objective of inferring higher quality novel interactions. The MCP score methods have been shown to generate networks with competitive or superior AUPRC scores on real-world S. cerevisiae and A. athaliana datasets, while reducing run-time by several orders of magnitude relative to existing best-in-class methods.</p><p>While we have applied MCPNet for GRN and GCN reconstruction, it operated on gene expression profile from a single time point, thus cannot infer causality, similar to the other existing tools evaluated in this work. As a consequence, the predicted gene co-expressions and regulatory interactions must be considered as candidates for further validation. With this understanding, the combination of network quality and computational performance suggests MCPNet to be a viable first-choice tool for gene network reconstruction and candidate gene and TF-target interactions.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Pan et al.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Downloaded from https://academic.oup.com/bioinformatics/article/39/6/btad373/7192172 by guest on 11 May 2024</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>Pan et al. Downloaded from https://academic.oup.com/bioinformatics/article/39/6/btad373/7192172 by guest on 11 May 2024</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3"><p>Pan et al.</p></note>
		</body>
		</text>
</TEI>
