<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Median quartet tree search algorithms using optimal subtree prune and regraft</title></titleStmt>
			<publicationStmt>
				<publisher>Springer Nature</publisher>
				<date>12/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10510814</idno>
					<idno type="doi">10.1186/s13015-024-00257-3</idno>
					<title level='j'>Algorithms for Molecular Biology</title>
<idno>1748-7188</idno>
<biblScope unit="volume">19</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Shayesteh Arasti</author><author>Siavash Mirarab</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size<italic>m</italic>can be placed optimally on a tree of size<italic>n</italic>in quasi-linear time with respect to<italic>n</italic>and (almost) independently of<italic>m</italic>. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The NP-Hard <ref type="bibr">[1]</ref> problem of finding a tree that minimizes the total quartet distance to a set of given trees has found wide-ranging applications in recent years <ref type="bibr">[2]</ref>. The quartet distance between two unrooted trees is obtained by dividing each tree into all its quartets (choices of four taxa) and counting quartet topologies that do not match <ref type="bibr">[3]</ref>. While studying this quartet median tree problem is not new <ref type="bibr">[4,</ref><ref type="bibr">5]</ref>, its renewed popularity is a result of its connection to a broader trend in phylogenomics -the embrace of methods that account for discordance between gene trees and species trees <ref type="bibr">[6]</ref>. While various approaches exist for accounting for such discordance when inferring a species tree <ref type="bibr">[7]</ref>, many of these methods rely on quartets. There is a reason for the use of quartets. As originally noted by Allman et al. <ref type="bibr">[8]</ref> for the multispecies coalescent model (MSC) <ref type="bibr">[9,</ref><ref type="bibr">10]</ref> of incomplete lineage sorting (ILS) and later extended to models of duplication and loss (GDL) <ref type="bibr">[11,</ref><ref type="bibr">12]</ref>, HGT <ref type="bibr">[13]</ref>, and even ILS+GDL <ref type="bibr">[14]</ref>, on a quartet species tree, the unrooted gene tree topology matching the species tree has a higher probability of being observed than the two alternative topologies. Some methods (e.g., ASTRAL <ref type="bibr">[15]</ref>) have used this observation to directly use the median quartet tree problem as a way of estimating species trees. Others (e.g., <ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref>) have used this observation to infer individual quartet species trees using some criterion and then combine them. Either way, taxa are divided into quartets.</p><p>Dividing n taxa naively into quartets will require &#65533;(n 4 ) time just to list the quartets, making any result- ing algorithm impractical on large datasets. Some methods still use this approach but subsample quartets to an asymptotically smaller size (e.g., a quadratic or even O(n log(n)) number <ref type="bibr">[19]</ref>). However, subsampling can be avoided while achieving high scalability using improved data structures and algorithms. For the simplest problem of computing the quartet distance between two trees, which naively would require &#65533;(n 4 ) time, a straightfor- ward algorithm can achieve quadratic time <ref type="bibr">[20]</ref>. This requires the post-traversal of one tree and comparing each node versus each node of the other tree, keeping track of the number of shared children below these nodes. This approach has been at the heart of ASTRAL since version 2 <ref type="bibr">[21]</ref>. If we allow ourselves to use much more sophisticated data structures, we can do even better. Brodal et al. <ref type="bibr">[22]</ref> have designed a complex data structure called Hierarchical Decomposition Tree (HDT), which, along with a host of counters and other algorithmic tricks, enable a O(n log 2 (n)) algorithm for comput- ing quartet distance. Mai and Mirarab <ref type="bibr">[23]</ref> later extended this approach to solve the problem of adding one taxon to a tree so that the updated tree has the minimum possible distance to a set of k input trees. Zhang and Mirarab <ref type="bibr">[24]</ref> used similar ideas to infer a tree by successively adding one taxon to a growing tree in a tool called ASTER. Thus, quadratic and sub-quadratic quartet methods without subsampling are widely available.</p><p>Available quartet-based median tree inference methods differ from most other phylogenetic inference methods in their approach. ASTRAL <ref type="bibr">[15]</ref>, which is perhaps the most widely used method for this problem, uses a dynamic programming algorithm (an approach with long history; see <ref type="bibr">[25]</ref><ref type="bibr">[26]</ref><ref type="bibr">[27]</ref>) to solve the problem exactly in exponential time or under some constraints in polynomial time ( O((nk) 2.726 ) in the worst case and close to n 2 k 2 empirically <ref type="bibr">[28]</ref>). The ASTER package <ref type="bibr">[24,</ref><ref type="bibr">29]</ref> uses several rounds of step-wide addition with random orders of adding taxa, followed by a dynamic programming step similar to ASTRAL to combine these greedy results. Earlier methods such as wQMC use graph-based techniques <ref type="bibr">[30]</ref>. Thus, none of these methods use the hill-climbing search algorithms used by most other phylogenetic inference tools. While ASTRAL has been scalable, it is not clear if the reason is the use of a constrained dynamic programming algorithm or if an efficient hill-climbing could be as efficient or perhaps even more. If the improvements are due to constrained dynamic programming, perhaps we should explore similar methods for other problems. On the other hand, it is possible that hill climbing can improve quartet-based estimation in terms of running time, accuracy, or both.</p><p>Hill climbing tree search requires efficient methods of updating the score after a rearrangement. This is often straightforward for Nearest Neighbour Interchange (NNI) moves around the current tree T. However, many modern methods use Subtree Prune and Regraft (SPR) rearrangements in addition to NNI. SPR rearrangement is defined on an edge (u, u &#8242; ) , selecting one end, say u, as the pruning point. The (u, u &#8242; ) edge is pruned at u and is grafted back on an edge (v, v &#8242; ) by creating new edges (v, u) and (u, v &#8242; ) . Assume we know the quartet distance of a tree to another tree before an SPR move. How should we update the distance after the move? We could use the Brodal et al. <ref type="bibr">[22]</ref> (called B13 hereafter) method and simply recompute the score in O(n log 2 (n)) time. Doing so, we would need O(n 2 log 2 (n)) time to find the optimal SPR move for a given (u, u &#8242; ) ; one "round" of SPR would in the worst case require trying pruning all edges of a tree, which would need O(n 3 log 2 (n)) . Thus, a single SPR round can start to become infeasible, and we need many rounds.</p><p>Our goal in this paper is to enable SPR-based hill climbing for the quartet median tree given a set of k input trees. Mai and Mirarab <ref type="bibr">[23]</ref> extended the B13 algorithm to optimally add a single taxon to a tree in O(n log 2 (n)k) time. For a pruned sub- tree of size m, we can repeatedly use this algorithm to find the optimal grafting destination in</p><p>In this paper, we show that we can do even better: In O((nm) log(nm) log(n)k) time, we can find the optimal position for the pruned subtree of size m. Surprisingly, this time does not increase with m and is only O(n log 2 (n)k) in the worst case when m and n are of the same order. This worst-case for finding the optimal grafting position is surprisingly the same as the time needed for computing the quartet score. With this algorithm, a full SPR round requires only O(n 2 log 2 (n)k) time because O(n) SPR sources need to be tested.</p><p>Our theoretical results enable us to design a hillclimbing algorithm for the median quartet tree problem. We built such a tool, called Q-SPR. In simulation and on real data, we show that starting from ASTRAL-III trees, SPR moves can improve the quartet score marginally; however, these improvements do not result in meaningful improvements in accuracy. Starting from a tree built using a stepwise addition performed using tripVote leads to a complete hill-climbing software, which, while competitive with ASTRAL-III in terms of accuracy, is substantially slower in practice under the conditions we tested here. Our results indicate that the dynamic programming strategy of ASTRAL is indeed beneficial for achieving fast running time. However, we note that Q-SPR still is useful for further refining ASTRAL-III output. Moreover, its memory and running time depend on k only linearly, which is better than ASTRAL-III, which depends on k super-quadratically, for handling tens or hundreds of thousands of genes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Materials and methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notations</head><p>We denote a tree by T = (V T , E T ) and let L T &#8838; V T be the leaftset. An edge (v, u) &#8712; E T is directed from the parent v to the child u. We refer to the root of a tree T by r T and use d T to denote its maximum node degree, omitting the subscript when clear. We use u &#8593; to denote the parent of a node u&#8712; V T \{r T } . We let L u denote the set of leaves below u. Removing the edge (u &#8593; , u) from the tree T cre- ates two subtrees, one with and one without the vertex u, denoted by T &#8744; u and T &#8743; u , respectively. Note that any resulting degree-2 node in T &#8743; u is suppressed, connecting its child to its parent. We use T u &#8226; T &#8242; to denote the placement of a rooted subtree T &#8242; on the edge (u &#8593; , u) of T: we divide (u &#8593; , u) to (u &#8593; , v) and (v, u) by adding v and then connect T &#8242; to v by adding the edge (v, r T &#8242; ) . When u is the root r T , we add a new root r * and create two new edges (r * , u) and (r * , r T &#8242; ) . The tree T can be rerooted at an edge e = (v, u) to obtain the rerooted tree T &#8853; u ; to do so, we divide e into (v, r) and (r, u) where r is the new root and reverse the direction of all edges in the path from r to r T . Finally, we remove all degree-2 nodes other than the new root by connecting their children to their parents in the rerooted tree T &#8853; u . A tree T can be restricted to any arbitrary set of three leaves in L T (suppressing nodes with a single child in the process) <ref type="bibr">[31]</ref>; we call each of those a triplet of T. We call the least common ancestor (LCA) of any three leaves in T its anchor. We similarly define a rooted quartet and its anchor by restricting T to any set of four leaves. Unrooting this tree gives us the unrooted quartet; when not specified, the term quartet refers to the unrooted case. For a triplet of leaves {x, y, z} , we say T 1 and T 2 match, or the triplet is matching or shared between T 1 and T 2 , iff {x, y, z} &#8838; L T 1 &#8745; L T 2 , and T 1 and T 2 have the same triplet topologies when restricted to {x, y, z} . Similarly, a quartet of leaves {w, x, y, z} is called a matching or shared quartet of T 1 and T 2 iff {w, x, y, z} &#8838; L T 1 &#8745; L T 2 , and the restricted trees have the same unrooted topology in both trees.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Problem definition</head><p>We use S 3 (T 1 , T 2 ) to denote the number of triplets that match between the two trees and use S 4 (T 1 , T 2 ) to denote the number of quartets matching between two trees. Definition 1 (Q-SPR Problem) Given a rooted query tree T = (V T , E T ) , a set of arbitrarily rooted reference trees R = {R 1 , R 2 , ..., R k } , and p &#8712; V T \{r T } , find where u * is the optimal placement of the pruned subtree</p><p>&#8226; T &#8744; p gives the optimal output (Fig. <ref type="figure">1A</ref>). Let n = |L T | and m = |L p | , and thus</p><p>The Maximum-matching Quartet Placement (MQP) problem of Mai and Mirarab <ref type="bibr">[23]</ref> is a special case of Q-SPR where m = 1 and a single query taxon q is placed on a given tree. They also defined a Maximum Triplet Rooting (MTR) problem where the goal is to root a given tree T based on a set of rooted reference trees R so that the rooted tree T * maximizes k i=1 S 3 (T * , R i ) . MQP and MTR turn out to be equivalent: All reference trees can be rooted at q, and q can be placed at the root of the output of MTR to solve MQP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background: tripVote algorithm</head><p>Our solution to Q-SPR builds on the tripVote <ref type="bibr">[23]</ref> algorithm that solves MTR and MQP using an extension of the B13 algorithm. Recall tripVote seeks to optimally root a query tree T with respect to a reference tree R. tripVote is based on a recursion: where R is a single reference tree R, and the &#964; values are counters defined in Table <ref type="table">1</ref>. Effectively, this recursion computes the score of each rerooting T &#8853; u of T based on precomputed counters and the score of the rerooting on the parent of the current node. The &#964; i u , &#964; o u , and &#964; r u counters are computed in a top-down traversal of T, and Equation (1) makes it trivial to find the optimal rooting once these counters are computed.</p><p>Coloring scheme. To compute the counters, tripVote follows the coloring scheme of B13. As it traverses T in preorder, on visiting a node u, it recolors the leaves so that each side of u has a different color. Thus, we need d T colors, which we refer to with an index in [0, d T -1] . Around a node u of T with degree d and children v 1 , . . . , v d-1 and parent u &#8593; , we can obtain d subtrees:</p><p>, and T &#8743; u . After recoloring, the leaves in T &#8743; u are colored 0, and the leaves in the subtrees</p><p>are colored 1, . . . , d -1 , respectively. Colors are marked on leaves of both T and R, which are linked using bidirectional pointers. Conceptually, the node u partitions leaves of T into d parts, and leaves of each part can be located anywhere in the tree R (i.e., do not necessarily form a clade). The coloring of a leaf in R helps us track which part (i.e., side of u in T) it belongs to. Colors are</p><p>updated one leaf at a time. After all leaves are colored to match u, we traverse R from recolored leaves to the root and update a set of counters kept at each node of R. These counters are functions of the colors of the leaves in R and are used in counting the number of shared triplets between T and R that match certain criteria specific to node u, given in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hierarchical Decomposition Tree (HDT).</head><p>To make recoloring fast, B13 represent R using a data structure called HDT (Hierarchical Decomposition Tree). HDT is a locally balanced tree that is built on a rooted tree R with n nodes in O(n). Each node (also called component) of the HDT corresponds to a set of connected nodes in R. There are four possible types of components in an HDT listed in Table <ref type="table">2</ref>. An HDT is constructed in an iterative process. Initially, every leaf and internal node of R is represented with an L or I type component in the HDT, respectively. During each round of construction, two components are merged to form a new HDT component, which is their parent (i.e., includes them). The type of the new node depends on what types are merged (Table <ref type="table">2</ref>). The order of these merge operations is specified by B13. The process stops when we build the root component of the HDT ( R ), which corresponds to the set of all nodes of R.</p><p>Once the HDT is built, its nodes are assigned counters that keep getting updated. Each component X of the &#8226; P for short) with the maximum number of shared unrooted quartets with the reference tree(s). B The recursive equation of Lemma 1: each of the counters corresponds to a certain type of solo quartet, color-coded here. Subtracted counters are to fix double-counting by other counters, as shown in Table <ref type="table">3</ref>. C Reference tree R is represented as an HDT. For each node u of B , the leaves are colored such that they correspond to the sides of that node. Outside u is colored 0, the larger child is colored 1, and the rest are colored 2 . . . d u . The recoloring of B results in the recoloring of the HDT nodes representing R. Note that query nodes are always colored -1 and are never recolored. Overall, we need O(n log(n)) leaf recoloring during the top-down traversal of B , each of which can require a O(log(n)) bottom-up traversal of the HDT HDT has two main counters &#961; X and &#960; X j , defined in Table <ref type="table">1</ref> (of these, &#961; X is similar to B13 counters and &#960; X j is specific to tripVote). Note that these counters keep changing value as we traverse T and recolor its leaves. After we visit a node u of T, leaf recoloring and counter updates are triggered; after this is done, counters of the HDT ( &#961; X and &#960; X j ) have values defined in Table <ref type="table">1</ref> specific to u. Each HDT counter of a component X is the sum over all its children, plus extra triplets counted specifically at X. Due to this cumulative nature and based on their definition (Table <ref type="table">1</ref>), &#961; X and &#960; X j at the root of the HDT ( R ) give &#964; i u , &#964; o u , and &#964; r u values of T; that is,</p><p>To recolor each leaf of the HDT, we update &#961; X and &#960; X j counters for all its ancestors; at the root R , we can update the corresponding &#964; i u , &#964; o u , and &#964; r u counters of T. The &#961; X and &#960; X j counters, in turn, are based on O(d 2 ) other auxiliary counters kept for each node of the HDT and more complex recursive equations to update these counters (we revisit these recursions later). These updates can be done with a time complexity that, per node, does not depend on n, requiring O(d 2 ) amor- tized over all calculations.</p><p>Running time. In terms of the running time, recoloring one node of the HDT requires only O(d 2</p><p>T log(n)) calculations because of the local balance property of the HDT (i.e., each component of the HDT with m leaves has O(log(m)) height). Also, as we traverse T, using a smaller-half trick adopted from B13, we perform at most O(n log(n)) leaf recoloring on the HDT (a naive recolor- ing scheme would require O(n 2 ) recolorings). Since each counter is updated in constant time, the total recoloring time is O(d 2</p><p>T n log 2 (n)) for one reference tree. If we have k reference tree, we process them independently, obtaining O(kd 2 T n log 2 (n)) . Note that the HDT structure can be constructed from R in O(n) time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimal quartet SPR placement in sub-quadratic time</head><p>For ease of notation, we let B denote the backbone tree after pruning (i.e., T &#8743; p ), let P denote the pruned subtree (i.e., T &#8744; p ), and let d = d T . First, we note:</p><p>Observation 1 To find the number of shared quartets between any potential placement of P on B and the ref- erence tree R, we only need to count those quartets that have a single leaf from P and three leaves from B.</p><p>Proof The topologies for quartets with zero or more than one leaves from P do not depend on the placement of P . Those with zero or four from P clearly have no rela- tion to its placement. Those with two from P always have these two leaves together, regardless of the placement of P . Since P is rooted, those with three will have the topol- ogy implied by the triplet from P regardless of its place- ment.</p><p>Based on this simple observation, we define:</p><p>Definition 2 For any instance of the Q-SPR problem, a quartet is called solo if it includes exactly one leaf from P . For a solo quartet, removing the sole leaf from P creates its associated triplet. A solo quartet is said to belong to an HDT component X when the L-type component corresponding to each of its four leaves is nested within X.</p><p>Observation 2 Q-SPR problem can be solved by repeated applications of tripVote.</p><p>Proof Since only solo quartets matter, we can place each single-leaf tree q of P independently using tripVote to obtain the quartet score</p><p>where C is a constant (shared quartets with zero or more than one leaf from P ), which is independent of the node v and thus can be ignored. It is easy to see that by Observation 1, maximizing this score returns the optimal placement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Using tripVote to solve the</head><p>Table <ref type="table">1</ref> Each counter gives the number of triplets shared between a query tree T (or a rerooting of T; 2nd column) that are anchored at a particular node of T (e.g., u; 3rd column) and a component X of the HDT (4th column)</p><p>HDT represents the reference tree R and is rooted at R</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rerooting T at Shared triplets anchored at Under HDT component</head><p>Counters of a node u of T</p><p>for large enough n. Instead, we propose a new method to directly solve Q-SPR, as summarized in Algorithm 1. This algorithm borrows ideas from tripVote but moves it from being based on triplets to quartets; that is, we directly count the number of shared quartets between the query tree and an arbitrary rooting of the reference tree. Thus, we have to mark the leaves of P on reference tree(s). We assign a new color -1 to the leaves of the query subtree P (Fig. <ref type="figure">1C</ref>). These leaves, which are missing from B , have a fixed color. We will also need new counters. Below, we focus on the case with one reference tree R but handling multiple reference trees follows trivially.</p><p>Algorithm 1 Solution to the Q-SPR problem. d u is the degree of u.</p><p>Algorithm 1 operates through a pre-order traversal of B . Visiting each node u, it colors leaves of the HDT representing R such that each side of u has a different color, with 1 assigned to the largest child and 0 assigned to leaves not under u (Fig. <ref type="figure">1C</ref>). Once leaves have the right color, it updates a set of auxiliary counters available on the HDT for all nodes that have been recolored (UpdateHDTCounters function). When these updates are all done, it has computed (line: 11) a set of counters for u, as defined in Lemma 1 and depicted in Fig. <ref type="figure">1B</ref>. Once the traversal of B is finished, using these counters, and the recursion given in Lemma 1 (see Fig. <ref type="figure">1B</ref>), the algorithm is able to count each quartet shared between R and B u &#8226; P . In the end, finding the edge with the maximum score is trivial as the score for each edge is computed.</p><p>Lemma 1 For any non-root node u &#8712; V B \{r B }, the quar- tet score can be calculated recursively using where &#981; i u , &#981; r u , and &#981; o u are defined as the number of shared solo quartets between the placement B u &#8226; P and R such that the associated triplet would be anchored at</p><p>The quartet score for the root of B, r B , is:</p><p>where C is a constant that can be ignored.</p><p>(</p><p>Proof Recall that according to Observation 1, only the solo quartets play a role in calculating the quartet score when comparing different placements of the query subtree P on the backbone tree B . We let the constant C include all non-solo quartets that are shared. Thus, we only count shared solo quartets in the other terms. At the root of the backbone tree B , it is evident that each shared solo quartet is precisely counted once at the anchor of its associated triplet. Thus, Eq. 3 corresponds to the quartet score resulting from placing P on r B .</p><p>For any non-root node u &#8712; V B \{r B } , we partition the leaves of B into three sets (those under it, under its sister, and the rest):</p><p>. We examine all possible scenarios in which the leaves of the associated triplets for the shared quartets between B u</p><p>&#8226; P and R can be distributed among these three sets. For each case, Table <ref type="table">3</ref> establishes that the shared quartets are counted precisely once in Eq. 2 (see Fig. <ref type="figure">1B</ref>). A quartet that is a shared quartet for both B u &#8226; P and B u &#8593;</p><p>&#8226; P (all but last case) is counted by S 4 (B u &#8593; &#8226; P, R) . In cases given in the second, third, and fourth lines, a quartet is also counted once with a positive sign and once with a negative sign by other counters. In the last case, only &#981; r u counts the quartet.</p><p>We update the meaning of HDT counters &#961; X and &#960; X j compared to tripVote. Assume in the preorder traversal of B , we are on node u. Then, &#961; X and &#960; X j count the number of solo quartets that belong to the component X of the HDT and match the solo quartets in the placement B u &#8226; P where the associated triplet is anchored at</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2 HDT components types</head><p>Each component in the HDT corresponds to a set of connected nodes in the reference tree R, and can have four types. To construct HDT, B13 follows the rules provided here to compose new components and transform existing ones. B13 show that these rules applied with the appropriate order result in a locally balanced hierarchy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Type Rule Description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I initial</head><p>Corresponds to an internal node in R. Every I component is a leaf in the HDT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>L initial</head><p>Corresponds to a leaf node in R. Every L component is a leaf of the HDT and is considered a C type as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G</head><p>Corresponds to a subtree of R (i.e., every node of R descendent from any node in a G is also in that component.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C</head><p>Corresponds to a connected subset of nodes in R. The children of a C satisfy this: there exists an R node in one child that is the ancestor of an R node in the other child.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IG &#8594; C</head><p>An I and a G component can merge when the root of G is a child of the internal node in R represented by I.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CC &#8594; C</head><p>Two C components can merge if the root of one in R is the ancestor of the other in R and they form a connected subset of nodes.</p><p>&#8226; &#960; X j (j &#8805; 1) : the node u in the alternative rerooting</p><p>where v j is a child of u.</p><p>Note that rerootings of B are hypothetical (no actual rerooting is performed). Each unrooted quartet in B will have some arbitrary rooting in R. Once again, it is easy to see that at the HDT root R , we can update the counters of the B from the HDT counters:</p><p>We are now ready to state a key result:</p><p>Lemma 2 The &#961; X and &#960; X j counters of the HDT component X can be updated in O(d 2 ) time assuming the coun- ters for children of X are already calculated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof</head><p>We derive a set of recursions for each component, which depend on their type (Table <ref type="table">2</ref>). Since L types include a single leaf and I include no leaf, no quartet can belong to them; thus, &#961; X and &#960; X j are set to zero for these components.</p><p>Consider a component X of types IG &#8594; C , CC &#8594; C , or GG &#8594; G , each of which has two child components, denoted by X 1 and X 2 . A quartet q belongs to X of these types if q belongs to X1, q belongs to X2, or q has at least one leaf from each child of X. Thus, if we let &#961; X comb and &#960; X comb j count the part of &#961; X and &#960; X j coming from those quartets with at least one leaf from each child, then, clearly:</p><p>For IG &#8594; C components, &#961; X comb and &#960; X comb j are both 0 because the I component is a single internal node and cannot contain any leaves. For other types, we use auxiliary counters, which lead to 62 distinct cases. In short, (4)</p><p>&#8226; &#961; X comb : For these, the rooted quartets must have either ((i, j), (-1, 0)) or ((0, 0), (-1, i)) unrooted topologies, where i, j &#8712; <ref type="bibr">[1, d]</ref> . The recursive equations for these cases are given in Additional file 1: Tables <ref type="table">S1</ref> and<ref type="table">S2</ref>, respectively. Additional file 1: Fig. <ref type="figure">S1A</ref> demonstrates an example of how &#961; X is computed and also how &#981; r u is computed using &#961; X . &#8226; &#960; X comb j : For these, the rooted quartets must have the unrooted topology ((i, i), (-1, k)) where i, k &#8712; [0, d] and i, k = j . The recursions for auxiliary counters are given in Additional file 1: Table <ref type="table">S3</ref>.</p><p>Note that we have counted only resolved quartets. And all solo quartets have exactly a single -1 . Thus, for each HDT component, we have O(d 2 ) counters each of which is of the form (0, -1, i, j) for i, j &#8712; <ref type="bibr">[1, d]</ref> . Following all recursions in Additional file 1: Tables S1-S3, we can easily check that the cost of updating all the counters amortized over all counters is constant. This is because most counters require a constant time while a constant number of counters require O(d 2 ) time.</p><p>We next formally state that the number of leaf colorings is subquadratic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 3</head><p>The total traversal of a tree B with N nodes using Algorithm 1 requires a total of N log(N ) leaf color- ing steps.</p><p>Proof The proof follows from B13 results and in particular its smaller-half trick. Algorithm 1 first colors all nodes as 1, using O(N) operations. Then, during the traversal, it recolors each node u of B only if u has a sibling that is larger or the same size; these nodes are colored as 1 when u is visited (line 25), as 2 &#8804; i &#8804; d when u &#8593; is visited (line 27), and as 0 right before existing u &#8593; (line 29). For a tree with N leaves, the sum of the number of leaves under all nodes that have a larger or equalsized sibling is O N log(N ) ; consider the worst case, a fully balanced tree, where the number is the solution to Table <ref type="table">3</ref> Cases for an associated triplet L t = {t 1 , t 2 , t 3 } of a shared solo quartet in recursive equation <ref type="bibr">(2)</ref>. In each case, the quartet is counted exactly once, as shown in column</p><p>, and</p><p>, and</p><p>Thus, the worst-case scenario for the total number of recolored nodes is O(N log(N )) .</p><p>where n is the size of the full tree and m is the size of the pruned subtree.</p><p>Proof The HDT data structure has a height of O(log(n)) by design <ref type="bibr">[22]</ref>. According to Lemma . By Lemma 1, we obtain the score for placing P on each branch using &#981; i u , &#981; r u , and &#981; o u values and a simple tree traversal; finally, we simply choose the placement branch (u &#8593; , u) that max- imizes the score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Tree search using quartet SPR moves</head><p>Having access to an optimal and efficient SPR move, we can now design a standard hill-climbing tree search to find the quartet median tree with respect to a set of reference trees R. The algorithm begins with a starting tree T, obtained using any method. We draw non-root nodes of T randomly (without replacement) from some distribution resulting in a random permutation of non-root nodes of T, denoted as O T . For each node p in the order O T , we prune T &#8744; p from T to obtain T &#8743; p . We solve the Q-SPR problem to determine the optimal placement node p * of T &#8744; p . If p * &#65533; = p , we place T &#8744; p on p * to obtain the improved tree</p><p>p and use T * as the starting tree in the next round. If not, we apply the same approach to the complementary subtree of T &#8744; p , namely (T &#8853; p ) &#8744; p &#8593; . Let p &#8902; be the optimal placement of (T &#8853; p )</p><p>&#8744; p &#8593; on T &#8744; p . Similarly, if p &#8902; = p , we define the improved tree as</p><p>&#8744; p &#8593; , using T &#8902; as the starting tree in the subsequent round. We repeat this process for every node in O T until an improvement is obtained. The search stops if, in a round, we perform SPR moves on every node of T without finding an improvement in the quartet score. Given that the starting tree T has 2n -1 nodes, and we may need to find the optimal placement for all subtrees and their complementary subtrees, the total running time for one round of SPR for all nodes of T has a complexity of O(kd 2 n 2 log 2 (n)).</p><p>To fully specify this algorithm, we need to specify the distribution under which we draw without replacement (i.e., permutate) nodes of the tree. A simple choice is a uniform distribution, and we will explore that choice. However, our experiments in section E3. Design of heuristics reveal intriguing empirical patterns regarding the probability of an SPR move improving the quartet score. To make our search algorithm faster, we explored heuristics that assign higher probabilities to nodes with a higher empirical probability of improving the tree. Each node u &#8712; V T is assigned a weight w u calculated using a combination of various heuristics explained below. Each heuristic assigns a value h(u) &#8712; (0, 1] to u. When multiple heuristics are employed, these values are summed to yield the final weight w u of a node. Subsequently, these weights are normalized using &#373;u = w u / v&#8712;V T w v . At the beginning of each round of the search algorithm, to obtain the random permutation O T , determining the order in which nodes of T are visited in that round, each node u is drawn without replacement with probability &#373;u . We use three heuristics, as explained below.</p><p>Size of the subtree and its surrounding subtrees. The outcomes of our experiments reveal an interesting trend in the early stages of the search: Subtrees with sizes closer to n/2 or possessing siblings with this characteristic tend to have a higher probability of enhancing the tree. Nevertheless, as the score of the tree improves, only very small or very large subtrees appear to significantly influence the quartet score. For a non-root node u &#8712; V T and its siblings {s 1 , ...,</p><p>| be the size of the subtree corresponding to u and its siblings, respectively. We define the subtree impact i u as:</p><p>The heuristic h 1 (u) is defined using the subtree impact i u . In practice, we noticed that prioritizing subtrees with larger i u score in the initial round and subtrees with smaller subtree impact in the later rounds (e.g., round &gt; 10 ) results in the best running time. Thus, we assign h 1 (u) using a sigmoid function as follows:</p><p>Distance from the source and destination of the previous round. Our findings in section E3. Design of heuristics</p><p>also indicate that the subtrees around the two nodes associated with the applied SPR move in a given round have a higher likelihood of enhancing the tree in the subsequent round. Let T &#8744; p be the subtree that moved in the previous SPR round. Let ps and pd &#8712; V T correspond to the parent of the root of T &#8744; p before and after the SPR move, respectively. For a node u &#8712; V T , we define D T (u, ps) as the number of edges on the undirected path from the sister of u to ps in T. Similarly, D T (u, pd) is defined as the nodal distance between the sister of u and pd. The reason we use distance to the sister is that after a node u is pruned, u and its parent u &#8593; will be absent, so, the sister to u is the closest remaining node. Similarly, ps and pd correspond to the nodes of T that were unchanged when performing SPR move on T &#8744; p and are the sister to p before and after being moved, respectively. Therefore, D T (p, pd) = 0.</p><p>We define two heuristics h 2 (u) and h 3 (u) based on D T (u, ps) and D T (u, pd) , respectively, as:</p><p>For the initial round ( round = 1 ), we set h 2 (u) = 1 and h 3 (u) = 1 for all u &#8712; V T . In a specific scenario where (Distance from ps)</p><p>(Distance from pd)</p><p>D T (u, pd) = 0 , signifying that either T &#8744; u or T &#8743; u is the subtree on which the last round's SPR move was applied, leading to a minimal chance of improvement for the node u, we assign a very small weight to u by setting D T (u, pd) = 2 log n.</p><p>Starting tree. The user can input a starting tree, or, in cases where such a tree is not provided, we employ the following strategy to construct one: Beginning with a rooted three-taxon tree, where the taxa are randomly selected from the leaf set of the reference trees, we iteratively employ tripVote to place each additional taxon with respect to the reference trees until a complete tree is formed. We opt for tripVote in this process because, as evidenced by our experiments (refer to Fig. <ref type="figure">2</ref>), it demonstrates faster performance when the query subtree is particularly small ( |L P | &#8804; 2 ). We use tripVote in the default setting which also subsamples short quartets.</p><p>Implementation details. Our implementation of the Q-SPR algorithm, along with a comprehensive searchbased method for finding the median quartet tree given a set of reference trees is publicly accessible on GitHub. This code is built upon tripVote, which, in turn, relies on the tqDist library <ref type="bibr">[32]</ref>. tqDist is a library that calculates triplet and quartet scores between two trees, employing the B13 algorithm. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental setup</head><p>We include four experiments. E1) An analysis of the time complexity of the Q-SPR algorithm, comparing it to a modified version of tripVote capable of quartet-based SPR moves. E2) A comparison of the hill-climbing search method for finding the median quartet tree with the state-of-the-art tool ASTRAL-III. E3) Exploration of various heuristic approaches to enhance Q-SPR speed. E4) The use of Q-SPR as a subsequent step to widely-used tree estimation tools like ASTRAL-III and ASTER to further enhance their optimality and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E1: Running time comparison</head><p>We compared Q-SPR to a modified version of tripVote, as described in Observation 2. We used an existing 10,000taxon simulated dataset <ref type="bibr">[33]</ref> including 10 replicates with gene trees disagreeing with the species tree due to both ILS and Horizontal Gene Transfer (HGT), as simulated by SimPhy <ref type="bibr">[34]</ref>. We used the true species tree as the query tree T and the available gene trees estimated using FastTree-II <ref type="bibr">[35]</ref> as the reference set. Note that the gene trees include polytomies. We randomly selected n &#8712; {50, 100, 200, 500, 1000, 10000} taxa for each replicate and pruned both query trees and the reference trees to contain only the selected n taxa. We also subsampled the gene trees randomly to obtain k = 100 trees per repli- cate. For each replicate, we applied a single round of SPR on every subtree P of the query tree T and measured the time each method took to find the optimal placement. For trees of size n &#8805; 1000 , we restricted these analyses to subtrees of size m &#8804; 70 . In addition to the running times, we computed the quartet score S 4 (T &#8743; p v</p><p>&#8226; P, R) for every node v of T &#8743; p and compared the scores of Q-SPR to the scores of modified tripVote to ensure the correct implementation of Q-SPR.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E2. Full quartet median tree search using heuristics</head><p>In this experiment, we tested the performance and accuracy of our hill-climbing search method in comparison to ASTRAL-III. The dataset used for this experiment was an existing Simphy-simulated ILS-only 200-taxon dataset <ref type="bibr">[21]</ref>, simulated under three levels of ILS (tree height &#8712; {5 &#215; 10 5 , 2 &#215; 10 6 , 10 7 } corresponding to high, medium, and low levels of ILS, respectively). For each ILS level, we considered the 50 replicates with the speciation rate of 10 -6 and a set of either 50 or 200 estimated gene trees as our reference tree set. These gene trees can have polytomies, and trees that have less than twice the number of nodes of a fully resolved tree are removed, as done by Mirarab and Warnow <ref type="bibr">[21]</ref>; thus, the actual input can include fewer than 50 or 200 gene trees. We generated the starting trees from the reference set using the method described in section Tree search using quartet SPR moves. For each replicate, we evaluated our optimization method compared to ASTRAL-III in terms of the normalized quartet score of the final tree with respect to the gene trees (i.e., the optimization score). We also compared the accuracy of the tree produced by the two methods, comparing them to the true species tree in terms of quartet score <ref type="bibr">[3]</ref> and RF (Robinson-Foulds) distance <ref type="bibr">[36]</ref>. Finally, we compared the running time of our method, including the building of the starting tree, with the running time of ASTRAL-III.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E3. Exploring heuristic approaches</head><p>To develop heuristics for improving the effectiveness of moving each subtree, we investigated how characteristics of each subtree predict the likelihood of enhancing the quartet score when moved to the optimal position. We changed the search algorithm to execute all feasible SPR moves in each round, recording the improvement in the quartet score if any was achieved without updating the query tree. At the conclusion of each round, the SPR move with the greatest improvement was applied to the query tree. For each subtree, we examined characteristics such as the size of the subtree and its neighboring subtrees, the nodal distance from the root of the subtree to the nodes associated with the optimal SPR move in the preceding round, and the number of applied SPR moves (i.e., the number of completed rounds). These were chosen among a larger set of metrics examined (not shown) which did not show as much predictive power. Subsequently, we explored whether these characteristics can predict the likelihood of improvements in the score resulting from an SPR move. Based on the outcomes of this experiment (section E3. Design of heuristics), we formulated three heuristic functions outlined in section Tree search using quartet SPR moves. We investigated how any combination of these functions affects the running time of our search algorithm. Where not explicitly specified, the combination of all three methods was utilized as the default method. Additionally, we explored whether the use of heuristics accelerates the convergence of the search algorithm to the optimal score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E4. Improving ASTER and ASTRAL-III trees</head><p>We conducted an experiment using the highly optimized trees produced by ASTER and ASTRAL-III as inputs for our search algorithm to assess whether the optimality and accuracy of these trees could be further improved through additional SPR moves. In this experiment, we evaluated ASTRAL-III and ASTER, with the latter being a newer algorithm shown to outperform ASTRAL-III in handling missing data <ref type="bibr">[24]</ref>. We used the gene trees made available by Zhang and Mirarab <ref type="bibr">[24]</ref> who modified the 200-taxon dataset of Mirarab and Warnow <ref type="bibr">[21]</ref> used in E2 to remove approximately 5% of taxa at random from each estimated gene tree. We used this version with missing data because Zhang and Mirarab <ref type="bibr">[24]</ref> documented that the presence of only a small number of missing data can impact the optimality of ASTRAL-III. Other settings of the dataset are identical to E2.</p><p>To measure improvements after running Q-SPR, we compared the starting tree and its output against the gene trees and the true known species tree. We report the quartet score and the normalized Robinson-Foulds (nRF) <ref type="bibr">[36]</ref> metric. In addition, we used ASTRAL-III to compute the local posterior probability (PP) <ref type="bibr">[37]</ref> and the coalescent unit length for each internal branch for each tree with respect to the gene tree. ASTRAL-III sets branch lengths to zero when the frequency of the species tree quartet topology is less than 1/3 among gene tees, a pat- tern that is unexpected under MSC with a large enough number of gene trees. Similarly, localPP would be set to less than 1/3 under those conditions. We evaluated sup- port and length for internal branches, with a particular focus on the unexpected cases with zero branch length or localPP less than 1/3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E1: running time versus m and n</head><p>Theoretical asymptotic results match our empirical running time measurements (Fig. <ref type="figure">2</ref>). Recall that the running  time of Q-SPR for moving a subtree of size m from a tree of size n is O(k(nm) log(nm) log(n)) compared to O(km(nm) log(nm) log(n)) for tripVote. For fixed n = 500 and changing m, the asymptotic advantage of Q-SPR over tripVote is clear (Fig. <ref type="figure">2A</ref>). Matching theory, the running time of Q-SPR is nearly independent of m (ranging between 9 and 16 seconds with a mean of 13.5). The running time of tripVote increases linearly with m, again, as expected. Interestingly, tripVote is faster for subtrees of size one or two ( m &#8804; 2 ). This is because tripVote has fewer counters to maintain than Q-SPR, and thus has a smaller constant factor. The benefit of Q-SPR appears only for larger m values; e.g., for m &gt; 100 , Q-SPR is 73 times faster than tripVote on average. Note that the Q-SPR running time reduces slightly with m. To understand why, note that increasing m decreases the size of the backbone tree B , which is nm . Thus, as m increases, Q-SPR can become faster because it depends on the size of B and not m.</p><p>When we change the tree size n and apply SPR to mid-size subclades ( 30 &lt; m &lt; 70 ), tripVote and Q-SPR have similar running time growth rates (Fig. <ref type="figure">2B</ref>). This is because the theoretical running time of both methods is quasi-linear with respect to n; empirically, the observed log-log slope is slightly above 1.0, matching these expectations. However, note that for all n, Q-SPR is faster than tripVote by 10 to 31 times (mean: 16). This pattern also matches the theoretical expectations because Q-SPR is faster than tripVote by a factor of &#65533;(m).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E2: Tree estimation using Q-SPR search</head><p>Q-SPR obtains a better optimization score than ASTRAL-III in 66 out of 294 cases, while ASTRAL-III has a better score in 15 cases (Fig. <ref type="figure">3A</ref>). Moreover, Q-SPR improvements are more substantial than ASTRAL. Averaged over all cases, Q-SPR achieves a 0.012% higher normalized quartet score than ASTRAL, with this difference being close to 0.061% when considering only the cases where Q-SPR outperforms ASTRAL. These improvements are particularly significant for the high ILS (tree height = 5 &#215; 10 5 ) and k = 50 model condition, showing a 0.068% improvement overall and 0.092% for the cases with a higher score. These improvements in quartet scores are despite the fact that the starting trees of Q-SPR have substantially lower scores-often 1-3% (Fig. <ref type="figure">3B</ref>). In all but a handful of cases, Q-SPR manages to reach scores close to or above ASTRAL-III. The number of rounds of SPR needed ranges from as few as 4 and as high as 271, with more rounds needed for higher ILS and fewer genes.</p><p>Enhancing the quartet score with respect to the gene trees does not meaningfully impact the tree accuracy (Table <ref type="table">4</ref>). Exact conditions where one method outperforms the other depend on the choice of the metric, but in all cases, changes in accuracy are small compared to variation across replicates. Only 35% and 59% of the cases with improved quartet scores also exhibit higher accuracy compared to ASTRAL-III in terms of nRF and quartet score.</p><p>In terms of running time, ASTRAL-III is substantially faster than the Q-SPR search (Additional file 1: Table <ref type="table">S4</ref>). On average, ASTRAL is 124 and 306 times faster than Q-SPR search for k = 50 and k = 200 , respectively. It is important to note that although ASTRAL-III is faster in practice, all our inputs had low number of genes k. The running time of one round of Q-SPR grows linearly with the number of genes k, while ASTRAL-III running time grows quadratically with k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E3. Design of heuristics</head><p>The size of the pruned subtree has predictive power for the probability of improvement of an SPR move in ways that change with the number of passed rounds (Fig. <ref type="figure">4A</ref>). In the initial rounds, the subtrees with a size closer to n/2 have a higher likelihood of improving the tree when moved by an SPR move. However, as the rounds progress and the tree becomes closer to optimal, this pattern reverses, and subtrees with either very small or very large sizes become more likely to improve the score. This observation can be explained. An SPR move on midsized subtrees contributes the most to change in tree topology and change in the quartet score. In the initial rounds, when the tree is far from optimal, moving these subtrees exerts the most impact. However, as the search progresses and the tree becomes closer to optimal, these mid-sized subtrees become less likely to move. Consequently, very small or very large subtrees gain higher chances of improvement in later rounds. Note that very small and very large subtrees are equivalent because at each node p, we examine moving both T &#8744; p and T &#8743; p . The same pattern is observed when considering the impact of the size of the sibling of the pruned node on the likelihood of improvement (Fig. <ref type="figure">4B</ref>). These observations are the impetus for the first heuristic given in <ref type="bibr">(5)</ref>.</p><p>Another intriguing pattern is the impact of the distance of a subtree from the areas affected by the SPR move in the previous round (Fig. <ref type="figure">4C</ref>). Our experiment suggests a clear correlation between these distances and the likelihood of improvement. Nodes close to either the source or the destination of the previous successful SPR move have a higher chance of having a successful SPR of their own. This result is the basis for heuristics h 2 and h 3 .</p><p>The set of three heuristics results in moderate reductions in running time (Fig. <ref type="figure">4D</ref>). Using all of the three heuristics described in section Tree search using quartet SPR moves reduces the running time by 15 min on average (Additional 1: Table <ref type="table">S4</ref>). Interestingly, it appears that combining at least two of the three metrics is needed to obtain improved speeds. Our results from E2 also show The impact of the size of a subtree (A) or its sister (B) on the probability that an SPR applied to that subtree leads to an improved quartet score. Panels show the impact on the first round, middle rounds <ref type="bibr">(1,</ref><ref type="bibr">10]</ref>, and final runs <ref type="bibr">(10,</ref><ref type="bibr">50]</ref>. The starting tree is the result of step-wise additions. In the first round, subtrees with a size around n/2 have a higher probability of improvement while in the final rounds, small and larger subtrees are likely to improve speed. C Improvement probability of an SPR move compared to the distance to the source (ps) or the destination (pd) of the previous move. For each node p, we show the distance from its sister (i.e., the closest node left in the tree after we remove T &#8744; p ) to the node above which the previous SPR move was placed (pd) or the sister of the node that was moved in the previous SPR move (ps). Subtrees close to the previous source or destination have a higher probability of improving the score. The reduction at distance 0 for pd is because this case represents an attempt to move the previously moved node p, or its complement T &#8743; p , and the former by construction has 0 probability of moving because it is already in its optimal position. D Comparison of the running time for a full Q-SPR search run between different combinations of heuristic methods. The building time for the starting tree is also included.   that using heuristic approaches results in a faster optimization of the tree in terms of the number of rounds (Fig. <ref type="figure">3B</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E4. Improving ASTER and ASTRAL-III trees</head><p>Out of 600 replicate runs tested, Q-SPR improves the quartet score (i.e., the optimization criterion) compared to the ASTRAL-III or ASTER starting trees in 129 replicates. However, patterns of improvement in the quartet score depend heavily on the level of ILS and the starting tree method (Fig. <ref type="figure">5A</ref>). ASTRAL-III is improved more than ASTER, consistent with results of <ref type="bibr">[24]</ref>, showing that ASTRAL-III output can be suboptimal for cases with missing data and relatively few input trees. Also, improvements are more pronounced for 50 genes compared to 200 and for medium ILS compared to low ILS. While the improvement in quartet score can be up to 6% in rare cases, in most cases it is under 0.5% (Additional file 1: Fig. <ref type="figure">S3</ref>).</p><p>Better optimization scores do not consistently lead to substantially improved species trees (Fig. <ref type="figure">5B</ref>). For low and medium ILS, accuracy rarely changes, while some improvements are observed for high ILS, in particular with respect to ASTER (Fig. <ref type="figure">5C</ref>). Out of the 129 cases with improved quartet scores, the Q-SPR tree was more accurate in only 45 or 68 cases, in terms of nRF or quartet distance, respectively. However, in a substantial number of cases (84 or 61, for nRF and quartet distance, resp.), the improved optimization score led to reduced accuracy. It should be noted that when the quartet score improves but accuracy degrades, the reductions are small (mean: 0.96% nRF). When the quartet score does improve, the improvements in accuracy can be up to 11.9% (mean: 1.91% nRF). Cases with high accuracy improvement tend to be those with higher increases in the quartet score (Fig. <ref type="figure">5B</ref> and Additional file 1: Fig. <ref type="figure">S3</ref>). Beyond accuracy, local support values also change as a result of running Q-SPR (Additional file 1: Fig. <ref type="figure">S4</ref>). In particular, the output trees include fewer branches that have support below 1/3 and branch length 0, which is not expected under the MSC model. Interestingly, it appears that compared to ASTER, Q-SPR has more branches with 100% support as well.</p><p>The progress of Q-SPR across rounds shows high variation across replicates and model conditions (Additional file 1: Fig. <ref type="figure">S2</ref>). While the mean number of rounds is 1 for low ILS, 200 genes condition, for high ILS, 50 genes, we need 13.2 rounds on average. It also appears that improved accuracy is often obtained in challenging datasets where the quartet score is low, to begin with, while substantial improvements in quartet score often do not improve accuracy if the initial quartet score is high (Fig. <ref type="figure">5E</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>Our main algorithmic contribution in this paper was showing how to find the optimal SPR move for quartet distance in time that grows quasi-linearly with the size of the tree. The best previous method for solving this method was repeated applications of tripVote (Observation 2), which is asymptotically slower than our method by a factor of n. Using our efficient algorithm for SPR moves, we were able to build a hill-climbing method for inferring species trees from gene trees.</p><p>Our resulting method, Q-SPR, was slower than ASTRAL-III and no more accurate than it. While this observation reduces the immediate impact of Q-SPR in practice, it does get us close to answering a fundamental question: Is the combined scalability and accuracy of ASTRAL-III due to its dynamic programming algorithm?</p><p>The answer seems to be yes, as employing the traditional hill-climbing approach achieves essentially the same accuracy but at a much higher running time. The implication of this observation for future work is that perhaps using a dynamic programming algorithm constrained to a predefined search space for phylogenetic inference problems other than quartet median tree can improve their scalability and accuracy as well.</p><p>We showed that Q-SPR can improve on ASTRAL-III and ASTER in terms of the objective function if those are used as the starting tree. On a practical level, these improvements are useful as they eliminate cases with very low local posterior probability support, particularly those with support below 1/3 . However, the fact that topological accuracy does not improve despite improvements in quartet score is interesting. The reason seems to be that branches that change tend to be low support branches that are uncertain in both resolutions. In other words, the imperfect correlation between the quartet score and accuracy given a limited number of genes has reduced the impact of improving the quartet score beyond what heuristics such as ASTER and ASTRAL-III achieve. In practice, the main benefit of following ASTRAL-III or ASTER with Q-SPR is to 1) test whether differences between outputs of these methods and alternative analyses (e.g., concatenation) can be explained by lack of optimality as opposed to other explanations, and 2) eliminate problematic branches with 0 length or support &lt; 1/3.</p><p>In a hill-climbing search, if one can prioritize what branches to examine, the search may converge faster. We identified some potential ways of making such predictions and observed moderate improvements in running time as a result. We leave it to future work to examine whether more elaborate methods, such as those proposed by Azouri et al. <ref type="bibr">[38]</ref> can further improve accuracy. Such future work should also examine the impact of starting</p></div></body>
		</text>
</TEI>
