<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Phylogenetic trees</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10253595</idno>
					<idno type="doi">10.2140/jsag.2021.11.1</idno>
					<title level='j'>The journal of software for algebra and geometry</title>
<idno>1948-7916</idno>
<biblScope unit="volume">11</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Nathaniel Bushek Hector Banos</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce the package PhylogeneticTrees for Macaulay2, which allows users to compute phylogenetic invariants for group-based tree models. We provide some background information on phylogenetic algebraic geometry and show how the package PhylogeneticTrees can be used to calculate a generating set for a phylogenetic ideal as well as a lower bound for its dimension. Finally, we show how methods within the package can be used to compute a generating set for the join of any two ideals.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>MOTIVATION. A central problem in phylogenetics is to describe the evolutionary history of n species from their aligned DNA sequences. One way to do this is through a model-based approach. A phylogenetic model is a statistical model, specified parametrically, of molecular evolution at a single DNA site. We can regard the aligned sequences as a collection of n-tuples of the four DNA bases, one from each site. Each choice of parameters results in a probability distribution on the 4 n possible n-tuples. The goal of model-based reconstruction is to find a choice of parameters that yields a distribution close to the empirical distribution. If we are able to do so then it is reasonable to assume that the model is an accurate reflection of the underlying evolutionary process. Most significantly, we can infer that the underlying tree parameter of the phylogenetic model is the evolutionary tree of the species under consideration. MATHEMATICAL BACKGROUND. In phylogenetic algebraic geometry, the statistical models under consideration are tree-based Markov models. This means that we assume a Markov process proceeds along a tree with a transition matrix associated to each edge. A &#954;-state phylogenetic model on an n-leaf tree T induces a polynomial map from the parameter space T &#8838; &#8235;&#1938;&#8236; m to the probability simplex &#954; n -1 ,</p><p>The image of this map is the set of all probability distributions we obtain by varying the entries of the transition matrices; we refer to the image as the model M T . For phylogenetic applications, usually &#954; = 2 or &#954; = 4. The Zariski closure of the model V T := M T &#8838; &#8235;&#1923;&#8236; &#954; n is an affine algebraic variety. For the models we consider, the entries of &#968; T are homogeneous polynomials of uniform degree. Thus, V T can be viewed 2 Ba&#241;os, Bushek, Davidson, Gross, Harris, Krone, Long, Stewart and Walker &#172;&#172;&#172;&#172; Phylogenetic trees as the affine cone of a projective variety in &#8235;&#1936;&#8236; &#954; n -1 . The ideal I T := I (V T ) &#8838; &#8235;[&#1923;&#8236;x 1 , . . . , x &#954; n ] of all polynomials vanishing on V T is a homogeneous ideal called the ideal of phylogenetic invariants; this ideal carries useful information about the model that can be used for determining model identifiability and performing model selection [E. S. <ref type="bibr">Allman and Sullivant 2011;</ref><ref type="bibr">Allman et al. 2012;</ref><ref type="bibr">Cavender and Felsenstein 1987;</ref><ref type="bibr">Casanellas and Fern&#225;ndez-S&#225;nchez 2006;</ref><ref type="bibr">A. 1987;</ref><ref type="bibr">Long and Sullivant 2015;</ref><ref type="bibr">Matsen et al. 2008;</ref><ref type="bibr">Matsen and Steel 2007;</ref><ref type="bibr">Rhodes and Sullivant 2012;</ref><ref type="bibr">Rusinko and Hipp 2012]</ref>.</p><p>Given two models, M T 1 and M T 2 , we define the mixture model M T 1 * M T 2 to be the image of the map</p><p>(1)</p><p>As before, we take the Zariski closure M T 1 * M T 2 &#8838; &#8235;&#1923;&#8236; &#954; n , and now we obtain the algebraic variety</p><p>The join of two algebraic varieties V and W embedded in a common ambient space is the variety</p><p>In the special case when V = W, the join variety V * V is called the secant variety of V . Similarly, given two ideals</p><p>x n ] is the join ideal. As for varieties, if I 1 = I 2 , the ideal I 1 * I 2 is the secant ideal. We refer to <ref type="bibr">[Sturmfels and Sullivant 2006, Section 2]</ref> for the definition of I 1 * I 2 , but note the following important property:</p><p>Our package provides a means of computing invariants for those working in phylogenetic algebraic geometry. We handle a class of commonly used models called group-based models that have special restrictions on the entries of the transition matrices. These entries are indexed by elements of a group and thus are subject to the Fourier-Hadamard coordinate transformation, which makes the parametrization monomial and the ideals toric <ref type="bibr">[Evans and Speed 1993;</ref><ref type="bibr">Sz&#233;kely et al. 1993</ref>]. We will refer to the original coordinates, which represent leaf probabilities, as probability coordinates and the transformed coordinates as Fourier coordinates; furthermore, following the literature, we will use p for probability coordinates and q for Fourier coordinates. For these group-based models, we implement a theoretical construction for inductively determining the ideal of phylogenetic invariants for any tree from the invariants for claw trees <ref type="bibr">[Sturmfels and Sullivant 2005]</ref>. We also handle the join and secant ideals formed from these ideals, which allows for computations involving mixture models.</p><p>FUNCTIONALITY FOR TORIC PHYLOGENETIC VARIETIES. As an example, let T be the four-leaf tree illustrated in Figure <ref type="figure">1</ref> and consider the Cavender-Farris-Neyman (CFN) model, a two-state group-based model, on T . Then the toric ideal I T is generated in degree 2. Using our package, we can compute a generating set for I T using two different methods, phyloToric42 and phyloToricFP. The first method phyloToric42 calls FourTiTwo.m2, the <ref type="bibr">[Macaulay2]</ref> interface to [4ti2], a software package with functionality for computing generating sets of toric ideals. We input T by its set of nontrivial splits. In this instance, 01|23 is the only nontrivial split of T , which we can enter as either {0, 1} or {2, 3}. The indices on the q's correspond to two-state labelings of the four leaves of T :</p><p>Macaulay2, version 1.7 i1: load "PhylogeneticTrees.m2" i2: n = 4; T = {{0,1}}; M = CFNmodel; i5: toString phyloToric42(n,T,M) o5: = ideal(-q_(0,1,1,0)*q_(1,0,0,1)+q_(0,1,0,1)*q_(1,0,1,0), -q_(0,0,1,1)*q_(1,1,0,0)+q_(0,0,0,0)*q_(1,1,1,1))</p><p>The second method phyloToricFP computes generators of I T using Theorem 24 in <ref type="bibr">[Sturmfels and Sullivant 2005]</ref>; the "FP" in the method name stands for fiber product <ref type="bibr">[Sullivant 2007]</ref>. For this example, this theorem allows us to explicitly construct a generating set of I T from generators of I K 1,3 , the ideal associated to the CFN model on the claw tree K 1,3 . While our example here is binary, we note that this method is implemented for all trees, binary or not.</p><p>i6: = toString phyloToricFP(n,T,M) o6: = ideal(-q_(0,0,1,1)*q_(1,1,0,0)+q_(0,0,0,0)*q_(1,1,1,1), q_(0,0,1,1)*q_(1,1,0,0)-q_(0,0,0,0)*q_(1,1,1,1), q_(0,0,1,1)*q_(1,1,0,0)-q_(0,0,0,0)*q_(1,1,1,1), -q_(0,0,1,1)*q_(1,1,0,0)+q_(0,0,0,0)*q_(1,1,1,1), -q_(0,1,1,0)*q_(1,0,0,1)+q_(0,1,0,1)*q_(1,0,1,0), q_(0,1,1,0)*q_(1,0,0,1)-q_(0,1,0,1)*q_(1,0,1,0), q_(0,1,1,0)*q_(1,0,0,1)-q_(0,1,0,1)*q_(1,0,1,0), -q_(0,1,1,0)*q_(1,0,0,1)+q_(0,1,0,1)*q_(1,0,1,0))</p><p>The algorithm used by phyloToricFP returns more polynomials than are required to generate the ideal. If we wish to directly compare this ideal to that returned by phyloToric42 we must reconstruct both ideals in the same ring. Thus, we use the function qRing to define the ring of Fourier coordinates and use the option of specifying the ring for our ideals: In our experiments, for most cases, phyloToric42 runs much faster than phyloToricFP. This is likely because we have implemented a naive version of the toric fiber product algorithm from <ref type="bibr">[Sturmfels and Sullivant 2005]</ref> with no attempt to avoid producing redundant polynomials. It would be worth investigating if there is a faster implementation of this algorithm. Still, one advantage offered by the fiber product method is the ability to inductively construct a single invariant when computing the entire ideal is infeasible. The method phyloToricRandom returns such a randomly constructed invariant.</p><p>The polynomials that are returned by both methods are in Fourier coordinates, however, they can be converted to probability coordinates using the function fourierToProbability. To do so, we must first construct the ring of probability coordinates using pRing. Then the method fourierToProbability returns a ring map that converts polynomials in Fourier coordinates to probability coordinates:</p><p>FUNCTIONALITY FOR SECANT VARIETIES. Mixtures of group-based phylogenetic models correspond to secants and joins of toric ideals, objects that are of interest in combinatorial commutative algebra, but are notoriously hard to compute. In the methods joinIdeal and secant, we implement the elimination method described in <ref type="bibr">[Sturmfels and Sullivant 2006, Section 2]</ref> for computing the join of two homogeneous ideals or the secant of one homogeneous ideal. Consider now the Jukes-Cantor model on T from Figure <ref type="figure">1</ref>. The phylogenetic ideal for the mixture of M T with itself is the second secant ideal of the homogeneous ideal I T , denoted I T * I T . For secants, the method secant takes as input a homogenous ideal and an integer k and returns a generating set for the k-th secant ideal. The method also accepts the optional argument DegreeLimit=&gt;{l}, which computes generators of the ideal only up to degree l. Thus, we can obtain generators of degree 3 or less of I T * I T with the following commands. The minimal generating set of SecI3 contains 49 linear invariants; we only print the generators with degree greater than one:</p><p>i13: = I = phyloToric42(n,T,JCmodel); i14: SecI3 = secant(I,2,DegreeLimit={3}); i15: toString for i in flatten entries mingens SecI3 list (if (degree i)#0 == 1 then continue; i) o15 = {q_(0,3,3,0)*q_(3,0,2,1)*q_(3,2,0,1)-q_(0,3,2,1)*q_(3,0,3,0)*q_(3,2,0,1) +q_(0,3,2,1)*q_(3,0,0,3)*q_(3,2,1,0)-q_(0,3,0,3)*q_(3,0,2,1)*q_(3,2,1,0) -q_(0,3,3,0)*q_(3,0,0,3)*q_(3,2,3,2)+q_(0,3,0,3)*q_(3,0,3,0)*q_(3,2,3,2)}</p><p>The degree bound allows for the possibility of obtaining some invariants when computing a generating set for the secant ideal is infeasible. Let (I T * I T ) l be the ideal generated by the elements of I T * I T of degree less than or equal to l. In some instances (I T * I T ) l may be equal to I T * I T . To prove this, we must verify that dim((I T * I T ) l ) = dim(I T * I T ) and that (I T * I T ) l is prime. Assuming we are able to compute (I T * I T ) l , we can compute its dimension and verify that it is prime. We then know that dim((I T * I T ) l ) &#8805; dim(I T * I T ), leaving the inequality dim((I T * I T ) l ) &#8804; dim(I T * I T ) to show. The method toricSecantDim enables us to do this using a probabilistic method based on Terracini's lemma <ref type="bibr">[1911]</ref> to compute a lower bound on dim(I T * I T ).</p><p>Using this method, we can show that the secant of the ideal from the previous example is in fact generated in degree less than three: In the code above, we used phyloToricAMatrix(n,T,JCmodel) to construct the defining integral matrix of the toric ideal. For more details, see the documentation and <ref type="bibr">[Sturmfels 1996</ref>]. In this instance, the method outlined is substantially faster than using the secant method without a degree bound. ADDITIONAL FUNCTIONALITY. Although this package was developed with toric ideals from phylogenetics in mind, the methods secant and joinIdeal can be used for any homogeneous ideals. Thus, these can be employed for computations outside of phylogenetic algebraic geometry.</p><p>The following models are loaded with the package: the Cavender-Farris-Neyman model, the Jukes-Cantor model, the Kimura 2-parameter model, and the Kimura 3-parameter model. Additionally, some functionality for working with trees is available in this package, which includes the methods edgeCut, vertexCut, edgeContract, internalEdges, internalVertices. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>REFERENCES.</head><p>[4ti2] 4ti2 team, "4ti2: a software package for algebraic, geometric and combinatorial problems on linear spaces", available at www.4ti2.de.</p></div></body>
		</text>
</TEI>
