<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Aether: leveraging linear programming for optimal cloud computing in genomics</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/08/2017</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10145584</idno>
					<idno type="doi">10.1093/bioinformatics/btx787</idno>
					<title level='j'>Bioinformatics</title>
<idno>1367-4803</idno>
<biblScope unit="volume">34</biblScope>
<biblScope unit="issue">9</biblScope>					

					<author>Jacob M Luber</author><author>Braden T Tierney</author><author>Evan M Cofer</author><author>Chirag J Patel</author><author>Aleksandar D Kostic</author><author>Bonnie Berger</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Motivation: Across biology, we are seeing rapid developments in scale of data production without a corresponding increase in data analysis capabilities.Results: Here, we present Aether (http://aether.kosticlab.org), an intuitive, easy-to-use, cost-effective and scalable framework that uses linear programming to optimally bid on and deploy combinations of underutilized cloud computing resources. Our approach simultaneously minimizes the cost of data analysis and provides an easy transition from users' existing HPC pipelines.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Aether</head><p>Data accumulation is exceeding Moore's law, which only still progresses due to advances in parallel chip architecture <ref type="bibr">(Esmaeilzadeh et al., 2013)</ref>. Fortunately, the shift away from in-house computing clusters to cloud infrastructure has yielded approaches to computational challenges in biology that both make science more reproducible and eliminate time lost in high-performance computing queues <ref type="bibr">(Beaulieu-Jones and Greene, 2017;</ref><ref type="bibr">Garg et al., 2011)</ref>; however, existing off-the-shelf tools built for cloud computing often remain inaccessible, cumbersome, and in some instances, costly.</p><p>Solutions to parallelizable compute problems in computational biology are increasingly necessary; however, batch job-oriented cloud computing ystems, such as Amazon Web Services (AWS) Batch, Google preemptible Virtual Machines (VMs), Apache Spark and MapReduce implementations are either closed source, restrictively licensed, or locked in their own ecosystems making them inaccessible to many bioinformatics labs <ref type="bibr">(Shvachko et al., 2010;</ref><ref type="bibr">Yang et al., 2007)</ref>. Other approaches for bidding on cloud resources exist, but they neither provide implementations nor interface with a distributed batch job process with a backend implementation of all necessary networking <ref type="bibr">(Andrzejak et al., 2010;</ref><ref type="bibr">Tordsson et al., 2012;</ref><ref type="bibr">Zheng et al., 2015)</ref>.</p><p>Our proposed tool, Aether, leverages a linear programming (LP) approach to minimize cloud compute cost while being constrained by user needs and cloud capacity, which are parameterized by the number of cores, RAM, and in-node solid-state drive space. Specifically, certain types of instances are allocated to large web service providers (e.g. Netflix) and auctioned on a secondary market when they are not fully utilized <ref type="bibr">(Zheng et al., 2015)</ref>. Users bid amongst each other for use of this already purchased but unused compute time at extremely low rates (up to 90% off the listed price; <ref type="url">https://aws.amazon.com/ec2/pricing/</ref>). However, this market is not without its complexities. For instance, significant price fluctuations, up to an order of magnitude, could lead to early termination of multi-hour compute jobs (Fig. <ref type="figure">1A</ref>). Clearly, bidding strategies must be dynamic to overcome such hurdles.</p><p>Aether consists of bidder and batch job processing command line tools that query instance metadata from the vendor application programming interface (APIs) to formulate the LP problem. LP is an optimization method that simultaneously solves a large system of equations to determine the best outcome of a scenario that can be described by linear relationships. The Aether bidder, described in detail in the Supplementary Methods, generates and solves a system of 140 inequalities using the simplex algorithm (Fig. <ref type="figure">1B</ref>). For the purposes of reproducibility, an implementation of the bidder using CPLEX is also provided as an optional command line flag.</p><p>Subsequently, the replica nodes specified by the LP result are placed under the control of a primary node, which assigns batch processing jobs over transmission control protocol, monitors for any failures, gathers all logs, sends all results to a specified cloud storage location, and terminates all compute nodes once processing is complete (Fig. <ref type="figure">1C</ref>). Additionally, Aether is able to distribute compute across multiple cloud providers. Sample code for this is provided with the Aether implementation although it was not utilized in our reported tests due to cost feasibility. Our implementation runs on any Unix-like system; we ran our pipeline and cost analysis using AWS but have provided code to spin up compute nodes on either Microsoft Azure or on a user's local physical clusters.</p><p>To test our bidding approach and batch job pipeline at scale, we used our framework to de novo assemble and annotate 1572 metagenomic, longitudinal samples from the stool of 222 infants in Northern Europe (Supplementary Fig. <ref type="figure">S1</ref>; Ba &#168;ckhed et al., 2015; <ref type="bibr">Kostic et al., 2015;</ref><ref type="bibr">Vatanen et al., 2016;</ref><ref type="bibr">Yassour et al., 2016)</ref>. The sequencing data within datasets from the DIABIMMUNE consortium ranged from 4680 to 22, 435, 430 reads/sample with a median of 19, 020, 036 reads/sample. Assemblies were performed with MEGAHIT and annotations were done with PROKKA <ref type="bibr">(Li et al., 2015;</ref><ref type="bibr">Seemann, 2014)</ref>.</p><p>Metagenomic data, typically shotgun DNA sequencing of microbial communities, is difficult to analyze because of the enormous amounts of compute required to naively assemble short sequence reads into large contiguous spans (contigs) of DNA. To accomplish our assemblies, our bidding algorithm suggested that the optimal strategy would be to spin up 30TB of RAM across underutilized compute nodes. Our networked batch job processing module utilized these nodes for 13 h and yielded an assembly and annotation cost of US$0.30 per sample (Supplementary Fig. <ref type="figure">S2</ref>). Theoretically, the pipeline can complete in the time it takes for the longest sub-process (i.e. assembly in this case) to finish (7 h). Spinning up the same nodes for this long without a bidding approach would cost US$1.60 per sample (Supplementary Fig. <ref type="figure">S3</ref>). In order for on-site hardware to achieve the same cost efficiency as our pipeline, one would have to carry out on the order of 1 million assemblies over the lifespan of the servers, a practically insurmountable task (Supplementary Fig. <ref type="figure">S2</ref>). Such efficiency in both time and cost at scale is unprecedented. In fact, due to resource paucity, computational costs have forced the field of metagenomics to rely on algorithmic approaches that utilize mapping back to reference genomes rather than de novo methods <ref type="bibr">(Truong et al., 2015)</ref>.</p><p>Additional testing of Aether showed marginally better relative cost savings (compared to the assembly example) when tasked with aligning braw reads to the previously assembled genomes with BWA-MEM <ref type="bibr">(Li, 2009</ref>; <ref type="url">http://arxiv.org/abs/1303.3997</ref>); this is not surprising as shorter computational tasks are less sensitive to the risks of early spot instance termination. Additionally, in simulated runs of the bidder incorporating pricing history from periods where ask prices were approximately an order of magnitude higher than normal on the east coast of the United States (Fig. <ref type="figure">1A</ref>), Aether suggested utilization of different instance types that would have resulted in similar cost and time to completion as our actual run. To allow users to make optimal usage of these benefits, the ability to simulate bidding for different timeframes is included as a feature. By not having to potentially re-run analysis pipelines (due to being outbid on compute during runtime), we claim that utilizing Aether leads to a reduction of market inefficiencies. We have both qualitatively and empirically compared Aether to existing AWS tools such as AWS Batch and Spot Fleet Pricing (Supplementary Fig. <ref type="figure">S1</ref>.3 and Supplementary Methods). Additionally, where empirical validation of benefit was possible, we have iterated on previous work and incorporated strategies such as basing a subset of constraints on service level agreements <ref type="bibr">(Andrzejak et al., 2010)</ref>. Future directions include training the bidding algorithm to predict its own effect on pricing variability when being utilized at massive scale as well as distributing compute nodes across datacenters when enough resources are being spun up to strongly influence the market.</p><p>To our knowledge, this is the first implementation of a bidding algorithm for cloud compute resources that is tied both to an easy-to-use front-end as well as a distributed backend that allows for spinning up purchased compute nodes across multiple providers. Conceivably, this tool can be applied to any number of disciplines, bringing cost-effective cloud computing into the hands of scientists in fields beyond biology. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://academic.oup.com/bioinformatics/article-abstract/34/9/1565/4708304 by guest on 21 April 2020</p></note>
		</body>
		</text>
</TEI>
