<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Chromosome-level Genome Assembly of the Highly Heterozygous Sea Urchin &lt;i&gt;Echinometra&lt;/i&gt; sp. EZ Reveals Adaptation in the Regulatory Regions of Stress Response Genes</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10444163</idno>
					<idno type="doi">10.1093/gbe/evac144</idno>
					<title level='j'>Genome Biology and Evolution</title>
<idno>1759-6653</idno>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue">10</biblScope>					

					<author>Remi N Ketchum</author><author>Phillip L Davidson</author><author>Edward G Smith</author><author>Gregory A Wray</author><author>John A Burt</author><author>Joseph F Ryan</author><author>Adam M Reitzel</author><author>Rachel O’Neill</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Abstract            Echinometra is the most widespread genus of sea urchin and has been the focus of a wide range of studies in ecology, speciation, and reproduction. However, available genetic data for this genus are generally limited to a few select loci. Here, we present a chromosome-level genome assembly based on 10x Genomics, PacBio, and Hi-C sequencing for Echinometra sp. EZ from the Persian/Arabian Gulf. The genome is assembled into 210 scaffolds totaling 817.8Mb with an N50 of 39.5Mb. From this assembly, we determined that the E. sp. EZ genome consists of 2n=42 chromosomes. BUSCO analysis showed that 95.3% of BUSCO genes were complete. Ab initio and transcript-informed gene modeling and annotation identified 29,405 genes, including a conserved Hox cluster. E. sp. EZ can be found in high-temperature and high-salinity environments, and we therefore compared E. sp. EZ gene families and transcription factors associated with environmental stress response (“defensome”) with other echinoid species with similar high-quality genomic resources. While the number of defensome genes was broadly similar for all species, we identified strong signatures of positive selection in E. sp. EZ noncoding elements near genes involved in environmental response pathways as well as losses of transcription factors important for environmental response. These data provide key insights into the biology of E. sp. EZ as well as the diversification of Echinometra more widely and will serve as a useful tool for the community to explore questions in this taxonomic group and beyond.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The expansion of sequencing technologies, particularly long-read sequencing, has dramatically improved the assembly of genomes for all species, particularly those with high heterozygosity and/or repeat content. These advances are particularly impactful for species where no or few closely related reference genomes are available and for species with large population sizes where nucleotide diversity is typically high. Both of these factors are common for marine invertebrates. Within the phylum Echinodermata, the purple sea urchin, Strongylocentrotus purpuratus, was the first genome to be sequenced <ref type="bibr">(Sodergren et al. 2006)</ref>. This genome not only revealed a number of critical expansions of gene families and conservation of core genes in developmental regulatory networks, but also provided an essential tool for future studies of cis-regulation and genetic adaptation <ref type="bibr">(N&#232;gre et al. 2011;</ref><ref type="bibr">Technau and Schwaiger 2015;</ref><ref type="bibr">Clucas et al. 2019)</ref>. Since the release of the S. purpuratus genome, the genomes for other echinoderms have been published <ref type="bibr">(Long et al. 2016;</ref><ref type="bibr">Hall et al. 2017;</ref><ref type="bibr">Kinjo et al. 2018)</ref> and genomic resources such as SpBase <ref type="bibr">(Cameron et al. 2009)</ref>, Echinobase <ref type="bibr">(Kudtarkar and Cameron 2017)</ref>, and EchinoDB <ref type="bibr">(Janies et al. 2016</ref>) have been established for data sharing and analyses. More recently, the genomes of Lytechinus variegatus <ref type="bibr">(Davidson et al. 2020)</ref> and Lytechinus pictus <ref type="bibr">(Warner et al. 2021</ref>) have been assembled to chromosome-level. To date, there are no sea urchin genome assemblies available from urchins that inhabit extreme environments such as the Persian/Arabian Gulf (herein the PAG), yet such efforts to expand the taxonomic representation of existing assemblies would help make inferences about evolutionary processes.</p><p>Echinometra is a widespread genus of pantropical sea urchins that have been the focus of many genetic <ref type="bibr">(Matsuoka and Toshihiko 1991;</ref><ref type="bibr">Palumbi et al. 1997;</ref><ref type="bibr">McCartney et al. 2000;</ref><ref type="bibr">McCartney and Lessios 2004;</ref><ref type="bibr">Bronstein and Loya 2013)</ref>, ecological <ref type="bibr">(Khamala 1971;</ref><ref type="bibr">McClanahan and Muthiga 2007;</ref><ref type="bibr">Sangil and Guzman 2016)</ref>, and reproductive studies <ref type="bibr">(Metz et al. 1994;</ref><ref type="bibr">Aslan and Uehara 1997;</ref><ref type="bibr">Rahman et al. 2000</ref><ref type="bibr">Rahman et al. , 2001;;</ref><ref type="bibr">Mita et al. 2004</ref>). Echinometra are found in the Indo-Pacific, Caribbean, and Atlantic Ocean and they are often the most prevalent urchins on the reefs they inhabit <ref type="bibr">(McClanahan and Muthiga 2007;</ref><ref type="bibr">Bronstein and Loya 2013)</ref>. Although widely studied, the species-level taxonomy for this genus is incomplete and while some studies have referenced eight species, there is more likely to be at least nine species of Echinometra <ref type="bibr">(Bronstein and Loya 2013;</ref><ref type="bibr">Ketchum et al. 2018)</ref>. <ref type="bibr">Bronstein and Loya (2013)</ref> were the first to describe Echinometra species EE and ZE (these abbreviations reference their collection site: Eilat and Zanzibar, respectively), which had been historically misidentified as Echinometra mathaei. Following the authors' analyses, these sea urchins were then inferred to be a single species (hence the combined name Echinometra sp. EZ). The misidentification for this new species also occurred in the PAG and Gulf of Oman but has since been corrected <ref type="bibr">(Ketchum et al. 2020</ref><ref type="bibr">(Ketchum et al. , 2021))</ref>. The PAG populations are of particular interest as they inhabit the world's warmest reefs with daily mean summer temperatures regularly &gt;35 &#176;C and daily extremes exceeding 37 &#176;C <ref type="bibr">(Smith et al. 2017;</ref><ref type="bibr">Burt et al. 2019)</ref>. Additionally, the PAG is highly saline (salinities often &gt;45 ppt; <ref type="bibr">Burt et al. 2008</ref>) and frequently experiences hypoxia in the summer <ref type="bibr">(De Verneil et al. 2021)</ref>. The E. sp. EZ populations inhabiting this sea live in an extreme environment with warming episodes that are predicted to increase in severity and frequency <ref type="bibr">(Sheppard et al. 2010</ref>).</p><p>Here we present the chromosome-level assembly for E. sp. EZ from the PAG and the associated gene annotation set. This new assembly is a marked improvement on the previous draft genome assembled <ref type="bibr">(Ketchum et al. 2020)</ref>. With the genome in hand, we provide insight into the evolution of several gene families related to environmental stress responses and identify instances of strong positive selection in the putative regulatory regions for transcription factors involved in regulating molecular pathways for homeostasis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results and Discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Genome Assembly and Annotation</head><p>The initial Echinometra sp. EZ draft genome <ref type="bibr">(Ketchum et al. 2020)</ref> was assembled from Illumina short (150 bp) paired-end reads, had an assembly size of 1,589 Mb, comprised 4,487,317 scaffolds, and had a BUSCO complete score of 34.8% (table <ref type="table">1</ref>). The improved assembly generated with 10x Genomics and PacBio long reads produced a genome assembly size of 817.5 Mb with 3,800 scaffolds and an overall BUSCO completeness score of 94.4%. We used Hi-C reads to assemble these 3,800 scaffolds into an assembly that was 817.8 Mb in length with 210 scaffolds. The longest scaffold was 65.8 Mb and the scaffold N50 length was 39.5 Mb. The contact map is indicative of the presence of 21 chromosomes in the haploid E. sp. EZ genome (supplementary fig. <ref type="figure">1</ref>, Supplementary Material online). This is consistent with estimates for other Echinometra species based on karyotyping <ref type="bibr">(Uehara et al. 2020</ref>). The assembly is highly complete with very little duplicated or missing content and has a BUSCO complete score of 95.3%. The high contiguity and BUSCO scores are in line with other recently published sea urchin genomes <ref type="bibr">(Davidson et al. 2020</ref>;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Significance</head><p>We present the first chromosome-level genome assembly for the sea urchin Echinometra sp. EZ from the environmentally extreme Persian/Arabian Gulf. There are currently no high-quality sea urchin genomes available for urchins that inhabit extreme environments, despite the fact that expanding the taxonomic representation of genome assemblies would help make inferences about evolutionary processes, especially where it pertains to adaptation. Here, we compared the E. sp. EZ genome to several other sea urchin genomes to provide insight into the evolution of key developmental and environmental stress response genes. Echinometra is a well-studied and widespread genus of sea urchin that plays critical ecological roles in coral reef communities, and this genome represents a crucial tool for future studies across multiple fields. <ref type="bibr">Warner et al. 2021)</ref>. Finally, we performed gene prediction in BRAKER v2. <ref type="bibr">1.5 (Bru&#778; na et al. 2021</ref>) using E. sp. EZ RNA-seq data and S. purpuratus v5.0 protein models as inputs, and identified 29,405 genes in this assembly (Gene Model BUSCO Results: Complete 82.2%, Partial 4.6%, Missing 13.2%).</p><p>A notable challenge in assembling this genome was the high heterozygosity (supplementary fig. <ref type="figure">2</ref>, Supplementary Material online). The estimated heterozygosity was 4.4% which is extremely high (e.g., heterozygosity of 101 drosophilids ranged from 0.00035% to 2.1% in <ref type="bibr">Kim et al. 2021</ref>) but is consistent with the draft genome estimates (4.5%). These levels of heterozygosity are comparable with the S. purpuratus genome (4-5%) but are higher than the values for L. variegatus (2.9%). Prior to running PurgeHaplotigs, the percentage of core duplicated genes as predicted by BUSCO was 71.1%, indicating that regions of the genome where allelic variation exceeded CANU thresholds were assembled into multiple haplotigs rather than a single haplotype-fused representation of the haploid genome. With the application of PurgeHaplotigs, we reduced the duplication levels to 2.8%. Despite the higher heterozygosity, these duplication levels are within the ranges observed in the Lytechinus assemblies (0.6% and 1.36% in L. variegatus <ref type="bibr">[Davidson et al. 2020</ref>] and L. pictus <ref type="bibr">[Warner et al. 2021]</ref>, respectively). As such, these results indicate that we have generated a high-quality single pseudo-haploid reference assembly containing sequences from both parental haplotypes.</p><p>The availability of a high-quality E. sp. EZ genome allows for a comparison of key developmental genes between other sequenced echinoid genomes. The Hox cluster of E. sp. EZ consists of 11 Hox genes as well as Hox-associated homeobox gene Evx. The 11 E. sp. EZ Hox genes span 736 kb and the cluster is organized in the same order and orientation as the genes in the Hox clusters of S. purpuratus <ref type="bibr">(Cameron et al. 2006)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chemical Defensome</head><p>Adaptive evolution can occur rapidly in species as they respond to changing local environments or as they encounter environmental challenges during range expansion <ref type="bibr">(Chen et al. 2018</ref>). Gene family expansions offer one potential mechanism for rapid adaptation to novel environments <ref type="bibr">(Kondrashov 2012;</ref><ref type="bibr">Meerupati et al. 2013)</ref>. The E. sp. EZ individual sampled for this genome is hypothesized to be a member of a population likely to have undergone rapid adaptation considering the young age of the PAG (modern coastlines formed 3,000-6,000 years ago in the wake of the marine transgression following the most recent ice age; Riegl and Purkis 2012) and the extreme thermal environment. Therefore, we hypothesized that key gene families underwent expansion events and would be evident in the E. sp. EZ genome when compared with four sea urchin species from other geographic regions.</p><p>We investigated a suite of gene families included in the chemical defensome; nuclear receptors, biotransformation enzymes, stress response, and transporter genes. Genomic analyses of five species of sea urchins showed that the number of chemical defensome genes ranged from 800 in E. sp. EZ to 944 in Heliocidaris erythrogramma (fig. <ref type="figure">2</ref>). The variability in the number of genes could be a result of missing annotations, however, this is likely to be a random process so is unlikely to only effect one subfamily of genes. Each gene subfamily is represented in each species and although the number of genes in each subfamily varies across species, they are relatively consistent, with the exception of the nuclear receptors (E. sp. EZ had 21 compared with an average of about 45 from the other echinoid species). The nuclear receptor ROR-alpha-like and the nuclear receptor subfamily 2 group E member 1 (nr2e1) genes are missing from the E. sp. EZ genome (no hits were returned when using a local TBLASTN against the genome with default parameters). The nuclear receptor ROR-alpha-like gene performs a diverse array of biological functions which includes regulation of glucose and inflammation cytokines, and free fatty acid metabolism <ref type="bibr">(Zueva et al. 2018</ref>). Nuclear receptor </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FIG.</head><p>1.-Echinometra sp. EZ Hox cluster organization is exactly the same as the organization of the Strongylocentrotus purpuratus Hox cluster in terms of membership and orientation. The names on the S. purpuratus Hox genes reflect the names assigned in an earlier study <ref type="bibr">(Cameron et al. 2006)</ref>. Except for Hox1 and Evx, we omitted names on the E. sp. EZ Hox genes to reflect the uncertainty of the orthology of echinoderm Hox genes relative to the vertebrate Hox genes of the same name (supplementary fig. <ref type="figure">3</ref>, Supplementary Material online).</p><p>subfamily 2 group E member 1 is a gene involved in human retinal development and also regulates adult neural stem cell proliferation <ref type="bibr">(O'Loghlen et al. 2015)</ref>. Adaptive loss-of-function mutations or gene loss events are likely to occur more frequently than amino acid substitutions and have been documented in a wide variety of organisms <ref type="bibr">(Wang et al. 2006;</ref><ref type="bibr">Borges et al. 2015)</ref>. It is currently unclear what functions nr2e1 performs in sea urchins or why these two nuclear receptors are missing from E. sp. EZ's genome. Overall, we did not observe any evidence for gene family expansions in E. sp. EZ that would be consistent with adaptation to a high temperature environment. However, changes in the coding regions or regulation of these genes may better account for E. sp. EZ's mechanism of adaptation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Positive Selection</head><p>We tested whether there is evidence of positive selection acting on noncoding elements near transcription factors commonly associated with maintaining homeostasis in order to assess whether putative regulatory elements of stress response genes may be evolving faster in E. sp. EZ. For this set of analyses we used urchins from the Family Echinometridae (E. sp. EZ, H. erythrogramma, and H. tuberculata) with L. variegatus as an outgroup. Transcription factors regulate the expression of large suites of downstream genes and thus are potentially important targets of positive selection for adaptation. Previous studies of selection for transcription factors involved in developmental gene regulatory networks have revealed particular cis-regulatory regions under positive selection, which may explain differences in developmental pathways <ref type="bibr">(Erkenbrack and Davidson 2015)</ref>.</p><p>We used adaptiPhy to identify elevated rates of sequence divergence associated with noncoding sequences 25 kb upstream and downstream of 25 key transcription factors that are broadly involved in mediating responses to environmental variation (supplementary table 1, Supplementary Material online). The divergence in these regions are compared with the background rates of sequence evolution across the phylogeny to identify evidence of branch-specific positive selection. Ten of 25 genes were discarded during filtering due to a lack of long syntenic fragments or possible duplications. AdaptiPhy returned 1,350 sites across the three species of sea urchins compared in this analysis. We detected evidence of positive selection in a median of 16.2% of sites in E. sp. EZ, 1.9% in H. erythrogramma, and 6.0% in H. tuberculata. Recent work on Heliocidaris <ref type="bibr">(Davidson et al. 2022a</ref>) measured instances of positive selection in noncoding, open chromatin regions across the whole genome and found that when genes are categorized by function, a median of 1.18% and 1.74% of nearby noncoding sites show selection in H. erythrogramma and H. tuberculata, respectively. These values include sites near "Defensome" genes (1.90% and 1.22%, respectively) and sites near "Immune" genes (2.34% and 1.10%, respectively). While these data sets are not identical, the analyses are comparable and suggest that the enrichment of selection near these genes in E. sp. EZ may represent a significant signal of adaptation in E. sp. EZ relative to the other species. In particular, there were five genes (hif1a, nfe2, nfkb, nr1h6b, and nr1h6c) with nearby noncoding elements exhibiting significant signatures of positive selection predominantly in E. sp. EZ (fig. <ref type="figure">3</ref> and supplementary fig. <ref type="figure">4</ref>, Supplementary Material online).</p><p>The first gene, hif1a, is a transcription factor that serves as a gene expression modulator in response to hypoxia and other environmental factors <ref type="bibr">(Wenger 2002)</ref>. A recent study showed that hypoxia occurs repeatedly on a southern PAG reef in the summer and can last for several hours at a time with occasional anoxic events <ref type="bibr">(De Verneil et al. 2021</ref>). In the sea anemone Exaiptasia pallida, hif1a expression changes rapidly and significantly early on in the heat stress response <ref type="bibr">(Cleves et al. 2020)</ref>. A link between temperature and HIF-1 activity has also been documented in fishes (carp; <ref type="bibr">Rissanen et al. (2006)</ref>, and Atlantic salmon; <ref type="bibr">Olsvik et al. (2013)</ref>) where temperature stress resulted in impaired binding activity of HIF-1. These findings point towards a critical relationship between hif1a and stress response. Further, our observation of positive selection in the noncoding sequences neighboring hif1a could be indicative of altered expression of this gene in E. sp. EZ, which in turn could potentially impact stress response genes under the regulatory control of this transcription factor. We also identified sites under selection for nuclear factor erythroid 2, which activates gene expression in response to oxidative and xenobiotic stress <ref type="bibr">(Reitzel et al. 2008)</ref>  Heat responsive genes regulator of innate immunity and linked to resistance to environmental stress (Gilmore and Wolenski 2012). The last two genes are nuclear receptors, nr1h6b and nr1h6c, whose biological function are not clearly defined but are predicted to be involved in the regulation of biotransformation enzymes and transporters based on the sequence similarity with vertebrate nr1h subfamily (e.g., FXR, LXR) <ref type="bibr">(Goldstone et al. 2006)</ref>. Although our analysis presented for this selection of transcription factors is relatively small compared with the nearly 30,000 annotated genes in the genome, we anticipate that many other loci will also reveal signatures of positive selection in future analyses of the E. sp. EZ genome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>Echinometra is a widespread genus of sea urchin that plays important ecological roles in tropical benthic communities. This is the first chromosome-level genome assembly for a species from this genus and represents a crucial advance with implications across multiple fields. First, it will allow comparisons across multiple sea urchin genera to elucidate fundamental evolutionary patterns in genome evolution. Our initial analyses reported here suggest the genome size, number of genes, and gene family representation are fairly consistent across these echinoid species. Second, the completeness of the genome will facilitate ongoing research investigating the role of adaptation in these urchins, particularly with respect to identifying structural variation, positive selection, and gene arrangements along ecological gradients and contrasting habitats. We provide some examples of elevated signatures of positive selection for putative regulatory elements of orthologous transcription factors likely involved in environmental stress responses. Last, the Echinometra genus is a classically studied example of speciation and therefore this highly contiguous genome assembly will provide a tool elucidate the genomic underpinnings of speciation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Materials and Methods</head><p>Tissue Collection and DNA/RNA Extraction Samples were collected from the southern PAG which is characterized by extreme and highly variable temperatures and salinity <ref type="bibr">(Vaughan et al. 2019)</ref>. One Echinometra sp. EZ adult was collected from Dhabiya Reef <ref type="bibr">(lat-long: 24.36549992, 54.10079998)</ref> in December 2017 for DNA isolation for genome assembly. Gonadal tissue was dissected from this individual and immediately placed on dry ice and sent to the Dovetail Genomics Center who performed the subsequent DNA extraction using the Qiagen Genomic-tip DNA kit. In order to generate a transcriptome, RNA was isolated from an additional E. sp. EZ adult that was collected from Saadiyat Reef <ref type="bibr">(lat-long: 24.59899996, 54.42149998)</ref>  ice. RNA was extracted using the RNAqueous Total RNA Isolation Kit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sequencing and Genome Assembly</head><p>RNA was sent to DHMRI (Kannapolis, NC, USA) for library preparation and sequencing with an Illumina HiSeq2500. Total RNA was quantified using the Quant-iT RiboGreen RNA Assay Kit (Thermo Fisher) and RNA integrity assessed using an Agilent Bioanalyzer (Santa Clara, CA, USA). RNA sequencing libraries were generated using the Illumina TruSeq RNA Library Prep Kit following the manufacturer's protocol and quantitated using qPCR and fragments were visualized using an Agilent Bioanalyzer. The library was run on one flow cell for a 125 bp paired-end sequencing run.</p><p>Gonadal tissue was sent to the Dovetail Genomics Center who performed all the library construction and sequencing for the genome. The assembly and scaffolding was accomplished through a joint effort between the authors and Dovetail Genomics. Library construction, sequencing, assembly, and scaffolding are described in detail below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>10x Library Preparation and Sequencing</head><p>Whole-genome sequencing libraries were prepared with 1.0 ng of genomic DNA using the Chromium Genome Library and Gel Bead Kit v.2 (10x Genomics, cat. 120262). Link-read based technology using 10x Genomics Chromium was performed according to the manufacturer's instructions with one modification. Briefly, in order to create Gel Bead-in-Emulsions (GEMs), gDNA was combined with Master Mix, a library of Genome Gel Beads, and partitioning oil on a Chromium Genome Chip. The GEMs were isothermally amplified with primers containing an Illumina Read 1 sequencing primer, a unique 16 bp 10x barcode and a 6 bp random primer sequence, and barcoded DNA fragments were recovered for Illumina library construction. The amount and fragment size of the post-GEM DNA was quantified prior to using a Bioanalyzer 2100 with an Agilent High sensitivity DNA kit (Agilent, cat. 5067-4626). The GEM amplification product was sheared on an E220 Focused Ultrasonicator (Covaris, Woburn, MA, USA) to &#8764;350 bp (55 s at peak power = 175, duty factor = 10, and cycle/burst = 200). Next, the sheared GEMs were converted to a sequencing library following the 10x standard operating procedure and the library was quantified by qPCR with a Kapa Library Quant kit (Kapa Biosystems-Roche). Finally, the library was sequenced on a partial lane (472M reads) of the NovaSeq6000 sequencer (Illumina, San Diego, CA, USA) with paired-end 150 bp reads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PacBio Library Preparation and Sequencing</head><p>A SMRTbell library (&#8764;20 kb) for PacBio Sequel II was constructed using a SMRTbell Template Prep Kit 1.0 (PacBio, Menlo Park, CA, USA) according to the manufacturer's recommended protocol. The Sequel Binding Kit 2.0 (PacBio) was used to bind the pooled library to the polymerase and then loaded onto the PacBio Sequel using the MagBead Kit V2 (PacBio). Sequencing was performed on one PacBio Sequel SMRT cell (Instrument Control Software Version 5.0.0.6235, Primary Analysis Software Version 5.0.0.6236 and SMRT Link Version 5.0.0.6792).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dovetail Hi-C Library Preparation and Sequencing</head><p>A Dovetail Hi-C library was prepared in a similar manner as described in <ref type="bibr">Lieberman-Aiden et al. (2009)</ref>. For each library, chromatin was fixed with formaldehyde in the nucleus and then extracted. This was then digested with DpnII and the 5&#8242; overhangs were filled with biotinylated nucleotides. After the free blunt ends were ligated, crosslinks were reversed and the DNA was purified from protein. Biotin that was not internal to ligated fragments was removed and DNA was then sheared to a mean fragment size of 350 bp. Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before the PCR enrichment of each library. Finally, the libraries were sequenced on an Illumina HiSeq X to produce 200 million 2x150 bp paired-end reads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Genome Assembly</head><p>CANU v.2.0 <ref type="bibr">(Koren et al. 2017</ref>) was used for genome assembly with the following parameters to generate a de novo assembly using the PacBio reads: genomeSize = 1,300, minReadLength = 3,000, corOutCoverage = 300, useGrid = False. 10x sequences were aligned to the CANU assembly using longRanger v.2.2.2 <ref type="bibr">(Marks et al. 2019)</ref> and then Pilon was used to polish the CANU assembly with the 10x reads. In order to reduce the percentage of duplication (due to the high levels of heterozygosity), PurgeHaplotigs <ref type="bibr">(Roach et al. 2018</ref>) was run multiple times with the percent cutoff for identifying a contig as a haplotig ranging from 35% to 55%. A cutoff of 40% yielded the most complete assembly and this genome assembly was subsequently run through the PurgeHaplotigs "clip" option that identifies and trims overlapping contig ends.</p><p>The purged genome assembly and the Dovetail Hi-C library reads were input into HiRise. HiRise is a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies <ref type="bibr">(Putnam et al. 2016)</ref>. The Dovetail Hi-C library sequences were aligned to the draft input assembly using the modified SNAP read mapper (<ref type="url">http:// snap.cs.berkeley.edu</ref>). HiRise analyzed the separations of Dovetail Hi-C read pairs mapped within the draft scaffolds to produce a likelihood model for genomic distance between read pairs. The model was used to identify and break Downloaded from <ref type="url">https://academic.oup.com/gbe/article/14/10/evac144/6717576</ref> by UNC Charlotte user on 21 August misjoins, to score potential joins, and make joins above a threshold.</p><p>Jellyfish v2.3.0 <ref type="bibr">(Mar&#231;ais and Kingsford 2011)</ref> and GenomeScope v1.0.0 <ref type="bibr">(Vurture et al. 2017)</ref> were used to calculate genome size, heterozygosity, and repetitiveness by performing a nonlinear least-squares optimization to fit a mixture of four evenly spaced negative binomial distributions to the k-mer profile. The GenomeScope profile for E. sp. EZ was generated from linked-read sequencing data (10x Genomics) and a k-mer length of 21 (supplementary fig. <ref type="figure">2</ref>, Supplementary Material online).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gene Prediction and Annotation</head><p>Prior to gene prediction and annotation, genomic repetitive elements were identified with RepeatModeler v1.0.11 <ref type="bibr">(Smit and</ref><ref type="bibr">Hubley 2008-2015)</ref> to generate a E. sp. EZ-specific repeat element library. Repetitive regions were soft-masked prior to gene prediction and annotation using RepeatMasker v4.0.8 <ref type="bibr">(Smit et al. 2013</ref><ref type="bibr">(Smit et al. -2015) )</ref> with the -s and -xsmall flags. Transcriptome sequences were cleaned and trimmed using Trimmomatic v0.38 <ref type="bibr">(Bolger et al. 2014)</ref> with a phred quality score of 33. Leading and trailing bases with a quality score &lt;3 were removed, a four-base wide sliding window was used to cut where the average quality per base dropped below 15, and reads that were &lt;36 bp long were removed. BWA-mem v0.7.17 <ref type="bibr">(Li 2013)</ref> was used to align the transcriptome sequences the genome with default parameters. Aligned RNA-seq data and S. purpuratus v5.0 protein models (<ref type="url">https://www.ncbi.nlm.nih. gov/assembly/GCF_000002235.5</ref>, accessed February 23, 2021) were input into the BRAKER v2.1.5 <ref type="bibr">(Bru&#778; na et al. 2021</ref>) pipeline, which relies on GeneMark v4.62 <ref type="bibr">(Lomsadze et al. 2005)</ref> and Augustus v3.4.0 <ref type="bibr">(Stanke et al. 2006)</ref> to generate gene models.</p><p>To identify transposable elements (TEs) in the E. sp. EZ genome, a master TE file with 26,470 TEs was generated by combining TEs from RepeatModeler and the default set of TEs in Maker <ref type="bibr">(Campbell et al. 2014)</ref>. Potential TEs were inspected with BlastP v2.5.0+ <ref type="bibr">(Altschul et al. 1990)</ref> against the TE master file and the S. purpuratus v5.0 gene models. Gene models were filtered by removing those that 1) had a hit to the TE master file but no hit to the S. purpuratus gene models, 2) had a highly similar hit (BLASTX E-value &lt;1e-20) to a TE but a weak hit to the S. purpuratus gene models (BLASTX E-value &gt; 1), and 3) had a match to the S. purpuratus gene models and were labeled as retrotransposons. Further TE analysis was performed in extensive de novo TE annotator <ref type="bibr">(Ou et al. 2019</ref>) using default parameters (supplementary table 2, Supplementary Material online).</p><p>Assembled protein models were functionally annotated using BLASTP v2.5.0+ with three protein databases: S. purpuratus v4.2, UniProt Knowledgebase Swiss-Prot protein models v2021-03, and RefSeq invertebrate protein models with S. purpuratus excluded. Lastly, BUSCO v5 <ref type="bibr">(Manni et al. 2021</ref>) was used to measure the completeness of the genome assembly and the protein models using the metazoan database.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hox Phylogeny</head><p>To maximize transparency and minimize bias, we generated and posted to GitHub (<ref type="url">https://github.com/josephryan/ SpezHox</ref>) a phylotocol (DeBiasse and Ryan 2018), which specified our planned phylogenetic analyses prior to running these analyses. The alignment, command lines, and treefiles are also available at this GitHub repository.</p><p>We used hmm2aln.pl version 0.05 (<ref type="url">https://github.com/ josephryan/hmm2aln.pl</ref>), which uses hmmsearch from HMMer version 3.3 <ref type="bibr">(Potter et al. 2018)</ref> to identify target domains in protein sequences and aligns resulting sequences to the specified hidden Markov model (HMM). We used the Hox HMM (hd60.hmm) from <ref type="bibr">Zwarycz et al. (2015)</ref> to search transcriptomes from the following species: E. sp. EZ, S. purpuratus, L. variegatus, Patiria miniata, Parastichopus parvimensis, and Holothuria glaberrima. To this we added the curated set of homeodomains of the HOXL subclass from HomeoDB <ref type="bibr">(Zhong and Holland 2011)</ref> as well as the published Hox/ ParaHox homeodomains from Crassostrea gigas <ref type="bibr">(Paps et al. 2015)</ref> and Capitella teleta <ref type="bibr">(Zwarycz et al. 2015)</ref>. From this set, we removed the three Hox3-related genes of Drosophila melanogaster (zen1, zen2, and bicoid), as these have been shown to be highly derived relative to the ancestral Hox3 sequence <ref type="bibr">(Stauber et al. 1999;</ref><ref type="bibr">Hughes et al. 2004</ref>) as was done in <ref type="bibr">Ryan et al (2007)</ref>.</p><p>As expected, our HMM approach led to the inclusion of non-HOXL homeodomains. To create a dataset consisting only of HOXL class homeodomains, we generated an initial tree using IQTREE version 1.6.12 <ref type="bibr">(Nguyen et al. 2015)</ref> under the LG model with otherwise default parameters. We then used the make_subalignment script version 0.06 (<ref type="url">https:// github.com/josephryan/make_subalignment</ref>) to retain (in our alignment) only sequences that were included in the smallest clade that included all of the human HOXL homeodomains in our preliminary tree, essentially pruning all non-HOXL homeodomains. At this point we removed three S. purpuratus sequences (XP_011675355, XP_011682234, and XP_011682243), which were copies of three other sequences (NP_999816, NP_999774, and XP_793141, respectively) that were retained in the final alignment. We then ran a final tree using IQTREE with the LG + G4 model and otherwise default parameters on this pruned data set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PFAM Analysis</head><p>We conducted a comparison of gene family representation in the E. sp. EZ assembly with other echinoids to compare the gene annotations and identify potential expansions and contractions across species. We focused this comparison on Genome Biol. Evol. 14 <ref type="bibr">(10)</ref> <ref type="url">https://doi.org/10.1093/gbe/evac144</ref> Advance Access publication 26 September 2022 genes and gene families related to environmental responses. The list of genes included for these comparisons were from the chemical defensome developed from S. purpuratus <ref type="bibr">(Goldstone et al. 2006)</ref>. We performed hidden Markov model searches using HMMER <ref type="bibr">(Finn et al. 2011)</ref> and PFAM profiles to compare gene family representation in five species of sea urchins: E. sp. EZ, L. variegatus <ref type="bibr">(Davidson et al. 2020)</ref>, S. purpuratus (v5.0 gene models), H. erythrogramma, and H. tuberculata <ref type="bibr">(Davidson et al. 2022b)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Positive Selection</head><p>We used adaptiPhy <ref type="bibr">(Berrio et al. 2020)</ref>, which infers regions of the genome that are targets of branch-specific positive selection, to identify specific noncoding regions of the E. sp. EZ genome under selection (see <ref type="url">https://github.com/ wodanaz/adaptiPhy</ref> for command lines). This method controls for branch length when comparing substitution rates of query and neutral sequences between species and therefore should not be effected by comparing two relatively closely related species to a more distantly related one. Therefore, our analyses compared E. sp. EZ to the H. erythrogramma and H. tuberculata genomes with L. variegatus used as the outgroup.</p><p>First, we curated a list of 25 transcription factors based on their roles in the defensome, general stress response, as well as metabolism, cell growth, immunity, and wound healing (supplementary table 1, Supplementary Material online). Using a whole-genome alignment between all four species, we extracted 25 kb upstream and downstream of the transcription start site for each of the 25 genes (median and mean intergenic distance for E. sp. EZ is 18.1 and kb, respectively). We used the sliding window approach in bedtools to split these regions into 300 bp fragments and excluded exons, UTRs, and repetitive elements. We then identified one-to-one orthologous sequences shared across all urchins and ran adaptiPhy to test for selection. After filtering for gaps and alignment lengths of at least 75 bp in each species, a P-value was calculated based on a likelihood ratio test comparing models of branch-specific positive selection and neutrally evolving sequence <ref type="bibr">(Davidson et al. 2022b)</ref>, and then adjusted using the false discovery rate estimation. Finally, we calculated the zeta statistic (ratio of substitution rate of query to neutral substitution rate) to visualize the strength of selection. The branch substitution rates for the zeta scores were calculated separately in phyloFit <ref type="bibr">(Hubisz et al. 2011)</ref> for the query and reference sequences. A zeta statistic &gt;1.25 and supported by a P-value &lt;10% for a noncoding element in one lineage indicates evidence of positive selection.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Genome Biol. Evol. 14<ref type="bibr">(10)</ref> https://doi.org/10.1093/gbe/evac144 Advance Access publication 26 September 2022</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Genome Biol. Evol. 14<ref type="bibr">(10)</ref> https://doi.org/10.1093/gbe/evac144 Advance Access publication 26 September 2022 E. sp. EZ H. tuberculata H. erythrogramma L. variegatus S. purpuratus</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>Genome Biol. Evol. 14<ref type="bibr">(10)</ref> https://doi.org/10.1093/gbe/evac144 Advance Access publication 26 September 2022</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Downloaded from https://academic.oup.com/gbe/article/14/10/evac144/6717576 by UNC Charlotte user on 21 August</p></note>
		</body>
		</text>
</TEI>
