skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Mitigating assembly and switch errors in phased genomes of polar fishes reveals haplotype diversity in copy number of antifreeze protein genes
Abstract Phased genomes and pangenomes are enhancing our understanding of genetic variation. Accurate phasing and assembly in repetitive regions of the genome remain challenging, however. Addressing this obstacle is crucial for studying structural genomic variation, such as copy number variations (CNVs) common to repetitive regions. Polar fishes, for example, evolved repetitive tandem arrays of antifreeze protein (AFP) genes that facilitate adaptation to freezing and expanded in copy number in colder environments. AFP CNVs remain poorly characterized in polar fishes and may be illuminated by haplotype-aware approaches. We performed long-read sequencing for two polar fishes in the suborder Zoarcoidei and leveraged additional published long-read data to assemble phased genomes. We developed a workflow to measure haplotype diversity in CNV while controlling for misassembly and switch errors—a change from one parental haplotype to another in a contiguous assembly. We presentgfa_parser, which computes and extracts all possible contiguous sequences for phased or primary assemblies from graphical fragment assembly (GFA) files, andswitch_error_screen, which flags potential switch errors.gfa_parserrevealed that assembly uncertainty was ubiquitous across AFP array haplotypes and that standard processing of graphical fragment assemblies can bias measurement of haplotype CNVs. We detected no switch errors in AFP arrays. After controlling for misassembly and switch error, we detected haplotype diversity of AFP CNVs in all studied polar Zoarcoidei species and in 60% of AFP arrays. Intraindividual haplotype diversity spanned differences of 3–16 copies. Our workflow revealed intraspecific genomic diversity in zoarcoids that likely fueled the evolution of AFP copy number across temperature.  more » « less
Award ID(s):
2312253
PAR ID:
10645188
Author(s) / Creator(s):
; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Heredity
ISSN:
0018-067X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hodgins, Kathryn (Ed.)
    Abstract Antifreeze proteins (AFPs) have enabled teleost fishes to repeatedly colonize polar seas. Four AFP types have convergently evolved in several fish lineages. AFPs inhibit ice crystal growth and lower tissue freezing point. In lineages with AFPs, species inhabiting colder environments may possess more AFP copies. Elucidating how differences in AFP copy number evolve is challenging due to the genes’ tandem array structure and consequently poor resolution of these repetitive regions. Here, we explore the evolution of type III AFPs (AFP III) in the globally distributed suborder Zoarcoidei, leveraging six new long-read genome assemblies. Zoarcoidei has fewer genomic resources relative to other polar fish clades while it is one of the few groups of fishes adapted to both the Arctic and Southern Oceans. Combining these new assemblies with additional long-read genomes available for Zoarcoidei, we conducted a comprehensive phylogenetic test of AFP III evolution and modeled the effects of thermal habitat and depth on AFP III gene family evolution. We confirm a single origin of AFP III via neofunctionalization of the enzyme sialic acid synthase B. We also show that AFP copy number increased under low temperature but decreased with depth, potentially because pressure lowers freezing point. Associations between the environment and AFP III copy number were driven by duplications of paralogs that were translocated out of the ancestral locus at which AFP III arose. Our results reveal novel environmental effects on AFP evolution and demonstrate the value of high-quality genomic resources for studying how structural genomic variation shapes convergent adaptation. 
    more » « less
  2. Abstract The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, aPhasedErrorCorrection andAssemblyTool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly onB. taurus(Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads. 
    more » « less
  3. Abstract Long‐read sequencing is driving a new reality for genome science in which highly contiguous assemblies can be produced efficiently with modest resources. Genome assemblies from long‐read sequences are particularly exciting for understanding the evolution of complex genomic regions that are often difficult to assemble. In this study, we utilized long‐read sequencing data to generate a high‐quality genome assembly for an Antarctic eelpout,Ophthalmolycus amberensis, the first for the globally distributed family Zoarcidae. We used this assembly to understand howO. amberensishas adapted to the harsh Southern Ocean and compared it to another group of Antarctic fishes: the notothenioids. We showed that selection has largely acted on different targets in eelpouts relative to notothenioids. However, we did find some overlap; in both groups, genes involved in membrane structure, thermal tolerance and vision have evidence of positive selection. We found evidence for historical shifts of transposable element activity inO. amberensisand other polar fishes, perhaps reflecting a response to environmental change. We were specifically interested in the evolution of two complex genomic loci known to underlie key adaptations to polar seas: haemoglobin and antifreeze proteins (AFPs). We observed unique evolution of the haemoglobin MN cluster in eelpouts and related fishes in the suborder Zoarcoidei relative to other Perciformes. For AFPs, we identified the first species in the suborder with no evidence ofafpIIIsequences (Cebidichthys violaceus) in the genomic region where they are found in all other Zoarcoidei, potentially reflecting a lineage‐specific loss of this cluster. Beyond polar fishes, our results highlight the power of long‐read sequencing to understand genome evolution. 
    more » « less
  4. Abstract Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11–21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature. 
    more » « less
  5. The rapid evolution of repetitive DNA sequences, including satellite DNA, tandem duplications, and transposable elements, underlies phenotypic evolution and contributes to hybrid incompatibilities between species. However, repetitive genomic regions are fragmented and misassembled in most contemporary genome assemblies. We generated highly contiguous de novo reference genomes for the Drosophila simulans species complex ( D. simulans , D. mauritiana , and D. sechellia ), which speciated ∼250,000 yr ago. Our assemblies are comparable in contiguity and accuracy to the current D. melanogaster genome, allowing us to directly compare repetitive sequences between these four species. We find that at least 15% of the D. simulans complex species genomes fail to align uniquely to D. melanogaster owing to structural divergence—twice the number of single-nucleotide substitutions. We also find rapid turnover of satellite DNA and extensive structural divergence in heterochromatic regions, whereas the euchromatic gene content is mostly conserved. Despite the overall preservation of gene synteny, euchromatin in each species has been shaped by clade- and species-specific inversions, transposable elements, expansions and contractions of satellite and tRNA tandem arrays, and gene duplications. We also find rapid divergence among Y-linked genes, including copy number variation and recent gene duplications from autosomes. Our assemblies provide a valuable resource for studying genome evolution and its consequences for phenotypic evolution in these genetic model species. 
    more » « less