skip to main content


Title: OMGS: Optical Map-based Genome Scaffolding
Due to the current limitations of sequencing technologies, de novo genome assembly is typically carried out in two stages, namely contig (sequence) assembly and scaffolding. While scaffolding is computationally easier than sequence assembly, the scaffolding problem can be challenging due to the high repetitive content of eukaryotic genomes, possible mis-joins in assembled contigs and inaccuracies in the linkage information. Genome scaffolding tools either use paired-end/mate-pair/linked/Hi-C reads or genome-wide maps (optical, physical or genetic) as linkage information. Optical maps (in particular Bionano Genomics maps) have been extensively used in many recent large-scale genome assembly projects (e.g., goat, apple, barley, maize, quinoa, sea bass, among others). However, the most commonly used scaffolding tools have a serious limitation: they can only deal with one optical map at a time, forcing users to alternate or iterate over multiple maps. In this paper, we introduce a novel scaffolding algorithm called OMGS that for the first time can take advantages of multiple optical maps. OMGS solves several optimization problems to generate scaffolds with optimal contiguity and correctness. Extensive experimental results demonstrate that our tool outperforms existing methods when multiple optical maps are available, and produces comparable scaffolds using a single optical map. OMGS can be obtained from GIT.  more » « less
Award ID(s):
1814359 1543963 1526742 1646333
NSF-PAR ID:
10094706
Author(s) / Creator(s):
Date Published:
Journal Name:
RECOMB 2019 - ACM Annual Conference on Research in Computational Molecular Biology
Page Range / eLocation ID:
190-207
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background The helmeted honeyeater (Lichenostomus melanops cassidix) is a Critically Endangered bird endemic to Victoria, Australia. To aid its conservation, the population is the subject of genetic rescue. To understand, monitor, and modulate the effects of genetic rescue on the helmeted honeyeater genome, a chromosome-length genome and a high-density linkage map are required. Results We used a combination of Illumina, Oxford Nanopore, and Hi-C sequencing technologies to assemble a chromosome-length genome of the helmeted honeyeater, comprising 906 scaffolds, with length of 1.1 Gb and scaffold N50 of 63.8 Mb. Annotation comprised 57,181 gene models. Using a pedigree of 257 birds and 53,111 single-nucleotide polymorphisms, we obtained high-density linkage and recombination maps for 25 autosomes and Z chromosome. The total sex-averaged linkage map was 1,347 cM long, with the male map being 6.7% longer than the female map. Recombination maps revealed sexually dimorphic recombination rates (overall higher in males), with average recombination rate of 1.8 cM/Mb. Comparative analyses revealed high synteny of the helmeted honeyeater genome with that of 3 passerine species (e.g., 32 Hi-C scaffolds mapped to 30 zebra finch autosomes and Z chromosome). The genome assembly and linkage map suggest that the helmeted honeyeater exhibits a fission of chromosome 1A into 2 chromosomes relative to zebra finch. PSMC analysis showed a ∼15-fold decline in effective population size to ∼60,000 from mid- to late Pleistocene. Conclusions The annotated chromosome-length genome and high-density linkage map provide rich resources for evolutionary studies and will be fundamental in guiding conservation efforts for the helmeted honeyeater. 
    more » « less
  2. Sharakhov, Igor V. (Ed.)
    Rubus idaeus L. (red raspberry), is a perennial woody plant species of the Rosaceae family that is widely cultivated in the temperate regions of world and is thus an economically important soft fruit species. It is prized for its flavour and aroma, as well as a high content of healthful compounds such as vitamins and antioxidants. Breeding programs exist globally for red raspberry, but variety development is a long and challenging process. Genomic and molecular tools for red raspberry are valuable resources for breeding. Here, a chromosome-length genome sequence assembly and related gene predictions for the red raspberry cultivar ‘Anitra’ are presented, comprising PacBio long read sequencing scaffolded using Hi-C sequence data. The assembled genome sequence totalled 291.7 Mbp, with 247.5 Mbp (84.8%) incorporated into seven sequencing scaffolds with an average length of 35.4 Mbp. A total of 39,448 protein-coding genes were predicted, 75% of which were functionally annotated. The seven chromosome scaffolds were anchored to a previously published genetic linkage map with a high degree of synteny and comparisons to genomes of closely related species within the Rosoideae revealed chromosome-scale rearrangements that have occurred over relatively short evolutionary periods. A chromosome-level genomic sequence of R . idaeus will be a valuable resource for the knowledge of its genome structure and function in red raspberry and will be a useful and important resource for researchers and plant breeders. 
    more » « less
  3. null (Ed.)
    The blue crab, Callinectes sapidus (Rathbun, 1896) is an economically, culturally, and ecologically important species found across the temperate and tropical North and South American Atlantic coast. A reference genome will enable research for this high-value species. Initial assembly combined 200× coverage Illumina paired-end reads, a 60× 8 kb mate-paired library, and 50× PacBio data using the MaSuRCA assembler resulting in a 985 Mb assembly with a scaffold N50 of 153 kb. Dovetail Chicago and HiC sequencing with the 3d DNA assembler and Juicebox assembly tools were then used for chromosome scaffolding. The 50 largest scaffolds span 810 Mb are 1.5–37 Mb long and have a repeat content of 36%. The 190 Mb unplaced sequence is in 3921 sequences over 10 kb with a repeat content of 68%. The final assembly N50 is 18.9 Mb for scaffolds and 9317 bases for contigs. Of arthropod BUSCO, ∼88% (888/1013) were complete and single copies. Using 309 million RNAseq read pairs from 12 different tissues and developmental stages, 25,249 protein-coding genes were predicted. Between C. sapidus and Portunus trituberculatus genomes, 41 of 50 large scaffolds had high nucleotide identity and protein-coding synteny, but 9 scaffolds in both assemblies were not clear matches. The protein-coding genes included 9423 one-to-one putative orthologs, of which 7165 were syntenic between the two crab species. Overall, the two crab genome assemblies show strong similarities at the nucleotide, protein, and chromosome level and verify the blue crab genome as an excellent reference for this important seafood species. 
    more » « less
  4. de los Campos, G (Ed.)
    Abstract De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync’s correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions. 
    more » « less
  5. INTRODUCTION To faithfully distribute genetic material to daughter cells during cell division, spindle fibers must couple to DNA by means of a structure called the kinetochore, which assembles at each chromosome’s centromere. Human centromeres are located within large arrays of tandemly repeated DNA sequences known as alpha satellite (αSat), which often span millions of base pairs on each chromosome. Arrays of αSat are frequently surrounded by other types of tandem satellite repeats, which have poorly understood functions, along with nonrepetitive sequences, including transcribed genes. Previous genome sequencing efforts have been unable to generate complete assemblies of satellite-rich regions because of their scale and repetitive nature, limiting the ability to study their organization, variation, and function. RATIONALE Pericentromeric and centromeric (peri/centromeric) satellite DNA sequences have remained almost entirely missing from the assembled human reference genome for the past 20 years. Using a complete, telomere-to-telomere (T2T) assembly of a human genome, we developed and deployed tailored computational approaches to reveal the organization and evolutionary patterns of these satellite arrays at both large and small length scales. We also performed experiments to map precisely which αSat repeats interact with kinetochore proteins. Last, we compared peri/centromeric regions among multiple individuals to understand how these sequences vary across diverse genetic backgrounds. RESULTS Satellite repeats constitute 6.2% of the T2T-CHM13 genome assembly, with αSat representing the single largest component (2.8% of the genome). By studying the sequence relationships of αSat repeats in detail across each centromere, we found genome-wide evidence that human centromeres evolve through “layered expansions.” Specifically, distinct repetitive variants arise within each centromeric region and expand through mechanisms that resemble successive tandem duplications, whereas older flanking sequences shrink and diverge over time. We also revealed that the most recently expanded repeats within each αSat array are more likely to interact with the inner kinetochore protein Centromere Protein A (CENP-A), which coincides with regions of reduced CpG methylation. This suggests a strong relationship between local satellite repeat expansion, kinetochore positioning, and DNA hypomethylation. Furthermore, we uncovered large and unexpected structural rearrangements that affect multiple satellite repeat types, including active centromeric αSat arrays. Last, by comparing sequence information from nearly 1600 individuals’ X chromosomes, we observed that individuals with recent African ancestry possess the greatest genetic diversity in the region surrounding the centromere, which sometimes contains a predominantly African αSat sequence variant. CONCLUSION The genetic and epigenetic properties of centromeres are closely interwoven through evolution. These findings raise important questions about the specific molecular mechanisms responsible for the relationship between inner kinetochore proteins, DNA hypomethylation, and layered αSat expansions. Even more questions remain about the function and evolution of non-αSat repeats. To begin answering these questions, we have produced a comprehensive encyclopedia of peri/centromeric sequences in a human genome, and we demonstrated how these regions can be studied with modern genomic tools. Our work also illuminates the rich genetic variation hidden within these formerly missing regions of the genome, which may contribute to health and disease. This unexplored variation underlines the need for more T2T human genome assemblies from genetically diverse individuals. Gapless assemblies illuminate centromere evolution. ( Top ) The organization of peri/centromeric satellite repeats. ( Bottom left ) A schematic portraying (i) evidence for centromere evolution through layered expansions and (ii) the localization of inner-kinetochore proteins in the youngest, most recently expanded repeats, which coincide with a region of DNA hypomethylation. ( Bottom right ) An illustration of the global distribution of chrX centromere haplotypes, showing increased diversity in populations with recent African ancestry. 
    more » « less