Title: ModDotPlot—rapid and interactive visualization of tandem repeats
Motivation: A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy. Results: In this work, we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mb genome of Arabidopsis thaliana in under 5 min on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes. Availability and implementation: ModDotPlot is available at https://github.com/marbl/ModDotPlot. more »« less
Shumate, Alaina; Salzberg, Steven L
(, Bioinformatics)
Valencia, Alfonso
(Ed.)
Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however, for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously annotated reference genome. Here, we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and implementation Liftoff can be installed via bioconda and PyPI. In addition, the source code for Liftoff is available at https://github.com/agshumate/Liftoff. Supplementary information Supplementary data are available at Bioinformatics online.
Heusser, Andrew C; Ziman, Kirsten; Owen, Lucy LW; Manning, Jeremy R
(, arXiv.org)
Data visualizations can reveal trends and patterns that are not otherwise obvious from the raw data or summary statistics. While visualizing low-dimensional data is relatively straightforward (for example, plotting the change in a variable over time as (x,y) coordinates on a graph), it is not always obvious how to visualize high-dimensional datasets in a similarly intuitive way. Here we present HypeTools, a Python toolbox for visualizing and manipulating large, high-dimensional datasets. Our primary approach is to use dimensionality reduction techniques (Pearson, 1901; Tipping & Bishop, 1999) to embed high-dimensional datasets in a lower-dimensional space, and plot the data using a simple (yet powerful) API with many options for data manipulation [e.g. hyperalignment (Haxby et al., 2011), clustering, normalizing, etc.] and plot styling. The toolbox is designed around the notion of data trajectories and point clouds. Just as the position of an object moving through space can be visualized as a 3D trajectory, HyperTools uses dimensionality reduction algorithms to create similar 2D and 3D trajectories for time series of high-dimensional observations. The trajectories may be plotted as interactive static plots or visualized as animations. These same dimensionality reduction and alignment algorithms can also reveal structure in static datasets (e.g. collections of observations or attributes). We present several examples showcasing how using our toolbox to explore data through trajectories and low-dimensional embeddings can reveal deep insights into datasets across a wide variety of domains.
Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.4% of human protein-coding genes to a chimpanzee genome assembly with 98.7% sequence identity. Availability The source code for Liftoff is available at https://github.com/agshumate/Liftoff
Long-term monitoring of habitat occupancy can reveal patterns of habitat use, population dynamics, and factors controlling species distribution. The American pika (Ochotona princeps), a small mammal found in rocky habitats throughout western North America, has been targeted for occupancy studies due to its relatively conspicuous behavior and its unusual adaptations for surviving long, cold winters without hibernation. These adaptations include an unusually high resting metabolic rate and maintenance of body temperatures near the lethal maximum for this species, which would appear to compromise the pika's ability to survive warmer summers. Recent monitoring as well as projections based on future climate scenarios have suggested this species is experiencing a period of range retraction due to warming summers and/or loss of insulating winter snow cover. Niwot Ridge is situated ideally to test competing hypotheses about the trajectory and drivers of pika range shift. The pika is still common throughout the Colorado Rockies, but published models differ markedly regarding projections of the pika’s future distribution in this region. Niwot Ridge has experienced warmer summers as well as shorter periods of insulating snow cover in recent years, and there is evidence that pikas are now less common than they once were in at least one area on the ridge. This study is designed to provide robust data on pika population trends through long-term monitoring of occupancy in a spatially balanced random sample of pika habitat patches centered on Niwot Ridge. Survey plots (n = 72) were selected according to a Generalized Random-Tessellation Stratified (GRTS) algorithm, stratified dichotomously by elevation, average annual snow accumulation (SWE), and probabilities of pika occurrence based on previous data. Each plot extends 12 m in radius from a GRTS point. To ensure that each plot contains at least 10% cover of talus, plot coordinates were adjusted (usually less than 50 m) or replaced using the GRTS oversample to select the next available and suitable plot within the same categories of elevation, SWE and probability of occurrence (see "pika-survey-GRTS-plot-tracking-record.cr.data.csv" for plot strata, survey schedules, GRTS sequence, and records of plot replacement or location adjustments). Trained technicians survey plots for pikas and fresh pika sign (food caches and fecal pellets) as well as metrics of habitat quality. Each year, 48 of the 72 plots are surveyed in a rotating panel design (24 plots are surveyed annually, 24 in even years and 24 in odd years). Plots are surveyed in August when pikas are engaged in food caching and other conspicuous behaviors related to territory establishment and defense. Data collected at each plot are detailed in a survey manual ("pika_survey.cr.methods.docx"). Each plot is outfitted with a data logger (sensor) to record sub-surface temperature several times each day. Photos of plot and sensor locations are used in navigation and sensor retrieval. Each survey is completed during a brief (half-hour) visit to the plot to service the sensor and to record habitat and pika data. A subset of plots (n = 12) are selected for double surveys each year to allow estimation of pika detection probability. Estimates of detection probability are also informed by data on time to detection of pikas and pika sign recorded during each survey. Samples of fresh pika fecal pellets are collected from occupied plots and are stored as vouchers of pika presence and for use in studies of population genetics and physiology, including studies of physiological stress in relation to habitat quality and microclimate.
Long-term monitoring of habitat occupancy can reveal patterns of habitat use, population dynamics, and factors controlling species distribution. The American pika (Ochotona princeps), a small mammal found in rocky habitats throughout western North America, has been targeted for occupancy studies due to its relatively conspicuous behavior and its unusual adaptations for surviving long, cold winters without hibernation. These adaptations include an unusually high resting metabolic rate and maintenance of body temperatures near the lethal maximum for this species, which would appear to compromise the pika's ability to survive warmer summers. Recent monitoring as well as projections based on future climate scenarios have suggested this species is experiencing a period of range retraction due to warming summers and/or loss of insulating winter snow cover. Niwot Ridge is situated ideally to test competing hypotheses about the trajectory and drivers of pika range shift. The pika is still common throughout the Colorado Rockies, but published models differ markedly regarding projections of the pika’s future distribution in this region. Niwot Ridge has experienced warmer summers as well as shorter periods of insulating snow cover in recent years, and there is evidence that pikas are now less common than they once were in at least one area on the ridge. This study is designed to provide robust data on pika population trends through long-term monitoring of occupancy in a spatially balanced random sample of pika habitat patches centered on Niwot Ridge. Survey plots (n = 72) were selected according to a Generalized Random-Tessellation Stratified (GRTS) algorithm, stratified dichotomously by elevation, average annual snow accumulation (SWE), and probabilities of pika occurrence based on previous data. Each plot extends 12 m in radius from a GRTS point. To ensure that each plot contains at least 10% cover of talus, plot coordinates were adjusted (usually less than 50 m) or replaced using the GRTS oversample to select the next available and suitable plot within the same categories of elevation, SWE and probability of occurrence (see "pika-survey-GRTS-plot-tracking-record.cr.data.csv" for plot strata, survey schedules, GRTS sequence, and records of plot replacement or location adjustments). Trained technicians survey plots for pikas and fresh pika sign (food caches and fecal pellets) as well as metrics of habitat quality. Each year, 48 of the 72 plots are surveyed in a rotating panel design (24 plots are surveyed annually, 24 in even years and 24 in odd years). Plots are surveyed in August when pikas are engaged in food caching and other conspicuous behaviors related to territory establishment and defense. Data collected at each plot are detailed in a survey manual ("pika_survey.cr.methods.docx"). Each plot is outfitted with a data logger (sensor) to record sub-surface temperature several times each day. Photos of plot and sensor locations are used in navigation and sensor retrieval. Each survey is completed during a brief (half-hour) visit to the plot to service the sensor and to record habitat and pika data. A subset of plots (n = 12) are selected for double surveys each year to allow estimation of pika detection probability. Estimates of detection probability are also informed by data on time to detection of pikas and pika sign recorded during each survey. Samples of fresh pika fecal pellets are collected from occupied plots and are stored as vouchers of pika presence and for use in studies of population genetics and physiology, including studies of physiological stress in relation to habitat quality and microclimate.
Sweeten, Alexander P, Schatz, Michael C, and Phillippy, Adam M. ModDotPlot—rapid and interactive visualization of tandem repeats. Retrieved from https://par.nsf.gov/biblio/10561744. Bioinformatics 40.8 Web. doi:10.1093/bioinformatics/btae493.
Sweeten, Alexander P, Schatz, Michael C, & Phillippy, Adam M. ModDotPlot—rapid and interactive visualization of tandem repeats. Bioinformatics, 40 (8). Retrieved from https://par.nsf.gov/biblio/10561744. https://doi.org/10.1093/bioinformatics/btae493
@article{osti_10561744,
place = {Country unknown/Code not available},
title = {ModDotPlot—rapid and interactive visualization of tandem repeats},
url = {https://par.nsf.gov/biblio/10561744},
DOI = {10.1093/bioinformatics/btae493},
abstractNote = {Motivation: A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy. Results: In this work, we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mb genome of Arabidopsis thaliana in under 5 min on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes. Availability and implementation: ModDotPlot is available at https://github.com/marbl/ModDotPlot.},
journal = {Bioinformatics},
volume = {40},
number = {8},
publisher = {Oxford University Press},
author = {Sweeten, Alexander P and Schatz, Michael C and Phillippy, Adam M},
editor = {Ponty, Yann}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.