Abstract The search for the genetic basis of phenotypes has primarily focused on single nucleotide polymorphisms, often overlooking structural variants (SVs). SVs can significantly affect gene function, but detecting and characterizing them is challenging, even with long-read sequencing. Moreover, traditional single-reference methods can fail to capture many genetic variants. Using long reads, we generated a Capuchino Seedeater (Sporophila) pangenome, including 16 individuals from 7 species, to investigate how SVs contribute to species and coloration differences. Leveraging this pangenome, we mapped short-read data from 127 individuals, genotyped variants identified in the pangenome graph, and subsequently performed FST scans and genome-wide association studies. Species divergence primarily arises from SNPs and indels (< 50 bp) in non-coding regions of melanin-related genes, as larger SVs rarely overlap with divergence peaks. One exception was a 55 bp deletion near the OCA2 and HERC2 genes, associated with feather pheomelanin content. These findings support the hypothesis that the reshuffling of small regulatory alleles, rather than larger species-specific mutations, accelerated plumage evolution leading to prezygotic isolation in Capuchinos.
more »
« less
Genetic Variant Detection Over Generations: Sparsity-Constrained Optimization Using Block-Coordinate Descent
Structural variants (SVs) are rearrangements of regions in an individual’s genome signal. SVs are an important source of genetic diversity and disease in humans and other mammalian species. The SV detection process is susceptible to sequencing and mapping errors, especially when the average number of reads supporting each variant is low (i.e. low-coverage settings), which leads to high false-positive rates. Besides their rarity in the human genome, they are shared between related individuals. Thus, it’s advantageous to devise algorithms that focus on close relatives. In this paper, we develop a constrained-optimization method to detect germline SVs in genetic signals by considering multiple related people. First, we exploit familial relationships by considering a biologically realistic scenario of three generations of related individuals (a grandparent, a parent, and a child). Second, we pose the problem as a constrained optimization problem regularized by a sparsity-promoting penalty. Our framework demonstrates improvements in predicting SVs in related individuals and uncovering true SVs from false positives on both simulated and real genetic signals from the 1000 Genomes Project with low coverage. Further, our block-coordinate descent approach produces results with equal accuracy to the 3D projections of the solution, demonstrating feasibility for more complex and higher-dimensional pedigrees.
more »
« less
- PAR ID:
- 10505329
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- 2023 IEEE Conference on Medical Measurements and Applications (MeMeA)
- ISBN:
- 978-1-6654-9384-0
- Page Range / eLocation ID:
- 1 to 5
- Format(s):
- Medium: X
- Location:
- Jeju, Korea, Republic of
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Purugganan, Michael (Ed.)Abstract Structural variants (SVs) are a largely unstudied feature of plant genome evolution, despite the fact that SVs contribute substantially to phenotypes. In this study, we discovered SVs across a population sample of 347 high-coverage, resequenced genomes of Asian rice (Oryza sativa) and its wild ancestor (O. rufipogon). In addition to this short-read data set, we also inferred SVs from whole-genome assemblies and long-read data. Comparisons among data sets revealed different features of genome variability. For example, genome alignment identified a large (∼4.3 Mb) inversion in indica rice varieties relative to japonica varieties, and long-read analyses suggest that ∼9% of genes from the outgroup (O. longistaminata) are hemizygous. We focused, however, on the resequencing sample to investigate the population genomics of SVs. Clustering analyses with SVs recapitulated the rice cultivar groups that were also inferred from SNPs. However, the site-frequency spectrum of each SV type—which included inversions, duplications, deletions, translocations, and mobile element insertions—was skewed toward lower frequency variants than synonymous SNPs, suggesting that SVs may be predominantly deleterious. Among transposable elements, SINE and mariner insertions were found at especially low frequency. We also used SVs to study domestication by contrasting between rice and O. rufipogon. Cultivated genomes contained ∼25% more derived SVs and mobile element insertions than O. rufipogon, indicating that SVs contribute to the cost of domestication in rice. Peaks of SV divergence were enriched for known domestication genes, but we also detected hundreds of genes gained and lost during domestication, some of which were enriched for traits of agronomic interest.more » « less
-
Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.more » « less
-
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.more » « less
-
Abstract Structural variants (SVs) are a major source of genetic variation; and descriptions in natural populations and connections with phenotypic traits are beginning to accumulate in the literature. We integrated advances in genomic sequencing and animal tracking to begin filling this knowledge gap in the Eurasian blackcap. Specifically, we (a) characterized the genome-wide distribution, frequency, and overall fitness effects of SVs using haplotype-resolved assemblies for 79 birds, and (b) used these SVs to study the genetics of seasonal migration. We detected >15 K SVs. Many SVs overlapped repetitive regions and exhibited evidence of purifying selection suggesting they have overall deleterious effects on fitness. We used estimates of genomic differentiation to identify SVs exhibiting evidence of selection in blackcaps with different migratory strategies. Insertions and deletions dominated the SVs we identified and were associated with genes that are either directly (e.g., regulatory motifs that maintain circadian rhythms) or indirectly (e.g., through immune response) related to migration. We also broke migration down into individual traits (direction, distance, and timing) using existing tracking data and tested if genetic variation at the SVs we identified could account for phenotypic variation at these traits. This was only the case for 1 trait—direction—and 1 specific SV (a deletion on chromosome 27) accounted for much of this variation. Our results highlight the evolutionary importance of SVs in natural populations and provide insight into the genetic basis of seasonal migration.more » « less
An official website of the United States government

