skip to main content


Title: More for less: predicting and maximizing genomic variant discovery via Bayesian nonparametrics
Abstract While the cost of sequencing genomes has decreased dramatically in recent years, this expense often remains non-trivial. Under a fixed budget, scientists face a natural trade-off between quantity and quality: spending resources to sequence a greater number of genomes or spending resources to sequence genomes with increased accuracy. Our goal is to find the optimal allocation of resources between quantity and quality. Optimizing resource allocation promises to reveal as many new variations in the genome as possible. In this paper, we introduce a Bayesian nonparametric methodology to predict the number of new variants in a follow-up study based on a pilot study. When experimental conditions are kept constant between the pilot and follow-up, we find that our prediction is competitive with the best existing methods. Unlike current methods, though, our new method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for more realistic predictions and for optimal allocation of a fixed budget between quality and quantity.  more » « less
Award ID(s):
1750286
NSF-PAR ID:
10282763
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Biometrika
ISSN:
0006-3444
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath. 
    more » « less
  2. INTRODUCTION A major challenge in genomics is discerning which bases among billions alter organismal phenotypes and affect health and disease risk. Evidence of past selective pressure on a base, whether highly conserved or fast evolving, is a marker of functional importance. Bases that are unchanged in all mammals may shape phenotypes that are essential for organismal health. Bases that are evolving quickly in some species, or changed only in species that share an adaptive trait, may shape phenotypes that support survival in specific niches. Identifying bases associated with exceptional capacity for cellular recovery, such as in species that hibernate, could inform therapeutic discovery. RATIONALE The power and resolution of evolutionary analyses scale with the number and diversity of species compared. By analyzing genomes for hundreds of placental mammals, we can detect which individual bases in the genome are exceptionally conserved (constrained) and likely to be functionally important in both coding and noncoding regions. By including species that represent all orders of placental mammals and aligning genomes using a method that does not require designating humans as the reference species, we explore unusual traits in other species. RESULTS Zoonomia’s mammalian comparative genomics resources are the most comprehensive and statistically well-powered produced to date, with a protein-coding alignment of 427 mammals and a whole-genome alignment of 240 placental mammals representing all orders. We estimate that at least 10.7% of the human genome is evolutionarily conserved relative to neutrally evolving repeats and identify about 101 million significantly constrained single bases (false discovery rate < 0.05). We cataloged 4552 ultraconserved elements at least 20 bases long that are identical in more than 98% of the 240 placental mammals. Many constrained bases have no known function, illustrating the potential for discovery using evolutionary measures. Eighty percent are outside protein-coding exons, and half have no functional annotations in the Encyclopedia of DNA Elements (ENCODE) resource. Constrained bases tend to vary less within human populations, which is consistent with purifying selection. Species threatened with extinction have few substitutions at constrained sites, possibly because severely deleterious alleles have been purged from their small populations. By pairing Zoonomia’s genomic resources with phenotype annotations, we find genomic elements associated with phenotypes that differ between species, including olfaction, hibernation, brain size, and vocal learning. We associate genomic traits, such as the number of olfactory receptor genes, with physical phenotypes, such as the number of olfactory turbinals. By comparing hibernators and nonhibernators, we implicate genes involved in mitochondrial disorders, protection against heat stress, and longevity in this physiologically intriguing phenotype. Using a machine learning–based approach that predicts tissue-specific cis - regulatory activity in hundreds of species using data from just a few, we associate changes in noncoding sequence with traits for which humans are exceptional: brain size and vocal learning. CONCLUSION Large-scale comparative genomics opens new opportunities to explore how genomes evolved as mammals adapted to a wide range of ecological niches and to discover what is shared across species and what is distinctively human. High-quality data for consistently defined phenotypes are necessary to realize this potential. Through partnerships with researchers in other fields, comparative genomics can address questions in human health and basic biology while guiding efforts to protect the biodiversity that is essential to these discoveries. Comparing genomes from 240 species to explore the evolution of placental mammals. Our new phylogeny (black lines) has alternating gray and white shading, which distinguishes mammalian orders (labeled around the perimeter). Rings around the phylogeny annotate species phenotypes. Seven species with diverse traits are illustrated, with black lines marking their branch in the phylogeny. Sequence conservation across species is described at the top left. IMAGE CREDIT: K. MORRILL 
    more » « less
  3. In consulting, finance, and other service industries, customers represent a revenue stream, and must be acquired and retained over time. In this paper, we study the resource allocation problem of a profit maximizing service firm that dynamically allocates its resources toward acquiring new clients and retaining unsatisfied existing ones. The interaction between acquisition and retention in our model is reflected in the cash constraint on total expected spending on acquisition and retention in each period. We formulate this problem as a dynamic program in which the firm makes decisions in both acquisition and retention after observing the current size of its customer base and receiving information about customers in danger of attrition, and we characterize the structure of the optimal acquisition and retention strategy. We show that when the firm's customer base size is relatively low, the firm should spend heavily on acquisition and try to retain every unhappy customer. However, as its customer base grows, the firm should gradually shift its emphasis from acquisition to retention, and it should also aim to strike a balance between acquisition and retention while spending its available resources. Finally, when the customer base is large enough, it may be optimal for the firm to begin spending less in both acquisition and retention. We also extend our analysis to situations where acquisition or retention success rate, as a function of resources allocation, is uncertain and show that the optimal acquisition and retention policy can be surprisingly complex. However, we develop an effective heuristic for that case. This paper aims to provide service managers some analytical principles and effective guidelines on resource allocation between these two significant activities based on their firm's customer base size.

     
    more » « less
  4. Dynamic network topology can pose important challenges to communication and control protocols in networks of autonomous vehicles. For instance, maintaining connectivity is a key challenge in unmanned aerial vehicle (UAV) networks. However, tracking and computational resources of the observer module might not be sufficient for constant monitoring of all surrounding nodes in large-scale networks. In this paper, we propose an optimal measurement policy for network topology monitoring under constrained resources. To this end, We formulate the localization of multiple objects in terms of linear networked systems and solve it using Kalman filtering with intermittent observation. The proposed policy includes two sequential steps. We first find optimal measurement attempt probabilities for each target using numerical optimization methods to assign the limited number of resources among targets. The optimal resource allocation follows a waterfall-like solution to assign more resources to targets with lower measurement success probability. This provides a 10% to 60% gain in prediction accuracy. The second step is finding optimal on-off patterns for measurement attempts for each target over time. We show that a regular measurement pattern that evenly distributed resources over time outperforms the two extreme cases of using all measurement resources either in the beginning or at the end of the measurement cycle. Our proof is based on characterizing the fixed-point solution of the error covariance matrix for regular patterns. Extensive simulation results confirm the optimality of the most alternating pattern with up to 10-fold prediction improvement for different scenarios. These two guidelines define a general policy for target tracking under constrained resources with applications to network topology prediction of autonomous systems 
    more » « less
  5. Summary

    Food chain efficiency (FCE), the proportion of primary production converted to production of the top trophic level, can influence several ecosystem services as well as the biodiversity and productivity of each trophic level. AquaticFCEis affected by light and nutrient supply, largely via effects on primary producer stoichiometry that propagate to herbivores and then carnivores. Here, we test the hypothesis that the identity of the top carnivore mediatesFCEresponses to changes in light and nutrient supply.

    We conducted a large‐scale, 6‐week mesocosm experiment in which we manipulated light and nutrient (nitrogen and phosphorus) supply and the identity of the carnivore in a 2 × 2 × 2 factorial design. We quantified the response ofFCEand the biomass and productivity of each trophic level (phytoplankton, zooplankton, and carnivore). We used an invertebrate,Chaoborus americanus, and a vertebrate, bluegill sunfish (Lepomis macrochirus), as the two carnivores in this study because of the large difference in phosphorus requirements between these taxa.

    We predicted that bluegill would be more likely to experience P‐limitation due to higher P requirements, and hence thatFCEwould be lower in the bluegill treatments than in theChaoborustreatments. We also expected the interactive effect of light and nutrients to be stronger in the bluegill treatments. Within a carnivore treatment, we predicted highestFCEunder low light and high nutrient supply, as these conditions would produce high‐quality (low C:nutrient) algal resources. In contrast, if food quantity had a stronger effect on carnivore production than food quality, carnivore production would increase proportionally with primary production, thusFCEwould be similar across light and nutrient treatments.

    Carnivore identity mediated the effects of light and nutrients onFCE, and as predictedFCEwas higher in food chains withChaoborusthan with bluegill. Also as predicted,FCEinChaoborustreatments was higher under low light. However,FCEin bluegill treatments was higher at high light supply, opposite to our predictions. In addition, bluegill production increased proportionally with primary production, whileChaoborusproduction was not correlated with primary production, suggesting that bluegill responded more strongly to food quantity than to food quality. These carnivore taxa differ in traits other than body stoichiometry, for example, feeding selectivity, which may have contributed to the observed differences inFCEbetween carnivores.

    Comparison of our results with those from previous experiments showed thatFCEresponds similarly to light and nutrients in food chains withChaoborusand larval fish (gizzard shad: Clupeidae), but very differently in food chains with bluegill. These findings warrant further investigation into the mechanisms related to carnivore identity (e.g., developmental stage, feeding selectivity) underlying these responses, and highlight the importance of considering both top‐down and bottom‐up effects when evaluating food chain responses to changing light and nutrient conditions.

     
    more » « less