Abstract Large-scale genotype and phenotype data have been increasingly generated to identify genetic markers, understand gene function and evolution and facilitate genomic selection. These datasets hold immense value for both current and future studies, as they are vital for crop breeding, yield improvement and overall agricultural sustainability. However, integrating these datasets from heterogeneous sources presents significant challenges and hinders their effective utilization. We established the Genotype-Phenotype Working Group in November 2021 as a part of the AgBioData Consortium (https://www.agbiodata.org) to review current data types and resources that support archiving, analysis and visualization of genotype and phenotype data to understand the needs and challenges of the plant genomic research community. For 2021–22, we identified different types of datasets and examined metadata annotations related to experimental design/methods/sample collection, etc. Furthermore, we thoroughly reviewed publicly funded repositories for raw and processed data as well as secondary databases and knowledgebases that enable the integration of heterogeneous data in the context of the genome browser, pathway networks and tissue-specific gene expression. Based on our survey, we recommend a need for (i) additional infrastructural support for archiving many new data types, (ii) development of community standards for data annotation and formatting, (iii) resources for biocuration and (iv) analysis and visualization tools to connect genotype data with phenotype data to enhance knowledge synthesis and to foster translational research. Although this paper only covers the data and resources relevant to the plant research community, we expect that similar issues and needs are shared by researchers working on animals. Database URL: https://www.agbiodata.org. 
                        more » 
                        « less   
                    
                            
                            The evolving privacy and security concerns for genomic data analysis and sharing as observed from the iDASH competition
                        
                    
    
            Abstract Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1838083
- PAR ID:
- 10388923
- Date Published:
- Journal Name:
- Journal of the American Medical Informatics Association
- Volume:
- 29
- Issue:
- 12
- ISSN:
- 1067-5027
- Page Range / eLocation ID:
- 2182 to 2190
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.more » « less
- 
            Abstract Montane species endemic to the “sky islands” of the North American southwest were significantly impacted by changing climates during the Pleistocene. We combined mitochondrial and genomic data with species distribution modelling to determine whetherAphonopelma marxi, a large tarantula from the nearby Colorado Plateau, was similarly impacted by glacial climates. Genetic analyses revealed that the species comprises three main clades that diverged in the Pleistocene. A clade distributed along the Mogollon Rim appears to have persisted in place during glacial conditions, whereas the other two clades probably colonized central and northeastern portions of the species' range from refugia in canyons. Climate models support this hypothesis for the Mogollon Rim, but late glacial climate data appear too coarse to detect suitable areas in canyons. Locations of canyon refugia could not be inferred from genomic analyses due to missing data, encouraging us to explore the effect of missing loci in phylogeographical inferences using RADseq. Results from analyses with varying amounts of missing data suggest that samples with large amounts of missing data can still improve inferences, and the specific loci that are missing matters more than the number of missing loci. This study highlights the profound impact of Pleistocene climates on tarantulas endemic to the Colorado Plateau, as well as the mixed nature of the region's fauna. Some animals recently colonized from nearby deserts as glacial climates receded, whereas others, like tarantulas, appear to have persisted on the Mogollon Rim and in refugia associated with the region's famous river‐cut canyons.more » « less
- 
            Abstract The Biorepository and Integrative Genomics (BIG) Initiative in Tennessee has developed a pioneering resource to address gaps in genomic research by linking genomic, phenotypic, and environmental data from a diverse Mid-South population, including underrepresented groups. We analyzed 13,152 exomes from BIG and found significant genetic diversity, with 50% of participants inferred to have non-European or several types of admixed ancestry. Ancestry within the BIG cohort is stratified, with distinct geographic and demographic patterns, as African ancestry is more common in urban areas, while European ancestry is more common in suburban regions. We observe ancestry-specific rates of novel genetic variants, which are enriched for functional or clinical relevance. Disease prevalence analysis linked ancestry and environmental factors, showing higher odds ratios for asthma and obesity in minority groups, particularly in the urban area. Finally, we observe discrepancies between self-reported race and genetic ancestry, with related individuals self-identifying in differing racial categories. These findings underscore the limitations of race as a biomedical variable. BIG has proven to be an effective model for community-centered precision medicine. We integrated genomics education, and fostered great trust among the contributing communities. Future goals include cohort expansion, and enhanced genomic analysis, to ensure equitable healthcare outcomes.more » « less
- 
            Abstract The genomics revolution continues to change how ecologists and evolutionary biologists study the evolution and maintenance of biodiversity. It is now easier than ever to generate large molecular data sets consisting of hundreds to thousands of independently evolving nuclear loci to estimate a suite of evolutionary and demographic parameters. However, any inferences will be incomplete or inaccurate if incorrect taxonomic identities and perpetuated throughout the analytical pipeline. Due to decades of research and comprehensive online databases, sequencing and analysis of mitochondrial DNA (mtDNA), chloroplast DNA (cpDNA) and select nuclear genes can provide researchers with a cost effective and simple means to verify the species identity of samples prior to subsequent phylogeographic and population genomic analysis. The addition of these sequences to genomic studies can also shed light on other important evolutionary questions such as explanations for gene tree‐species tree discordance, species limits, sex‐biased dispersal patterns, adaptation, and mtDNA introgression. Although the mtDNA and cpDNA genomes often should not be used exclusively to make historical inferences given their well‐known limitations, the addition of these data to modern genomic studies adds little cost and effort while simultaneously providing a wealth of useful data that can have significant implications for both basic and applied research.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    