Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project (https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.
- Authors:
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Publication Date:
- NSF-PAR ID:
- 10306503
- Journal Name:
- Nucleic Acids Research
- Volume:
- 50
- Issue:
- D1
- Page Range or eLocation-ID:
- p. D996-D1003
- ISSN:
- 0305-1048
- Publisher:
- Oxford University Press
- Sponsoring Org:
- National Science Foundation
More Like this
-
The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an ever-increasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming (CTT), for automatically identifying most, if not all, members of a gene family in any sequenced genomes on CentOS 7 platform. To benefit a broader application on other operating systems, we also created a Docker application package, CTTdocker. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes,more »
-
Abstract Fueled by the explosion of (meta)genomic data, genome mining of specialized metabolites has become a major technology for drug discovery and studying microbiome ecology. In these efforts, computational tools like antiSMASH have played a central role through the analysis of Biosynthetic Gene Clusters (BGCs). Thousands of candidate BGCs from microbial genomes have been identified and stored in public databases. Interpreting the function and novelty of these predicted BGCs requires comparison with a well-documented set of BGCs of known function. The MIBiG (Minimum Information about a Biosynthetic Gene Cluster) Data Standard and Repository was established in 2015 to enable curation and storage of known BGCs. Here, we present MIBiG 2.0, which encompasses major updates to the schema, the data, and the online repository itself. Over the past five years, 851 new BGCs have been added. Additionally, we performed extensive manual data curation of all entries to improve the annotation quality of our repository. We also redesigned the data schema to ensure the compliance of future annotations. Finally, we improved the user experience by adding new features such as query searches and a statistics page, and enabled direct link-outs to chemical structure databases. The repository is accessible online at https://mibig.secondarymetabolites.org/.
-
Obeid, I. (Ed.)The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »
-
Obeid, I. ; Selesnick, I. (Ed.)The Neural Engineering Data Consortium at Temple University has been providing key data resources to support the development of deep learning technology for electroencephalography (EEG) applications [1-4] since 2012. We currently have over 1,700 subscribers to our resources and have been providing data, software and documentation from our web site [5] since 2012. In this poster, we introduce additions to our resources that have been developed within the past year to facilitate software development and big data machine learning research. Major resources released in 2019 include: ● Data: The most current release of our open source EEG data is v1.2.0 of TUH EEG and includes the addition of 3,874 sessions and 1,960 patients from mid-2015 through 2016. ● Software: We have recently released a package, PyStream, that demonstrates how to correctly read an EDF file and access samples of the signal. This software demonstrates how to properly decode channels based on their labels and how to implement montages. Most existing open source packages to read EDF files do not directly address the problem of channel labels [6]. ● Documentation: We have released two documents that describe our file formats and data representations: (1) electrodes and channels [6]: describes how tomore »
-
Obeid, Iyad Selesnick (Ed.)The Temple University Hospital EEG Corpus (TUEG) [1] is the largest publicly available EEG corpus of its type and currently has over 5,000 subscribers (we currently average 35 new subscribers a week). Several valuable subsets of this corpus have been developed including the Temple University Hospital EEG Seizure Corpus (TUSZ) [2] and the Temple University Hospital EEG Artifact Corpus (TUAR) [3]. TUSZ contains manually annotated seizure events and has been widely used to develop seizure detection and prediction technology [4]. TUAR contains manually annotated artifacts and has been used to improve machine learning performance on seizure detection tasks [5]. In this poster, we will discuss recent improvements made to both corpora that are creating opportunities to improve machine learning performance. Two major concerns that were raised when v1.5.2 of TUSZ was released for the Neureka 2020 Epilepsy Challenge were: (1) the subjects contained in the training, development (validation) and blind evaluation sets were not mutually exclusive, and (2) high frequency seizures were not accurately annotated in all files. Regarding (1), there were 50 subjects in dev, 50 subjects in eval, and 592 subjects in train. There was one subject common to dev and eval, five subjects common to dev andmore »