Background Transposable element (TE) polymorphisms are important components of population genetic variation. The functional impacts of TEs in gene regulation and generating genetic diversity have been observed in multiple species, but the frequency and magnitude of TE variation is under appreciated. Inexpensive and deep sequencing technology has made it affordable to apply population genetic methods to whole genomes with methods that identify single nucleotide and insertion/deletion polymorphisms. However, identifying TE polymorphisms, particularly transposition events or non-reference insertion sites can be challenging due to the repetitive nature of these sequences, which hamper both the sensitivity and specificity of analysis tools. Methods We have developed the tool RelocaTE2 for identification of TE insertion sites at high sensitivity and specificity. RelocaTE2 searches for known TE sequences in whole genome sequencing reads from second generation sequencing platforms such as Illumina. These sequence reads are used as seeds to pinpoint chromosome locations where TEs have transposed. RelocaTE2 detects target site duplication (TSD) of TE insertions allowing it to report TE polymorphism loci with single base pair precision. Results and Discussion The performance of RelocaTE2 is evaluated using both simulated and real sequence data. RelocaTE2 demonstrate high level of sensitivity and specificity, particularly when the sequence coverage is not shallow. In comparison to other tools tested, RelocaTE2 achieves the best balance between sensitivity and specificity. In particular, RelocaTE2 performs best in prediction of TSDs for TE insertions. Even in highly repetitive regions, such as those tested on rice chromosome 4, RelocaTE2 is able to report up to 95% of simulated TE insertions with less than 0.1% false positive rate using 10-fold genome coverage resequencing data. RelocaTE2 provides a robust solution to identify TE insertion sites and can be incorporated into analysis workflows in support of describing the complete genotype from light coverage genome sequencing.
more »
« less
Reproducible and accessible analysis of transposon insertion sequencing in Galaxy for qualitative essentiality analyses
Abstract Background Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research. Yet, the field of next-generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One such “problem areas” is the analysis of Transposon Insertion Sequencing (TIS) data. TIS allows probing of almost the entire genome of a microorganism by introducing random insertions of transposon-derived constructs. The impact of the insertions on the survival and growth under specific conditions provides precise information about genes affecting specific phenotypic characteristics. A wide array of tools has been developed to analyze TIS data. Among the variety of options available, it is often difficult to identify which one can provide a reliable and reproducible analysis. Results Here we sought to understand the challenges and propose reliable practices for the analysis of TIS experiments. Using data from two recent TIS studies, we have developed a series of workflows that include multiple tools for data de-multiplexing, promoter sequence identification, transposon flank alignment, and read count repartition across the genome. Particular attention was paid to quality control procedures, such as determining the optimal tool parameters for the analysis and removal of contamination. Conclusions Our work provides an assessment of the currently available tools for TIS data analysis. It offers ready to use workflows that can be invoked by anyone in the world using our public Galaxy platform ( https://usegalaxy.org ). To lower the entry barriers, we have also developed interactive tutorials explaining details of TIS data analysis procedures at https://bit.ly/gxy-tis .
more »
« less
- Award ID(s):
- 1661497
- PAR ID:
- 10301926
- Date Published:
- Journal Name:
- BMC Microbiology
- Volume:
- 21
- Issue:
- 1
- ISSN:
- 1471-2180
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Background Biological mutagens (such as transposon) with sequences inserted, play a crucial role to link observed phenotype and genotype in reverse genetic studies. For this reason, accurate and efficient software tools for identifying insertion sites based on the analysis of sequencing reads are desired. Results We developed a bioinformatics tool, a Finder, to identify genome-wide Insertions in Mutagenesis (named as “InMut-Finder”), based on target sequences and flanking sequences from long reads, such as Oxford Nanopore Sequencing. InMut-Finder succeeded in identify > 100 insertion sites in Medicago truncatula and soybean mutants based on sequencing reads of whole-genome DNA or enriched insertion-site DNA fragments. Insertion sites discovered by InMut-Finder were validated by PCR experiments. Conclusion InMut-Finder is a comprehensive and powerful tool for automated insertion detection from Nanopore long reads. The simplicity, efficiency, and flexibility of InMut-Finder make it a valuable tool for functional genomics and forward and reverse genetics. InMut-Finder was implemented with Perl, R, and Shell scripts, which are independent of the OS. The source code and instructions can be accessed at https://github.com/jsg200830/InMut-Finder .more » « less
-
As transposon sequencing (TnSeq) assays have become prolific in the microbiology field, it is of interest to scrutinize their potential drawbacks. TnSeq data consist of millions of nucleotide sequence reads that are generated by PCR amplification of transposon-genomic junctions. Reads mapping to the junctions are enumerated thus providing information on the number of transposon insertion mutations in each individual gene. Here we explore the possibility that PCR amplification of transposon insertions in a TnSeq library skews the results by introducing bias into the detection and/or enumeration of insertions. We compared the detection and frequency of mapped insertions when altering the number of PCR cycles, and when including a nested PCR, in the enrichment step. Additionally, we present nCATRAs - a novel, amplification-free TnSeq method where the insertions are enriched via CRISPR/Cas9-targeted transposon cleavage and subsequent Oxford Nanopore MinION sequencing. nCATRAs achieved 54 and 23% enrichment of the transposons and transposon-genomic junctions, respectively, over background genomic DNA. These PCR-based and PCR-free experiments demonstrate that, overall, PCR amplification does not significantly bias the results of TnSeq insofar as insertions in the majority of genes represented in our library were similarly detected regardless of PCR cycle number and whether or not PCR amplification was employed. However, the detection of a small subset of genes which had been previously described as essential is sensitive to the number of PCR cycles. We conclude that PCR-based enrichment of transposon insertions in a TnSeq assay is reliable, but researchers interested in profiling putative essential genes should carefully weigh the number of amplification cycles employed in their library preparation protocols. In addition, nCATRAs is comparable to traditional PCR-based methods (Kendall’s correlation=0.896–0.897) although the latter remain superior owing to their accessibility and high sequencing depth.more » « less
-
Stajich, J (Ed.)Abstract Bioinformatic analysis—such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis—is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).more » « less
-
Summary From a single transgenic line harboring fiveTnt1transposon insertions, we generated a near‐saturated insertion population inMedicago truncatula. Using thermal asymmetric interlaced‐polymerase chain reaction followed by sequencing, we recovered 388 888 flanking sequence tags (FSTs) from 21 741 insertion lines in this population.FSTrecovery from 14Tnt1lines using the whole‐genome sequencing (WGS) and/orTnt1‐capture sequencing approaches suggests an average of 80 insertions per line, which is more than the previous estimation of 25 insertions. Analysis of the distribution pattern and preference ofTnt1insertions showed thatTnt1is overall randomly distributed throughout theM. truncatulagenome. At the chromosomal level,Tnt1insertions occurred on both arms of all chromosomes, with insertion frequency negatively correlated with theGCcontent. Based on 174 546 filteredFSTs that show exact insertion locations in theM. truncatulagenome version 4.0 (Mt4.0), 0.44Tnt1insertions occurred per kb, and 19 583 genes containedTnt1with an average of 3.43 insertions per gene. Pathway and gene ontology analyses revealed thatTnt1‐inserted genes are significantly enriched in processes associated with ‘stress’, ‘transport’, ‘signaling’ and ‘stimulus response’. Surprisingly, gene groups with higher methylation frequency were more frequently targeted for insertion. Analysis of 19 583Tnt1‐inserted genes revealed that 59% (1265) of 2144 transcription factors, 63% (765) of 1216 receptor kinases and 56% (343) of 616 nucleotide‐binding site‐leucine‐rich repeat genes harbored at least oneTnt1insertion, compared with the overall 38% ofTnt1‐inserted genes out of 50 894 annotated genes in the genome.more » « less
An official website of the United States government

