skip to main content


Title: RTExtract: Time-series NMR spectra quantification based on 3D surface ridge tracking
Abstract Motivation Time-series NMR has advanced our knowledge about metabolic dynamics. Before analyzing compounds through modeling or statistical methods, chemical features need to be tracked and quantified. However, because of peak overlap and peak shifting, the available protocols are time consuming at best or even impossible for some regions in NMR spectra. Results We introduce RTExtract (Ridge Tracking based Extract), a computer vision-based algorithm, to quantify time-series NMR spectra. The NMR spectra of multiple time points were formulated as a 3D surface. Candidate points were first filtered using local curvature and optima, then connected into ridges by a greedy algorithm. Interactive steps were implemented to refine results. Among 173 simulated ridges, 115 can be tracked (RMSD < 0.001). For reproducing previous results, RTExtract took less than two hours instead of ∼48 hours, and two instead of seven parameters need tuning. Multiple regions with overlapping and changing chemical shifts are accurately tracked. Availability Source code is freely available within Metabolomics toolbox GitHub repository (https://github.com/artedison/Edison_Lab_Shared_Metabolomics_UGA/tree/master/metabolomics_toolbox/code/ridge_tracking) and is implemented in MATLAB and R. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1713746
NSF-PAR ID:
10156043
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This dataset contains raw data, processed data, and the codes used for data processing in our manuscript from our Fourier-transform infrared (FTIR) spectroscopy, Nuclear magnetic resonance (NMR), Raman spectroscopy, and X-ray diffraction (XRD) experiments. The data and codes for the fits of our unpolarized Raman spectra to polypeptide spectra is also included. The following explains the folder structure of the data provided in this dataset, which is also explained in the file ReadMe.txt. Browsing the data in Tree view is recommended. Folder contents Codes Raman Data Processing: The MATLAB script file RamanDecomposition.m contains the code to decompose the sub-peaks across different polarized Raman spectra (XX, XZ, ZX, ZZ, and YY), considering a set of pre-determined restrictions. The helper functions used in RamanDecomposition.m are included in the Helpers folder. RamanDecomposition.pdf is a PDF printout of the MATLAB code and output. P Value Simulation: 31_helix.ipynb and a_helix.ipynb: These two Jupyter Notebook files contain the intrinsic P value simulation for the 31-helix and alpha-helix structures. The simulation results were used to prepare Supplementary Table 4. See more details in the comments contained. Vector.py, Atom.py, Amino.py, and Helpers.py: These python files contains the class definitions used in 31_helix.ipynb and a_helix.ipynb. See more details in the comments contained. FTIR FTIR Raw Transmission.opj: This Origin data file contains the raw transmission data measured on single silk strand and used for FTIR spectra analysis. FTIR Deconvoluted Oscillators.opj: This Origin data file was generated from the data contained in the previous file using W-VASE software from J. A. Woollam, Inc. FTIR Unpolarized MultiStrand Raw Transmission.opj: This Origin data file contains the raw transmission data measured on multiple silk strands. The datasets contained in the first two files above were used to plot Figure 2a-b and the FTIR data points in Figure 4a, and Supplementary Figure 6. The datasets contained in the third file above were used to plot Supplementary Figure 3a. The datasets contained in the first two files above were used to plot Figure 2a-b, FTIR data points in Figure 4a, and Supplementary Figure 6. NMR Raw data files of the 13C MAS NMR spectra: ascii-spec_CP.txt: cross-polarized spectrum ascii-spec_DP.txt: direct-polarized spectrum Data is in ASCII format (comma separated values) using the following columns: Data point number Intensity Frequency [Hz] Frequency [ppm] Polypeptide Spectrum Fits MATLAB scripts (.m files) and Helpers: The MATLAB script file Raman_Fitting_Process_Part_1.m and Raman_Fitting_Process_Part_2.m contains the step-by-step instructions to perform the fitting process of our calculated unpolarized Raman spectrum, using digitized model polypeptide Raman spectra. The Helper folder contains two helper functions used by the above scripts. See the scripts for further instruction and information. Data aPA.csv, bPA.csv, GlyI.csv, GlyII.csv files: These csv files contain the digitized Raman spectra of poly-alanine, beta-alanine, poly-glycine-I, and poly-glycine-II. Raman_Exp_Data.mat: This MATLAB data file contains the processed, polarized Raman spectra obtained from our experiments. Variable freq is the wavenumber information of each collected spectrum. The variables xx, yy, zz, xz, zx represent the polarized Raman spectra collected. These variables are used to calculate the unpolarized Raman spectrum in Raman_Fitting_Process_Part_2.m. See the scripts for further instruction and information. Raman Raman Raw Data.mat: This MATLAB data file contains all the raw data used for Raman spectra analysis. All variables are of MATLAB structure data type. Each variable has fields called Freq and Raw, with Freq contains the wavenumber information of the measured spectra and Raw contains 5 measured Raman signal strengths. Variable XX, XZ, ZX, ZZ, and YY were used to plot and sub-peak analysis for Figure 2c-d, Raman data points in Figure 4a, Figure 5b, Supplementary Figure 2, and Supplementary Figure 7. Variable WideRange was used to plot and identify the peaks for Supplementary Figure 3b. X-Ray X-Ray.mat: This MATLAB data file contains the raw X-ray data used for the diffraction analysis in Supplementary Figure 5. 
    more » « less
  2. Background subtraction is a general problem in spectroscopy often addressed with application-specific techniques, or methods that introduce a variety of implementation barriers such as having to specify peak-free regions of the spectrum. An iterative dual-tree complex wavelet transform-based background subtraction method (DTCWT-IA) was recently developed for the analysis of ultrafast electron diffraction patterns. The method was designed to require minimal user intervention, to support streamlined analysis of many diffraction patterns with complex overlapping peaks and time-varying backgrounds, and is implemented in an open-source computer program. We examined the performance of DTCWT-IA for the analysis of spectra acquired by a range of optical spectroscopies including ultraviolet–visible spectroscopy (UV–Vis), X-ray photoelectron spectroscopy (XPS), and surface-enhanced Raman spectroscopy (SERS). A key benefit of the method is that the user need not specify regions of the spectrum where no peaks are expected to occur. SER spectra were used to investigate the robustness of DTCWT-IA to signal-to-noise levels in the spectrum and to user operation, specifically to two of the algorithm parameter settings: decomposition level and iteration number. The single, general DTCWT-IA implementation performs well in comparison to the different conventional approaches to background subtraction for UV–Vis, XPS, and SERS, while requiring minimal input. The method thus holds the same potential for optical spectroscopy as for ultrafast electron diffraction, namely streamlined analysis of spectra with complex distributions of peaks and varying signal levels, thus supporting real-time spectral analysis or the analysis of data acquired from different sources. 
    more » « less
  3. Millet, Oscar (Ed.)
    System biology relies on holistic biomolecule measurements, and untangling biochemical networks requires time-series metabolomics profiling. With current metabolomic approaches, time-series measurements can be taken for hundreds of metabolic features, which decode underlying metabolic regulation. Such a metabolomic dataset is untargeted with most features unannotated and inaccessible to statistical analysis and computational modeling. The high dimensionality of the metabolic space also causes mechanistic modeling to be rather cumbersome computationally. We implemented a faster exploratory workflow to visualize and extract chemical and biochemical dependencies. Time-series metabolic features (about 300 for each dataset) were extracted by Ridge Tracking-based Extract (RTExtract) on measurements from continuous in vivo monitoring of metabolism by NMR (CIVM-NMR) in Neurospora crassa under different conditions. The metabolic profiles were then smoothed and projected into lower dimensions, enabling a comparison of metabolic trends in the cultures. Next, we expanded incomplete metabolite annotation using a correlation network. Lastly, we uncovered meaningful metabolic clusters by estimating dependencies between smoothed metabolic profiles. We thus sidestepped the processes of time-consuming mechanistic modeling, difficult global optimization, and labor-intensive annotation. Multiple clusters guided insights into central energy metabolism and membrane synthesis. Dense connections with glucose 1-phosphate indicated its central position in metabolism in N . crassa . Our approach was benchmarked on simulated random network dynamics and provides a novel exploratory approach to analyzing high-dimensional metabolic dynamics. 
    more » « less
  4. In this paper, we develop structure assisted nonnegative matrix factorization (NMF) methods for blind source separation of degenerate data. The motivation originates from nuclear magnetic resonance (NMR) spectroscopy, where a multiple mixture NMR spectra are recorded to identify chemical compounds with similar structures. Consider the linear mixing model (LMM), we aim to identify the chemical compounds involved when the mixing process is known to be nearly singular. We first consider a class of data with dominant interval(s) (DI) where each of source signals has dominant peaks over others. Besides, a nearly singular mixing process produces degenerate mixtures. The DI condition implies clustering structures in the data points. Hence, the estimation of the mixing matrix could be achieved by data clustering. Due to the presence of the noise and the degeneracy of the data, a small deviation in the estimation may introduce errors in the output. To resolve this problem and improve robustness of the separation, methods are developed in two aspects. One is to find better estimation of the mixing matrix by allowing a constrained perturbation to the clustering output, and it can be achieved by a quadratic programming. The other is to seek sparse source signals by exploiting the DI condition, and it solves an 1 optimization. If no source information is available, we propose to adopt the nonnegative matrix factorization approach by incorporating the matrix structure (parallel columns of the mixing matrix) into the cost function and develop multiplicative iteration rules for the numerical solutions. We present experimental results of NMR data to show the performance and reliability of the method in the applications arising in NMR spectroscopy. 
    more » « less
  5. Abstract Motivation

    Cancer subtypes were usually defined based on molecular characterization of single omic data. Increasingly, measurements of multiple omic profiles for the same cohort are available. Defining cancer subtypes using multi-omic data may improve our understanding of cancer, and suggest more precise treatment for patients.

    Results

    We present NEMO (NEighborhood based Multi-Omics clustering), a novel algorithm for multi-omics clustering. Importantly, NEMO can be applied to partial datasets in which some patients have data for only a subset of the omics, without performing data imputation. In extensive testing on ten cancer datasets spanning 3168 patients, NEMO achieved results comparable to the best of nine state-of-the-art multi-omics clustering algorithms on full data and showed an improvement on partial data. On some of the partial data tests, PVC, a multi-view algorithm, performed better, but it is limited to two omics and to positive partial data. Finally, we demonstrate the advantage of NEMO in detailed analysis of partial data of AML patients. NEMO is fast and much simpler than existing multi-omics clustering algorithms, and avoids iterative optimization.

    Availability and implementation

    Code for NEMO and for reproducing all NEMO results in this paper is in github: https://github.com/Shamir-Lab/NEMO.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less