skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improving Prediction of Peroxide Value of Edible Oils Using Regularized Regression Models
We present four unique prediction techniques, combined with multiple data pre-processing methods, utilizing a wide range of both oil types and oil peroxide values (PV) as well as incorporating natural aging for peroxide creation. Samples were PV assayed using a standard starch titration method, AOCS Method Cd 8-53, and used as a verified reference method for PV determination. Near-infrared (NIR) spectra were collected from each sample in two unique optical pathlengths (OPLs), 2 and 24 mm, then fused into a third distinct set. All three sets were used in partial least squares (PLS) regression, ridge regression, LASSO regression, and elastic net regression model calculation. While no individual regression model was established as the best, global models for each regression type and pre-processing method show good agreement between all regression types when performed in their optimal scenarios. Furthermore, small spectral window size boxcar averaging shows prediction accuracy improvements for edible oil PVs. Best-performing models for each regression type are: PLS regression, 25 point boxcar window fused OPL spectral information RMSEP = 2.50; ridge regression, 5 point boxcar window, 24 mm OPL, RMSEP = 2.20; LASSO raw spectral information, 24 mm OPL, RMSEP = 1.80; and elastic net, 10 point boxcar window, 24 mm OPL, RMSEP = 1.91. The results show promising advancements in the development of a full global model for PV determination of edible oils.  more » « less
Award ID(s):
2003839
PAR ID:
10390188
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Molecules
Volume:
26
Issue:
23
ISSN:
1420-3049
Page Range / eLocation ID:
7281
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Grapevine rootstocks are gaining importance in viticulture as a strategy to combat abiotic challenges, as well as enhance scion physiology. Direct leaf-level physiological parameters like net assimilation rate, stomatal conductance to water vapor, quantum yield of PSII, and transpiration can illuminate the rootstock effect on scion physiology. However, these measures are time-consuming and limited to leaf-level analysis. This study used different rootstocks to investigate the potential application of aerial hyperspectral imagery in the estimation of canopy level measurements. A statistical framework was developed as an ensemble stacked regression (REGST) that aggregated five different individual machine learning algorithms: Least absolute shrinkage and selection operator (Lasso), Partial least squares regression (PLSR), Ridge regression (RR), Elastic net (ENET), and Principal component regression (PCR) to optimize high-throughput assessment of vine physiology. In addition, a Convolutional Neural Network (CNN) algorithm was integrated into an existing REGST, forming a hybrid CNN-REGST model with the aim of capturing patterns from the hyperspectral signal. Based on the findings, the performance of individual base models exhibited variable prediction accuracies. In most cases, Ridge Regression (RR) demonstrated the lowest test Root Mean Squared Error (RMSE). The ensemble stacked regression model (REGST) outperformed the individual machine learning algorithms with an increase in R2 by (0.03 to 0.1). The performances of CNN-REGST and REGST were similar in estimating the four different traits. Overall, these models were able to explain approximately 55–67% of the variation in the actual ground-truth data. This study suggests that hyperspectral features integrated with powerful AI approaches show great potential in tracing functional traits in grapevines. 
    more » « less
  2. The preprocessing of infrared spectra can significantly improve predictive accuracy for protein, carbohydrate, lipid, or other nutrition components, yet optimal preprocessing selection is typically empirical, tedious, and dataset specific. This study introduces a Bayesian optimization-based framework designed for the automated selection of optimal spectral preprocessing pipelines within a chemometric modeling context. The framework was applied to mid-infrared spectra of milk to predict compositional parameters for fat, protein, lactose, and total solids. A total of 385 averaged spectra corresponding to 198 unique samples was split into a 70/30 ratio (training/test) using a group-aware Kennard-Stone algorithm, resulting in 269 averaged spectra (135 unique samples) for training and 116 spectra (58 unique samples) for testing. Six regression models: Elastic Net, Gradient Boosting Machines (GBM), Partial Least Squares (PLS), RidgeCV Regression, LassoLarsCV, and Support Vector Regression (SVR) were evaluated across three preprocessing conditions: (1) no preprocessing, (2) literature-derived custom preprocessing (e.g., MSC, SNV, and first and second derivatives), and (3) optimized preprocessing via the proposed Bayesian framework. Optimized preprocessing consistently outperformed other methods, with RidgeCV achieving the best performance for all components except lactose, where PLS slightly outperformed it. Improvements in predictive accuracy, particularly in terms of RMSEP were observed across all milk components. The best RMSEP results were achieved for protein (RMSEP = 0.054, R2=0.981) and lactose (RMSEP = 0.026, R2=0.917), followed by fat (RMSEP = 0.139, R2=0.926) and total solids (RMSEP = 0.154, R2=0.960). Literature-based pipelines demonstrated inconsistent effectiveness, highlighting the limitations of transferring preprocessing methods between datasets. The Bayesian optimization approach identified relatively simple yet highly effective preprocessing pipelines, typically involving few steps. By eliminating manual trial and error, this data-driven strategy offers a robust and generalizable solution that streamlines spectral modeling in dairy analysis and can be readily applied to other types of spectroscopic data across various domains. 
    more » « less
  3. Summary The fused lasso, also known as total-variation denoising, is a locally adaptive function estimator over a regular grid of design points. In this article, we extend the fused lasso to settings in which the points do not occur on a regular grid, leading to a method for nonparametric regression. This approach, which we call the $$K$$-nearest-neighbours fused lasso, involves computing the $$K$$-nearest-neighbours graph of the design points and then performing the fused lasso over this graph. We show that this procedure has a number of theoretical advantages over competing methods: specifically, it inherits local adaptivity from its connection to the fused lasso, and it inherits manifold adaptivity from its connection to the $$K$$-nearest-neighbours approach. In a simulation study and an application to flu data, we show that excellent results are obtained. For completeness, we also study an estimator that makes use of an $$\epsilon$$-graph rather than a $$K$$-nearest-neighbours graph and contrast it with the $$K$$-nearest-neighbours fused lasso. 
    more » « less
  4. Abstract A potential method to determine whether two varieties of edible oils can be differentiated by Fourier transform infrared (FTIR) spectroscopy is proposed using digitally generated data of adulterated edible oils from an infrared (IR) spectral library. The first step is the evaluation of digitally blended data sets. Specifically, IR spectra of adulterated edible oils are computed from digitally blending experimental data of the IR spectra of an edible oil and the corresponding adulterant using the appropriate mixing coefficients to achieve the desired level of adulteration. To determine whether two edible oils can be differentiated by FTIR spectroscopy, pure IR spectra of the two edible oils are compared with IR spectra of two edible oils digitally mixed using a genetic algorithm for pattern recognition to solve a ternary classification problem. If the IR spectra of the two edible oils and their binary mixtures are differentiable from principal component plots of the spectral data, then differences between the IR spectra of these two edible oils are of sufficient magnitude to ensure that a reliable classification by FTIR spectroscopy can be obtained. Using this approach, the feasibility of authenticating edible oils such as extra virgin olive oil (EVOO) directly from library spectra is demonstrated. For this study, both digital and experimental data are combined to generate training and validation data sets to assess detection limits in FTIR spectroscopy for the adulterants. 
    more » « less
  5. Streamflow prediction plays a vital role in water resources planning in order to understand the dramatic change of climatic and hydrologic variables over different time scales. In this study, we used machine learning (ML)-based prediction models, including Random Forest Regression (RFR), Long Short-Term Memory (LSTM), Seasonal Auto- Regressive Integrated Moving Average (SARIMA), and Facebook Prophet (PROPHET) to predict 24 months ahead of natural streamflow at the Lees Ferry site located at the bottom part of the Upper Colorado River Basin (UCRB) of the US. Firstly, we used only historic streamflow data to predict 24 months ahead. Secondly, we considered meteorological components such as temperature and precipitation as additional features. We tested the models on a monthly test dataset spanning 6 years, where 24-month predictions were repeated 50 times to ensure the consistency of the results. Moreover, we performed a sensitivity analysis to identify our best-performing model. Later, we analyzed the effects of considering different span window sizes on the quality of predictions made by our best model. Finally, we applied our best-performing model, RFR, on two more rivers in different states in the UCRB to test the model’s generalizability. We evaluated the performance of the predictive models using multiple evaluation measures. The predictions in multivariate time-series models were found to be more accurate, with RMSE less than 0.84 mm per month, R-squared more than 0.8, and MAPE less than 0.25. Therefore, we conclude that the temperature and precipitation of the UCRB increases the accuracy of the predictions. Ultimately, we found that multivariate RFR performs the best among four models and is generalizable to other rivers in the UCRB. 
    more » « less