Abstract Many recent studies have explored remote sensing approaches to facilitate non‐destructive sampling of aboveground biomass (AGB). Lidar platforms (e.g., iPhone and iPad PRO models) have recently made remote sensing technologies widely available and present an alternative to traditional approaches for estimating AGB. Lidar approaches can be completed within a fraction of the time required by many analog methods. However, it is unknown if handheld sensors are capable of accurately predicting AGB or how different modeling techniques affect prediction accuracy. Here, we collected AGB from 0.25‐m2plots (N = 45) from three sites along an elevational gradient within rangelands surrounding Flagstaff, Arizona, USA. Each plot was scanned with a mobile laser scanner (MLS) and iPad before plants were clipped, dried, and weighed. We compared the capability of iPad and MLS sensors to estimate AGB via minimization of model normalized root mean square error (NRMSE). This process was performed on predictor subsets describing structural, spectral, and field‐based characteristics across a suite of modeling approaches including simple linear, stepwise, lasso, and random forest regression. We found that models developed from MLS and iPad data were equally capable of predicting AGB (NRMSE 26.6% and 29.3%, respectively) regardless of the variable subsets considered. We also found that stepwise regression regularly resulted in the lowest NRMSE. Structural variables were consistently selected during each modeling approach, while spectral variables were rarely included. Field‐based variables were important in linear regression models but were not included after variable selection within random forest models. These findings support the notion that remote sensing techniques offer a valid alternative to analog field‐based data collection methods. Together, our results demonstrate that data collected using a more widely available platform will perform similarly to a more costly option and outline a workflow for modeling AGB using remote sensing systems alone.
more »
« less
Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning
In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman's own Random Forest methods. While this can be simplistically described as "Breiman won", these same developments also expose the limitations of the prediction-first philosophy that he espoused, making careful statistical analysis all the more important. This paper outlines these exciting recent developments in the random forest literature which, in our view, occurred as a result of a necessary blending of the two ways of thinking Breiman originally described. We also ask what areas statistics and statisticians might currently overlook.
more »
« less
- Award ID(s):
- 1712554
- PAR ID:
- 10298553
- Date Published:
- Journal Name:
- Observational studies
- Volume:
- 7
- Issue:
- 1
- ISSN:
- 2767-3324
- Page Range / eLocation ID:
- 107-125
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives.more » « less
-
NA (Ed.)Canonical correlation analysis (CCA) is a classic statistical method for discovering latent co-variation that underpins two or more observed random vectors. Several extensions and variations of CCA have been proposed that have strengthened our capabilities in terms of revealing common random factors from multiview datasets. In this work, we first revisit the most recent deterministic extensions of deep CCA and highlight the strengths and limitations of these state-of-the-art methods. Some methods allow trivial solutions, while others can miss weak common factors. Others overload the problem by also seeking to reveal what is {\em not common} among the views -- i.e., the private components that are needed to fully reconstruct each view. The latter tends to overload the problem and its computational and sample complexities. Aiming to improve upon these limitations, we design a novel and efficient formulation that alleviates some of the current restrictions. The main idea is to model the private components as {\em conditionally} independent given the common ones, which enables the proposed compact formulation. In addition, we also provide a sufficient condition for identifying the common random factors. Judicious experiments with synthetic and real datasets showcase the validity of our claims and the effectiveness of the proposed approach.more » « less
-
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.more » « less
-
Algorithmic rankers are ubiquitously applied in automated decision systems such as hiring, admission, and loan-approval systems. Without appropriate explanations, decision-makers often cannot audit or trust algorithmic rankers' outcomes. In recent years, XAI (explainable AI) methods have focused on classification models, but for algorithmic rankers, we are yet to develop state-of-the-art explanation methods. Moreover, explanations are also sensitive to changes in data and ranker properties, and decision-makers need transparent model diagnostics for calibrating the degree and impact of ranker sensitivity. To fulfill these needs, we take a dual approach of: i) designing explanations by transforming Shapley values for the simple form of a ranker based on linear weighted summation and ii) designing a human-in-the-loop sensitivity analysis workflow by simulating data whose attributes follow user-specified statistical distributions and correlations. We leverage a visualization interface to validate the transformed Shapley values and draw inferences from them by leveraging multi-factorial simulations, including data distributions, ranker parameters, and rank ranges.more » « less
An official website of the United States government

