skip to main content

This content will become publicly available on July 28, 2022

Title: Study becomes insight: Ecological learning from machine learning
1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models. 2. We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables. 3. We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential more » variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions. 4. Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’. « less
; ; ; ; ;
Windecker, Saras
Award ID(s):
1832194 2025166
Publication Date:
Journal Name:
Methods in Ecology and Evolution
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Scientific and engineering problems often require the use of artificial intelligence to aid understanding and the search for promising designs. While Gaussian processes (GP) stand out as easy-to-use and interpretable learners, they have difficulties in accommodating big data sets, categorical inputs, and multiple responses, which has become a common challenge for a growing number of data-driven design applications. In this paper, we propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously. The method is built upon the latent-variable Gaussian process (LVGP) model where categorical factors are mapped intomore »a continuous latent space to enable GP modeling of mixed-variable data sets. By extending variational inference to LVGP models, the large training data set is replaced by a small set of inducing points to address the scalability issue. Output response vectors are represented by a linear combination of independent latent functions, forming a flexible kernel structure to handle multiple responses that might have distinct behaviors. Comparative studies demonstrate that the proposed method scales well for large data sets with over 104 data points, while outperforming state-of-the-art machine learning methods without requiring much hyperparameter tuning. In addition, an interpretable latent space is obtained to draw insights into the effect of categorical factors, such as those associated with “building blocks” of architectures and element choices in metamaterial and materials design. Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism with aperiodic microstructures and multiple materials.« less
  2. This paper synthesizes multiple methods for machine learning (ML) model interpretation and visualization (MIV) focusing on meteorological applications. ML has recently exploded in popularity in many fields, including meteorology. Although ML has been successful in meteorology, it has not been as widely accepted, primarily due to the perception that ML models are “black boxes,” meaning the ML methods are thought to take inputs and provide outputs but not to yield physically interpretable information to the user. This paper introduces and demonstrates multiple MIV techniques for both traditional ML and deep learning, to enable meteorologists to understand what ML models havemore »learned. We discuss permutation-based predictor importance, forward and backward selection, saliency maps, class-activation maps, backward optimization, and novelty detection. We apply these methods at multiple spatiotemporal scales to tornado, hail, winter precipitation type, and convective-storm mode. By analyzing such a wide variety of applications, we intend for this work to demystify the black box of ML, offer insight in applying MIV techniques, and serve as a MIV toolbox for meteorologists and other physical scientists.« less
  3. Reliable statistical inference is central to forest ecology and management, much of which seeks to estimate population parameters for forest attributes and ecological indicators for biodiversity, functions and services in forest ecosystems. Many populations in nature such as plants or animals are characterized by aggregation of tendencies, introducing a big challenge to sampling. Regardless, a biased or imprecise inference would mislead analysis, hence the conclusion and policymaking. Systematic adaptive cluster sampling (SACS) is designunbiased and particularly efficient for inventorying spatially clustered populations. However, (1) oversampling is common for nonrare variables, making SACS a difficult choice for inventorying common forest attributesmore »or ecological indicators; (2) a SACS sample is not completely specified until the field campaign is completed, making advance budgeting and logistics difficult; (3) even for rare variables, uncertainty regarding the final sample still persists; and (4) a SACS sample may be variable-specific as its formation can be adapted to a particular attribute or indicator, thus risking imbalance or non-representativeness for other jointly observed variables. Consequently, to solve these challenges, we aim to develop a generalized SACS (GSACS) with respect to the design and estimators, and to illustrate its connections with systematic sampling (SS) as has been widely employed by national forest inventories and ecological observation networks around the world. In addition to theoretical derivations, empirical sampling distributions were validated and compared for GSACS and SS using sampling simulations that incorporated a comprehensive set of forest populations exhibiting different spatial patterns. Five conclusions are relevant: (1) in contrast to SACS, GSACS explicitly supports inventorying forest attributes and ecological indicators that are nonrare, and solved SACS problems of oversampling, uncertain sample form, and sample imbalance for alternative attributes or indicators; (2) we demonstrated that SS is a special case of GSACS; (3) even with fewer sample plots, GSACS gives estimates identical to SS; (4) GSACS outperforms SS with respect to inventorying clustered populations and for making domain-specific estimates; and (5) the precision in design-based inference is negatively correlated with the prevalence of a spatial pattern, the range of spatial autocorrelation, and the sample plot size, in a descending order.« less
  4. Background Ecological communities tend to be spatially structured due to environmental gradients and/or spatially contagious processes such as growth, dispersion and species interactions. Data transformation followed by usage of algorithms such as Redundancy Analysis (RDA) is a fairly common approach in studies searching for spatial structure in ecological communities, despite recent suggestions advocating the use of Generalized Linear Models (GLMs). Here, we compared the performance of GLMs and RDA in describing spatial structure in ecological community composition data. We simulated realistic presence/absence data typical of many β -diversity studies. For model selection we used standard methods commonly used in mostmore »studies involving RDA and GLMs. Methods We simulated communities with known spatial structure, based on three real spatial community presence/absence datasets (one terrestrial, one marine and one freshwater). We used spatial eigenvectors as explanatory variables. We varied the number of non-zero coefficients of the spatial variables, and the spatial scales with which these coefficients were associated and then compared the performance of GLMs and RDA frameworks to correctly retrieve the spatial patterns contained in the simulated communities. We used two different methods for model selection, Forward Selection (FW) for RDA and the Akaike Information Criterion (AIC) for GLMs. The performance of each method was assessed by scoring overall accuracy as the proportion of variables whose inclusion/exclusion status was correct, and by distinguishing which kind of error was observed for each method. We also assessed whether errors in variable selection could affect the interpretation of spatial structure. Results Overall GLM with AIC-based model selection (GLM/AIC) performed better than RDA/FW in selecting spatial explanatory variables, although under some simulations the methods performed similarly. In general, RDA/FW performed unpredictably, often retaining too many explanatory variables and selecting variables associated with incorrect spatial scales. The spatial scale of the pattern had a negligible effect on GLM/AIC performance but consistently affected RDA’s error rates under almost all scenarios. Conclusion We encourage the use of GLM/AIC for studies searching for spatial drivers of species presence/absence patterns, since this framework outperformed RDA/FW in situations most likely to be found in natural communities. It is likely that such recommendations might extend to other types of explanatory variables.« less
  5. Student perceptions of the complete online transition of two CS courses in response to the COVID-19 pandemic Due to the COVID-19 pandemic, universities across the globe switched from traditional Face-to-Face (F2F) course delivery to completely online. Our university declared during our Spring break that students would not return to campus, and that all courses must be delivered fully online starting two weeks later. This was challenging to both students and instructors. In this evidence-based practice paper, we present results of end-of-semester student surveys from two Spring 2020 CS courses: a programming intensive CS2 course, and a senior theory course inmore »Formal Languages and Automata (FLA). Students indicated course components they perceived as most beneficial to their learning, before and then after the online transition, and preferences for each regarding online vs. F2F. By comparing student reactions across courses, we gain insights on which components are easily adapted to online delivery, and which require further innovation. COVID was unfortunate, but gave a rare opportunity to compare students’ reflections on F2F instruction with online instructional materials for half a semester vs. entirely online delivery of the same course during the second half. The circumstances are unique, but we were able to acquire insights for future instruction. Some course components were perceived to be more useful either before or after the transition, and preferences were not the same in the two courses, possibly due to differences in the courses. Students in both courses found prerecorded asynchronous lectures significantly less useful than in-person lectures. For CS2, online office hours were significantly less useful than in-person office hours, but we found no significant difference in FLA. CS2 students felt less supported by their instructor after the online transition, but no significant difference was indicated by FLA students. FLA students found unproctored online exams offered through Canvas more stressful than in-person proctored exams, but the opposite was indicated by CS2 students. CS2 students indicated that visual materials from an eTextbook were more useful to them after going online than before, but FLA students indicated no significant difference. Overall, students in FLA significantly preferred the traditional F2F version of the course, while no significant difference was detected for CS2 students. We did not find significant effects from gender on the preference of one mode over the other. A serendipitous outcome was learning that some changes forced by circumstance should be considered for long term adoption. Offering online lab sessions and online exams where the questions are primarily multiple choice are possible candidates. However, we found that students need to feel the presence of their instructor to feel properly supported. To determine what course components need further improvement before transitioning to fully online mode, we computed a logistic regression model. The dependent variable is the student's preference for F2F or fully online. The independent variables are the course components before and after the online transition. For both courses, in-person lectures were a significant factor negatively affecting students' preferences of the fully online mode. Similarly, for CS2, in-person labs and in-person office hours were significant factors pushing students’ preferences toward F2F mode.« less