skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ASSESSING LIDAR TRAINING DATA QUANTITIES FOR CLASSIFICATION MODELS
Abstract. Classifying objects within aerial Light Detection and Ranging (LiDAR) data is an essential task to which machine learning (ML) is applied increasingly. ML has been shown to be more effective on LiDAR than imagery for classification, but most efforts have focused on imagery because of the challenges presented by LiDAR data. LiDAR datasets are of higher dimensionality, discontinuous, heterogenous, spatially incomplete, and often scarce. As such, there has been little examination into the fundamental properties of the training data required for acceptable performance of classification models tailored for LiDAR data. The quantity of training data is one such crucial property, because training on different sizes of data provides insight into a model’s performance with differing data sets. This paper assesses the impact of training data size on the accuracy of PointNet, a widely used ML approach for point cloud classification. Subsets of ModelNet ranging from 40 to 9,843 objects were validated on a test set of 400 objects. Accuracy improved logarithmically; decelerating from 45 objects onwards, it slowed significantly at a training size of 2,000 objects, corresponding to 20,000,000 points. This work contributes to the theoretical foundation for development of LiDAR-focused models by establishing a learning curve, suggesting the minimum quantity of manually labelled data necessary for satisfactory classification performance and providing a path for further analysis of the effects of modifying training data characteristics.  more » « less
Award ID(s):
1940145
PAR ID:
10522681
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ISPRS
Date Published:
Journal Name:
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Volume:
XLVI-4/W4-2021
ISSN:
2194-9034
Page Range / eLocation ID:
101 to 106
Subject(s) / Keyword(s):
Lidar, laser scanning, training data, validation data, benchmarking, machine learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In recent years, there have been rapid improvements in both remote sensing methods and satellite image availability that have the potential to massively improve burn severity assessments of the Alaskan boreal forest. In this study, we utilized recent pre- and post-fire Sentinel-2 satellite imagery of the 2019 Nugget Creek and Shovel Creek burn scars located in Interior Alaska to both assess burn severity across the burn scars and test the effectiveness of several remote sensing methods for generating accurate map products: Normalized Difference Vegetation Index (NDVI), Normalized Burn Ratio (NBR), and Random Forest (RF) and Support Vector Machine (SVM) supervised classification. We used 52 Composite Burn Index (CBI) plots from the Shovel Creek burn scar and 28 from the Nugget Creek burn scar for training classifiers and product validation. For the Shovel Creek burn scar, the RF and SVM machine learning (ML) classification methods outperformed the traditional spectral indices that use linear regression to separate burn severity classes (RF and SVM accuracy, 83.33%, versus NBR accuracy, 73.08%). However, for the Nugget Creek burn scar, the NDVI product (accuracy: 96%) outperformed the other indices and ML classifiers. In this study, we demonstrated that when sufficient ground truth data is available, the ML classifiers can be very effective for reliable mapping of burn severity in the Alaskan boreal forest. Since the performance of ML classifiers are dependent on the quantity of ground truth data, when sufficient ground truth data is available, the ML classification methods would be better at assessing burn severity, whereas with limited ground truth data the traditional spectral indices would be better suited. We also looked at the relationship between burn severity, fuel type, and topography (aspect and slope) and found that the relationship is site-dependent. 
    more » « less
  2. Sustainability has become a critical focus area across the technology industry, most notably in cloud data centers. In such shared-use computing environments, there is a need to account for the power consumption of individual users. Prior work on power prediction of individual user jobs in shared environments has often focused on workloads that stress a single resource, such as CPU or DRAM. These works typically employ a specific machine learning (ML) model to train and test on the target workload for high accuracy. However, modern workloads in data centers can stress multiple resources simultaneously, and cannot be assumed to always be available for training. This paper empirically evaluates the performance of various ML models under different model settings and training data assumptions for the per-job power prediction problem using a range of workloads. Our evaluation results provide key insights into the efficacy of different ML models. For example, we find that linear ML models suffer from poor prediction accuracy (as much as 25% prediction error), especially for unseen workloads. Conversely, non-linear models, specifically XGBoost and xRandom Forest, provide reasonable accuracy (7–9% error). We also find that data-normalization and the power-prediction model formulation affect the accuracy of individual ML models in different ways. 
    more » « less
  3. Woody plant encroachment (WPE) is transforming grasslands globally, yet accurately mapping this process remains challenging. State-funded, publicly available high-resolution aerial imagery offers a potential solution, including the USDA’s National Agriculture Imagery Program (NAIP) and NSF’s National Ecological Observatory Network (NEON) Aerial Observation Platform (AOP). We evaluated the accuracy of land cover classification using NAIP, NEON, and both sources combined. We compared two machine learning models—support vector machines and random forests—implemented in R using large training and evaluation data sets. Our study site, Konza Prairie Biological Station, is a long-term experiment in which variable fire and grazing have created mosaics of herbaceous plants, shrubs, deciduous trees, and evergreen trees (Juniperus virginiana). All models achieved high overall accuracy (>90%), with NEON slightly outperforming NAIP. NAIP underperformed in detecting evergreen trees (52–78% vs. 83–86% accuracy with NEON). NEON models relied on LiDAR-based canopy height data, whereas NAIP relied on multispectral bands. Combining data from both platforms yielded the best results, with 97.7% overall accuracy. Vegetation indices contributed little to model accuracy, including NDVI (normalized digital vegetation index) and EVI (enhanced vegetation index). Both machine learning methods achieved similar accuracy. Our results demonstrate that free, high-resolution imagery and open-source tools can enable accurate, high-resolution, landscape-scale WPE monitoring. Broader adoption of such approaches could substantially improve the monitoring and management of grassland biodiversity, ecosystem function, ecosystem services, and environmental resilience. 
    more » « less
  4. In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes. 
    more » « less
  5. Abstract. Remote sensing measurements have been widely used to estimate the planetary boundary layer height (PBLHT). Each remote sensing approach offers unique strengths and faces different limitations. In this study, we use machine learning (ML) methods to produce a best-estimate PBLHT (PBLHT-BE-ML) by integrating four PBLHT estimates derived from remote sensing measurements at the Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Southern Great Plains (SGP) observatory. Three ML models – random forest (RF) classifier, RF regressor, and light gradient-boosting machine (LightGBM) – were trained on a dataset from 2017 to 2023 that included radiosonde, various remote sensing PBLHT estimates, and atmospheric meteorological conditions. Evaluations indicated that PBLHT-BE-ML from all three models improved alignment with the PBLHT derived from radiosonde data (PBLHT-SONDE), with LightGBM demonstrating the highest accuracy under both stable and unstable boundary layer conditions. Feature analysis revealed that the most influential input features at the SGP site were the PBLHT estimates derived from (a) potential temperature profiles retrieved using Raman lidar (RL) and atmospheric emitted radiance interferometer (AERI) measurements (PBLHT-THERMO), (b) vertical velocity variance profiles from Doppler lidar (PBLHT-DL), and (c) aerosol backscatter profiles from micropulse lidar (PBLHT-MPL). The trained models were then used to predict PBLHT-BE-ML at a temporal resolution of 10 min, effectively capturing the diurnal evolution of PBLHT and its significant seasonal variations, with the largest diurnal variation observed over summer at the SGP site. We applied these trained models to data from the ARM Eastern Pacific Cloud Aerosol Precipitation Experiment (EPCAPE) field campaign (EPC), where the PBLHT-BE-ML, particularly with the LightGBM model, demonstrated improved accuracy against PBLHT-SONDE. Analyses of model performance at both the SGP and EPC sites suggest that expanding the training dataset to include various surface types, such as ocean and ice-covered areas, could further enhance ML model performance for PBLHT estimation across varied geographic regions. 
    more » « less