skip to main content

Title: Weakly Supervised Spatial Deep Learning for Earth Image Segmentation Based on Imperfect Polyline Labels
In recent years, deep learning has achieved tremendous success in image segmentation for computer vision applications. The performance of these models heavily relies on the availability of large-scale high-quality training labels (e.g., PASCAL VOC 2012). Unfortunately, such large-scale high-quality training data are often unavailable in many real-world spatial or spatiotemporal problems in earth science and remote sensing (e.g., mapping the nationwide river streams for water resource management). Although extensive efforts have been made to reduce the reliance on labeled data (e.g., semi-supervised or unsupervised learning, few-shot learning), the complex nature of geographic data such as spatial heterogeneity still requires sufficient training labels when transferring a pre-trained model from one region to another. On the other hand, it is often much easier to collect lower-quality training labels with imperfect alignment with earth imagery pixels (e.g., through interpreting coarse imagery by non-expert volunteers). However, directly training a deep neural network on imperfect labels with geometric annotation errors could significantly impact model performance. Existing research that overcomes imperfect training labels either focuses on errors in label class semantics or characterizes label location errors at the pixel level. These methods do not fully incorporate the geometric properties of label location errors in the vector representation. To fill the gap, this article proposes a weakly supervised learning framework to simultaneously update deep learning model parameters and infer hidden true vector label locations. Specifically, we model label location errors in the vector representation to partially reserve geometric properties (e.g., spatial contiguity within line segments). Evaluations on real-world datasets in the National Hydrography Dataset (NHD) refinement application illustrate that the proposed framework outperforms baseline methods in classification accuracy.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Intelligent Systems and Technology
Page Range / eLocation ID:
1 to 20
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given raster imagery features and imperfect vector training labels with registration uncertainty, this paper studies a deep learning framework that can quantify and reduce the registration uncertainty of training labels as well as train neural network parameters simultaneously. The problem is important in broad applications such as streamline classification on Earth imagery or tissue segmentation on medical imagery, whereby annotating precise vector labels is expensive and time-consuming. However, the problem is challenging due to the gap between the vector representation of class labels and the raster representation of image features and the need for training neural networks with uncertain label locations. Existing research on uncertain training labels often focuses on uncertainty in label class semantics or characterizes label registration uncertainty at the pixel level (not contiguous vectors). To fill the gap, this paper proposes a novel learning framework that explicitly quantifies vector labels' registration uncertainty. We propose a registration-uncertainty-aware loss function and design an iterative uncertainty reduction algorithm by re-estimating the posterior of true vector label locations distribution based on a Gaussian process. Evaluations on real-world datasets in National Hydrography Dataset refinement show that the proposed approach significantly outperforms several baselines in the registration uncertainty estimations performance and classification performance. 
    more » « less
  2. Given earth imagery with spectral features on a terrain surface, this paper studies surface segmentation based on both explanatory features and surface topology. The problem is important in many spatial and spatiotemporal applications such as flood extent mapping in hydrology. The problem is uniquely challenging for several reasons: first, the size of earth imagery on a terrain surface is often much larger than the input of popular deep convolutional neural networks; second, there exists topological structure dependency between pixel classes on the surface, and such dependency can follow an unknown and non-linear distribution; third, there are often limited training labels. Existing methods for earth imagery segmentation often divide the imagery into patches and consider the elevation as an additional feature channel. These methods do not fully incorporate the spatial topological structural constraint within and across surface patches and thus often show poor results, especially when training labels are limited. Existing methods on semi-supervised and unsupervised learning for earth imagery often focus on learning representation without explicitly incorporating surface topology. In contrast, we propose a novel framework that explicitly models the topological skeleton of a terrain surface with a contour tree from computational topology, which is guided by the physical constraint (e.g., water flow direction on terrains). Our framework consists of two neural networks: a convolutional neural network (CNN) to learn spatial contextual features on a 2D image grid, and a graph neural network (GNN) to learn the statistical distribution of physics-guided spatial topological dependency on the contour tree. The two models are co-trained via variational EM. Evaluations on the real-world flood mapping datasets show that the proposed models outperform baseline methods in classification accuracy, especially when training labels are limited. 
    more » « less
  3. Abstract

    Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming by detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page:

    more » « less
  4. null (Ed.)
    Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents. 
    more » « less
  5. Cloud masking is both a fundamental and a critical task in the vast majority of Earth observation problems across social sectors, including agriculture, energy, water, etc. The sheer volume of satellite imagery to be processed has fast-climbed to a scale (e.g., >10 PBs/year) that is prohibitive for manual processing. Meanwhile, generating reliable cloud masks and image composite is increasingly challenging due to the continued distribution-shifts in the imagery collected by existing sensors and the ever-growing variety of sensors and platforms. Moreover, labeled samples are scarce and geographically limited compared to the needs in real large-scale applications. In related work, traditional remote sensing methods are often physics-based and rely on special spectral signatures from multi- or hyper-spectral bands, which are often not available in data collected by many -- and especially more recent -- high-resolution platforms. Machine learning and deep learning based methods, on the other hand, often require large volumes of up-to-date training data to be reliable and generalizable over space. We propose an autonomous image composition and masking (Auto-CM) framework to learn to solve the fundamental tasks in a label-free manner, by leveraging different dynamics of events in both geographic domains and time-series. Our experiments show that Auto-CM outperforms existing methods on a wide-range of data with different satellite platforms, geographic regions and bands.

    more » « less