This content will become publicly available on August 9, 2025
When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model that strategically combines the strengths of Transformers and Convolutional Neural Networks (CNNs). AerialFormer features a CNN Stem module integrated to preserve low-level and high-resolution features, enhancing the model’s capability to process details of aerial imagery. The proposed AerialFormer is designed with a hierarchical structure, in which a Transformer encoder generates multi-scale features and a multi-dilated CNN (MDC) decoder aggregates the information from the multi-scale inputs. As a result, information is taken into account in both local and global contexts, so that powerful representations and high-resolution segmentation can be achieved. The proposed AerialFormer was benchmarked on three benchmark datasets, including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that the proposed AerialFormer remarkably outperforms state-of-the-art methods.
more » « less- Award ID(s):
- 2345176
- PAR ID:
- 10556613
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Remote Sensing
- Volume:
- 16
- Issue:
- 16
- ISSN:
- 2072-4292
- Page Range / eLocation ID:
- 2930
- Subject(s) / Keyword(s):
- remote sensing semantic segmentation transformers dilated convolution
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Messinger, David W. ; Velez-Reyes, Miguel (Ed.)Recently, multispectral and hyperspectral data fusion models based on deep learning have been proposed to generate images with a high spatial and spectral resolution. The general objective is to obtain images that improve spatial resolution while preserving high spectral content. In this work, two deep learning data fusion techniques are characterized in terms of classification accuracy. These methods fuse a high spatial resolution multispectral image with a lower spatial resolution hyperspectral image to generate a high spatial-spectral hyperspectral image. The first model is based on a multi-scale long short-term memory (LSTM) network. The LSTM approach performs the fusion using a multiple step process that transitions from low to high spatial resolution using an intermediate step capable of reducing spatial information loss while preserving spectral content. The second fusion model is based on a convolutional neural network (CNN) data fusion approach. We present fused images using four multi-source datasets with different spatial and spectral resolutions. Both models provide fused images with increased spatial resolution from 8m to 1m. The obtained fused images using the two models are evaluated in terms of classification accuracy on several classifiers: Minimum Distance, Support Vector Machines, Class-Dependent Sparse Representation and CNN classification. The classification results show better performance in both overall and average accuracy for the images generated with the multi-scale LSTM fusion over the CNN fusionmore » « less
-
null (Ed.)Convolutional Neural Network (CNN) based image segmentation has made great progress in recent years. However, video object segmentation remains a challenging task due to its high computational complexity. Most of the previous methods employ a two-stream CNN framework to handle spatial and motion features separately. In this paper, we propose an end-to-end encoder-decoder style 3D CNN to aggregate spatial and temporal information simultaneously for video object segmentation. To efficiently process video, we propose 3D separable convolution for the pyramid pooling module and decoder, which dramatically reduces the number of operations while maintaining the performance. Moreover, we also extend our framework to video action segmentation by adding an extra classifier to predict the action label for actors in videos. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.more » « less
-
In this paper, we present a novel terrain classifica- tion framework for large-scale remote sensing images. A well- performing multi-scale superpixel tessellation based segmentation approach is employed to generate homogeneous and irregularly shaped regions, and a transfer learning technique is sequentially deployed to derive representative deep features by utilizing suc- cessful pre-trained convolutional neural network (CNN) models. This design is aimed to overcome the big problem of lacking available ground-truth data and to increase the generalization power of the multi-pixel descriptor. In the subsequent classification step, we train a fast and robust support vector machine (SVM) to assign the pixel-level labels. Its maximum-margin property can be easily combined with a graph Laplacian propagation approach. Moreover, we analyze the advantages of applying a feature selection technique to the deep CNN features which are extracted by transfer learning. In the experiments, we evaluate the whole framework based on different geographical types. Compared with other region-based classification methods, the results show that our framework can obtain state-of-the-art performance w.r.t. both classification accuracy and computational efficiency.more » « less
-
Micro-CT, also known as X-ray micro-computed tomography, has emerged as the primary instrument for pore-scale properties study in geological materials. Several studies have used deep learning to achieve super-resolution reconstruction in order to balance the trade-off between resolution of CT images and field of view. Nevertheless, most existing methods only work with single-scale CT scans, ignoring the possibility of using multi-scale image features for image reconstruction. In this study, we proposed a super-resolution approach via multi-scale fusion using residual U-Net for rock micro-CT image reconstruction (MS-ResUnet). The residual U-Net provides an encoder-decoder structure. In each encoder layer, several residual sequential blocks and improved residual blocks are used. The decoder is composed of convolutional ReLU residual blocks and residual chained pooling blocks. During the encoding-decoding method, information transfers between neighboring multi-resolution images are fused, resulting in richer rock characteristic information. Qualitative and quantitative comparisons of sandstone, carbonate, and coal CT images demonstrate that our proposed algorithm surpasses existing approaches. Our model accurately reconstructed the intricate details of pores in carbonate and sandstone, as well as clearly visible coal cracks.more » « less
-
7T magnetic resonance imaging (MRI) has the potential to drive our understanding of human brain function through new contrast and enhanced resolution. Whole brain segmentation is a key neuroimaging technique that allows for region-by-region analysis of the brain. Segmentation is also an important preliminary step that provides spatial and volumetric information for running other neuroimaging pipelines. Spatially localized atlas network tiles (SLANT) is a popular 3D convolutional neural network (CNN) tool that breaks the whole brain segmentation task into localized sub-tasks. Each sub-task involves a specific spatial location handled by an independent 3D convolutional network to provide high resolution whole brain segmentation results. SLANT has been widely used to generate whole brain segmentations from structural scans acquired on 3T MRI. However, the use of SLANT for whole brain segmentation from structural 7T MRI scans has not been successful due to the inhomogeneous image contrast usually seen across the brain in 7T MRI. For instance, we demonstrate the mean percent difference of SLANT label volumes between a 3T scan-rescan is approximately 1.73%, whereas its 3T-7T scan-rescan counterpart has higher differences around 15.13%. Our approach to address this problem is to register the whole brain segmentation performed on 3T MRI to 7T MRI and use this information to finetune SLANT for structural 7T MRI. With the finetuned SLANT pipeline, we observe a lower mean relative difference in the label volumes of ~8.43% acquired from structural 7T MRI data. Dice similarity coefficient between SLANT segmentation on the 3T MRI scan and the after finetuning SLANT segmentation on the 7T MRI increased from 0.79 to 0.83 with p<0.01. These results suggest finetuning of SLANT is a viable solution for improving whole brain segmentation on high resolution 7T structural imaging.more » « less