skip to main content


Search for: All records

Creators/Authors contains: "Chuah, Mooi Choo"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Avidan, S. (Ed.)
    In this paper, we tackle the problem of RGB-D Semantic Segmentation. The key challenges in solving this problem lie in 1) how to extract features from depth sensor data and 2) how to effectively fuse the features extracted from the two modalities. For the first challenge, we found that the depth information obtained from the sensor is not always reliable (e.g. objects with reflective or dark surfaces typically have inaccurate or void sensor readings), and existing methods that extract depth features using ConvNets did not explicitly consider the reliability of depth value at different pixel locations. To tackle this challenge, we propose a novel mechanism, namely Uncertainty-Aware Self-Attention that explicitly controls the information flow from unreliable depth pixels to confident depth pixels during feature extraction. For the second challenge, we propose an effective and scalable fusion module based on Cross-Attention that can adaptively fuse and exchange information between the RGB encoder and depth encoder. Our proposed framework, namely UCTNet, is an encoder-decoder network that naturally incorporates these two key designs for robust and accurate RGB-D Segmentation. Experimental results show that UCTNet outperforms existing works and achieves state-of-the-art performances on two RGB-D Semantic Segmentation benchmarks. 
    more » « less
  2. null (Ed.)
    The task of instance segmentation in videos aims to consistently identify objects at pixel level throughout the entire video sequence. Existing state-of-the-art methods either follow the tracking-bydetection paradigm to employ multi-stage pipelines or directly train a complex deep model to process the entire video clips as 3D volumes. However, these methods are typically slow and resourceconsuming such that they are often limited to offline processing. In this paper, we propose SRNet, a simple and efficient framework for joint segmentation and tracking of object instances in videos. The key to achieving both high efficiency and accuracy in our framework is to formulate the instance segmentation and tracking problem into a unified spatial-relation learning task where each pixel in the current frame relates to its object center, and each object center relates to its location in the previous frame. This unified learning framework allows our framework to perform join instance segmentation and tracking through a single stage while maintaining low overheads among different learning tasks. Our proposed framework can handle two different task settings and demonstrates comparable performance with state-of-the-art methods on two different benchmarks while running significantly faster. 
    more » « less
  3. null (Ed.)
    Training a semantic segmentation model requires large densely-annotated image datasets that are costly to obtain. Once the training is done, it is also difficult to add new ob- ject categories to such segmentation models. In this pa- per, we tackle the few-shot semantic segmentation prob- lem, which aims to perform image segmentation task on un- seen object categories merely based on one or a few sup- port example(s). The key to solving this few-shot segmen- tation problem lies in effectively utilizing object informa- tion from support examples to separate target objects from the background in a query image. While existing meth- ods typically generate object-level representations by av- eraging local features in support images, we demonstrate that such object representations are typically noisy and less distinguishing. To solve this problem, we design an ob- ject representation generator (ORG) module which can ef- fectively aggregate local object features from support im- age(s) and produce better object-level representation. The ORG module can be embedded into the network and trained end-to-end in a weakly-supervised fashion without extra hu- man annotation. We incorporate this design into a modified encoder-decoder network to present a powerful and efficient framework for few-shot semantic segmentation. Experimen- tal results on the Pascal-VOC and MS-COCO datasets show that our approach achieves better performance compared to existing methods under both one-shot and five-shot settings. 
    more » « less
  4. null (Ed.)
    Training a semantic segmentation model requires large densely-annotated image datasets that are costly to obtain. Once the training is done, it is also difficult to add new object categories to such segmentation models. In this paper, we tackle the few-shot semantic segmentation problem, which aims to perform image segmentation task on unseen object categories merely based on one or a few support example(s). The key to solving this few-shot segmentation problem lies in effectively utilizing object information from support examples to separate target objects from the background in a query image. While existing methods typically generate object-level representations by averaging local features in support images, we demonstrate that such object representations are typically noisy and less distinguishing. To solve this problem, we design an object representation generator (ORG) module which can effectively aggregate local object features from support image( s) and produce better object-level representation. The ORG module can be embedded into the network and trained end-to-end in a weakly-supervised fashion without extra human annotation. We incorporate this design into a modified encoder-decoder network to present a powerful and efficient framework for few-shot semantic segmentation. Experimental results on the Pascal-VOC and MS-COCO datasets show that our approach achieves better performance compared to existing methods under both one-shot and five-shot settings. 
    more » « less
  5. In recent years, robotic technologies, e.g. drones or autonomous cars have been applied to the agricultural sectors to improve the efficiency of typical agricultural operations. Some agricultural tasks that are ideal for robotic automation are yield estimation and robotic harvesting. For these applications, an accurate and reliable image-based detection system is critically important. In this work, we present a low-cost strawberry detection system based on convolutional neural networks. Ablation studies are presented to validate the choice of hyper- parameters, framework, and network structure. Additional modifications to both the training data and network structure that improve precision and execution speed, e.g., input compression, image tiling, color masking, and network compression, are discussed. Finally, we present a final network implementation on a Raspberry Pi 3B that demonstrates a detection speed of 1.63 frames per second and an average precision of 0.842. 
    more » « less
  6. The rapid pace of urbanization and socioeconomic development encourage people to spend more time together and therefore monitoring of human dynamics is of great importance, especially for facilities of elder care and involving multiple activities. Traditional approaches are limited due to their high deployment costs and privacy concerns (e.g., camera-based surveillance or sensor-attachment-based solutions). In this work, we propose to provide a fine-grained comprehensive view of human dynamics using existing WiFi infrastructures often available in many indoor venues. Our approach is low-cost and device-free, which does not require any active human participation. Our system aims to provide smart human dynamics monitoring through participant number estimation, human density estimation and walking speed and direction derivation. A semi-supervised learning approach leveraging the non-linear regression model is developed to significantly reduce training efforts and accommodate different monitoring environments. We further derive participant number and density estimation based on the statistical distribution of Channel State Information (CSI) measurements. In addition, people's walking speed and direction are estimated by using a frequency-based mechanism. Extensive experiments over 12 months demonstrate that our system can perform fine-grained effective human dynamic monitoring with over 90% accuracy in estimating participants number, density, and walking speed and direction at various indoor environments. 
    more » « less