Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation

M. M. Rahman, R. Marculescu

Citation Details

Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale selfattention (SA) mechanism. In this paper, we address this issue by introducing a Multiscale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of the multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ and ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with other downstream medical image and semantic segmentation tasks. more »

Award ID(s):: 2007284

PAR ID:: 10468128

Author(s) / Creator(s):: M. M. Rahman, R. Marculescu

Publisher / Repository:: Medical Imaging with Deep Learning (MIDL) conference

Date Published:: 2023-07-01

Subject(s) / Keyword(s):: Medical image segmentation, Vision transformer, Multi-scale transformer, Feature-mixing augmentation, Self-attention

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this