Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100 h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters.
more »
« less
Parallel Scale-wise Attention Network for Effective Scene Text Recognition
The paper proposes a new text recognition network for scene-text images. Many state-of-the-art methods employ the attention mechanism either in the text encoder or decoder for the text alignment. Although the encoder-based attention yields promising results, these schemes inherit noticeable limitations. They perform the feature extraction (FE) and visual attention (VA) sequentially, which bounds the attention mechanism to rely only on the FE final single-scale output. Moreover, the utilization of the attention process is limited by only applying it directly to the single scale feature-maps. To address these issues, we propose a new multi-scale and encoder-based attention network for text recognition that performs the multi-scale FE and VA in parallel. The multi-scale channels also undergo regular fusion with each other to develop the coordinated knowledge together. Quantitative evaluation and robustness analysis on the standard benchmarks demonstrate that the proposed network outperforms the state-of-the-art in most cases.
more »
« less
- Award ID(s):
- 1955535
- PAR ID:
- 10285115
- Date Published:
- Journal Name:
- International Joint Conference on Neural Networks (IJCNN),
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Digital pathology has played a key role in replacing glass slides with digital images, enhancing various pathology workflows. Whole slide images are digitized pathological images improving the capabilities of digital pathology and contributing to the overall turnaround time for diagnoses. The digitized images have been successfully integrated with artificial intelligence algorithms assisting pathologists in many tasks, but there are still demands to develop a new algorithm for a better diagnosis process. In this paper, we propose a new deep convolutional neural network model integrating a feature pyramid network with a self-attention mechanism in three pathways: encoder, decoder, and self-attention nested for providing accurate tumor region segmentation on whole slide images. The encoder pathway adopts ResNet50 architecture for the bottom-up network. The decoder pathway adopts the feature pyramid network for the top-down network. The self-attention nested pathway forms the attention map represented by the distribution of attention scores focusing on localizing tumor regions and avoiding irrelevant information. The results of our experiment show that the proposed model outperforms the state-of-the-art deep convolutional neural network models in terms of tumor and stromal region segmentation. Moreover, various encoder networks were equipped with the proposed model and compared with each other. The results indicate that the ResNet series using the proposed model outperforms other encoder networks.more » « less
-
An efficient and effective decoding mechanism is crucial in medical image segmentation, especially in scenarios with limited computational resources. However, these decoding mechanisms usually come with high computational costs. To address this concern, we introduce EMCAD, a new efficient multi-scale convolutional attention decoder, designed to optimize both performance and computational efficiency. EMCAD leverages a unique multi-scale depth-wise convolution block, significantly enhancing feature maps through multi-scale convolutions. EMCAD also employs channel, spatial, and grouped (large-kernel) gated attention mechanisms, which are highly effective at capturing intricate spatial relationships while focusing on salient regions. By employing group and depth-wise convolution, EMCAD is very efficient and scales well (e.g., only 1.91M parameters and 0.381G FLOPs are needed when using a standard encoder). Our rigorous evaluations across 12 datasets that belong to six medical image segmentation tasks reveal that EMCAD achieves state-of-the-art (SOTA) performance with 79.4% and 80.3% reduction in #Params and #FLOPs, respectively. Moreover, EMCAD’s adaptability to different encoders and versatility across segmentation tasks further establish EMCAD as a promising tool, advancing the field towards more efficient and accurate medical image analysis. Our implementation is available at https://github.com/SLDGroup/EMCAD.more » « less
-
In recent years, face recognition systems have achieved exceptional success due to promising advances in deep learning architectures. However, they still fail to achieve the expected accuracy when matching profile images against a gallery of frontal images. Current approaches either perform pose normalization (i.e., frontalization) or disentangle pose information for face recognition. We instead propose a new approach to utilize pose as auxiliary information via an attention mechanism. In this paper, we hypothesize that pose-attended information using an attention mechanism can guide contextual and distinctive feature extraction from profile faces, which further benefits better representation learning in an embedded domain. To achieve this, first, we design a unified coupled profile-to-frontal face recognition network. It learns the mapping from faces to a compact embedding subspace via a class-specific contrastive loss. Second, we develop a novel pose attention block (PAB) to specially guide the pose-agnostic feature extraction from profile faces. To be more specific, PAB is designed to explicitly help the network to focus on important features along both “channel” and “spatial” dimensions while learning discriminative yet pose-invariant features in an embedding subspace. To validate the effectiveness of our proposed method, we conduct experiments on both controlled and in the- wild benchmarks including Multi-PIE, CFP, and IJB-C, and show superiority over the state-of-the-art.more » « less
-
We propose GourmetNet, a single-pass, end-to-end trainable network for food segmentation that achieves state-of-the-art performance. Food segmentation is an important problem as the first step for nutrition monitoring, food volume and calorie estimation. Our novel architecture incorporates both channel attention and spatial attention information in an expanded multi-scale feature representation using our advanced Waterfall Atrous Spatial Pooling module. GourmetNet refines the feature extraction process by merging features from multiple levels of the backbone through the two attention modules. The refined features are processed with the advanced multi-scale waterfall module that combines the benefits of cascade filtering and pyramid representations without requiring a separate decoder or post-processing. Our experiments on two food datasets show that GourmetNet significantly outperforms existing current state-of-the-art methods.more » « less
An official website of the United States government

