On Separate Normalization in Self-supervised Transformers

Chen, X; Wang, Y; Du, Y; Hassoun, S; Liu, L

Citation Details

Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the class token [CLS] and the tokens. We propose in this paper a new yet simple normalization method that separately normalizes embedding vectors respectively corresponding to normal tokens and the [CLS] token, in order to better capture their distinct characteristics and enhance downstream task performance. Our empirical study shows that the [CLS] embeddings learned with our separate normalization layer better encode the global contextual information and are distributed more uniformly in its anisotropic space. When the conventional normalization layer is replaced with a separate normalization layer, we observe an average 2.7% performance improvement in learning tasks from the image, natural language, and graph domains. more »

Award ID(s):: 1909536

PAR ID:: 10575560

Author(s) / Creator(s):: Chen, X; Wang, Y; Du, Y; Hassoun, S; Liu, L

Publisher / Repository:: Curran Associates

Date Published:: 2023-12-12

ISBN:: 9781713899921

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this