Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher’s first attention head instructs the Student’s first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.
more »
« less
Talking-head Generation with Rhythmic Head Motion
More Like this
-
-
Pygopodids are limb-reduced, miniaturized geckos found across Australia and New Guinea. Pygopodids are mainly terrestrial; however, Aprasia species are highly fossorial and further miniaturized, converging on similar ecology and morphology to typhlopid snakes. Additionally, Aprasia from eastern/central and western Australia exhibit distinct skull shapes, possibly due to the functional demands of burrowing in different soil types. Another pygopodid genus, Ophidiocephalus, also was described as fossorial with morphology most similar to eastern Aprasia species, and thus may experience a similar pattern of cranial stress when digging. The burrowing mechanics of pygopodids have never been studied; however, we propose that mechanical stress is distributed outwardly as a shell across the expanded nasals, rather than along an anterior-posterior central column as suggested for other head-first burrowing squamates. To test how differences in morphology may be related to differing functional demands, Finite Element Analysis was implemented by applying and comparing both face loads and point loads of 20N onto 3D solid meshes of the skulls of one eastern/central and one western Aprasia, and one Ophidiocephalus. The resulting stress and strain were low in all taxa and appeared to be evenly spread out across each axis; however, Ophidiocephalus experienced slightly higher average stress than either Aprasia. Although anatomically divergent, each lineage appears to have independently converged on a similar level of biomechanical performance.more » « less
An official website of the United States government

