Compression with Attention: Learning in Lower Dimensions

Singh, Gaurav; Bazargan, Kia

With deep learning models ever ballooning in size to push state-ofthe- art accuracy improvements, efforts to find compact models have become necessary. To meet such an objective, we propose a novel operation called Personal Self-Attention (PSA). It is designed specifically to learn non-linear 1-D functions faster than existing architectures like Multi-Layer Perceptron (MLP) and Polynomial-based methods, while being highly compatible with gradient backpropagation. We show that by stacking and combining these non-linear functions with linear transformations, we can achieve the same accuracy as a larger model but with a hidden dimension that is significantly smaller. To test our contribution, we implemented PSA on an MLP-based vision model called ResMLP and tested it against vision classification tasks on SVHN, and CIFAR-10 datasets. We show how PSA pushes the pareto-front, achieving the same accuracy with 2 − 6× smaller hidden-dimension sizes compared to the conventional MLP structures. Further, by quantizing our non-linear function, the PSA can be mapped to a simple lookup table, allowing for very efficient translation to FPGA hardware. We demonstrate this by designing an unrolled high-throughput accelerator for ResMLP using nearly 1.5× fewer DSPs with PSA compared to a conventional MLP architecture while achieving the same accuracy of 86% and throughput of 29k FPS.

More Like this