null
(Ed.)
Transformers are expensive to train due to the quadratic time and space complexity
in the self-attention mechanism. On the other hand, although kernel machines
suffer from the same computation bottleneck in pairwise dot products, several
approximation schemes have been successfully incorporated to considerably reduce
their computational cost without sacrificing too much accuracy. In this work,
we leverage the computation methods for kernel machines to alleviate the high
computational cost and introduce Skyformer, which replaces the softmax structure
with a Gaussian kernel to stabilize the model training and adapts the Nyström
method to a non-positive semidefinite matrix to accelerate the computation. We
further conduct theoretical analysis by showing that the matrix approximation
error of our proposed method is small in the spectral norm. Experiments on Long
Range Arena benchmark show that the proposed method is sufficient in getting
comparable or even better performance than the full self-attention while requiring
fewer computation resources.
more »
« less