skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Faster Walsh-Hadamard and Discrete Fourier Transforms from Matrix Non-rigidity
We give algorithms with lower arithmetic operation counts for both the Walsh-Hadamard Transform (WHT) and the Discrete Fourier Transform (DFT) on inputs of power-of-2 size N. For the WHT, our new algorithm has an operation count of 23/24N logN + O(N). To our knowledge, this gives the first improvement on the N logN operation count of the simple, folklore Fast Walsh-Hadamard Transform algorithm. For the DFT, our new FFT algorithm uses 15/4N logN + O(N) real arithmetic operations. Our leading constant 15/4 = 3.75 improves on the leading constant of 5 from the Cooley-Tukey algorithm from 1965, leading constant 4 from the split-radix algorithm of Yavne from 1968, leading constant 34/9=3.7777 from a modification of the split-radix algorithm by Van Buskirk from 2004, and leading constant 3.76875 from a theoretically optimized version of Van Buskirk’s algorithm by Sergeev from 2017. Our new WHT algorithm takes advantage of a recent line of work on the non-rigidity of the WHT: we decompose the WHT matrix as the sum of a low-rank matrix and a sparse matrix, and then analyze the structures of these matrices to achieve a lower operation count. Our new DFT algorithm comes from a novel reduction, showing that parts of the previous best FFT algorithms can be replaced by calls to an algorithm for the WHT. Replacing the folklore WHT algorithm with our new improved algorithm leads to our improved FFT.  more » « less
Award ID(s):
2238221
PAR ID:
10488392
Author(s) / Creator(s):
;
Publisher / Repository:
ACM
Date Published:
Journal Name:
STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of Computing
ISBN:
9781450399135
Page Range / eLocation ID:
455 to 462
Format(s):
Medium: X
Location:
Orlando FL USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Extending computational harmonic analysis tools from the classical setting of regular lattices to the more general setting of graphs and networks is very important and much research has been done recently. Our previous Generalized Haar-Walsh Transform (GHWT) is a multiscale transform for signals on graphs, which is a generalization of the classical Haar and Walsh-Hadamard Transforms. This article proposes the extended Generalized Haar-Walsh Transform (eGHWT). The eGHWT and its associated best-basis selection algorithm for graph signals will significantly improve the performance of the previous GHWT with the similar computational cost, O(N log N) where N is the number of nodes of an input graph. While the previous GHWT/best-basis algorithm seeks the most suitable orthonormal basis for a given task among more than 1.5^N possible bases, the eGHWT/best-basis algorithm can find a better one by searching through more than 0.618 ⋅ (1.84)^N possible bases. This article describes the details of the eGHWT/basis-basis algorithm and demonstrates its superiority using several examples including genuine graph signals as well as conventional digital images viewed as graph signals. Keywords: Multiscale basis dictionaries, wavelets on graphs, graph signal processing, adapted time-frequency analysis, the best-basis algorithm 
    more » « less
  2. Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural prior they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the O(N logN) Cooley-Tukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points—the first time a structured approach has done so—with 4X faster inference speed and 40X fewer parameters. 
    more » « less
  3. Polynomial approximations for e−x and ex have applications to the design of algorithms for many problems, and our degree bounds show both the power and limitations of these algorithms. We focus in particular on the Batch Gaussian Kernel Density Estimation problem for n sample points in Θ(logn) dimensions with error δ=n−Θ(1). We show that the running time one can achieve depends on the square of the diameter of the point set, B, with a transition at B=Θ(logn) mirroring the corresponding transition in dB;δ(e−x): - When B=o(logn), we give the first algorithm running in time n1+o(1). - When B=κlogn for a small constant κ>0, we give an algorithm running in time n1+O(loglogκ−1/logκ−1). The loglogκ−1/logκ−1 term in the exponent comes from analyzing the behavior of the leading constant in our computation of dB;δ(e−x). - When B=ω(logn), we show that time n2−o(1) is necessary assuming SETH. 
    more » « less
  4. Abstract Extending computational harmonic analysis tools from the classical setting of regular lattices to the more general setting of graphs and networks is very important, and much research has been done recently. The generalized Haar–Walsh transform (GHWT) developed by Irion and Saito (2014) is a multiscale transform for signals on graphs, which is a generalization of the classical Haar and Walsh–Hadamard transforms. We propose theextendedgeneralized Haar–Walsh transform (eGHWT), which is a generalization of the adapted time–frequency tilings of Thiele and Villemoes (1996). The eGHWT examines not only the efficiency of graph-domain partitions but also that of “sequency-domain” partitionssimultaneously. Consequently, the eGHWT and its associated best-basis selection algorithm for graph signals significantly improve the performance of the previous GHWT with the similar computational cost,$$O(N \log N)$$ O ( N log N ) , whereNis the number of nodes of an input graph. While the GHWT best-basis algorithm seeks the most suitable orthonormal basis for a given task among more than$$(1.5)^N$$ ( 1.5 ) N possible orthonormal bases in$$\mathbb {R}^N$$ R N , the eGHWT best-basis algorithm can find a better one by searching through more than$$0.618\cdot (1.84)^N$$ 0.618 · ( 1.84 ) N possible orthonormal bases in$$\mathbb {R}^N$$ R N . This article describes the details of the eGHWT best-basis algorithm and demonstrates its superiority using several examples including genuine graph signals as well as conventional digital images viewed as graph signals. Furthermore, we also show how the eGHWT can be extended to 2D signals and matrix-form data by viewing them as a tensor product of graphs generated from their columns and rows and demonstrate its effectiveness on applications such as image approximation. 
    more » « less
  5. null (Ed.)
    The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Theta(log n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n - factor blow-up in work as well as a near-optimal O(log n loglog log n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log n-factor blow-up in work for all practical values of n (i.e., n le 10 ^10,000 ). 
    more » « less