skip to main content

Title: SMASH: Structured matrix approximation by separation and hierarchy

This paper presents an efficient method to perform structured matrix approximation by separation and hierarchy (SMASH), when the original dense matrix is associated with a kernel function. Given the points in a domain, a tree structure is first constructed based on an adaptive partition of the computational domain to facilitate subsequent approximation procedures. In contrast to existing schemes based on either analytic or purely algebraic approximations, SMASH takes advantage of both approaches and greatly improves efficiency. The algorithm follows a bottom‐up traversal of the tree and is able to perform the operations associated with each node on the same level in parallel. A strong rank‐revealing factorization is applied to the initial analytic approximation in theseparationregime so that a special structure is incorporated into the final nested bases. As a consequence, the storage is significantly reduced on one hand and a hierarchy of the original grid is constructed on the other hand. Due to this hierarchy, nested bases at upper levels can be computed in a similar way as the leaf level operations but on coarser grids. The main advantages of SMASH include its simplicity of implementation, its flexibility to construct various hierarchical rank structures, and a low storage cost. The efficiency and robustness of SMASH are demonstrated through various test problems arising from integral equations, structured matrices, etc.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Numerical Linear Algebra with Applications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 valid families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio. 
    more » « less
  2. Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix A as A≈LR, where L and R are the low rank factors. The total number of elements in L and R can be significantly less than that in A. Furthermore, the entries of L and R are quantized to low precision formats −− compressing A by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of A by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of A onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-7b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques. 
    more » « less
  3. Abstract

    Beginning with the work of Landau, Pollak and Slepian in the 1960s on time‐band limiting, commuting pairs of integral and differential operators have played a key role in signal processing, random matrix theory, and integrable systems. Previously, such pairs were constructed by ad hoc methods, which essentially worked because a commuting operator of low order could be found by a direct calculation. We describe a general approach to these problems that proves that every pointWof Wilson's infinite dimensional adelic Grassmannian gives rise to an integral operator , acting on for a contour , which reflects a differential operator with rational coefficients in the sense that on a dense subset of . By using analytic methods and methods from integrable systems, we show that the reflected differential operator can be constructed from the Fourier algebra of the associated bispectral function . The exact size of this algebra with respect to a bifiltration is in turn determined using algebro‐geometric methods. Intrinsic properties of four involutions of the adelic Grassmannian naturally lead us to consider the reflecting property above in place of plain commutativity. Furthermore, we prove that the time‐band limited operators of the generalized Laplace transforms with kernels given by the rank one bispectral functions always reflect a differential operator. A 90° rotation argument is used to prove that the time‐band limited operators of the generalized Fourier transforms with kernels admit a commuting differential operator. These methods produce vast collections of integral operators with prolate‐spheroidal properties, associated to the wave functions of all rational solutions of the KP hierarchy vanishing at infinity, introduced by Krichever in the late 1970s.

    more » « less
  4. Abstract

    Extending computational harmonic analysis tools from the classical setting of regular lattices to the more general setting of graphs and networks is very important, and much research has been done recently. The generalized Haar–Walsh transform (GHWT) developed by Irion and Saito (2014) is a multiscale transform for signals on graphs, which is a generalization of the classical Haar and Walsh–Hadamard transforms. We propose theextendedgeneralized Haar–Walsh transform (eGHWT), which is a generalization of the adapted time–frequency tilings of Thiele and Villemoes (1996). The eGHWT examines not only the efficiency of graph-domain partitions but also that of “sequency-domain” partitionssimultaneously. Consequently, the eGHWT and its associated best-basis selection algorithm for graph signals significantly improve the performance of the previous GHWT with the similar computational cost,$$O(N \log N)$$O(NlogN), whereNis the number of nodes of an input graph. While the GHWT best-basis algorithm seeks the most suitable orthonormal basis for a given task among more than$$(1.5)^N$$(1.5)Npossible orthonormal bases in$$\mathbb {R}^N$$RN, the eGHWT best-basis algorithm can find a better one by searching through more than$$0.618\cdot (1.84)^N$$0.618·(1.84)Npossible orthonormal bases in$$\mathbb {R}^N$$RN. This article describes the details of the eGHWT best-basis algorithm and demonstrates its superiority using several examples including genuine graph signals as well as conventional digital images viewed as graph signals. Furthermore, we also show how the eGHWT can be extended to 2D signals and matrix-form data by viewing them as a tensor product of graphs generated from their columns and rows and demonstrate its effectiveness on applications such as image approximation.

    more » « less
  5. Models in which the covariance matrix has the structure of a sparse matrix plus a low rank perturbation are ubiquitous in data science applications. It is often desirable for algorithms to take advantage of such structures, avoiding costly matrix computations that often require cubic time and quadratic storage. This is often accomplished by performing operations that maintain such structures, for example, matrix inversion via the Sherman–Morrison–Woodbury formula. In this article, we consider the matrix square root and inverse square root operations. Given a low rank perturbation to a matrix, we argue that a low‐rank approximate correction to the (inverse) square root exists. We do so by establishing a geometric decay bound on the true correction's eigenvalues. We then proceed to frame the correction as the solution of an algebraic Riccati equation, and discuss how a low‐rank solution to that equation can be computed. We analyze the approximation error incurred when approximately solving the algebraic Riccati equation, providing spectral and Frobenius norm forward and backward error bounds. Finally, we describe several applications of our algorithms, and demonstrate their utility in numerical experiments. 
    more » « less