skip to main content


Title: Codestitcher: inter-procedural basic block layout optimization
Modern software executes a large amount of code. Previous techniques of code layout optimization were developed one or two decades ago and have become inadequate to cope with the scale and complexity of new types of applications such as compilers, browsers, interpreters, language VMs and shared libraries. This paper presents Codestitcher, an inter-procedural basic block code layout optimizer which reorders basic blocks in an executable to benefit from better cache and TLB performance. Codestitcher provides a hierarchical framework which can be used to improve locality in various layers of the memory hierarchy. Our evaluation shows that Codestitcher improves the performance of the origi- nal program by 3% to 25% (on average, by 10%) on 5 widely used applications with large code sizes: MySQL, Clang, Firefox, Apache, and Python. It gives an additional improvement of 4% over LLVM’s PGO and 3% over PGO combined with the best function reordering technique.  more » « less
Award ID(s):
1717877 1629376 1319617 1116104
PAR ID:
10122262
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 28th International Conference on Compiler Construction {CC} 2019
Page Range / eLocation ID:
65 to 75
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of distributed pose graph optimization (PGO) that has important applications in multi- robot simultaneous localization and mapping (SLAM). We pro- pose the majorization minimization (MM) method for distributed PGO (MM−PGO) that applies to a broad class of robust loss kernels. The MM−PGO method is guaranteed to converge to first-order critical points under mild conditions. Furthermore, noting that the MM−PGO method is reminiscent of proximal methods, we leverage Nesterov’s method and adopt adaptive restarts to accelerate convergence. The resulting accelerated MM methods for distributed PGO—both with a master node in the network (AMM−PGO∗) and without (AMM−PGO#)— have faster convergence in contrast to the MM−PGO method without sacrificing theoretical guarantees. In particular, the AMM−PGO# method, which needs no master node and is fully decentralized, features a novel adaptive restart scheme and has a rate of convergence comparable to that of the AMM−PGO∗ method using a master node to aggregate information from all the nodes. The efficacy of this work is validated through extensive applications to 2D and 3D SLAM benchmark datasets and comprehensive comparisons against existing state-of-the-art methods, indicating that our MM methods converge faster and result in better solutions to distributed PGO. The code is available at https://github.com/MurpheyLab/DPGO. 
    more » « less
  2. This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine computations, such as those that arise in sparse tensors. SPF is extended to perform layout specification, optimization, and code generation of sparse tensor code: (1) We develop a polyhedral layout specification that decouples iteration spaces for layout and computation; and (2) we develop efficient co-iteration of sparse tensors by combining polyhedra scanning over the layout of one sparse tensor with the synthesis of code to find corresponding elements in other tensors through an SMT solver. We compare the generated code with that produced by a state-of-the-art tensor compiler, TACO. We achieve on average 1.63× faster parallel performance than TACO on sparse-sparse co-iteration and describe how to improve that to 2.72× average speedup by switching the find algorithms. We also demonstrate that decoupling iteration spaces of layout and computation enables additional layout and computation combinations to be supported. 
    more » « less
  3. null (Ed.)
    In this paper, we consider the problem of distributed pose graph optimization (PGO) that has extensive applications in multi-robot simultaneous localization and mapping (SLAM). We propose majorization minimization methods for distributed PGO and show that our proposed methods are guaranteed to converge to first-order critical points under mild conditions. Furthermore, since our proposed methods rely a proximal operator of distributed PGO, the convergence rate can be significantly accelerated with Nesterov’s method, and more importantly, the acceleration induces no compromise of theoretical guarantees. In addition, we also present accelerated majorization minimization methods for the distributed chordal initialization that have a quadratic convergence, which can be used to compute an initial guess for distributed PGO. The efficacy of this work is validated through applications on a number of 2D and 3D SLAM datasets and comparisons with existing state-of-the-art methods, which indicates that our proposed methods have faster convergence and result in better solutions to distributed PGO. 
    more » « less
  4. In this paper, we present CPL-Sync, a certifiably correct algorithm to solve planar pose graph optimization (PGO) using the complex number representation. We formulate planar PGO as the maximum likelihood estimation (MLE) on the product of unit complex numbers, and relax this nonconvex quadratic complex optimization problem to complex semidefinite programming (SDP). Furthermore, we simplify the corresponding semidefinite programming to Riemannian staircase optimization (RSO) on complex oblique manifolds that can be solved with the Riemannian trust region (RTR) method. In addition, we prove that the SDP relaxation and RSO simplification are tight as long as the noise magnitude is below a certain threshold. The efficacy of this work is validated through comparisons with existing methods as well as applications on planar PGO in simultaneous localization and mapping (SLAM), which indicates that the proposed algorithm is more efficient and capable of solving planar PGO certifiably. The C++ code for CPL-Sync is available at https://github. com/fantaosha/CPL- Sync. 
    more » « less
  5. Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library. 
    more » « less