skip to main content

Title: COBRA: A Framework for Evaluating Compositions of Hardware Branch Predictors
We present COBRA, a framework which enables a realistic hardware-guided methodology for evaluating compositions of hardware branch predictors. COBRA provides a common interface for developing RTL implementations of predictor subcomponents, as well as a predictor composer that automatically generates hardware predictor pipelines from sub-components based on a high-level topological model of a desired algorithm. We demonstrate how COBRA aids in the design and evaluation of diverse predictor architectures and how our hardware-centric approach captures concerns in predictor characterization that are not exposed in software-based algorithm development. Using COBRA, we generate three superscalar pipelined branch predictors with diverse architectures, synthesize them to run at 1 GHz on a commercial FinFET process, integrate them with the open-source BOOM out-of-order core, and evaluate their endto- end performance on workloads over trillions of cycles. The COBRA generator system has been open-sourced as part of the SonicBOOM out-of-order core.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Page Range / eLocation ID:
310 to 320
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Predicting coarse-grain variations in workload behavior during execution is essential for dynamic resource optimization of processor systems. Researchers have proposed various methods to first classify workloads into phases and then learn their long-term phase behavior to predict and anticipate phase changes. Early studies on phase prediction proposed table-based phase predictors. More recently, simple learning-based techniques such as decision trees have been explored. However, more recent advances in machine learning have not been applied to phase prediction so far. Furthermore, existing phase predictors have been studied only in connection with specific phase classifiers even though there is a wide range of classification methods. Early work in phase classification proposed various clustering methods that required access to source code. Some later studies used performance monitoring counters, but they only evaluated classifiers for specific contexts such as thermal modeling. In this work, we perform a comprehensive study of source-oblivious phase classification and prediction methods using hardware counters. We adapt classification techniques that were used with different inputs in the past and compare them to state-of-the-art hardware counter based classifiers. We further evaluate the accuracy of various phase predictors when coupled with different phase classifiers and evaluate a range of advanced machine learning techniques, including SVMs and LSTMs for workload phase prediction. We apply classification and prediction approaches to SPEC workloads running on an Intel Core-i9 platform. Results show that a two-level kmeans clustering combined with SVM-based phase change prediction provides the best tradeoff between accuracy and long-term stability. Additionally, the SVM predictor reduces the average prediction error by 80% when compared to a table-based predictor. 
    more » « less
  2. Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neural networks (DNNs). As the performance requirements of ML applications grow continuously, the hardware accelerators start playing a central role in DNN design. This trend makes NAS even more complicated and time-consuming for most real applications. This paper proposes FLASH, a very fast NAS methodology that co-optimizes the DNN accuracy and performance on a real hardware platform. As the main theoretical contribution, we first propose the NN-Degree, an analytical metric to quantify the topological characteristics of DNNs with skip connections (e.g., DenseNets, ResNets, Wide-ResNets, and MobileNets). The newly proposed NN-Degree allows us to do training-free NAS within one second and build an accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Second, by performing inference on the target hardware, we fine-tune and validate our analytical models to estimate the latency, area, and energy consumption of various DNN architectures while executing standard ML datasets. Third, we construct a hierarchical algorithm based on simplicial homology global optimization (SHGO) to optimize the model-architecture co-design process, while considering the area, latency, and energy consumption of the target hardware. We demonstrate that, compared to the state-of-the-art NAS approaches, our proposed hierarchical SHGO-based algorithm enables more than four orders of magnitude speedup (specifically, the execution time of the proposed algorithm is about 0.1 seconds). Finally, our experimental evaluations show that FLASH is easily transferable to different hardware architectures, thus enabling us to do NAS on a Raspberry Pi-3B processor in less than 3 seconds. 
    more » « less
  3. null (Ed.)
    State-of-the-art value predictors either use control-flow context or data context to predict values. Predictors based on control-flow context use branch histories to remember past values, but these predictors require lengthy histories to predict anything other than constant and strided values. Predictors that use data context---also known as Finite Context Method (FCM) predictors---use a history of past values to predict a broader class of values, but such predictors achieve low coverage due to long training times, and they can become complex due to speculative value histories. We observe that the combination of branch and value history provides better predictability than the use of each history separately because it can predict values in control-dependent sequences of values. Furthermore, the combination improves training time by enabling accurate predictions to be made with shorter history, and it simplifies the hardware design by removing the need for speculative value histories. Based on these observations, we propose a new unlimited budget value predictor, Heterogeneous-Context Value Predictor (HCVP), that when hybridized with E-Stride, achieves a geometric mean IPC of 3.88 on the 135 public traces, as compared to 3.81 for the current leader of the Championship Value Prediction. 
    more » « less
  4. Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. In this work, we address the scalability challenge by exploiting latency monotonicity --- the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Finally, we validate our approach and conduct experiments with devices of different platforms on multiple mainstream search spaces, including MobileNet-V2, MobileNet-V3, NAS-Bench-201, ProxylessNAS and FBNet. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device. 
    more » « less
  5. We present a learning-enabled Task and Motion Planning (TAMP) algorithm for solving mobile manipulation problems in environments with many articulated and movable obstacles. Our idea is to bias the search procedure of a traditional TAMP planner with a learned plan feasibility predictor. The core of our algorithm is PIGINet, a novel Transformer-based learning method that takes in a task plan, the goal, and the initial state, and predicts the probability of finding motion trajectories associated with the task plan. We integrate PIGINet within a TAMP planner that generates a diverse set of high-level task plans, sorts them by their predicted likelihood of feasibility, and refines them in that order. We evaluate the runtime of our TAMP algorithm on seven families of kitchen rearrangement problems, comparing its performance to that of non-learning baselines. Our experiments show that PIGINet substantially improves planning efficiency, cutting down runtime by 80\% on problems with small state spaces and 10\%-50\% on larger ones, after being trained on only 150-600 problems. Finally, it also achieves zero-shot generalization to problems with unseen object categories thanks to its visual encoding of objects. 
    more » « less