skip to main content


Search for: All records

Award ID contains: 2016701

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Current state-of-the-art resource management systems leverage Machine Learning (ML) methods to enable the efficient use of heterogeneous memory hardware, deployed across emerging computing platforms. While machine intelligence can be effectively used to learn and predict complex data access patterns of modern analytics, the use of ML over the exploded data sizes and memory footprints is prohibitive for its practical system-level integration. For this reason, recent solutions use existing lightweight historical information to predict the access behavior of majority of the application pages, and train ML models over a small page subset. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper aims to reduce such vast operational overheads, that further exacerbate the existing high overheads of using machine intelligence, in return for high performance and efficiency. To this end, we build Cronus, an image-based pipeline for selecting pages for ML-based management. We visualize memory access patterns and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We then use the created images to detect patterns and select page groups for machine learning model deployment. Cronus drastically reduces the operational costs, while preserving the effectiveness of the page selection and achieved performance of machine intelligent hybrid memory management. This work makes a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions. 
    more » « less
  2. Emerging workloads benefit from massive memory capacities provided by hybrid memory platforms. Recent system-level hybrid memory management solutions integrate machine learning methods to better predict complex data access behaviors. Given the substantial associated learning overheads, such solutions train parallel recurrent neural networks to learn the access patterns at the granularity of a page for a carefully selected page subset. Our observation reveals that the size of this subset varies immensely across workload classes, sizes and patterns. Increasing the granularity at the level of a page group will help reduce the aggregate learning overheads. Yet, unsupervised machine learning clustering methods are not practical to use in this context. Instead, this paper builds Coeus - a page grouping mechanism for machine learning-based hybrid memory management. Coeus is simple, robust and efficient. Coeus leverages data reuse insights to fine-tune the granularity at which patterns are interpreted by the system. As a result, Coeus creates large clusters of pages that share the same access behavior, in a practical way. Coeus reduces by almost 3x the associated learning overheads. In addition, Coeus achieves 3x higher application performance, by the combined effects of applying machine learning to more pages and by performing management operations at better granularity, compared to configurations of existing hybrid memory managers. 
    more » « less
  3. Disaggregated memory is being proposed as a way to provide efficient memory scaling for data intensive applications. High performance interconnect technologies, such as CXL, make disaggregated, fabric-attached-memory (FAM) a viable secondary tier of memory. Previous work on remote memory relies on extending kernel level paging to utilize FAM as an additional storage tier after local memory. These approaches have the advantage of exposing remote memory in application transparent ways that do not require code changes, but they incur large overheads due to the mismatch between the abstraction of a flat virtual address space and the reality of the tiered nature of FAM. In this paper, we present an alternative approach to remote memory based on application-specific objects. We design FAM-Graph - a semi-external graph processing system that leverages application-level properties, such as read only edge data, to efficiently tier data between local and remote memory, and prefetch remote data for local computation. Using several graph algorithms and datasets, we demonstrate that FAM-Graph achieves end-to-end performance within factors of 1–6× of Galois, the state of the art shared memory graph processing system, while using up to 20× less local memory. When Galois is used in conjunction with an OS-level FAM solution, we show that FAM-Graph achieves better end-to-end performance by up to 9× when both systems are configured with the same amount of local memory. 
    more » « less
  4. Centered on modern C++ and the SYCL standard for heterogeneous programming, Data Parallel C++ (dpc++) and Intel's oneAPI software ecosystem aim to lower the barrier to entry for the use of accelerators like FPGAs in diverse applications. In this work, we consider the usage of FPGAs for scientific computing, in particular with a multigrid solver, MueLu. We report on early experiences implementing kernels of the solver in DPC++ for execution on Stratix 10 FPGAs, and we evaluate several algorithmic design and implementation choices. These choices not only impact performance, but also shed light on the capabilities and limitations of DPC++ and oneAPI. 
    more » « less
  5. Arm HPC has succeeded in scaling up in the supercomputing space with the deployment of systems like RIKEN’s Fugaku supercomputer and Sandia’s Astra cluster. At the same time, the onboarding of new users to the Arm HPC ecosystem has never been more complex due to an overabundance of compilers, libraries, and build options for tools and applications. This work investigates one particular method to ease the integration of new users into the space of Arm HPC through the use of Open OnDemand to provide a consistent and easy-to-use front-end for Georgia Tech’s A64FX cluster, Octavius. We detail the motivations for this deployment as well as the potential pitfalls in integrating with an Arm A64FX environment. User-motived applications that incorporate the interactive usage of virtual desktops and Jupyter notebooks are discussed as motivating user workflows, and we provide some context on how future deployments might look with combined Arm and Open OnDemand integration. 
    more » « less
  6. Current state-of-the-art systems for hybrid memory management are enriched with machine intelligence. To enable the practical use of Machine Learning (ML), system-level page schedulers focus the ML model training over a small subset of the applications’ memory footprint. At the same time, they use existing lightweight historical information to predict the access behavior of majority of the pages. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper explores the opportunities to reduce such operational overheads of machine learning-based hybrid memory page schedulers via use of visualization techniques to depict memory access patterns, and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We propose an initial version of a visualization pipeline for prioritizing pages for machine learning, that is independent of the hybrid memory configuration. Our approach selects pages whose ML-based management delivers, on average, performance levels within 5% of current solutions, while reducing by 75 × the page selection time. We discuss future directions and make a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions. 
    more » « less
  7. null (Ed.)
  8. null (Ed.)
    Emerging hybrid memory systems that comprise technologies such as Intel's Optane DC Persistent Memory, exhibit disparities in the access speeds and capacity ratios of their heterogeneous memory components. This breaks many assumptions and heuristics designed for traditional DRAM-only platforms. High application performance is feasible via dynamic data movement across memory units, which maximizes the capacity use of DRAM while ensuring efficient use of the aggregate system resources. Newly proposed solutions use performance models and machine intelligence to optimize which and how much data to move dynamically. However, the decision of when to move this data is based on empirical selection of time intervals, or left to the applications. Our experimental evaluation shows that failure to properly conFigure the data movement frequency can lead to 10%-100% performance degradation for a given data movement policy; yet, there is no established methodology on how to properly conFigure this value for a given workload, platform and policy. We propose Cori, a system-level tuning solution that identifies and extracts the necessary application-level data reuse information, and guides the selection of data movement frequency to deliver gains in application performance and system resource efficiency. Experimental evaluation shows that Cori configures data movement frequencies that provide application performance within 3% of the optimal one, and that it can achieve this up to 5 x more quickly than random or brute-force approaches. System-level validation of Cori on a platform with DRAM and Intel's Optane DC PMEM confirms its practicality and tuning efficiency. 
    more » « less
  9. null (Ed.)
    Application performance improvements in emerging systems with hybrid memory components, such as DRAM and Intel’s Optane DC persistent memory, are possible via periodic data movements, that maximize the DRAM use and system resource efficiency. Similarly, predominantly used NUMA DRAM-only systems benefit from data balancing solutions, such as AutoNUMA, which periodically remap an application and its data on the same NUMA node. Although there has been a significant body of research focused on the clever selection of the data to be moved periodically, there is little insight as to how to select the frequency of the data movements, i.e., the duration of the monitoring period. Our experimental analysis shows that fine-tuning the period frequency can boost application performance on average by 70% for systems with locally attached memory units and 5x when accessing remote memory via interconnection networks. Thus, there is potential for significant performance improvements just by cleverly selecting the frequency of the data movements apart from choosing the data itself. While existing solutions empirically set the duration of the period, our work provides insights into the application-level properties that influence the choice of the period. More specifically, we show that there is a correlation between the application-level data reuse distance and migration frequency. Future work aims to solidify this correlation and build a profiling solution that provides users with the data movement frequency which dynamic data management solutions can then use to enhance performance. 
    more » « less