skip to main content


Title: Efficient Heap Data Management on Software Managed Manycore Architectures
Software Managed Manycore (SMM) architectures have been proposed as a solution for scaling the memory architecture. In a typical SMM architecture, Scratch Pad Memories (SPM) is used instead of caches, and data must be explicitly managed in software. While all code and data need to be managed, heap management on SMMs is especially challenging due to the highly dynamic nature of heap data access. Existing techniques spend over 90% of execution time on heap data management, which largely compromised the power efficiency of SMM architectures. This paper presents compiler-based efficient techniques that reduce heap management overhead. Experimental results on benchmarks from MiBench executing on an SMM processor modeled in Gem5 demonstrate that our approach implemented in LLVM 3.8 can improve execution time by an average of 80%, compared to the state-of-the-art.  more » « less
Award ID(s):
1723476 1525855
NSF-PAR ID:
10108944
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID)
Page Range / eLocation ID:
269 to 274
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use and is crucial to scale many challenging applications, while non-speculative parallelism is more efficient and allows parallel irrevocable actions (e.g., parallel I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) systems support speculative (transactional) and non-speculative (non-transactional) work, but lack coordination mechanisms between the two, and are limited to unordered parallelism. Prior work has extended HTMs to avoid the limitations of speculative execution, e.g., through escape actions and open-nested transactions. But these mechanisms are incompatible with systems that exploit ordered parallelism, which parallelize a broader range of applications and are easier to use. We contribute two techniques that enable seamlessly composing and coordinating speculative and non-speculative work in the context of ordered parallelism: (i) a task-based execution model that efficiently coordinates concurrent speculative and non-speculative ordered tasks, allowing them to create tasks of either kind and to operate on shared data; and (ii) a safe way for speculative tasks to invoke software-managed speculative actions that avoid hardware version management and conflict detection. These contributions improve efficiency and enable new capabilities. Across several benchmarks, they allow the system to dynamically choose whether to execute tasks speculatively or non-speculatively, avoid needless conflicts among speculative tasks, and allow speculative tasks to safely invoke irrevocable actions. 
    more » « less
  2. Applications of neural networks have gained significant importance in embedded mobile devices and Internet of Things (IoT) nodes. In particular, convolutional neural networks have emerged as one of the most powerful techniques in computer vision, speech recognition, and AI applications that can improve the mobile user experience. However, satisfying all power and performance requirements of such low power devices is a significant challenge. Recent work has shown that binarizing a neural network can significantly improve the memory requirements of mobile devices at the cost of minor loss in accuracy. This paper proposes MB-CNN, a memristive accelerator for binary convolutional neural networks that perform XNOR convolution in-situ novel 2R memristive data blocks to improve power, performance, and memory requirements of embedded mobile devices. The proposed accelerator achieves at least 13.26 × , 5.91 × , and 3.18 × improvements in the system energy efficiency (computed by energy × delay) over the state-of-the-art software, GPU, and PIM architectures, respectively. The solution architecture which integrates CPU, GPU and MB-CNN outperforms every other configuration in terms of system energy and execution time. 
    more » « less
  3. With the end of Dennard scaling, power constraints have led to increasing compute specialization in the form of differently specialized accelerators integrated at various levels of the general-purpose system hierarchy. The result is that the most common general-purpose computing platform is now a heterogeneous mix of architectures even within a single die. Consequently, mapping application code regions into available execution engines has become a challenge due to different interfaces and increased software complexity. At the same time, the energy costs of data movement have become increasingly dominant relative to computation energy. This has inspired a move towards data-centric systems, where computation is brought to data, in contrast to traditional processing-centric models. However, enabling compute nearer memory entails its own challenges, including the interactions between distance-specialization and compute-specialization. The granularity of any offload to near(er) memory logic would impact the potential data transmission reduction, as smaller offloads will not be able to amortize the transmission costs of invocation and data return, while very large offloads can only be mapped onto logic that can support all of the necessary operations within kernel-scale codes, which exacerbates both area and power constraints. For better energy efficiency, each set of related operations should be mapped onto the execution engine that, among those capable of running the set of operations, best balances the data movement and the degree of compute specialization of that engine for this code. Further, this offload should proceed in a decentralized way that keeps both the data and control movement low for all transitions among engines and transmissions of operands and results. To enable such a decentralized offload model, we propose an architecture interface that enables a common offload model for accelerators across the memory hierarchy and a tool chain to automatically identify (in a distance-aware fashion) and map profitable code regions on specialized execution engines. We evaluate the proposed architecture for a wide range of workloads and show energy reduction compared to an energy-efficient in-order core. We also demonstrate better area efficiency compared to kernel-scale offloads. 
    more » « less
  4. Karim Ali and Jan Vitek (Ed.)
    Using a stack for managing the local state of procedures as popularized by Algol is a simple but effective way to achieve a primitive form of automatic memory management. Hence, the call stack remains the backbone of most programming language runtimes to the present day. However, the appealing simplicity of the call stack model comes at the price of strictly enforced limitations: since every function return pops the stack, it is difficult to return stack-allocated data from a callee upwards to its caller – especially variable-size data such as closures. This paper proposes a solution by introducing a small tweak to the usual stack semantics. We design a type system that tracks the underlying storage mode of values, and when a function returns a stack-allocated value, we just don’t pop the stack! Instead, the stack frame is de-allocated together with a parent the next time a heap-allocated value or primitive is returned. We identify a range of use cases where this delayed-popping strategy is beneficial, ranging from closures to trait objects to other types of variable-size data. Our evaluation shows that this execution model reduces heap and GC pressure and recovers spatial locality of programs improving execution time between 10% and 25% with respect to standard execution. 
    more » « less
  5. null (Ed.)
    Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underperformed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning. 
    more » « less