- Award ID(s):
- 1527692
- PAR ID:
- 10026255
- Date Published:
- Journal Name:
- Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
- Page Range / eLocation ID:
- 145 to 161
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Cybercrime scene reconstruction that aims to reconstruct a previous execution of the cyber attack delivery process is an important capability for cyber forensics (e.g., post mortem analysis of the cyber attack executions). Unfortunately, existing techniques such as log-based forensics or record-and-replay techniques are not suitable to handle complex and long-running modern applications for cybercrime scene reconstruction and post mortem forensic analysis. Specifically, log-based cyber forensics techniques often suffer from a lack of inspection capability and do not provide details of how the attack unfolded. Record-and-replay techniques impose significant runtime overhead, often require significant modifications on end-user systems, and demand to replay the entire recorded execution from the beginning. In this paper, we propose C2SR, a novel technique that can reconstruct an attack delivery chain (i.e., cybercrime scene) for post-mortem forensic analysis. It provides a highly desired capability: interactable partial execution reconstruction. In particular, it reproduces a partial execution of interest from a large execution trace of a long-running program. The reconstructed execution is also interactable, allowing forensic analysts to leverage debugging and analysis tools that did not exist on the recorded machine. The key intuition behind C2SR is partitioning an execution trace by resources and reproducing resource accesses that are consistent with the original execution. It tolerates user interactions required for inspections that do not cause inconsistent resource accesses. Our evaluation results on 26 real-world programs show that C2SR has low runtime overhead (less than 5.47%) and acceptable space overhead. We also demonstrate with four realistic attack scenarios that C2SR successfully reconstructs partial executions of long-running applications such as web browsers, and it can remarkably reduce the user’s efforts to understand the incident.more » « less
-
Abstract Motivation General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners.
Results We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.
Availability and implementation Experiments for this study: https://github.com/BenLangmead/bowtie-scaling.
Bowtie http://bowtie-bio.sourceforge.net .
Bowtie 2 http://bowtie-bio.sourceforge.net/bowtie2 .
HISAT http://www.ccb.jhu.edu/software/hisat
Supplementary information Supplementary data are available at Bioinformatics online.
-
null (Ed.)Because of its many desirable properties, such as its ability to control effects and thus potentially disastrous race conditions, functional programming offers a viable approach to programming modern multicore computers. Over the past decade several parallel functional languages, typically based on dialects of ML and Haskell, have been developed. These languages, however, have traditionally underperformed procedural languages (such as C and Java). The primary reason for this is their hunger for memory, which only grows with parallelism, causing traditional memory management techniques to buckle under increased demand for memory. Recent work opened a new angle of attack on this problem by identifying a memory property of determinacy-race-free parallel programs, called disentanglement, which limits the knowledge of concurrent computations about each other’s memory allocations. The work has showed some promise in delivering good time scalability. In this paper, we present provably space-efficient automatic memory management techniques for determinacy-race-free functional parallel programs, allowing both pure and imperative programs where memory may be destructively updated. We prove that for a program with sequential live memory of R * , any P -processor garbage-collected parallel run requires at most O ( R * · P ) memory. We also prove a work bound of O ( W + R * P ) for P -processor executions, accounting also for the cost of garbage collection. To achieve these results, we integrate thread scheduling with memory management. The idea is to coordinate memory allocation and garbage collection with thread scheduling decisions so that each processor can allocate memory without synchronization and independently collect a portion of memory by consulting a collection policy, which we formulate. The collection policy is fully distributed and does not require communicating with other processors. We show that the approach is practical by implementing it as an extension to the MPL compiler for Parallel ML. Our experimental results confirm our theoretical bounds and show that the techniques perform and scale well.more » « less
-
As the era of high frequency, single core processors have come to a close, the new paradigm of many core processors has come to dominate. In response to these systems, asynchronous multitasking runtime systems have been developed as a promising solution to efficiently utilize these newly available hardware. Asynchronous multitasking runtime systems work by dividing a problem into a large number of fine grained tasks. However, as the number of tasks created increase, the overheads associated with task creation and management cannot be ignored. Task inlining, a method where the parent thread consumes a child thread, enables the runtime system to achieve the balance between parallelism and its overhead. As largely impacted by different processor architectures, the decision of task inlining is dynamic in nature. In this research, we present adaptive techniques for deciding, at runtime, whether a particular task should be inlined or not. We present two policies, a baseline policy that makes inlining decision based on a fixed threshold and an adaptive policy which decides the threshold dynamically at runtime. We also evaluate and justify the performance of these policies on different processor architectures. To the best of our knowledge, this is the first study of the impacts of adaptive policy at runtime for task inlining in an asynchronous multitasking runtime system on different processor architectures. From experimentation, we find that the baseline policy improves the execution time from 7.61% to 54.09%. Furthermore, the adaptive policy improves over the baseline policy by up to 74%.more » « less
-
Heterogeneity in computing systems is clearly increasing, especially as “accelerators” burrow deeper and deeper into different parts of an architecture. What is new, however, is a rapid change in not only the number of such heterogeneous processors, but in their connectivity to other structures, such as cores with different ISAs or smart memory interfaces. Technologies such as chiplets are accelerating this trend. This paper is focused on the problem of how to architect efficient systems that combine multiple heterogeneous concurrent threads, especially when the underlying heterogeneous cores are separated by networks or have no shared-memory access paths. The goal is to eliminate today’s need to invoke significant software stacks to cross any of these boundaries. A suggestion is made of using migrating threads as the glue. Two experiments are described: using a heterogeneous platform where all threads share the same memory to solve a rich ML problem, and a fast PageRank approximation that mirrors the kind of computation for which thread migration may be useful. Architectural “lessons learned” are developed that should help guide future development of such systems.}more » « less