Memory profiling captures programs’ dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation effort. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a framework for streamlined development of fast memory profilers. With PROMPT, developers need only specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT where all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling overhead was improved by 5.3× and 7.1× respectively. To further underscore PROMPT’s impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In 570 lines of code, this redesigned workflow satisfies the client’s memory profiling needs while achieving more than 90% reduction in profiling overhead and improved robustness compared to the original profilers.
more »
« less
Sampling optimized code for type feedback
To efficiently execute dynamically typed languages, many language implementations have adopted a two-tier architecture. The first tier aims for low-latency startup times and collects dynamic profiles, such as the dynamic types of variables. The second tier provides high-throughput using an optimizing compiler that specializes code to the recorded type information. If the program behavior changes to the point that not previously seen types occur in specialized code, that specialized code becomes invalid, it is deoptimized, and control is transferred back to the first tier execution engine which will start specializing anew. However, if the program behavior becomes more specific, for instance, if a polymorphic variable becomes monomorphic, nothing changes. Once the program is running optimized code, there are no means to notice that an opportunity for optimization has been missed. We propose to employ a sampling-based profiler to monitor native code without any instrumentation. The absence of instrumentation means that when the profiler is not active, no overhead is incurred. We present an implementation is in the context of the Ř just-in-time, optimizing compiler for the R language. Based on the sampled profiles, we are able to detect when the native code produced by Ř is specialized for stale type feedback and recompile it to more type-specific code. We show that sampling adds an overhead of less than 3% in most cases and up to 9% in few cases and that it reliably detects stale type feedback within milliseconds.
more »
« less
- Award ID(s):
- 1759736
- PAR ID:
- 10309561
- Date Published:
- Journal Name:
- Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Coverage-guided fuzzing is one of the most effective software security testing techniques. Fuzzing takes on one of two forms: compiler-based or binary-only, depending on the availability of source code. While the fuzzing community has improved compiler-based fuzzing with performance- and feedback-enhancing program transformations, binary-only fuzzing lags behind due to the semantic and performance limitations of instrumenting code at the binary level. Many fuzzing use cases are binary-only (i.e., closed source). Thus, applying fuzzing-enhancing program transformations to binary-only fuzzing—without sacrificing performance—remains a compelling challenge. This paper examines the properties required to achieve compiler-quality binary-only fuzzing instrumentation. Based on our findings, we design ZAFL: a platform for applying fuzzing-enhancing program transformations to binary-only targets—maintaining compiler-level performance. We showcase ZAFL's capabilities in an implementation for the popular fuzzer AFL, including five compiler-style fuzzing-enhancing transformations, and evaluate it against the leading binary-only fuzzing instrumenters AFL-QEMU and AFL-Dyninst. Across LAVA-M and real-world targets, ZAFL improves crash-finding by 26–96% and 37–131%; and throughput by 48– 78% and 159–203% compared to AFL-Dyninst and AFL-QEMU, respectively—while maintaining compiler-level of overhead of 27%. We also show that ZAFL supports real-world open- and closed-source software of varying size (10K– 100MB), complexity (100–1M basic blocks), platform (Linux and Windows), and format (e.g., stripped and PIC).more » « less
-
Gradually typed programming languages permit the incremental addition of static types to untyped programs. To remain sound, languages insert run-time checks at the boundaries between typed and untyped code. Unfortunately, performance studies have shown that the overhead of these checks can be disastrously high, calling into question the viability of sound gradual typing. In this paper, we show that by building on existing work on soft contract verification, we can reduce or eliminate this overhead. Our key insight is that while untyped code cannot be trusted by a gradual type system, there is no need to consider only the worst case when optimizing a gradually typed program. Instead, we statically analyze the untyped portions of a gradually typed program to prove that almost all of the dynamic checks implied by gradual type boundaries cannot fail, and can be eliminated at compile time. Our analysis is modular, and can be applied to any portion of a program. We evaluate this approach on a dozen existing gradually typed programs previously shown to have prohibitive performance overhead—with a median overhead of 2.5× and up to 80.6× in the worst case—and eliminate all overhead in most cases, suffering only 1.5× overhead in the worst case.more » « less
-
Compilers face an intrinsic tradeoff between compilation speed and code quality. The tradeoff is particularly stark in a dynamic setting where JIT compilation time contributes to application runtime. Many systems now employ multiple compilation tiers, where one tier offers fast compile speed while another has much slower compile speed but produces higher quality code. With proper heuristics on when to use each, the overall performance is better than using either compiler in isolation. At the introduction of WebAssembly into the Web platform in 2017, most engines employed optimizing compilers and pre-compiled entire modules before execution. Yet since that time, all Web engines have introduced new “baseline” compiler tiers for Wasm to improve startup time. Further, many new non-web engines have appeared, some of which also employ simple compilers. In this paper, we demystify single-pass compilers for Wasm, explaining their internal algorithms and tradeoffs, as well as providing a detailed empirical study of those employed in production. We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm.more » « less
-
A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level byte-code, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm engines have not offered much insight into executing programs, supporting at best bytecode-level stepping and basic source maps, but no instrumentation capabilities. In this paper, we show the first non-intrusive dynamic instrumentation system for WebAssembly in the open-source Wizard Research Engine. Our innovative design offers a flexible, complete hierarchy of instrumentation primitives that support building high-level, complex analyses in terms of low-level, programmable probes. In contrast to emulation or machine code instrumentation, injecting probes at the bytecode level increases expressiveness and vastly simplifies the implementation by reusing the engine's JIT compiler, interpreter, and deoptimization mechanism rather than building new ones. Wizard supports both dynamic instrumentation insertion and removal while providing consistency guarantees, which is key to composing multiple analyses without interference. We detail a fully-featured implementation in a high-performance multi-tier Wasm engine, show novel optimizations specifically designed to minimize instrumentation overhead, and evaluate performance characteristics under load from various analyses. This design is well-suited for production engine adoption as probes can be implemented to have no impact on production performance when not in use.more » « less
An official website of the United States government

