Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying reinforcement learning algorithm is complex and requires additional training for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. Such an algorithm doesn’t require any additional parameters except for the original language model and maximally reuses the pretraining pipeline. To achieve this, we formulate instruction alignment problem for language models as a goal-reaching problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions. The resulting two-stage algorithm shed light to a family of reward-free approaches that utilize the hindsightly relabeled instructions based on feedback. We evaluate the performance of HIR extensively on 12 challenging BigBench reasoning tasks and show that HIR outperforms the baseline algorithms and is comparable to or even surpasses supervised fine-tuning. The implementation of HIR is available at https://github.com/tianjunz/HIR.
more »
« less
PDL: a high-level hardware design language for pipelined processors
Processors are typically designed in Register Transfer Level (RTL) languages, which give designers low-level control over circuit structure and timing. To achieve good performance, processors are pipelined, with multiple instructions executing concurrently in different parts of the circuit. Thus even though processors implement a fundamentally sequential specification (the instruction set architecture), the implementation is highly concurrent. The interactions of multiple instructions---potentially speculative---can cause incorrect behavior. We present PDL, a novel hardware description language targeted at the construction of pipelined processors. PDL provides one instruction at a time semantics: the first language to enforce that the generated pipelined circuit has the same behavior as a sequential specification. This enforcement facilitates design-space exploration. Adding or removing pipeline stages, moving operations across stages, or otherwise changing pipeline structure normally requires careful analysis of bypass paths and stall logic; with PDL, this analysis is handled by the PDL compiler. At the same time, PDL still offers designers fine-grained control over performance-critical microarchitectural choices such as timing of operations, data forwarding, and speculation. We demonstrate PDL's expressive power and ease of design exploration by implementing several RISC-V cores with differing microarchitectures. Our results show that PDL does not impose significant performance or area overhead compared to a standard HDL.
more »
« less
- Award ID(s):
- 1717554
- PAR ID:
- 10389500
- Date Published:
- Journal Name:
- ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI)
- Page Range / eLocation ID:
- 719 to 732
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone. See additional materials at https://sites.google.com/view/del-taco-learningmore » « less
-
null (Ed.)This paper presents a novel approach to robot task learning from language-based instructions, which focuses on increasing the complexity of task representations that can be taught through verbal instruction. The major proposed contribution is the development of a framework for directly mapping a complex verbal instruction to an executable task representation, from a single training experience. The method can handle the following types of complexities: 1) instructions that use conjunctions to convey complex execution constraints (such as alternative paths of execution, sequential or nonordering constraints, as well as hierarchical representations) and 2) instructions that use prepositions and multiple adjectives to specify action/object parameters relevant for the task. Specific algorithms have been developed for handling conjunctions, adjectives and prepositions as well as for translating the parsed instructions into parameterized executable task representations. The paper describes validation experiments with a PR2 humanoid robot learning new tasks from verbal instruction, as well as an additional range of utterances that can be parsed into executable controllers by the proposed system.more » « less
-
Energy-efficient bitcoin mining cores have gained significant attention since the energy cost for computing dominates the mining expenses [1]. Ultra-low-voltage (ULV) digital circuits have emerged as an attractive approach to improve the energy-efficiency. However, they demand a large timing margin for the worst-case process, voltage, and temperature (PVT) variations, undermining a significant portion of energy savings. Recent works, including multi-phase latch pipeline [1], tunable replica circuits [2]–[3], in-situ error detection and correction (EDAC) [4]–[6], and dynamic timing enhancement [7], can reduce the pessimistic margin. However, it is not straightforward to adopt those techniques in mining cores due to their deeply-pipelined architecture (up to 128 stages [1]). For example, to adopt EDAC, the deep pipeline requires inserting many bulky error detectors as it has many critical paths. Our experiment with a 0.3V 28-nm mining core shows >18.9% registers need to be replaced with error detectors, considering 6σ local process variation only. Also, multiple stages can have timing errors simultaneously, making an error correction process (e.g., clock gating [5], VDD boosting [6]) complex and costly.more » « less
-
The tidal waves of modern electronic/electrical devices have led to increasing demands for ubiquitous application-specific power converters. A conventional manual design procedure of such power converters is computation- and labor-intensive, which involves selecting and connecting component devices, tuning component-wise parameters and control schemes, and iteratively evaluating and optimizing the design. To automate and speed up this design process, we propose an automatic framework that designs custom power converters from design specifications using Monte Carlo Tree Search. Specifically, the framework embraces the upper-confidence-bound-tree (UCT), a variant of Monte Carlo Tree Search, to automate topology space exploration with circuit design specification-encoded reward signals. Moreover, our UCT-based approach can exploit small offline data via the specially designed default policy and can run in parallel to accelerate topology space exploration. Further, it utilizes a hybrid circuit evaluation strategy to substantially reduce design evaluation costs. Empirically, we demonstrated that our framework could generate energy-efficient circuit topologies for various target voltage conversion ratios. Compared to existing automatic topology optimization strategies, the proposed method is much more computationally efficient—the sequential version can generate topologies with the same quality while being up to 67% faster. The parallelization schemes can further achieve high speedups compared to the sequential version.more » « less
An official website of the United States government

