We propose CURE, a deep reinforcement learning (DRL)-based NoC design framework that simultaneously reduces network latency, improves energy-efficiency, and tolerates transient errors and permanent faults. CURE has several architectural innovations and a DRL-based hardware controller to manage design complexity and optimize trade-offs. First, in CURE, we propose reversible multi-function adaptive channels (RMCs) to reduce NoC power consumption and network latency. Second, we implement a new fault-secure adaptive error correction hardware in each router to enhance reliability for both transient errors and permanent faults. Third, we propose a router power-gating and bypass design that powers off NoC components to reduce power and extend chip lifespan. Further, for the complex dynamic interactions of these techniques, we propose using DRL to train a proactive control policy to provide improved fault-tolerance, reduced power consumption, and improved performance. Simulation using the PARSEC benchmark shows that CURE reduces end-to-end packet latency by 39%, improves energy efficiency by 92%, and lowers static and dynamic power consumption by 24% and 38%, respectively, over conventional solutions. Using mean-time-to-failure, we show that CURE is 7.7x more reliable than the conventional NoC design.
more »
« less
DDPC: Automated Data-Driven Power-Performance Controller Design on-the-fly for Latency-sensitive Web Services
Traditional power reduction techniques such as DVFS or RAPL are challenging to use with web services because they significantly affect the services’ latency and throughput. Previous work sug- gested the use of controllers based on control theory or machine learning to reduce performance degradation under constrained power. However, generating these controllers is challenging as ev- ery web service applications running in a data center requires a power-performance model and a fine-tuned controller. In this paper, we present DDPC, a system for autonomic data-driven controller generation for power-latency management. DDPC automates the process of designing and deploying controllers for dynamic power allocation to manage the power-performance trade-offs for latency- sensitive web applications such as a social network. For each application, DDPC uses system identification techniques to learn an adaptive power-performance model that captures the application’s power-latency trade-offs which is then used to generate and deploy a Proportional-Integral (PI) power controller with gain-scheduling to dynamically manage the power allocation to the server running application using RAPL. We evaluate DDPC with two realistic latency-sensitive web applications under varying load scenarios. Our results show that DDPC is capable of autonomically generating and deploying controllers within a few minutes reducing the ac- tive power allocation of a web-server by more than 50% compared to state-of-the-art techniques while maintaining the latency well below the target of the application.
more »
« less
- PAR ID:
- 10433380
- Date Published:
- Journal Name:
- ACM Web Conference
- Page Range / eLocation ID:
- 3067 to 3076
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modern fifth-generation (5G) networks are increasingly moving towards architectures characterized by softwarization and virtualization. This paper addresses the complexities and challenges in deploying applications and services in the emerging multi-tiered 5G network architecture, particularly in the context of microservices-based applications. These applications, characterized by their structure as directed graphs of interdependent functions, are sensitive to the deployment tiers and resource allocation strategies, which can result in performance degradation and susceptibility to failures. Additionally, the threat of deploying potentially malicious applications exacerbates resource allocation inefficiencies. To address these issues, we propose a novel optimization framework that incorporates a probabilistic approach for assessing the risk of malicious applications, leading to a more resilient resource allocation strategy. Our framework dynamically optimizes both computational and networking resources across various tiers, aiming to enhance key performance metrics such as latency, accuracy, and resource utilization. Through detailed simulations, we demonstrate that our framework not only satisfies strict performance requirements but also surpasses existing methods in efficiency and security.more » « less
-
Many cloud services have Quality-of-Service (QoS) requirements; most requests have to to complete within a given latency constraint. Recently, researchers have begun to investigate whether it is possible to meet QoS while attempting to save power on a per-request basis. Existing work shows that one can indeed hand-tune a request latency predictor offline for a particular cloud application, and consult it at runtime to modulate CPU voltage and frequency, resulting in substantial power savings. In this paper, we propose ReTail, an automated and general solution for request-level power management of latency-critical services with QoS constraints. We present a systematic process to select the features of any given application that best correlate with its request latency. ReTail uses these features to predict latency, and adjust CPU’s power consumption. ReTail’s predictor is trained fully at runtime. We show that unlike previous findings, simple techniques perform better than complex machine learning models, when using the right input features. For a web search engine, ReTail outperforms prior mechanisms based on complex hand-tuned predictors for that application domain. Furthermore, ReTail’s systematic approach also yields superior power savings across a diverse set of cloud applications.more » « less
-
null (Ed.)Controllers of security-critical cyber-physical systems, like the power grid, are a very important class of computer systems. Attacks against the control code of a power-grid system, especially zero-day attacks, can be catastrophic. Earlier detection of the anomalies can prevent further damage. However, detecting zero-day attacks is extremely challenging because they have no known code and have unknown behavior. Furthermore, if data collected from the controller is transferred to a server through networks for analysis and detection of anomalous behavior, this creates a very large attack surface and also delays detection. In order to address this problem, we propose Reconstruction Error Distribution (RED) of Hardware Performance Counters (HPCs), and a data-driven defense system based on it. Specifically, we first train a temporal deep learning model, using only normal HPC readings from legitimate processes that run daily in these power-grid systems, to model the normal behavior of the power-grid controller. Then, we run this model using real-time data from commonly available HPCs. We use the proposed RED to enhance the temporal deep learning detection of anomalous behavior, by estimating distribution deviations from the normal behavior with an effective statistical test. Experimental results on a real power-grid controller show that we can detect anomalous behavior with high accuracy (>99.9%), nearly zero false positives and short (<360ms) latency.more » « less
-
The control of cryogenic qubits in today’s super-conducting quantum computer prototypes presents significant scalability challenges due to the massive costs of generating/routing the analog control signals that need to be sent from a classical controller at room temperature to the quantum chip inside the dilution refrigerator. Thus, researchers in industry and academia have focused on designing in-fridge classical controllers in order to mitigate these challenges. Due to the maturity of CMOS logic, many industrial efforts (Microsoft, Intel) have focused on Cryo-CMOS as a near-term solution to design in-fridge classical controllers. Meanwhile, Supercon-ducting Single Flux Quantum (SFQ) is an alternative, less mature classical logic family proposed for large-scale in-fridge controllers. SFQ logic has the potential to maximize scalability thanks to its ultra-high speed and very low power consumption. However, architecture design for SFQ logic poses challenges due to its unconventional pulse-driven nature and lack of dense memory and logic. Thus, research at the architecture level is essential to guide architects to design SFQ-based classical controllers for large-scale quantum machines.In this paper, we present DigiQ, the first system-level design of a Noisy Intermediate Scale Quantum (NISQ)-friendly SFQ-based classical controller. We perform a design space exploration of SFQ-based controllers and co-design the quantum gate decompositions and SFQ-based implementation of those decompositions to find an optimal SFQ-friendly design point that trades area and power for latency and control while ensuring good quantum algorithmic performance. Our co-design results in a single instruction, multiple data (SIMD) controller architecture, which has high scalability, but imposes new challenges on the calibration of control pulses. We present software-level solutions to address these challenges, which if unaddressed would degrade quantum circuit fidelity given the imperfections of qubit hardware.To validate and characterize DigiQ, we first implement it using hardware description languages and synthesize it using state-of-the-art/validated SFQ synthesis tools. Our synthesis results show that DigiQ can operate within the tight power and area budget of dilution refrigerators at >42,000-qubit scales. Second, we confirm the effectiveness of DigiQ in running quantum algorithms by modeling the execution time and fidelity of a variety of NISQ applications. We hope that the promising results of this paper motivate experimentalists to further explore SFQ-based quantum controllers to realize large-scale quantum machines with maximized scalability.more » « less
An official website of the United States government

