Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model’s behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
more »
« less
Cross-Layer Interactions in CPS for Performance and Certification
A central challenge in designing embedded control systems or cyber-physical systems (CPS) is that of translating high-level models of control algorithms into efficient implementations, while ensuring that model-level semantics are preserved. While a large body of techniques for designing provably correct control strategies exist in the control theory literature, when it comes to transforming mathematical descriptions of these strategies to an efficient implementation, the available means are surprisingly ad hoc in nature. Among other reasons, this is because of (i) implementation platform details not sufficiently being accounted for in controller models, (ii) side effects introduced in the code generation process, (iii) various compiler optimizations whose impact on the dynamics of the plant being controlled not being properly understood, (iv) the presence of analog components on the implementation platform whose behavior is difficult to model, (v) computation and communication delays that exist in an implementation but were not accounted for in the model, and (vi) also the effects of image/video processing whose accuracy and timing behavior are difficult to model. As we move towards designing autonomous systems, these issues become biting problems on the path to certification, and striking a balance between performance and certification. In this position paper, we discuss some of these challenges - that we formulate as the need for modeling the interactions between various implementation layers in a CPS - and potential research directions to address them.
more »
« less
- PAR ID:
- 10108230
- Date Published:
- Journal Name:
- 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)
- Page Range / eLocation ID:
- 1439 to 1444
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cyber-Physical Systems (CPS) consist of embedded computers with sensing and actuation capability, and are integrated into and tightly coupled with a physical system. Because the physical and cyber components of the system are tightly coupled, cyber-security is important for ensuring the system functions properly and safely. However, the effects of a cyberattack on the whole system may be difficult to determine, analyze, and therefore detect and mitigate. This work presents a model based software development framework integrated with a hardware-in-the-loop (HIL) testbed for rapidly deploying CPS attack experiments. The framework provides the ability to emulate low level attacks and obtain platform specific performance measurements that are difficult to obtain in a traditional simulation environment. The framework improves the cybersecurity design process which can become more informed and customized to the production environment of a CPS. The developed framework is illustrated with a case study of a railway transportation system.more » « less
-
Simulation-based analysis is essential in the model-based design process of Cyber-Physical Systems (CPS). Since heterogeneity is inherent to CPS, virtual prototyping of CPS designs and the simulation of their behavior in various environments typically involve a number of physical and computation/communication domains interacting with each other. Affordability of the model-based design process makes the use of existing domain-specific modeling and simulation tools all but mandatory. However, this pressure establishes the requirement for integrating the domain-specific models and simulators into a semantically consistent and efficient system-of-system simulation. The focus of the paper is the interoperability of popular integration platforms supporting heterogeneous multi-model simulations. We examine the relationship among three existing platforms: the High-Level Architecture (HLA)-based CPS Wind Tunnel (CPSWT), MOSAIK, and the Functional Mockup Unit (FMU). We discuss approaches to establish interoperability and present results of ongoing work in the context of an example.more » « less
-
Discrete manufacturing systems are complex cyber-physical systems (CPS) and their availability, performance, and quality have a big impact on the economy. Smart manufacturing promises to improve these aspects. One key approach that is being pursued in this context is the creation of centralized software-defined control (SDC) architectures and strategies that use diverse sensors and data sources to make manufacturing more adaptive, resilient, and programmable. In this paper, we present SDCWorks-a modeling and simulation framework for SDC. It consists of the semantic structures for creating models, a baseline controller, and an open source implementation of a discrete event simulator for SDCWorks models. We provide the semantics of such a manufacturing system in terms of a discrete transition system which sets up the platform for future research in a new class of problems in formal verification, synthesis, and monitoring. We illustrate the expressive power of SDCWorks by modeling the realistic SMART manufacturing testbed of University of Michigan. We show how our open source SDCWorks simulator can be used to evaluate relevant metrics (throughput, latency, and load) for example manufacturing systems.more » « less
-
Most cyber–physical systems (CPS) encounter a large volume of data which is added to the system gradually in real time and not altogether in advance. In this paper, we provide a theoretical framework that yields optimal control strategies for such CPS at the intersection of control theory and learning. In the proposed framework, we use the actual CPS, i.e., the ‘‘true" system that we seek to optimally control online, in parallel with a model of the CPS that is available. We then institute an information state for the system which does not depend on the control strategy. An important consequence of this independence is that for any given choice of a control strategy and a realization of the system’s variables until time t, the information states at future times do not depend on the choice of the control strategy at time t but only on the realization of the decision at time t, and thus they are related to the concept of separation between estimation of the state and control. Namely, the future information states are separated from the choice of the current control strategy. Such control strategies are called separated control strategies. Hence, we can derive offline the optimal control strategy of the system with respect to the information state, which might not be precisely known due to model uncertainties or complexity of the system, and then use standard learning approaches to learn the information state online while data are added gradually to the system in real time. We show that after the information state becomes known, the separated control strategy of the CPS model derived offline is optimal for the actual system. We illustrate the proposed framework in a dynamic system consisting of two subsystems with a delayed sharing information structure.more » « less
An official website of the United States government

