skip to main content

Title: Secure control of networked cyber-physical systems
For networked cyber-physical systems to proliferate, it is important to ensure that the resulting control system is secure. We consider a physical plant, abstracted as a single input- single-output stochastic linear dynamical system, in which a sensor node can exhibit malicious behavior. A malicious sensor may report false or distorted sensor measurements. For such compromised systems, we propose a technique which ensures that malicious sensor nodes cannot introduce any significant distortion without being detected. The crux of our technique consists of the actuator node superimposing a random signal, whose realization is unknown to the sensor, on the control law-specified input. We show that in spite of a background of process noise, the above method can detect the presence of malicious nodes. Specifically, we establish that by injecting an arbitrarily small amount of such random excitation into the system, one can ensure that either the malicious sensor is detected, or it is restricted to add distortion that is only of zero-power to the noise entering the system. The proposed technique is potentially usable in applications such as smart grids, intelligent transportation, and process control.
Award ID(s):
1646449 1619085 1302182
Publication Date:
Journal Name:
55th IEEE Conference on Decision and Control
Page Range or eLocation-ID:
283 to 289
Sponsoring Org:
National Science Foundation
More Like this
  1. We address the problem of security of cyber-physical systems where some sensors may be malicious. We consider a multiple-input, multiple-output stochastic linear dynamical system controlled over a network of communication and computational nodes which contains (i) a controller that computes the inputs to be applied to the physical plant, (ii) actuators that apply these inputs to the plant, and (iii) sensors which measure the outputs of the plant. Some of these sensors, however, may be malicious. The malicious sensors do not report the true measurements to the controller. Rather, they report false measurements that they fabricate, possibly strategically, so as to achieve any objective that they may have, such as destabilizing the closed-loop system or increasing its running cost. Recently, it was shown that under certain conditions, an approach of “dynamic watermarking” can secure such a stochastic linear dynamical system in the sense that either the presence of malicious sensors in the system is detected, or the malicious sensors are constrained to adding a distortion that can only be of zero power to the noise already entering the system. The first contribution of this paper is to generalize this result to partially observed MIMO systems with both process and observationmore »noises, a model which encompasses some of the previous models for which dynamic watermarking was established to guarantee security. This result, similar to the prior ones, is shown to hold when the controller subjects the reported sequence of measurements to two particular tests of veracity. The second contribution of this paper is in showing, via counterexamples, that both of these tests are needed in order to secure the control system in the sense that if any one of these two tests of sensor veracity is dropped, then the above guarantee does not hold. The proposed approach has several potential applications, including in smart grids, automated transportation, and process control.« less
  2. Byzantine Fault Tolerant (BFT) protocols are designed to ensure correctness and eventual progress in the face of misbehaving nodes [1]. However, this does not prevent negative effects an adversary may have on performance: a faulty node may significantly affect the latency and throughput of the system without being detected. This is especially true in speculative protocols optimized for the best-case where a single leader can force the protocol into the worst case [3]. Systems like Aardvark [2] that are designed to maximize worst-case performance tolerate byzantine behavior without necessarily detecting who the perpetrator is. By forcing regular view changes, for example, they mitigate the effects of leaders who deliberately delay dissemination of messages, even if this behavior would be difficult to prove to a third party. Byzantine faults, by definition, can be difficult to detect. An error of 'commission', such as a message with a mismatching digest, can be proven. Errors of 'omission', such as delaying or failing to relay a message, as a rule cannot be proven, and the node responsible for these types of omission faults may not appear faulty to all observers. Nevertheless, we observe that they can reliably be detected. Designing protocols that detect and ejectmore »nodes is challenging for two reasons. First, some behaviors are observed by a subset of honest nodes and cannot be objectively proven to a third party. Second, any mechanism capable of ejecting nodes could be subverted by Byzantine nodes to eject honest nodes. This paper presents the Protocol for Ejecting All Corrupted Hosts (Peach, a mechanism for detecting and ejecting faulty nodes in Byzantine fault tolerant (BFT) protocols. Nodes submit votes to a trusted configuration manager that replaces faulty nodes once a threshold of votes are received. We implement Peach for two BFT protocol variants, a traditional pbft-style three-phase protocol and a speculative protocol, and evaluate its ability to respond to Byzantine behavior. This work makes the following contributions: (1) We present and prove a necessary and sufficient constraint on cluster membership guaranteeing that any nodes causing performance degradation via acts of omission will be detected. (2) We present an agreement protocol, PEACHes, in which replicas pass votes about their subjective local observations of possible omissions to a TTP. (3) We show how the separation of detection and effectuation allows fine-grained detection of malicious behavior that is compatible and easily integrated with existing systems. (4) We present DecentBFT, an extension of BFT-Smart to which we added a speculative fast path (similar to Zyzzva) and integrated PEACHes. (5) We show DecentBFT rapidly detects and mitigates a variety of performance attacks that would have gone undetected by the state of the art.« less
  3. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues.« less
  4. Data aggregation is a key primitive in wireless sensor networks and refers to the process in which the sensed data are processed and aggregated en-route by intermediate sensor nodes. Since sensor nodes are commonly resource constrained, they may be compromised by attackers and instructed to launch various attacks. Despite the rich literature on secure data aggregation, most of the prior work focuses on detecting intermediate nodes from modifying partial aggregation results with two security challenges remaining. First, a compromised sensor node can report arbitrary reading of its own, which is fundamentally difficult to detect but widely considered to have limited impact on the final aggregation result. Second, a compromised sensor node can repeatedly attack the aggregation process to prevent the base station from receiving correct aggregation results, leading to a special form of Denial-of-Service attack. VMAT [1] (published in ICDCS 2011) is a representative secure data aggregation scheme with the capability of pinpointing and revoking compromised sensor nodes, which relies on a secure MIN aggregation scheme and converts other additive aggregation functions such as SUM and COUNT to MIN aggregations. In this paper, we introduce a novel enumeration attack against VMAT to highlight the security vulnerability of a sensor nodemore »reporting an arbitrary reading of its own. The enumeration attack allows a single compromised sensor node to significantly inflate the final aggregation result without being detected. As a countermeasure, we also introduce an effective defense against the enumeration attack. Theoretical analysis and simulation studies confirm the severe impact of the enumeration attack and the effectiveness of the countermeasure.« less
  5. The occurrence of extreme events, either natural or man-made, puts stress on both the physical infrastructure, causing damages and failures, and the financial system. The following recovery process requires a large amount of resources from financial agents, such as insurance companies. If the demand for funds overpasses their capacity, these financial agents cannot fulfill their obligations, thus defaulting, without being able to deliver the requested funds. However, agents can share risk among each other, according to specific agreements. Our goal is to investigate the relationship between these agreements and the overall response of the physical/financial systems to extreme events and to identify the optimal set of agreements, according to some risk-based metrics. We model the system as a directed and weighted graph, where nodes represent financial agents and links agreements among these. Each node faces an external demand of funds coming from the physical assets, modeled as a random variable, that can be transferred to other nodes, via the directed edges. For a given probabilistic model of demands and structure of the graph, we evaluate metrics such as the expected number of defaults, and we identify the graph configuration which optimizes the metric. The identified graph suggests to the agentsmore »a set of agreements to minimize global risk.« less