skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Reasoning about modern datacenter infrastructures using partial histories
Modern datacenter infrastructures are increasingly architected as a cluster of loosely coupled services. The cluster states are typically maintained in a logically centralized, strongly consistent data store (e.g., ZooKeeper, Chubby and etcd), while the services learn about the evolving state by reading from the data store, or via a stream of notifications. However, it is challenging to ensure services are correct, even in the presence of failures, networking issues, and the inherent asynchrony of the distributed system. In this paper, we identify that partial histories can be used to effectively reason about correctness for individual services in such distributed infrastructure systems. That is, individual services make decisions based on observing only a subset of changes to the world around them. We show that partial histories, when applied to distributed infrastructures, have immense explanatory power and utility over the state of the art. We discuss the implications of partial histories and sketch tooling for reasoning about distributed infrastructure systems.  more » « less
Award ID(s):
2029049 1816615
PAR ID:
10293053
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII)
Page Range / eLocation ID:
213 to 220
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A majority of today's cloud services are independently operated by individual cloud service providers. In this approach, the locations of cloud resources are strictly constrained by the distribution of cloud service providers' sites. As the popularity and scale of cloud services increase, we believe this traditional paradigm is about to change toward further federated services, a.k.a., multi-cloud, due to the improved performance, reduced cost of compute, storage and network resources, as well as increased user demands. In this paper, we present COMET, a lightweight, distributed storage system for managing metadata on large scale, federated cloud infrastructure providers, end users, and their applications (e.g. HTCondor Cluster or Hadoop Cluster). We showcase use case from NSF's, Chameleon, ExoGENI and JetStream research cloud testbeds to show the effectiveness of COMET design and deployment. 
    more » « less
  2. In recent years, there has been an increasing need to understand the SCADA networks that oversee our essential infrastructures. While previous studies have focused on networks in a single sector, few have taken a comparative approach across multiple critical infrastructures. This paper dissects operational SCADA networks of three essential services: power grids, gas distribution, and water treatment systems. Our analysis reveals some distinct and shared behaviors of these networks, shedding light on their operation and network configuration. Our findings challenge some of the previous perceptions about the uniformity of SCADA networks and emphasize the need for specialized approaches tailored to each critical infrastructure. With this research, we pave the way for better network characterization for cybersecurity measures and more robust designs in intrusion detection systems. 
    more » « less
  3. High performance computing systems are typically built with high-throughput and infrastructural uniformity in mind, but generally do not easily accommodate diverse data security requirements on a single cluster. Rather than fracturing that infrastructure by building many network isolated storage "islands" to secure each dataset covered by an individual data use agreement, we explore using the Ceph distributed storage system with client-side encryption to provision secure storage from a single, untrusted data lake. 
    more » « less
  4. De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
    Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements. 
    more » « less
  5. Unlike aboveground utility systems, for which very detailed and accurate information exists, there is generally a dearth of good-quality data about underground utility infrastructures that provide vital services. To identify key strategies to improve the resilience of these underground systems, this paper presents mechanisms for successful engagement and collaboration among stakeholders and shared cross-sector system vulnerability concerns (including data availability) based on the innova- tive use of focus groups. Outputs from two virtual focus groups were used to obtain information from New York City area utilities and other stakeholders affected by underground infrastructure. There was strong agreement among participants that (1) a trusted agency in New York City government should manage a detailed map of underground infrastructure that would allow stakeholders to securely access appropriate information about underground systems on a need-to-know basis; (2) environmental risk factors, such as infrastructure age and condition, as well as location should be included; and (3) improved mechanisms for collaboration and sharing information are needed, especially during non-emergency situations. Stakeholders also highlighted the need for a regularly updated central database of relevant contacts at key organizations, since institutions often have a high employee turnover rate, which creates knowledge loss. The focus group script developed as part of this research was designed to be transferable to other cities to assess data needs and potential obstacles to stakeholder collabora- tion in the areas of underground infrastructure mapping and modeling. 
    more » « less