skip to main content


Title: Invited Paper: Failure is (literally) an Option: Atomic Commitment vs Optionality in Decentralized Finance
Many aspects of blockchain-based decentralized finance can be understood as an extension of classical distributed computing. In this paper, we trace the evolution of two interrelated notions: failure and fault-tolerance. In classical distributed computing, a failure to complete a multi-party protocol is typically attributed to hardware malfunctions. A fault-tolerant protocol is one that responds to such failures by rolling the system back to an earlier consistent state. In the presence of Byzantine failures, a failure may be the result of an attack, and a fault-tolerant protocol is one that ensures that attackers will be punished and victims compensated. In modern decentralized finance however, failure to complete a protocol can be considered a legitimate option, not a transgression. A fault-tolerant protocol is one that ensures that the party offering the option cannot renege, and the party purchasing the option provides fair compensation (in the form of a fee) to the offering party. We sketch the evolution of such protocols, starting with two-phase commit, and finishing with timed hashlocked smart contracts.  more » « less
Award ID(s):
1917990
NSF-PAR ID:
10300609
Author(s) / Creator(s):
;
Date Published:
Journal Name:
SSS 2021: 23rd International Symposium on Stabilization, Safety, and Security of Distributed Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper presents a formulation of multiparty session types (MPSTs) for practical fault-tolerant distributed programming. We tackle the challenges faced by session types in the context of distributed systems involving asynchronous and concurrent partial failures – such as supporting dynamic replacement of failed parties and retrying failed protocol segments in an ongoing multiparty session – in the presence of unreliable failure detection. Key to our approach is that we develop a novel model of event-driven concurrency for multiparty sessions. Inspired by real-world practices, it enables us to unify the session-typed handling of regular I/O events with failure handling and the combination of features needed to express practical fault-tolerant protocols. Moreover, the characteristics of our model allow us to prove a global progress property for well-typed processes engaged in multiple concurrent sessions, which does not hold in traditional MPST systems. To demonstrate its practicality, we implement our framework as a toolchain and runtime for Scala, and use it to specify and implement a session-typed version of the cluster management system of the industrial-strength Apache Spark data analytics framework. Our session-typed cluster manager composes with other vanilla Spark components to give a functioning Spark runtime; e.g., it can execute existing third-party Spark applications without code modification. A performance evaluation using the TPC-H benchmark shows our prototype implementation incurs an average overhead below 10%. 
    more » « less
  2. Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency. 
    more » « less
  3. Abstract

    Modern distributed data management systems face a new challenge: how can autonomous, mutually distrusting parties cooperate safely and effectively? Addressing this challenge brings up familiar questions from classical distributed systems: how to combine multiple steps into a single atomic action, how to recover from failures, and how to synchronize concurrent access to data. Nevertheless, each of these issues requires rethinking when participants are autonomous and potentially adversarial. We propose the notion of across-chain deal, a new way to structure complex distributed computations that manage assets in an adversarial setting. Deals are inspired by classical atomic transactions, but are necessarily different, in important ways, to accommodate the decentralized and untrusting nature of the exchange. We describe novel safety and liveness properties, along with two alternative protocols for implementing cross-chain deals in a system of independent blockchain ledgers. One protocol, based on synchronous communication, is fully decentralized, while the other, based on semi-synchronous communication, requires a globally shared ledger. We also prove that some degree of centralization is required in the semi-synchronous communication model.

     
    more » « less
  4. Abstract

    We conceptualize adecentralizedsoftware application as one constituted fromautonomousagents that communicate viaasynchronousmessaging. Modern software paradigms such as microservices and settings such as the Internet of Things evidence a growing interest in decentralized applications. Constructing a decentralized application involves designing agents as independent local computations that coordinate successfully to realize the application’s requirements. Moreover, a decentralized application is susceptible to faults manifested as message loss, delay, and reordering. We contributeMandrake, a programming model for decentralized applications that tackles these challenges without relying on infrastructure guarantees. Specifically, we adopt the construct of aninformation protocolthat specifies messaging between agents purely in causal terms and can be correctly enacted by agents in a shared-nothing environment over nothing more than unreliable, unordered transport. Mandrake facilitates (1) implementing protocol-compliant agents by introducing a programming model; (2) transforming protocols into fault-tolerant ones with simple annotations; and (3) a declarative policy language that makes it easy to implement fault-tolerance in agents based on the capabilities in protocols. Mandrake’s significance lies in demonstrating a straightforward approach for constructing decentralized applications without relying on coordination mechanisms in the infrastructure, thus achieving some of the goals of the founders of networked computing from the 1970s.

     
    more » « less
  5. The addition of geometric reconfigurability in a cable driven parallel robot (CDPR) introduces kinematic redundancies which can be exploited for manipulating structural and mechanical properties of the robot through redundancy resolution. In the event of a cable failure, a reconfigurable CDPR (rCDPR) can also realign its geometric arrangement to overcome the effects of cable failure and recover the original expected trajectory and complete the trajectory tracking task. In this paper we discuss a fault tolerant control (FTC) framework that relies on an Interactive Multiple Model (IMM) adaptive estimation filter for simultaneous fault detection and diagnosis (FDD) and task recovery. The redundancy resolution scheme for the kinematically redundant CDPR takes into account singularity avoidance, manipulability and wrench quality maximization during trajectory tracking. We further introduce a trajectory tracking methodology that enables the automatic task recovery algorithm to consistently return to the point of failure. This is particularly useful for applications where the planned trajectory is of greater importance than the goal positions, such as painting, welding or 3D printing applications. The proposed control framework is validated in simulation on a planar rCDPR with elastic cables and parameter uncertainties to introduce modeled and unmodeled dynamics in the system as it tracks a complete trajectory despite the occurrence of multiple cable failures. As cables fail one by one, the robot topology changes from an over-constrained to a fully constrained and then an under-constrained CDPR. The framework is applied with a constant-velocity kinematic feedforward controller which has the advantage of generating steady-state inputs despite dynamic oscillations during cable failures, as well as a Linear Quadratic Regulator (LQR) feedback controller to locally dampen these oscillations. 
    more » « less