A Cost-Benefit Approach to Fault Tolerant Communication and Information Access

A DARPA/ITO grant (May 2000 - September 2003) to Johns Hopkins Univesity. A component of the DARPA Tolerant Networking effort.
Principal Investigators: Yair Amir and Baruch Awerbuch. Co-PI: Jonathan Stanton.

Relevant Papers

N-Way Fail-Over Infrastructure for Survivable Servers and Routers.
To appear in the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir, Ryan Caudy, Ashima Munjal, Theo Schlossnagle and Ciprian Tutu.

Maintaining the availability of critical servers and routers is an important concern for many organizations. At the lowest level, IP addresses represent the global namespace by which services are accessible on the Internet.
We introduce Wackamole, a completely distributed software solution based on a provably correct algorithm that negotiates the assignment of IP addresses among the currently available servers upon detection of faults. This reallocation ensures that at any given time any public IP address of the server cluster is covered exactly once, as long as at least one physical server survives the network fault. The same technique is extended to support highly available routers.
The paper presents the design considerations, algorithm specification and correctness proof, discusses the practical usage for server clusters and for routers, and evaluates the performance of the system.
Reliable Communication in Overlay Networks ps, ps.gz, pdf.
In the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir and Claudiu Danilov.
Reliable point-to-point communication is usually achieved in overlay networks by applying TCP/IP on the end nodes of a connection. This paper presents an hop-by-hop reliability approach that considerably reduces the latency and jitter of reliable connections. Our approach is feasible and beneficial in overlay networks that do not have the scalability and interoperability requirements of the global Internet.

The effects of the hop-by-hop reliability approach are quantified in simulation as well as in practice using a newly developed overlay network software that is fair with the external traffic on the Internet. The experimental results show that the overhead associated with overlay network processing at the application level does not play an important factor compared with the considerable gain of the approach.
On the Performance of Synchronous Wide-Area Database Replication
Technical Report CNDS-2002-4, September 2002.
Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton and Ciprian Tutu.

A fundamental challenge in database replication is maintaining a low cost of updates while assuring global system consistency. The problem is magnified for wide-area replication due to the high latency and the increased likelihood of network partitions. As a consequence, most database replication research moved away from strictly consistent models to update models with weaker semantics, relying on application knowledge to resolve conflicts.

This paper explores a synchronous replication architecture for local and wide-area networks that provides strong consistency and performs considerably better than previous consistent approaches. As a proof of concept, we implemented transparanet replication for the Postgres database system.

Our results show that sophisticated algorithms and careful distributed systems design can make symmetric, synchronous, peer database replication a reality over both local and wide-area networsk.
An On-Demand Secure Routing Protocol Resilient to Byzantine Failures ps, ps.gz, pdf.
In ACM Workshop on Wireless Security (WiSe) , Atlanta, Georgia, September 28 2002.
Baruch Awerbuch, Dave Holmer, Cristina Nita-Rotaru, and Herbert Rubens.

An ad hoc wireless network is an autonomous self-organizing system of mobile nodes connected by wireless links where nodes not in direct range can communicate via intermediate nodes. A common technique used in routing protocols for ad hoc wireless networks is to establish the routing paths on-demand, as opposed to continually maintaining a complete routing table. A significant concern in routing is the ability to function in the presence of byzantine failures which include nodes that drop, modify, or mis-route packets in an attempt to disrupt the routing service.

We propose an on-demand routing protocol for ad hoc wireless networks that provides resilience to byzantine failures caused by individual or colluding nodes. Our adaptive probing technique detects a malicious link after log faults have occurred, where n is the length of the path. These links are then avoided by multiplicatively increasing their weights and by using an on-demand route discovery protocol that finds a least weight path to the destination.
Flow Control for Many-to-Many Multicast: A Cost-Benefit Approach
Technical Report CNDS-2001-1
Yair Amir, Baruch Awerbuch, Claudiu Danilov, Jonathan Stanton,
We present a protocol that is analytically grounded, yet also achieves real world goals, such as simplicity, fairness and minimal resource usage. We base our flow control protocol on the Cost-Benefit algorithmic framework for resource management. We base decisions on the "opportunity" costs of network resources, comparing the cost of each individual resource to the benefit it provides. As opposed to existing window-based flow control schemes, we avoid end-to-end feedback by basing decisions on the state of the links between participating nodes. This produces control traffic proportional only to the number of overlay network links and independent of the number of groups.
A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication
Published in the International Conference on Dependable Systems and Networks (FTCS-30, DCCA-8), New York, New York, June 25-28, 2000.
Yair Amir, Claudiu Danilov, Jonathan Stanton,
This paper presents the design of the transport protocols of the Spread wide area group communication system. We focus on two aspects of the system. First, the value of using overlay networks for application level group communication services. Second, the requirements and design of effective low latency link protocols used to construct wide area group communication.

Overview

This project develop the theory and algorithms required to overcome extremely strong network attacks, while providing theoretically provable performance bounds. We are building a system that incorporates these algorithms, and that exhibits good performance in practice.

Key innovations

Analysis of strong adversary models: In order to understand how robust our solutions must be, we need to first understand what is the model of possible attacks and errors. We introduce a collection of adversary models. They range from a simple predictable slow adversary, to a somewhat limited "stable path" adversary (which allows communication over a path to be successfully completed), to a totally unpredictable adversary (which can selectively block traffic based on type, source, destination, etc.). Many of these models have not been considered in the literature so far.
New routing and dissemination protocols: We design a suite of novel routing protocols tailored to the above adversary models and prove that these protocols perform in a near-optimal manner. Specifically, our solution, in case an operational path exists, will be able to select such a path with high probability and, in case such a path does not exist, will store and forward the packets. The performance and correctness properties of these protocols will be supported by rigorous analysis. Our analysis will not assume anything about either the topology or the traffic patterns in the network, and will not assume a known correlation between past and future behavior of the adversary.
New replication protocols: When there is no information-theoretic possibility of communication, say in the case of a cut in the network, one can still continue the operation by making sure that the data is replicated in most areas, or at least in the areas where disconnection is likely. We design a suite of replication protocols that can handle a range of adversaries and can gracefully degrade performance and semantics as the network hostility increases.
A Cost-Benefit decision framework: This framework is used to select the most suitable protocol as network conditions change, both for network-level protocols such as routing and dissemination, and data-level protocols such as replication. The main idea is to consider the marginal benefit obtained by the application when consuming a given resource, versus the "opportunity cost" of using this resource. The latter is the benefit that may potentially be lost by other applications if this resource is committed. In the network level, the decision is based on application tolerance to delay, and the reliability of the network. In the data level, the decision is based on the cost of inaccessibility of data, the cost of updating replicas, and the synchronization cost of replication.
An overlay network architecture: We rely on an overlay network architecture that makes these protocols practical since they are too complex to have any hope to be implemented in general Internet routers anytime soon.

A Cost-Benefit Approach to Fault Tolerant Communication and Information Access

Research Activity

Presentations

Reports

Relevant Papers

Overview

Key innovations