A Cost-Benefit Approach to Fault Tolerant Communication and Information Access

Yearly Technical Report, July 2003

Objective:

Our goal is to develop a Cost-Benefit framework for fault tolerant communication and information access that addresses extremely powerful adversaries that were never handled in the past. The project will develop the theory and algorithms required to overcome strong network attacks, while providing theoretically provable performance bounds. We will build a system that incorporates these algorithms, and that exhibits good performance in practice.

Approach:

Our technical approach includes the following innovative topics:

Analysis of strong adversary models: In order to understand how robust our solutions must be, we need to first understand what is the model of possible attacks and errors. This project introduces a collection of adversary models. They range from a simple predictable slow adversary, to a somewhat limited "stable path" adversary (which allows communication over a path to be successfully completed), to a totally unpredictable adversary (which can selectively block traffic based on type, source, destination, etc.). Many of these models have not been considered in the literature so far.
New routing and dissemination protocols: We present a suite of novel routing protocols tailored to the above adversary models and prove that these protocols perform in a near-optimal manner. Specifically, we present novel solutions that, in case an operational path exists, will be able to find it. Even when no path exist for more than a very short time (shorter than network round-trip time) we still are able to pass packets between source and destination. Our goal is to support the performance and correctness properties of these protocols by rigorous analysis. We aim our analysis not to assume anything about either the topology or the traffic patterns in the network, and not to assume a known correlation between past and future behavior of the adversary.
New replication protocol: When there is no theoretic possibility of communication, say in the case of a cut in the network, one can still continue the operation by making sure that the data is replicated in most areas, or at least in the areas where disconnection is likely. We will develop a suite of replication protocols that can handle a range of adversaries and can gracefully degrade performance and semantics as the network hostility increases. We aim at being able to replicate an ACID database as this is the most demanding replication problem.
A Cost-Benefit decision framework: This framework is used to select the most suitable protocol as network conditions change, both for network-level protocols such as routing and flow control, and data-level protocols such as replication. The main idea is to consider the marginal benefit obtained by the application when consuming a given resource, versus the "opportunity cost" of using this resource. The latter is the benefit that may potentially be lost by other applications if this resource is committed. In the network level, the decision is based on application tolerance to delay, and the reliability of the network. In the data level, the decision is based on the cost of inaccessibility of data, the cost of updating replicas, and the synchronization cost of replication.
An overlay network architecture: We develop an overlay network architecture that will make these protocols practical since they are too complex to have any hope to be implemented in general Internet routers anytime soon.

Accomplishments:

Overlay network infrastructure

In order to better analyze and understand the overlay networks paradigm in an environment defined by weaker semantics, we developed a stand alone prototype called Spines using the client-daemon architecture that is able to build and automatically configure a dynamic overlay network. Our Overlay Network aims to be very scalable, as it does not have any limitation in number of nodes or links, other than what the routing protocol used can support.

We focused our work on the client-daemon communication in Spines. We develpoped a socket-like API for unreliable, best-effort UDP communication, and also for the session-based TCP reliable communication. This brings us the level of transparency necessary for making Spines easy to use in the socket programming paradigm, in a first step towoards complete transparency. We also analyzed the possibilities of poviding IP multicast service using Spines while using only simple unicast communication at the network level.

We developed an end-to-end reliability over our hop-by-hop reliability approach. We have a complete socket capability, similar to a TCP socket that flows over the overlay end-to-end. As a by product of our approach, we can now provide a TCP-fair implementation of an efficient user-level reliable protocol.

We demonstrated that employing hop-by-hop reliability techniques considerably reduces the average latency and jitter of reliable communication while still being fair with external Internet traffic. In order to deploy our protocols over the Internet we considered networking aspects such as congestion control, internal and external fairness, flow control and end-to-end reliability.

We showed that the benefit of hop-by-hop reliability greatly overcomes the overhead associated with reliable overlay routing given by factors such as processing overhead and CPU scheduling, and achieves much better performance compared to standard end-to-end TCP connections deployed on the same overlay network.

We designed a framework for application level, transparent reliable multicast using the hop-by-hop reliability in Spines. The framework includes end-to-end reliablility, congestion and flow control, and relaxed semantics over reliable multicast that handle partitions, merges, crashes and recoveries. We started the implementation of this framework in our overlay infrastructure.

We investigated some of the survivability aspects of Spines, both in wireless and wired environments. We developed a mechanism of trust based on monitoring the abnormal behaviour of overlay nodes, and an acusation system that would eventually reroute packets to avoid untrusted nodes. We released the first version of Spines (www.spines.org) under a standard BSD licence.

We implemented best-effort multicast in Spines, with an interface that resembles the standard IP Multicast service. Our preliminary tests show that Spines is very scalable with the number of senders, receivers and groups. We plan to release a new version of Spines that incorporates best effort multicast soon.

New replication protocol

We continued to work on optimizing and evaluating the replication architecture. We discovered and corrected several performance issues with the engine itself and designed a significant latency optimization to Safe messages in the Spread Toolkit that improved the performance of the replication system as a whole. A complete replicated database solution for the PostgreSQL database was produced and formed the basic version upon which we ran experiments.

We have benchmarked the replication server with and without the PostgreSQL database both in a local area cluster and on general wide area networks using the Emulab facility hosted by the University of Utah.

We were able to obtain performance results that show the efficiency of our replication architecture due to the use of an enhanced synchronization algorithm. We show that latency is not a limiting factor in attaining high throughput in wide-area network environments. We are able to sustain similar aggregate throughput on both local area and wide-area setups, outperforming existing synchronous replication solutions and providing grounds for a wide range of applications to adopt replication as a measure for fault tolerance and high availability.

Wackamole

We have developed and released Wackamole, a software tool that allows N-Way Fail Over for IP Addresses in a cluster.

Wackamole is a tool that helps with making a cluster highly available. It manages a bunch of virtual IPs that should be available to the outside world at all times. Wackamole ensures that exactly one machine within the cluster is listening on each virtual IP address that Wackamole manages. If it discovers that particular machines within the cluster are not alive, it will almost immediately ensure that other machines acquire the virtual IP addresses the down machines were managing. At no time will more than one connected machine be responsible for any virtual IP.

Wackamole also works toward achieving a balanced distribution of the public IPs within the cluster it manages.

Wackamole uses the membership notifications provided by the Spread Toolkit , also developed in our lab, to generate a consistent state that is agreed upon among all of the connected Wackamole instances. Wackamole uses this knowledge to ensure that all of the public IP addresses served by the cluster will be covered by exactly one Wackamole instance. We have designed and formally proven the correctness of the algorithm used by Wackamole.

Wackamole now supports four platforms, Linux, FreeBSD, Solaris 8, and Mac OSX. Development has also focused on making Wackamole more robust and fixing deployment issues we received from users. Based on email queries and downloads Wackamole has started to make an impact as a different model for IP failover for clusters and to be used in practice.

Papers:

An On-Demand Secure Routing Protocol Resilient to Byzantine Failures

ps, ps.gz, pdf. In ACM Workshop on Wireless Security (WiSe) , Atlanta, Georgia, September 28 2002.

Baruch Awerbuch, Dave Holmer, Cristina Nita-Rotaru, and Herbert Rubens.

An ad hoc wireless network is an autonomous self-organizing system of mobile nodes connected by wireless links where nodes not in direct range can communicate via intermediate nodes. A common technique used in routing protocols for ad hoc wireless networks is to establish the routing paths on-demand, as opposed to continually maintaining a complete routing table. A significant concern in routing is the ability to function in the presence of byzantine failures which include nodes that drop, modify, or mis-route packets in an attempt to disrupt the routing service.

We propose an on-demand routing protocol for ad hoc wireless networks that provides resilience to byzantine failures caused by individual or colluding nodes. Our adaptive probing technique detects a malicious link after log faults have occurred, where n is the length of the path. These links are then avoided by multiplicatively increasing their weights and by using an on-demand route discovery protocol that finds a least weight path to the destination.

On the Performance of Consistent Wide-Area Database Replication

Technical Report CNDS-2003-1, September 2002.

Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton and Ciprian Tutu.

In this paper we design a generic, consistent replication architecture that enables transparent database replication and we present the optimizations and tradeoffs of the chosen design. We demonstrate the practicality of our approach by building a prototype that replicates a PostgreSQL database system. We provide experimental results for consistent wide-area database replication. We claim that the use of an optimized synchronization engine is the key to building a practical synchronous replication system for wide-area network settings.

Reliable Communication in Overlay Networks

ps, ps.gz, pdf. In the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir and Claudiu Danilov.

Reliable point-to-point communication is usually achieved in overlay networks by applying TCP/IP on the end nodes of a connection. This paper presents an hop-by-hop reliability approach that considerably reduces the latency and jitter of reliable connections. Our approach is feasible and beneficial in overlay networks that do not have the scalability and interoperability requirements of the global Internet.

The effects of the hop-by-hop reliability approach are quantified in simulation as well as in practice using a newly developed overlay network software that is fair with the external traffic on the Internet. The experimental results show that the overhead associated with overlay network processing at the application level does not play an important factor compared with the considerable gain of the approach.

N-Way Fail-Over Infrastructure for Survivable Servers and Routers.

To appear in the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.

Yair Amir, Ryan Caudy, Ashima Munjal, Theo Schlossnagle and Ciprian Tutu.

Maintaining the availability of critical servers and routers is an important concern for many organizations. At the lowest level, IP addresses represent the global namespace by which services are accessible on the Internet.

We introduce Wackamole, a completely distributed software solution based on a provably correct algorithm that negotiates the assignment of IP addresses among the currently available servers upon detection of faults. This reallocation ensures that at any given time any public IP address of the server cluster is covered exactly once, as long as at least one physical server survives the network fault. The same technique is extended to support highly available routers.

The paper presents the design considerations, algorithm specification and correctness proof, discusses the practical usage for server clusters and for routers, and evaluates the performance of the system.

Software:

We have released version 1.0 of Spines, an Overlay Network Research Platform. Version 1.0 supports unicast best effort and reliable communication with an interface similar with the Unix socket interface. Spines is available at www.spines.org

On November 15, 2002 we released version 2.0.0 of Wackamole, an NxWay fail-over for IP addresses in a cluster. Version 2.0.0 supports the Linux, FreeBSD, Solaris 8, and Mac OSX operating systems. Wackmole is available at www.wackamole.org.

Technology Transfer:

Both Spines and Wackamole project have experienced a steady stream of downloads from our website including commercial, individual, and academic users over the last year. All in all we registered 60 distinct downloads for Spines and over 1000 downloads of Wackamole. We know of several organizations that use Wackamole in production both as NxWay failover for servers and as NxWay failover for routers (which is interesting because we never thought about it ourselves).

The OASIS Dem/Val project used Wackamole to provide failover for edge routers. The same project also used the replication technology to maintain consistent state among different wide area sites. The replication technology is currently evaluated by the Future Combat System (FCS) project.