A Cost Benefit Approach to Fault Tolerant Communication and Information Access

Quarterly Technical Report, January 2003

Progress:

This quarter we focused on the Spines overlay network infrastructure and on the Wackamole NxWay failover for servers and routers.

The Spines overlay network infrastructure.
We developed an end-to-end reliability over our hop-by-hop reliability approach. We have a complete socket capability, similar to a TCP socket that flows over the overlay end-to-end. As a by product of our approach, we can now provide a TCP-fair implementation of an efficient user-level reliable protocol.
We demonstrated that employing hop-by-hop reliability techniques considerably reduces the average latency and jitter of reliable communication while still being fair with external Internet traffic. In order to deploy our protocols over the Internet we considered networking aspects such as congestion control, internal and external fairness, flow control and end-to-end reliability.
We showed that the benefit of hop-by-hop reliability greatly overcomes the overhead associated with reliable overlay routing given by factors such as processing overhead and CPU scheduling, and achieves much better performance compared to standard end-to-end TCP connections deployed on the same overlay network.
We are getting ready to release Spines in the near future under as an open source project. It should be available at Spines.org
Wackamole: N-Way fail-over infrastructure for servers and routers.
We have evaluated Wackamole's performance varying the number of servers in the cluster and adapting the latencies of the Spread toolkit to optimize performance. We are able to achieve NxWay failover in a cluster within 12 seconds using the standard timeouts of the Spread toolkit. This is improved to under 2 seconds using a tuned version of Spread, that fits non-congested local area networks. We have specified the Wackamole algorithm and proved its correctness, which can be found in the technical report below.

Papers:

N-Way Fail-Over Infrastructure for Survivable Servers and Routers.

Technical Report CNDS-2002-5, December 2002.

Yair Amir, Ryan Caudy, Ashima Munjal, Theo Schlossnagle and Ciprian Tutu.

Maintaining the availability of critical servers and routers is an important concern for many organizations. At the lowest level, IP addresses represent the global namespace by which services are accessible on the Internet.

We introduce Wackamole, a completely distributed software solution based on a provably correct algorithm that negotiates the assignment of IP addresses among the currently available servers upon detection of faults. This reallocation ensures that at any given time any public IP address of the server cluster is covered exactly once, as long as at least one physical server survives the network fault. The same technique is extended to support highly available routers.

The paper presents the design considerations, algorithm specification and correctness proof, discusses the practical usage for server clusters and for routers, and evaluates the performance of the system.

Software:

We have released Wackamole version 2.0.0 in November 2002. The system is supported now under Linux, FreeBSD, Solaris 8, and MacOS-X. One of the main improvements is the new support for NxWay fail-over for routers. So far, we have registered over 800 downloads of the software from our web site.

Plans for Next Quarter:

We plan to release the first version of Spines as an overlay network research tool and make it available open source. Our focus for the next quarter will be on providing multicast functionality similar to IP Multicast using the overlay networks.