Quarterly Technical Report, April 2003

Progress:

This quarter we continued our work on the Spines overlay network infrastructure and on the Wackamole NxWay failover for servers and routers. We have also begun exploring issues related with Domain Name Service (DNS) availability.

Papers:

Reliable Communication in Overlay Networks
ps, ps.gz, pdf. To appear in the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir and Claudiu Danilov.

Reliable point-to-point communication is usually achieved in overlay networks by applying TCP/IP on the end nodes of a connection. This paper presents an hop-by-hop reliability approach that considerably reduces the latency and jitter of reliable connections. Our approach is feasible and beneficial in overlay networks that do not have the scalability and interoperability requirements of the global Internet.

The effects of the hop-by-hop reliability approach are quantified in simulation as well as in practice using a newly developed overlay network software that is fair with the external traffic on the Internet. The experimental results show that the overhead associated with overlay network processing at the application level does not play an important factor compared with the considerable gain of the approach.


N-Way Fail-Over Infrastructure for Survivable Servers and Routers.
To appear in the Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN03), San Francisco, June 2003.

Yair Amir, Ryan Caudy, Ashima Munjal, Theo Schlossnagle and Ciprian Tutu.

Maintaining the availability of critical servers and routers is an important concern for many organizations. At the lowest level, IP addresses represent the global namespace by which services are accessible on the Internet.

We introduce Wackamole, a completely distributed software solution based on a provably correct algorithm that negotiates the assignment of IP addresses among the currently available servers upon detection of faults. This reallocation ensures that at any given time any public IP address of the server cluster is covered exactly once, as long as at least one physical server survives the network fault. The same technique is extended to support highly available routers.

The paper presents the design considerations, algorithm specification and correctness proof, discusses the practical usage for server clusters and for routers, and evaluates the performance of the system.

Software:

We released the first version of Spines (www.spines.org) under a standard BSD licence. The current version offers both best-effort and reliable communication, obtaining better performance for reliable sessions in an overlay network setup, compared with the end-to-end reliable communication.

Plans for Next Quarter:

Our focus for the next quarter will be on providing reliable multicast functionality in overlay networks, and add survivabilty features to our overlay network platform. We will continue exploring aspects related to DNS availability.