We designed a framework for application level, transparent reliable multicast using the hop-by-hop reliability in Spines. The framework includes end-to-end reliablility, congestion and flow control, and relaxed semantics over reliable multicast that handle partitions, merges, crashes and recoveries. We started the implementation of this framework in our overlay infrastructure.
We investigated some of the survivability aspects of Spines, both in wireless and wired environments. We developed a mechanism of trust based on monitoring the abnormal behaviour of overlay nodes, and an acusation system that would eventually reroute packets to avoid untrusted nodes. We released the first version of Spines (www.spines.org) under a standard BSD licence.
The current DNS infrastructure suffers from several major drawbacks that impact the reliability and the quality of the provided service. Each DNS zone is served by a set of servers organized in a single-master - multiple-slaves architecture. Under this model, zone updates can be performed only at the master server and they are passively propagated to the slaves through a pull mechansim based on polling. If the master server of a zone becomes unavailable zone updates can no longer be applied. Furthermore, the entire infrastructure is highly dependent on the availability of the 13 root servers. A recent denial of service attack disabled 9 out of the 13 root servers exposing the vulnerability of the whole system. We have begun exploring the possibility of employing a peer zone management system to replace the current master-slave architecture. Such a system will maintain replicated copies of the DNS records at all the servers and will allow for dynamic zone updates to be submitted to any peer. Each update is propagated as soon as possible to all other servers, reducing to a minimum the time necessary for an update to reach all the slaves and enhancing the overall availability of the system.
Reliable Communication in Overlay Networks |
ps,
ps.gz,
pdf.
To appear in the Proceedings of the IEEE International Conference on
Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir and Claudiu Danilov.
Reliable point-to-point communication is usually achieved in overlay
networks by applying TCP/IP on the end nodes of a connection.
This paper presents an hop-by-hop reliability approach that
considerably reduces the latency and jitter of reliable connections.
Our approach is feasible and beneficial in overlay networks that
do not have the scalability and interoperability requirements of
the global Internet.
|
N-Way Fail-Over Infrastructure for Survivable Servers and Routers. |
To appear in the Proceedings of the IEEE International Conference on
Dependable Systems and Networks (DSN03), San Francisco, June 2003.
Yair Amir, Ryan Caudy, Ashima Munjal, Theo Schlossnagle and Ciprian Tutu. Maintaining the availability of critical servers and routers is an important concern for many organizations. At the lowest level, IP addresses represent the global namespace by which services are accessible on the Internet. We introduce Wackamole, a completely distributed software solution based on a provably correct algorithm that negotiates the assignment of IP addresses among the currently available servers upon detection of faults. This reallocation ensures that at any given time any public IP address of the server cluster is covered exactly once, as long as at least one physical server survives the network fault. The same technique is extended to support highly available routers. The paper presents the design considerations, algorithm specification and correctness proof, discusses the practical usage for server clusters and for routers, and evaluates the performance of the system. |