In the current implementation we provide only unreliable, best effort semantics, similarly with UDP. The overlay networks configures itself automatically, and dynamically grows or shrinks as nodes decide to join or leave the network, and supports partitions, merges, crashes and recoveries, and any such cascading events. Applications that use the overlay network use a simple API consisting in four calls (that provide connect, disconnect, send and receive), very similar to UDP socket functions.
We continued developing Spines by adding reliablility semantics, both hop by hop and end to end, similar to TCP. We showed in simulations how using hop by hop reliablility in an overlay improves the performance in terms of latency of point to point TCP connections.
We implemented the first mechanisms leading us to support multipath routing in Spines. These include the detection of the level of link congestion, necessary for a TCP-friendly pricing mechanism in the Cost-Benefit framework.
We have started to experiment with the replication server we developed. We are now benchmarking the replication server with and without the Postgres database on a local area cluster located in our lab, on the CAIRN wide area network and on general wide area networks using the Emulab facility hosted by the University of Utah.
We have completed a full set of experiments on local and wide area networks. We were able to accurately emulate the physical topology of the CAIRN network on the Emulab machines. The Emulab machines have processing and disk IO resources comperable to those of our local cluster and we were able to get excellent results for the replication engine that showed the efficiency of the replication architecture and the practical capibility for wide area database replication.
With the purpose of building a framework that will allow us to clearly identify the tradeoffs involved when replicating databases on wide area networks, we developed a more modular version of the replication algorithm (Maintaining Database Consistency in P2P Networks). We are investigating a new metric that will allow us to quantify the opportunity of establishing new replicas into a replicated system. We are also studying the possibility of enhancing the current replication schemes in order to increase their fault tolerance and scalability properties, in the context of dynamic networks.
Wackamole is a tool that helps with making a cluster highly available. It manages a bunch of virtual IPs that should be available to the outside world at all times. Wackamole ensures that exactly one machine within the cluster is listening on each virtual IP address that Wackamole manages. If it discovers that particular machines within the cluster are not alive, it will almost immediately ensure that other machines acquire the virtual IP addresses the down machines were managing. At no time will more than one connected machine be responsible for any virtual IP.
Wackamole also works toward achieving a balanced distribution of the public IPs within the cluster it manages.
Wackamole uses the membership notifications provided by the Spread Toolkit , also developed in our lab, to generate a consistent state that is agreed upon among all of the connected Wackamole instances. Wackamole uses this knowledge to ensure that all of the public IP addresses served by the cluster will be covered by exactly one Wackamole instance.
Wackamole now supports four platforms, Linux, FreeBSD, Solaris 8, and Mac OSX. Development has also focused on making Wackamole more robust and fixing deployment issues we received from users. Based on email queries and downloads Wackamole has started to make an impact as a different model for IP failover for clusters and to be used in practice.
Later in the year we developed a third generation version of the Archipelago system. This version completely reimplements the system with a modular design that allows pluggable protocols and services such as routing, transport, and security. This will allow us to use Archipelago as a flexible platform for experimentation with specialized routing protocols and the cost-benefit framework. It also allows us to use it in non-wireless, or hybrid wired-wireless environments.
Also available in an extended form as CNDS Technical Report CNDS-2001-3.
Yair Amir, Baruch Awerbuch, Claudiu Danilov, Jonathan Stanton
This paper presents a flow control for multi-sender multi-group multicast and unicast in wide area overlay networks. The protocol is analytically grounded and achieves real world goals, such as simplicity, fairness and minimal resource usage. Flows are regulated based on the "opportunity" costs of network resources used and the benefit provided by the flow. In contrast to existing window-based flow control schemes, we avoid end-to-end per sender or per group feedback by looking only at the state of the virtual links between participating nodes. This produces control traffic proportional only to the number of overlay network links and independent of the number of groups, senders or receivers. We show the effectiveness of the resulting protocol through simulations and validate the simulations with live Internet experiments.
Also available in extended form as Technical Report CNDS-2001-6, November 2001.
Yair Amir, and Ciprian Tutu.
This paper presents in detail an efficient and provably correct algorithm for database replication over partitionable networks. Our algorithm avoids the need for end-to-end acknowledgments for each action while supporting network partitions and merges and allowing dynamic instantiation of new replicas. One round of end-to-end acknowledgments is required only upon a membership change event such as a network partition. New actions may be introduced to the system at any point, not only while in a primary component. We show how performance can be further improved for applications that allow relaxation of consistency requirements. We provide experimental results that demonstrate the superiority of this approach.
Baruch Awerbuch, and Ciprian Tutu.
We present an algorithm for persistent consistent distributed commit (distributed database commit) in a dynamic, asynchronous, peer to peer network. The algorithm has constant overhead in time and space and almost constant communication complexity, allowing it to scale to very large size networks. Previous solutions required linear overhead in communication and space, making them unscalable. We introduce a modular solution based on several well defined blocks with clear formal specifications. These blocks can be implemented in a variety of ways and we give examples of possible implementations. Most of the existing solutions require acknowledgments from every participant for each action. Our algorithm is highly efficient by aggregating these acknowledgments. Also, in contrast with existing solutions, our algorithm does not require any membership knowledge. Components are detected based on local information and the information is disseminated on an overlay spanning tree. The algorithm may prove to be more suited for practical implementation than the existing ones, because of its simplicity.
Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton and Ciprian Tutu.
This paper explores the architecture, implementation and performance of a wide and local area database replication system. The architecture provides synchronous, peer replication, supporting diverse application semantics, based on a group communication paradigm. Network partitions and merges, computer crashes and recoveries, and message omissions are all handled. Using a generic replication engine and the Spread group communication toolkit, we provide replication services for the PostgreSQL database system. We define three different environments to be used as test-beds: a local area cluster, a wide area network that spans the U.S.A, and an emulated wide area test bed. We conduct an extensive set of experiments on these environments, varying the number of replicas and clients, the mix of updates and queries, and the network latency. Our results show that sophisticated algorithms and careful distributed systems design can make symmetric, synchronous, peer database replication a reality for both local and wide area networks.
We are evaluating how our replication method can be combined with other existing techniques in order to create the most performant replication system for a given setup.
We have released version 1.0.0 of Wackamole, an NxWay fail-over for IP addresses in a cluster on August 2001. Version 1.0.0 supports the Linux operating system. Wackmole is available at www.backhand.org/wackamole.
On November 5, 2001 we released version 1.2.0 of Wackamole, an NxWay fail-over for IP addresses in a cluster. Version 1.2.0 supports the Linux, FreeBSD, Solaris 8, and Mac OSX operating systems. Wackmole is available at www.backhand.org/wackamole.
On December 9, 2001 we released version 1.2.1 of mod_backhand, an Apache web-server module that enables cluster management and request load balancing and control for heterogeneous clusters. This version now supports Windows NT as well as several versions of Unix. More information about mod_backhand is available at www.backhand.org/mod_backhand.
The Wackamole project has experienced a steady stream of downloads from our website including commercial, individual, and academic users over the last year. All in all we registered 550 distinct downloads of the software. We know of several organizations that use it in production both as NxWay failover for servers and as NxWay failover for routers (which is interesting because we never thought about it ourselves).