Fault Tolerance in K3

This semester we added fault-tolerance capabilities to K3, a programming framework for building shared-nothing main-memory data systems, and explored several example applications with varying degrees of fault-tolerance.

We used the Spread toolkit to detect membership changes during the execution of a K3 program. In our implementation, each K3 executable acts as a Spread client. When a K3 program is deployed, all of the participating 'peers' join a public Spread group, and any subsequent membership changes are forwarded to the K3 application for processing. We enable programmers to register a 'trigger' that specifies application-specific logic for reacting to a change in membership.

During this project, we considered three models for fault tolerance. The first, and the most simple, model detects any failure and terminates the processes gracefully. The second model allows the programs to continue with the peers remaining after the crash in order to arrive at an approximate solution. The third, more sophisticated form of fault-tolerance does recovery by replaying the work lost by the crashed peer.

We demonstrate a proof of concept application that implements the 3rd form of fault tolerance described above. The application is a multi-stage SQL query from the Amplab Big Data Benchmark. Our implementation tolerates as many faults as there are replicas of the input datasets, and is also resilient to the 'master' node crashing.

More detailed description can be found in our presentation here. The source code for the work done in our project can be found in our here.