Introduction

Unexpected failures and outages will continue to affect the operation of cyber infrastructures like Amazon EC2 and network infrastructures like GENI. For many applications running in such infrastructures, such as long-running scientific jobs and networked system emulations, failure recovery means re-running the application from the beginning thus losing (partial) work done and wasting system resources. It is desirable for the infrastructure to provide efficient, application-transparent failure recovery capability that takes live "snapshots" of an infrastructure for future recovery or replay.

With advances in virtualization technologies, live snapshotting is feasible for a single virtual machine. However, the current technique is not adequate for suspending and resuming distributed experiments that run on GENI. GENI-VIOLIN's goal is to provide fast "live snapshotting" that allows suspend and resume of an entire GENI experiment distributed across multiple sites spanning multiple networks. This project is part of the GENI-alpha plenary demos planned for GENI Engineering Confernce 9 (GEC9). GENI-VIOLIN can be used for

  • Fault Tolerance: If infrastructure fails, GENI experiments can be resumed from a previously checkpointed state.
  • Debugging : GENI experimenters can go back in time and look at past checkpoints to debug their software.
  • Slice Management : Suspend and Resume can be invoked by GENI operators for better allocation and management of resources among multiple GENI slices.

Design and Implementation

The key challenge in suspending/resuming a distributed experiment is the coordination required by multiple independent checkpoints performed at the end-host. We leverage Purdue university's earlier work, VNSNAP built on top of VIOLIN. Our primary contribution is the development of distributed live snapshot algorithm that allows snapshotting entirely in the network with minimal changes to end-host systems and minimal performance degradation.

We have implemented Mattern's snapshotting algorithm using Xen's live migration and Openflow. There are two key components in the implementation.

  1. Snapshotting the end-host: A VM running on an end-host is snapshotted by issuing a "fake" live migration on the end-host to another node called snapshot server. The snapshot server stores the memory image of the snapshotted VM to the disk
  2. In-network coordination using Openflow: We have implemented a transaction coordinator and an Openflow controller (based on NOX), that coordinate the traffic between VMs in pre and post snapshot states. This coordination is done at the end-hosts in the original VNSNAP implementation.

GEC9 Demo

During GEC9, we demonstrated live snapshotting over two ProtoGENI sites (Utah and GPO) connected through a wide area network.

GEC8 Demo

In GEC8 demo, we showed how to create a VIOLIN with 4 VMs on 4 different nodes (on Utah Emulab), and ran a distributed Mandelbrot application written using MPI. The application is snapshotted, and when a failure happens, it is restored from a previously stored consistent snapshot. Please see the demo video below for more details.

People

Documents