Introduction
|
Unexpected failures and outages will continue to affect the operation of cyber
infrastructures like Amazon EC2 and network infrastructures like GENI. For
many applications running in such infrastructures, such as long-running
scientific jobs and networked system emulations, failure recovery means
re-running the application from the beginning thus losing (partial) work done
and wasting system resources. It is desirable for the infrastructure to
provide efficient, application-transparent failure recovery capability that
takes live "snapshots" of an infrastructure for future recovery or replay.
With advances in virtualization technologies, live snapshotting is feasible for a single virtual machine. However, the current technique is not adequate for suspending and resuming distributed experiments that run on GENI. GENI-VIOLIN's goal is to provide fast "live snapshotting" that allows suspend and resume of an entire GENI experiment distributed across multiple sites spanning multiple networks. This project is part of the GENI-alpha plenary demos planned for GENI Engineering Confernce 9 (GEC9). GENI-VIOLIN can be used for
|
|
Design and Implementation
|
The key challenge in suspending/resuming a distributed experiment is the
coordination required by multiple independent checkpoints performed at the
end-host. We leverage Purdue university's earlier work, VNSNAP
built on top of VIOLIN.
Our primary contribution is the development of distributed live snapshot
algorithm that allows snapshotting entirely in the network with minimal
changes to end-host systems and minimal performance degradation.
We have implemented Mattern's snapshotting algorithm using Xen's live migration and Openflow. There are two key components in the implementation.
|
|
GEC9 Demo
During GEC9, we demonstrated live snapshotting over two ProtoGENI sites (Utah and GPO) connected through a wide area network.
|
|
GEC8 Demo
In GEC8 demo, we showed how to create a VIOLIN with 4 VMs on 4 different nodes (on Utah Emulab), and ran a distributed Mandelbrot application written using MPI. The application is snapshotted, and when a failure happens, it is restored from a previously stored consistent snapshot. Please see the demo video below for more details.
People
- Pradeep Padala, DOCOMO USA Labs
- Bob Lantz, DOCOMO USA Labs
- Ulas Kozat, DOCOMO USA Labs
- Ken Igarashi, DOCOMO USA Labs
- Ardalan Kangarlou, Purdue University
- Sahan Gamage, Purdue University
- Dongyan Xu, Purdue University
Documents
- Ardalan Kangarlou, Ulas C. Kozat, Pradeep Padala, Bob Lantz, Ken Igarashi, Dongyan Xu. In-Network Live Snapshot Service for Recovering Virtual Infrastructures. IEEE Network Magazine. Special Issue on Cloud Computing. Jul 2011. [HTML Abstract and Link to PDF]
- In-Network Suspend and Resume for GENI Experiments. 9th GENI Engineering Conference (GEC9). Nov 2010. Poster. [PDF]
- Distributed Suspend and Resume for GENI Experiments. 8th GENI Engineering Conference (GEC8). Jul 2010. Poster. [PDF]