Live migration, draft

This topic is a raw thoughts about possible ways of feature implementation. Page located in obsolete "subos" wiki just because it is most complete sysnet documentation place.

Currently, we have subutai checkpoint command which is fully based on lxc-checkpoint functions and it's representation in go-lxc bindings. The "engine" for checkpoint and restore operations is a CRIU project which is all about dumping/freezing/restoring running processes.

Now we are able to create fully independent container archive which has not only data from disk, but also contains memory dump. This archive may be manually copied to another subutai peer, deployed there and receive running container with data and memory identical to original container.

Currently, there are several limitations in Subutai checkpoint option which are brought by different reasons and currently being solved:

state restore doesn't work for Subutai systemd-based container. Looks like it was known issue which has been solved in CRIU and LXC few releases ago, but for some reason it appeared in Subutai
Subutai checkpoint operation works correctly only in certain combinations of pc-kernel and liblxc version
memory dump doesn't work for shell jobs which were started in "attached" console; looks like this option is not exposed in go-lxc/lxc-checkpoint

To make checkpoint operation work as live migration we need to carefully consider two steps:

Container dump creation

Memory dump may be restored only over the rootfs which is identical to the original rootfs at the moment of memory dump creation, i.e. if we create data and memory dumps separately in time, there is a big chance, that data structure will be different from what memory "expects" and state restore will fail in this case. Therefore, to ensure successful container migration, the process of source memory dump should "freeze" container to prevent data changes and only after that data dump should be created. As follows from the written, after creation of a valid dump, source container will be in the stopped state, so live migration will include short downtime which is summary of time for data dump creation, time to transfer memory and data dump to the destination host and time to restore this dump. To minimize downtime we will use iterative approach for data dump - first, full and independent data archive will be created and transferred to the destination without memory dump and container freezing. After deployment of this archive on the destination host, second, incremental backup with memory dump will be created (with container stop) and sent to the destination to restore over the data from the first step.

Note: CRIU has a pre-dump command that "...would generate another set of pre-dump images which will contain memory changed after previous pre-dump", but looks like lxc-checkpoint and go-lxc doesn't provide this option. For now, we can experiment with full memory dump and if it take too much time, we can switch to direct CRIU binary or API usage.

Note: currently, subutai checkpoint command leaves original container in the "stopped" state which should be changed to "freezed" in case of failure in restore operation to make possible quick roll back to the original container.

Dump transfer to recipient

Considering network data transfer between remote peers, we are assuming that peers located behind the NAT and they are not directly accessible for each other. In such case, only two options seems to be possible: using third machine which is accessible for both peers (e.g. CDN) or creating p2p tunnel. Although using CDN to distribute container dump has its advantages, such as possibility to store backups, our objective is to reduce container downtime where the data transfer time is the significant part of it; using third side to push and pull data obviously not the fastest way.
Using direct p2p tunnels between remote peers looks even more suitable considering that the same mechanism already implemented and tested for port mapping feature.

Note: feature to sync backup and dump to the CDN may be implemented as a separated function in future

As a conclusion, approximate algorithm for live migration may be described:

Checking peers hardware and software for CRIU support
Creating p2p tunnel between source and destination peers
Creating full backup of source container without stopping it
Transferring backup archive to the destination peer
Restoring container from full backup on destination peer without starting it
Creating incremental backup and memory dump of source peer, freezing container
Transferring backup with memory dump to the destination
Applying incremental data backup and restoring memory dump
If step 6 or 7 failed, unfreeze source container and inform about migration failure

Algorithm with particular commands and commands itself not completed yet and will be added soon.
Besides, we'll meet lots of additional issues to solve during implementation, such as container port mapping, networking and credentials migration, etc. Seeing the complexity of already described things, it is better to solve secondary issues as they come.

Update. Command implementation.

Initial version of migration binding was added to dev branch and now may be tested in following conditions:

pc-kernel version is 4.4.0-67.88 (run snap list pc-kernel to check)
master-based containers
source RH has root key-based ssh access to destination RH

First two conditions, as it was already described, are temporarily unsolved bugs, third one - requirement. Test scenario:

Deploy two resource hosts - let's call it A and B
Generate SSH key-pair on A and add public key to authorized keys on B
Clone master template into new container "c1" on A
On peer A execute subutai migrate c1 -s prepare-data -d <IP of host B> - this command will prepare and send full data backup of container to host B. Container is still running on A.
If previous step succeeded, on host B run command subutai migrate c1 -s import-data - data backup will be deployed on B, container still "stopped"
If previous step succeeded, on host A execute subutai migrate c1 -s create-dump -d <IP of host B> - memory dump will be created, container frozen and differential data backup with memory dump sent to host B
If previous step succeeded, on host B run subutai migrate c1 -s restore-dump - differential backup with memory dump restored and container continue running on host B

It is possible to continue migration process only if each step completes successfully. Migration was not completely tested and probably will fail in many different cases; current version of migration algorithm is not "production ready", it is just a start point of development.
Since it's not possible to dump shell job inside the container, the easiest way to test "liveness" of this migration is, for example, apply non-persistent iptables rules, create files in /tmp, etc. to see that container runtime persists.

Components

Subutai Containers

Container Backup

Subutai Network

Snappy Ubuntu

Live migration draft

Installation

Introduction
Bare Metal
VirtualBox
KVM
VMWare

CLI

attach
backup
batch
clone
cleanup
config
daemon
demote
destroy
export
hostname
info
import
list
map
metrics
p2p
promote
proxy
quota
rename
restore
start
stop
tunnel
update
vxlan

FAQ

What should I do when I see DUP packets when using the ping command?
How can I SSH into my remote peer's resource hosts?
How can I deploy Management template from different branches?
How to install several peers within the same LAN
How do I increase the size of the btrfs disk after installation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live migration, draft

Update. Command implementation.

Clone this wiki locally