Backup to AWS EBS via Rsync and Boto

Update 5/15/2013:

  • Preserve file ownership in the backup
  • Maintain proper directory structure in the backup
  • Added workaround for SSH “connection denied” message
  • Terminate the EC2 instance when finished
  • Added section on automation

Overview

Amazon Web Services Elastic Block Storage provides cheap, reliable storage—perfect for backups. The idea is to temporarily spin up an EC2 instance, attach your EBS volume to it and upload your files. Transferring the data via rsync allows for incremental backups which is very fast and reduces costs. Once the backup is complete, the EC2 instance is terminated. The whole process can be repeated as often as needed by attaching a new EC2 instance to the same EBS volume. I backup 8 GB from my own server weekly using this method. The backup takes about 3 minutes and my monthly bill from Amazon is around $1.

Setup

  1. If you don’t already have one, create an account with AWS.
  2. Take note of your access key. You will need to place it in the script in order to connect to the AWS EC2 API.
  3. Create an Amazon EC2 key pair. You need this to launch and connect to your EC2 instance. Download the private key and store in on your system. In my example, I have the private key stored at /home/takaitra/.ec2/takaitra-aws-key.pem
  4. Create an EBS volume in your preferred zone (location). Make sure it is large enough to store your backups.
  5. Create a security group called “rsync” that allows connections on two inbound TCP ports: 22 (for SSH) and 873 (for rsync).
  6. Ensure a recent version of Python and Boto are installed on your system. In Debian, this is accomplished by running the command ‘apt-get install python-boto’

The Script

The below script automates the entire backup process via boto (A Python interface to AWS). Make sure to configure the VOLUME_ID, ZONE and BACKUP_DIRS variables with your own values. Also update SSH_OPTS to point to the private key of your EC2 key pair. <aws access key> and <aws secret key> need to be filled in on line 19.

#!/usr/bin/env python

import os
from boto.ec2.connection import EC2Connection
import time

IMAGE           = 'ami-3275ee5b' # Basic 64-bit Amazon Linux AMI
KEY_NAME        = 'takaitra-key'
INSTANCE_TYPE   = 't1.micro'
VOLUME_ID       = 'vol-########'
ZONE            = 'us-east-1a' # Availability zone must match the volume's
SECURITY_GROUPS = ['rsync'] # Security group allows SSH
SSH_OPTS        = '-o StrictHostKeyChecking=no -i /home/takaitra/.ec2/takaitra-aws-key.pem'
BACKUP_DIRS     = ['/etc/', '/opt/', '/root/', '/home/', '/usr/local/', '/var/www/']
DEVICE          = '/dev/sdh'

# Create the EC2 instance
print 'Starting an EC2 instance of type {0} with image {1}'.format(INSTANCE_TYPE, IMAGE)
conn = EC2Connection('<aws access key>', '<aws secret key>')
reservation = conn.run_instances(IMAGE, instance_type=INSTANCE_TYPE, key_name=KEY_NAME, placement=ZONE, security_groups=SECURITY_GROUPS)
instance = reservation.instances[0]
time.sleep(10) # Sleep so Amazon recognizes the new instance
while not instance.update() == 'running':
    time.sleep(3) # Let the instance start up
time.sleep(10) # Still feeling sleepy
print 'Started the instance: {0}'.format(instance.dns_name)
# Get the updated instance
reservations = conn.get_all_instances()
reservation = reservations[0]
instance = reservation.instances[0]

# Attach and mount the backup volume
print 'Attaching volume {0} to device {1}'.format(VOLUME_ID, DEVICE)
volume = conn.get_all_volumes(volume_ids=[VOLUME_ID])[0]
volumestatus = volume.attach(instance.id, DEVICE)
while not volume.status == 'in-use':
    time.sleep(3) # Wait for the volume to attach
    volume.update()
time.sleep(60) # Still feeling sleepy
print 'Volume is attached'
os.system("ssh -t -t {0} ec2-user@{1} \"sudo mkdir /mnt/data-store && sudo mount {2} /mnt/data-store && echo 'Defaults !requiretty' | sudo tee /etc/sudoers.d/rsync > /dev/null\"".format(SSH_OPTS, instance.public_dns_name, DEVICE))

# Rsync
print 'Beginning rsync'
for backup_dir in BACKUP_DIRS:
os.system("rsync -e \"ssh {0}\" -avz --delete --rsync-path=\"sudo rsync\" {2} ec2-user@{1}:/mnt/data-store{2}".format(SSH_OPTS, instance.dns_name, backup_dir))
print 'Rsync complete'

# Unmount and detach the volume, terminate the instance
print 'Unmounting and detaching volume'
os.system("ssh -t -t {0} ec2-user@{1} \"sudo umount /mnt/data-store\"".format(SSH_OPTS, instance.dns_name))
volume.detach()
while not volume.status == 'available':
    time.sleep(3) # Wait for the volume to detatch
    volume.update()
print 'Volume is detatched'
print 'Terminating instance'
instance.terminate()

Automation

Follow these steps in order to automate backups to Amazon EC2. The steps may vary slightly depending on which distro you are running.

  1. Save the script to a file without a file extension such as “rsync_to_ec2″. Cron (at least in Debian) ignores scripts with extensions.
  2. Configure the script as explained above.
  3. Make the script executable (chmod +x rsync_to_ec2)
  4. Check that the script is working by running it manually (./rsync_to_ec2). This may take a long time if this is your initial backup.
  5. Copy the script to /etc/cron.daily/ or /etc/cron.weekly depending on how often you want the backup to run.
  6. Profit!

About Takaitra

Matthew has a B.A. in computer science and is currently working toward a masters in software engineering at the University of St. Thomas. Currently working as a Java developer for Enclarity, his professional background is Java EE and other web technologies. His computer-related interests include web application development and Linux administration. Other interests are motorcycling, jogging, photography, small electronics and traveling. He lives in Minnesota with his beautiful wife, Tally, and his daughter Natalie.

5 thoughts on “Backup to AWS EBS via Rsync and Boto

  1. @Jayson Reis
    I wonder if you can retrieve the CloudWatch ‘network availablility’ status of an instance. If you could, you could use that to reliably determine if the instance is available on the network.

  2. @Jayson Reis
    I did notice that the instance wasn’t always usable even if the status was ‘running’. The time.sleep(10) is working for me although I’m sure your socket connection to ssh would be more reliable.

  3. instance.update() == ‘running’ is not enough to check if it is really running.
    Maybe you should do a connection to ssh with socket
    import socket
    while True:
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
    sock.connect((instance.dns_name, 22))
    except socket.error:
    time.sleep(3)
    continue
    break

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>