Incremental Backups using Rsync

Posted:	2007-06-02 23:43
Tags:	Debian

If you simply backup the filesystem on a remote server once a day via a cron job you will be able to restore the data in the event of a hardware failure - clearly very useful! This isn't the only reason for performing a backup though. Another possibility is that either through an accident, a virus or because of a malicious user you loose some data. If you have a backup you can simply restore the missing or modified files from the backup but it is possible that your cron job might have run before you get a chance. In these circumstances your backup will be a replica of the main filesystem which will be no good at all.

The solution to this problem is to take backups of the filesystem at regular intervals, perhaps once a day or even once an hour. If you do this simply by copying all the files each time in a similar way to the one described in my last rsync article you will quickly run out of disk space on the backup server.

Luckily you can reduce disk usage by making sure files from new backups are created using a hard link to an existing copy of the same file from a previous backup wherever possible. This way although each copy behaves like a full copy only files which have changed since the previous backup are actually physically copies, the rest are simply hard linked to the last backup which did contain a full copy of the file.

The example below shows how hard linking works. We have already setup a directory copy with two files then we create a copy using hard links:

james@bose:~/hard$ cp -al orig copy

The l option links the files instead of actually copying them.

Notice the inodes of the files in each directory are the same because they are the same physical file and how the hard linked directory has a much smaller disk usage than the original.

james@bose:~/hard$ du -h
3.9M    ./copy
4.0K    ./orig
4.0M    .
james@bose:~/hard$ ls -li orig/
total 3988
379977 -rw-r--r-- 2 james james 4074531 2007-06-02 22:06 test1
380075 -rw-r--r-- 2 james james      29 2007-06-02 21:59 test2
james@bose:~/hard$ ls -li copy/
total 3988
379977 -rw-r--r-- 2 james james 4074531 2007-06-02 22:06 test1
380075 -rw-r--r-- 2 james james      29 2007-06-02 21:59 test2
james@bose:~/hard$</code>

The du program only counts hard linked files once so you can see that it actually reports copy as using more disk space than orig. If you ran du on each of the directories individually they would both report a size of 3.9M.

Rsync version 2.5.6 and above supports an option called --link-dest which instructs rsync to use the link-dest directory specified (on the destination machine) as an additional hierarchy to compare destination files against when doing transfers (but only if the files are missing in the destination directory). Unchanged files are then hard linked from the link-dest directory to the destination directory rather than copied from the source location. The files must be identical in all preserved attributes (e.g. permissions, possibly ownership) in order for the files to be linked together. If the link-dest directory is a relative path, it is relative to the destination directory not the current working directory as you might have expected. Because the link-dest directory is not consulted if a file already exists, it is usually best to run rsync on a new, empty directory.

Anyway, imagine you have already run rsync to create a backup of a remote server filesystem in a directory called backup. An hour later you might want to create an incremental backup using the hard linking approach just mentioned so rather than issue this command:

rsync -aHxvz --delete --progress --numeric-ids -e "ssh -c arcfour -o Compression=no -x" root@example.com:/ backup.0/

you would issue these commands:

mv backup.0 backup.1
rsync -aHxvz --delete --progress --numeric-ids -e "ssh -c arcfour -o Compression=no -x" --link-dest=../backup.1 root@example.com:/ backup.0

It is important to get the / characters correct at the end of the paths to ensure rsync copies everything as you expect. This time backup.0 will contain the latest copy the filesystem but any files which have not changed will simply be hard linked to the corresponding file in backup so the second backup takes up far less space than the first.

As an example here is the disk usage from the backups of one of my servers:

bose:/home/james/files/james/files/Backup# du -hsc backup.0 backup.1
4.9G    backup.0
141M    backup.1
5.0G    total

Note: Once again although technically backup.1 is the older copy, the du command reports that backup.0 is using more disk space. As mentioned earlier this is simply because du considers the folder it comes across first to be the one that it should assign the size of a hard linked file to so backup.0 looks bigger than backup.1. Of course the only number that actually matters is the total. In this case this is 5.0G which is a lot less than the 9.8G which would be needed to store two full copies of the backups if we weren't using hard links:

bose:/home/james/files/james/files/Backup# du -hs backup.0
4.9G    backup.0
bose:/home/james/files/james/files/Backup# du -hs backup.1
4.9G    backup.1</code>

The beauty of this set up is that if you ever need to restore a file from a backup it is as simple as copying it back to the server from the backup you need.

(view source)