I've been using rdiff-backup to create reverse-incremental backups, and the super-cool archfs to mount these backup repositories and easily browse past revisions. It's elegant and efficient and hackerish and Unixy and all good, but the increments are new data that also need backing up. So how do you backup a ~70GB backup repository?
A future of infinite storage is just over the horizon, but we're not quite there yet, and so backups require some kind of volumization process, meaning splitting that 70GB up into DVD-sized chunks. No problem, says tar, but the repository will grow with every backup, and the new files could be in any subdirectory of the repository. This means that once you've burnt DVDs 1 to 15, there's no trivial way to fill DVD 16 with just the new files; you have to re-volumize the whole lot and burn a fresh set of DVDs.
The solution, I believe, is duplicity, a sister project of rdiff-backup. Its mission in life is to create encrypted remote backups, but almost as a side-effect it creates a forward-incremental repository.
The difference between forward and reverse incremental is this: rdiff-backup creates an exact mirror of the latest copy of the data, and stores old versions of files - in fact, diffs between new and old - in a special subdirectory. To recover a file from 4 revisions ago, you start with the latest version and apply the 4 diffs in reverse order. Duplicity, on the other hand, starts with a full copy of the oldest version of the data, and stores new versions of files as diffs between old and new. So to recreate the latest version of a file, you start with the original, and apply all the diffs required to bring it up to date.
The crucial difference between the two schemes is that duplicity repositories only ever grow. Even if you delete a file, then run duplicity, it simply stores a note that the file was deleted; it doesn't actually remove any files from the repository. This means you can use duplicity to create neatly volumized chunks of an rdiff-backup repository that only grow forward in time, which is perfect for burning a series of DVDs.
They're strongly GPG-encrypted too.
My scheme is thus to keep an rdiff-backup repository for Time Machine-like functionality, and back it up with duplicity. (And then burn DVD sets and give them to trusted accomplices across the globe, in case Europe befalls some level of catastrophe which destroys my data but leaves me alive).
Two problems have arisen:
- Duplicity repositories seem to be extremely fragile. If you lose one volume in the backup chain, duplicity refuses to restore the entire chain.
- Duplicity claims to be in beta and acknowledges that there may still be bugs. Should I trust it?
The first problem is probably just a limitation of the current version of duplicity, or perhaps I'm just not using it properly. It makes sense that all volumes after the missing volume will be rendered doubtful, but you should be able to at least restore the volumes up to the missing one. Assuming this is fixed, the system is still rather fragile... but then the whole point of making backups is that you make lots of copies of them, isn't it?
The second problem has only one solution: I must source dive, dive, dive! If I'm going to entrust the security and survival of my digital life's work to a system, I want a supremely comprehensive understanding of that system.
And so the point of this post is to announce Life Goal #1315013: to understand and destruction-test the rdiff-backup and duplicity source code. They're both written in Python, which I'm quite familiar with, so it'll be slightly harder than Life Goal #1315012 (get an Amazon S3 account), and slightly easier than Life Goal #1315011 (accidentally cause and defeat a zombie uprising).
Comments
I'm currently implementing a rdiff-backup/duplicity backup solution. I'm using rdiff-backup for local revision histories and using duplicity as my off site backup. How did your source digging go? Are you currently using these two programs? If so, have you had any issues and what versions are you using?
Hello cake. I'm still using rdiff-backup 1.2.7 and duplicity 0.5.09 (the versions in the Ubuntu 9.04 repositories) with no real problems. I've not done any source digging due to a combination of laziness and business. The development of both programs appears to be continuing at a brisk pace, so I have quite a bit of faith in both now. Some time soon I intend to run an alternate backup process occasionally - something like BoxBackup - just to avoid the eggs/baskets problem.
James, I've been looking at implementing a very similar backup scheme. I was wondering how the restore process works, though. If I have a rdiff-backup repository for the last 30 days locally on my home network, then I back that up using duplicity to S3, can I restore the rdiff-backup diffs for say, 90 days ago and then use rdiff-backup to restore them to the appropriate machine? Or is your duplicity repository simply a mirror of your rdiff-backup repository? It would be really nice if the catalogs for rdiff-backup worked so that it could just take any diffs and apply them if they exist, then the former scheme could work.
dcode, I'm not sure I completely understand your scenario. rdiff-backup stores a complete mirror, plus diffs; I haven't even thought about splitting the diffs out from the mirror. My duplicity repository is just a mirror of the mirror+diffs.
With rdiff-backup, you can, for example, only keep the diffs for the last 30 days. If you backup everyday, you'll have 29 sets of diffs and one mirror copy of the most recent revision. If you run duplicity once per week with a full backup every 30 days, you will have backed up each set of diffs, say for the last year. You could feasibly restore a file on your desktop that is 9 months old. You would have to restore all the rdiff-backup diffs from your duplicity mirror, then apply the diffs to restore the file to that point in time. The only thing is, I'm not sure if the catalogs in rdiff-backup would allow you to do this.
You mention that duplicity mirrors only ever grow. I'm just trying to figure out how I can have a rolling backup scheme with a local mirror for things that I'm more likely to restore (last 30 days) and a long-term offsite backup on S3 using duplicity.
Thanks for this James - I wouldn't have thought of using rdiff-backup and duplicity.
It's been a couple of years; do you have any more insights into this approach? I'm currently deciding how to do my backups.
dcode, if you are still there after all this time, apologies for the slightly late response. I now understand what you mean, but unfortunately I don't believe rdiff-backup will do it out of the box. Whether or not you can do it manually - applying the extra reverse diffs by hand - would be an interesting exercise for the reader.
David, I used this approach for a year or so, but eventually grew weary of it. The end product was always a set of volumes destined for DVDs, and I found I just didn't have the time or patience to sit down and go through the burning and disc shuffling process. Amazon EC2, despite being considerably more expensive than DVDs, is now my preferred backup solution. It takes a huge amount of time to upload many gigabytes of data over my home broadband connection, but it can be trickled overnight without any human intervention. Volumisation is unneccessary because:
Reverse-increments are provided by EC2's built-in EBS-to-S3 snapshots.
Tools used:
I'm only just getting started using this new scheme, so I'll probably write it up properly once I've been using it for a few years.
If you have a small enough amount of data that the DVD burning process doesn't scare you, then I'd say rdiff-backup and duplicity is a fine solution.
Thanks for your helpful reply, James. EC2/S3 is something I'll definitely consider for the future, though I'm not familiar with the basics of the systems themselves. I think for now I'll focus on getting local backups to HDs and DVDs done well, and then see if I can get something sorted for cloud backups, too.
This tutorial is the best-looking that I've found so far: http://blog.blackpepper.co.uk/black-pepper-blog/Using-Amazon-EC2-EBS-S3-for-automated-backups.html
Thanks again!
Btw, I've subscribed to your feed - interesting writing!
Hi James,
What was the reason for dropping duplicity and going to rsync?. The current version supports many backends including WebDav. By using rsync alone you lose the ability to have incremental backups ( unless you script it yourself ).
Thanks.
John Doe,
I became less interested in having incremental backups across the whole of my data. The bulk of it (for example, photographs) changes rarely if at all, and the rest I tend to keep in version control repositories. The key issue is that storing history for a file is a different task than backing it up, even though systems like Time Machine and rdiff-backup conflate the two, and they are admittedly similar. Files for which I wish to retain a history, I explicitly put under version control. I then worry about keeping the repository (and the rest of the unversioned data) safe using other means, like RAID-1 and offsite backups to EC2.
I probably shouldn't have mentioned Amazon's EBS-to-S3 snapshot feature as a tool for incremental backup; it's more about providing pure redundancy, and a guard against the occasional accidental deletion that might get propagated to the offsite backup. It's also much, much slower than I had expected.
Add a comment
* item 2
* item 3