I've been using rdiff-backup to create reverse-incremental backups, and the super-cool archfs to mount these backup repositories and easily browse past revisions. It's elegant and efficient and hackerish and Unixy and all good, but the increments are new data that also need backing up. So how do you backup a ~70GB backup repository?
A future of infinite storage is just over the horizon, but we're not quite there yet, and so backups require some kind of volumization process, meaning splitting that 70GB up into DVD-sized chunks. No problem, says tar, but the repository will grow with every backup, and the new files could be in any subdirectory of the repository. This means that once you've burnt DVDs 1 to 15, there's no trivial way to fill DVD 16 with just the new files; you have to re-volumize the whole lot and burn a fresh set of DVDs.
The solution, I believe, is duplicity, a sister project of rdiff-backup. Its mission in life is to create encrypted remote backups, but almost as a side-effect it creates a forward-incremental repository.
The difference between forward and reverse incremental is this: rdiff-backup creates an exact mirror of the latest copy of the data, and stores old versions of files - in fact, diffs between new and old - in a special subdirectory. To recover a file from 4 revisions ago, you start with the latest version and apply the 4 diffs in reverse order. Duplicity, on the other hand, starts with a full copy of the oldest version of the data, and stores new versions of files as diffs between old and new. So to recreate the latest version of a file, you start with the original, and apply all the diffs required to bring it up to date.
The crucial difference between the two schemes is that duplicity repositories only ever grow. Even if you delete a file, then run duplicity, it simply stores a note that the file was deleted; it doesn't actually remove any files from the repository. This means you can use duplicity to create neatly volumized chunks of an rdiff-backup repository that only grow forward in time, which is perfect for burning a series of DVDs.
They're strongly GPG-encrypted too.
My scheme is thus to keep an rdiff-backup repository for Time Machine-like functionality, and back it up with duplicity. (And then burn DVD sets and give them to trusted accomplices across the globe, in case Europe befalls some level of catastrophe which destroys my data but leaves me alive).
Two problems have arisen:
- Duplicity repositories seem to be extremely fragile. If you lose one volume in the backup chain, duplicity refuses to restore the entire chain.
- Duplicity claims to be in beta and acknowledges that there may still be bugs. Should I trust it?
The first problem is probably just a limitation of the current version of duplicity, or perhaps I'm just not using it properly. It makes sense that all volumes after the missing volume will be rendered doubtful, but you should be able to at least restore the volumes up to the missing one. Assuming this is fixed, the system is still rather fragile... but then the whole point of making backups is that you make lots of copies of them, isn't it?
The second problem has only one solution: I must source dive, dive, dive! If I'm going to entrust the security and survival of my digital life's work to a system, I want a supremely comprehensive understanding of that system.
And so the point of this post is to announce Life Goal #1315013: to understand and destruction-test the rdiff-backup and duplicity source code. They're both written in Python, which I'm quite familiar with, so it'll be slightly harder than Life Goal #1315012 (get an Amazon S3 account), and slightly easier than Life Goal #1315011 (accidentally cause and defeat a zombie uprising).