Increase efficiency of bandwidth and disk space with rdiff
Tagged:

The rdiff command line tool is a bit lousy documented if you ask me, so maybe it's a good idea to write some more about it.

Usually, the functionality provided by rdiff is used by rdiff-backup and rsync. In these cases there's no need to know how rdiff works. But in some cases you want to use the rdiff tool directly.

Say you create a tar archive of some directory every day. This is not really space efficient, a lot of data which hasn't changed is being duplicated each day. Besides, the data is at least twice on your hard disk: the real data and the tar archive.
But here is rdiff to the rescue (part of librsync), it can operate on a signature of the original data and not the data itself. Of course, the signature is some orders of magnitude smaller than the original data.

So we create a signature for B.tar first:

$ rdiff signature B.tar B.tar.sig

Now you can stuff B.tar somewhere safe, you don't need it anymore for making differential backups. You still need it for restoring, of course. The signature file will take care of all differential backups in the future.

So, what's inside B.tar.sig now? When executing this command, B.tar is being chopped up in blocks. Checksums are calculated for each block and put into the signature. Now, when a differential file is almost the same, it will match most of the checksums. For those blocks in the differential file where the checksum fails, the block's data is put into the delta file.

Life goes on and the next day a new tar archive has been created, B-20080124.tar. Now we want to calculate the differences compared to the original B.tar. Given its signature and the new file, you can create a delta now:

$ rdiff delta B.tar.sig B-20080124.tar B-20080124.delta

or

$ rdiff delta B.tar.sig B-20080124.tar | bzip2 -c > B-20080124.delta.bz2

The latter command does the same, only it compresses the delta.

So the delta file only contains those blocks of data where the checksum for that block didn't match in the signature. You can safely throw away B-20080124.tar now, because it can be reconstructed with the original file and the delta. Like this:

$ rdiff patch B.tar B-20080124.delta B-20080124.tar

The last argument is the destination file. So we end up with an exact copy of a file we've just thrown away.

So instead having approx. n*s MB of backup data, with n the number of backups and s the size of the backup, we have now about s+(n*s*0.1) MB of data (assuming each differential backup is 10% of the original archive size). So it's worth the effort when we're talking about hundreds of MB's.

A more easy use case would be to synchronize an almost identical, but very big file on two PC's. Just copying it would be a waste of bandwidth, thanks to rdiff you can do this in a couple of seconds.

First create a signature of the file you want to replace, send the signature to the PC with the newer version, create a delta there and send back the delta. Now you can reconstruct (patch) the file and both files should be identical. We have only sent a signature and a delta, so that was quite cheap.

So as you have seen in the two examples above, rdiff makes it possible to save you a lot of diskspace and bandwidth.