How to make backups with Git using gibak
Tagged:  •    •    •    •  

Due to circumstances I had to look for a new way to backup my home directory. I was using rdiff-backup(1) before, but I wasn't completely satisfied with it. The overhead of incremental data was quite large, especially if you moved files around. But overall it does its job quite well.

However, since a year or so I'm quite charmed of Git, the stupid file content tracker. It's a great tool for software development, but due to its advantages over other VCSs it's also suitable for backing up structures like a user's home directory. Git is quite smart with detecting file changes and keeps things quite efficient disk space wise. And it allows you to share your backup wherever you want and keep these easily up to date (granted, rdiff-backup is also capable of storing your backups on a remote machine without much hassle).

In order to make Git based backups, I decided to use gibak. gibak is basically a wrapper for Git. With a single command you can update your backup, it deals with ignored files, other Git repositories inside your home directory and some other things.

Gentoo users can find the ebuild here or at the attachment below. Or look in your favorite distro's package repository or build it from source.

Making backups

After installation, you can start a repository by invoking from your home directory:

$ gibak init

This will create a ~/.git directory, where your backup files will reside. At this point, you can create a ~/.gitignore file to exclude certain directories from the backup (of course you can also put ~/.gitignore files on deeper levels, it's just Git you're dealing with after all).

If you like, you can move the ~/.git directory somewhere else, to another partition for example. Because if you lose your home directory, you'll definitely lose ~/.git too. So you reduce the chance of losing your backup by putting it on another partition:

$ mv ~/.git /mnt/files/backup
$ ln -s /mnt/files/backup ~/.git

Later on we'll discuss how we get this backup on remote machines as well.

If everything is in place, we can proceed to make our first commit. From your home directory, invoke:

$ gibak commit

Despite that Git is quite efficient, this command could take some minutes to finish. For me, the initial commit took over 30 minutes. In the future it won't take that much time, as only the changes are processed.

Restoring

So now your home directory turned into a full fledged Git repository. If you messed up a file (or you accidentally deleted something), a simple git checkout will immediately rescue it for you:

$ rm foo.txt
$ git checkout foo.txt

If you wish to restore a file which has been deleted before the last backup commit, you have to do a little more. With git whatchanged we can see which files have been added and deleted during a directory's lifetime. From there we receive some pointers on where to dig in the repository to get a file back, named tmp/note13.txt:

$ cd tmp
$ git whatchanged .
commit 3131f9661ec1739f72c213ec5769bc0abefa85a9
Author: Bram Schoenmakers <bram@example.com>
Date: Mon Nov 10 00:50:31 2008 +0100

Committed on Mon, 10 Nov 2008 00:50:27 +0100

:100644 000000 7e59ff2... 0000000... D tmp/note13.txt

commit 94d242ff165c3a6042084b918ed92f7895276de2
Author: Bram Schoenmakers <bram@example.com>
Date: Sun Nov 9 13:33:37 2008 +0100

Committed on Sun, 09 Nov 2008 13:33:26 +0100

:100644 100644 2efa8bf... 7e59ff2... M tmp/note13.txt

commit 611d4e358150fd009d9a1659004ab01db5d3e94c
Author: Bram Schoenmakers <bram@example.com>
Date: Sun Nov 9 13:30:07 2008 +0100

Committed on Sun, 09 Nov 2008 13:29:59 +0100

:000000 100644 0000000... 2efa8bf... A tmp/note13.txt

You can see three commits where this particular file was involved. One time by adding it, one time by modifying it and then finally deleting it (obviously, that's what the characters A, M and D indicate).
Let's explain the other columns in front of the action type and file name. The first column describes the file mode before this commit (or just 000000 when the file didn't exist before that), the second column describes the file mode introduced by this commit. The third and fourth column contain an identifier. These point to the contents of the file, before and after respectively (in Git terms, these are blob identifiers).

If we wish to restore tmp/note13.txt as it was just before deletion, we use the identifier 7e59ff2. So these contents contain the modifications I did in the meantime. To restore this file, execute:

$ git cat-file -p 7e59ff2 > note13.txt

We could have used git checkout, using the commit ID when the file was modified (94d242f..). A difference is that checkout touches the index (a commit would record the revert).

To restore a complete directory structure, use git reset. For example, if we wish to bring back the tmp folder to the state it was just before our last backup commit, we proceed as follows:

$ cd tmp
$ git reset HEAD^ .
$ git commit
$ git clean -df .

The clean command is recommended because git reset does not remove the files which become untracked because of the reset.

Remote machines

As said before, your home directory has turned into a Git repository. So you can also share it like a Git repository, with the clone, push and/or pull subcommands. For example, if you have a SSH server running on the machine where your home directory resides, you can do a git clone somewhere else:

Git 1.5.x:

$ git clone --bare bram@192.168.1.100:/home/bram
$ git remote add --mirror origin bram@192.168.1.100:/home/bram

Git 1.6.x:

$ git clone --mirror bram@192.168.1.100:/home/bram

We use --bare because we're not interested in a full checkout of your data, this only wastes space on the remote side. This means the clone also doesn't know its origin, so that why we specify it afterwards.
With --mirror we say that the new repository shouldn't treat the branches as 'remote', but as 'local'. If we don't supply this, the head of the branch is not updated when we do a push.

To update your mirrored repository, execute a push from your local machine:

$ git push brambo@192.168.1.150/home/brambo/backup

You can also run

$ git fetch

from the remote machine in order to update your copy.

If you don't have shell access to the machine hosting your mirror, pass the --mirror option with push. This is because the creation of the remote repository was likely not created with mirroring in mind.

$ git push --mirror git://192.168.1.150/home/brambo/backup

Git submodules

Those who have other Git repositories in their home directory should be aware of one thing: gibak registers these repositories as submodules. A push will not transfer the repositories to the remote machine, and all your Git repositories appear empty there (only in case of a non-bare repository).

On your local machine, gibak creates a folder ~/.git/git-repositories. These contain up-to-date copies of your Git repositories. You should transfer this folder manually to the remote repositories, with rsync for example.

To get your repository data back in your remote checkout, make sure that ~/.git/git-repositories contains your Git repositories and invoke the following commands:

$ git submodule init
$ git submodule update

If you wish to update your Git repositories on the remote side you should update ~/.git/git-repositories with rsync first and then git submodule update.

Removing old data

In principle, your repository keeps growing with every commit you make. This may be undesirable, if you don't have the storage for three years of history. There are several ways to solve this, as can be found somewhere along the lines of the gibak website.

  • Use:
    $ gibak rm-older-than date-spec

    The date-spec can be like 5 days ago or 2007-12-14 13:05:00. But be warned, the author of this code cannot guarantee that this way of removing old data works 100% reliable.

  • Make a shallow copy of your ~/.git directory and replace the original with it. A shallow copy means that you cut off all history beyond a certain commit. You can make a clone by running:
    $ git clone --depth 5

    This will preserve the last 5 commits. Note that your repository becomes a bit crippled in a way. You can not clone or fetch from this repository, which makes it harder to share across other machines.

  • The last option is quite rigorous, but simple. Simply delete your ~/.git folder and start from scratch.

The gibak site also suggests to use git rebase. It's not really a way of getting rid of history, but rather a way of rewriting it. With this subcommand it should be possible to get rid of certain files which were accidentally recorded in the repository. However, I shall not discuss this part because the steps given on the site isn't entirely correct, and I couldn't find the right way of getting rid of files. Besides, you'll get more trouble (conflicts) if you have commits which modified the file later on. Also your home directory should be entirely clean when you do a git rebase, a condition which is hard to satisfy in such a dynamic directory structure. So this part is a bit tricky to get right.

Space optimizations

This repository is not intended to read much from, so personally I want it to take as less space as possible (at expense of speed). The following commands help to accomplish this. Please note that not all options are available in older Git versions, but if you use the latest version from your distribution's repository you should be safe.

  • The following command tells Git to compress all data as much as possible (it's actually the compression level passed to zlib):
    $ git config core.compression 9
  • git gc is a command to clean up your repository and packs all objects in a single pack with delta compression. By default, git will do this when it has 6700 or more loose objects in its repository. Since a home directory has many and considerably large files, we want this to happen sooner. So we set this to 1000.
    $ git config gc.auto 1000
  • This command makes your repository take slightly less space. However, please note that this only works for Git versions newer than 1.4.3. If you intend to share your repository on a machine with Git 1.4.3 or lower your repository becomes unreadable.
    $ git config repack.usedeltabaseoffset true
  • It is generally a good idea to run the following command once in a while:
    $ git gc --aggressive

    This will try harder to squeeze your objects efficiently in a pack. This may be very memory and disk intensive. If you really care about disk space, you could adjust the following option:

    $ git config gc.aggressiveWindow 15

    This defaults to 10, and with a higher value it tries harder to find common objects in your repository.

One last warning

If you're already a Git user, you know that git reset can be a dangerous tool. In this context, always double check if you are going to make hard resets. If you're not familiar with Git: a hard reset brings back a repository back to a specified point in time (index + all your files), and therefore erasing all your changes since the specified commit. This warning also holds for invoking git checkout master: you will lose all changes since your last commit.

Update 26 November 2008

At some point I got the following error message:

*** Project description file hasn't been set
error: hooks/update exited with error code 1
error: hook declined to update refs/heads/master
To /media/WD USB 2/Backup/gibak
! [remote rejected] master -> master (hook declined)
error: failed to push some refs to '/media/WD USB 2/Backup/gibak'

This is fixed by copying the ~/.git/description file to the remote Git repository.

not working

Hey, I tried your ebuild. The instalation went without any problem. However the file "git-init" is not included in my PATH by default. Maybe you could tweak that for a future version! i solved it just adding

/usr/libexec/git-core

ot the PATH.

Indeed, since Git 1.6 the

Indeed, since Git 1.6 the git-foo commands are deprecated in favor of git foo.

You also use the ebuild+patch at Gentoo Bugzilla, which should take care of this.

http://bugs.gentoo.org/show_bug.cgi?id=220833