Re: [Hampshire] Re: Backup solutions for Ubuntu

Author: Andy Smith
Date:
To: hampshire
Subject: Re: [Hampshire] Re: Backup solutions for Ubuntu

gpg: failed to create temporary file '/var/lib/lurker/.#lk0x57710100.hantslug.org.uk.2192': Permission denied
gpg: keyblock resource '/var/lib/lurker/pubring.gpg': Permission denied
gpg: Signature made Mon Feb 25 20:48:53 2008 GMT
gpg: using DSA key 2099B64CBF15490B
gpg: Can't check signature: No public key

Hi Alan,

On Mon, Feb 25, 2008 at 07:43:23AM +0000, alan c wrote:
> Andy Smith wrote:
> >- even given two files of identical content, rsync will not consider
> > them identical if metadata (times, ownership, permissions, etc.)
> > differs. You do find yourself collecting many copies of such
> > files.
>
> I think rsync has a 'size' option which uses size only to determine state.

That only works in very specific cases though.

Say you are trying to back up 20 machines each of which has an
average of 50,000 files, and keep 10 historical backups spanning
various time periods. This is 10 million inodes. This is a typical
use case for rsnapshot/dirvish/etc.

Firstly they are only going to consider files with the exact same
path - all the dupes that are elsewhere within each machine, and
across machines, will still be transferred and stored separately.

Next, you probably can't assume that just because a file is the same
size and path that it is identical.

Thirdly, even if you *did* decide they were identical, you still
have to script up a way to save the metadata of the two files
separately so that you can restore them properly should you ever
need to. You still end up with millions of files that take rsync
ages to check through - or even make it crash because it needs so
much memory.

This can be partially mitigated by running a tool called "freedups"
or "hardlink" over the whole backup tree (which contains all
different machines and their historical backups). It checks file
content for duplicates and will find and hardlnk together the
idential files that are on different machines.

It has the same performance problem that anything has that must walk
a directory tree containing millions of inodes does tough - even my
home setup now can't run hardlink across the whole tree in less time
than the interval between rsnapshot runs. This is not running on
poor hardware either; it's got 6 7200RPM SATA disks in a RAID-10 and
2G of RAM.

The point is that backuppc in the above scenario would store file
content once, then every copy of it anywhere under any name on any
of the machines would just be logically pointed at that, with the
metadata in a database somewhere else for restore time.

Cheers,
Andy

This message is part of the following thread:
	the complete thread tree sorted by date
	alan c at
	Adrian Bridgett at