Re: [Hampshire] Server crashing during backup

Top Page

Reply to this message
Author: James Courtier-Dutton
Date:  
To: Hampshire LUG Discussion List
Subject: Re: [Hampshire] Server crashing during backup
On 26/11/06, Daniel Pope <mauve@???> wrote:
> My Dad's office server has apparently been freezing during backup,
> requiring a restart. I've now managed to get a complete backup out of
> it, which is a relief, but I have not isolated the problem and therefore
> I thought I'd mention the issue and explain my theory on the subject, to
> see if anyone has any other theories or suggestions for diagnostics.
>
> The system is a Novatech P4 server, and it backs up (initiated manually)
> using rsync to a USB2 hard disk.
>
> The first call I got was saying that they couldn't back up. dmesg
> reported all manner of errors, which seems to happen on occasion with
> with USB drives - all that I've tried, at least. ReiserFS would appear
> to be a bad choice for a removable drive, bombing badly if the link
> drops while the FS is mounted. Anyway, there had been a kernel panic
> from somewhere in Reiser code and there were some dead processes
> preventing me umounting/remounting the drive, so I rebooted the machine.
>
> It was running a custom 2.6.9 kernel and I wondered if 2.6.18 would
> contain updated Reiser code with better error recovery. It couldn't hurt
> to try. So I installed a prepackaged build and got that up and running
> (Incidentally, Debian's 2.6.18 kernel package dependencies are wrong. I
> needed to manually upgrade initramfs-tools and rebuild the initrd before
> it would boot.)
>
> The backup also wouldn't mount so I just formatted it and started a new
> backup. All appeared to be working. But a few days later I got a call
> from them saying the server was freezing every time they started a backup.
>
> There was nothing useful in the logs. There was a logger entry saying
> the backup had started and then the next entry was syslogd coming back
> up. I did notice that the RAID had resynced too, which is worrying when
> your backup isn't working. I fscked the backup and it was indeed
> frakked, but that might have been an effect rather than a cause.
> reiserfsck --rebuild-tree fixed it but did have to be run twice.
>
> I did get some console messages of the form "CPU over temperature:
> throttling" as I fscked. So my guess is overheating? I don't know what
> the threshold for throttling is or how bad it is to receive that error
> message. Something that I do find strange is that even though the kernel
> knows the CPU is running hot, I can't seem to find out how hot unless I
> go through all the hassle of installing and configuring lm-sensors.
> There's nothing in /proc or /sys afaict.
>
> My working theory is that swapping from the old kernel to the stock new
> one enabled hyperthreading. I think the old one was compiled for
> uniprocessor because, well, it's nominally a uniprocessor box. The new
> one is compiled for SMP.
>
> Hyperthreading would presumably make more use of the execution units
> plus a whole other pipeline thus making the CPU run much hotter? Maybe
> Novatech didn't bother to bench test these boxes with HT enabled.
>
> I've now managed to get a backup by using the -W flag to rsync, which
> should have been used in the first place tbh, but I was lazy and didn't
> bother to look it up.
>
> Presumably there's some boot prompt option to disable HT?
>
> Dan
>


I would really avoid reiserfs if you can. Use ext3 instead. ext3 is
much better at error recovery than reiserfs.
With regard to HT, normally there is a BIOS option to disable it.
It is also a big no-no to try reiserfs on anything apart from a HD. It
is not designed for anything apart from a HD, and it really works
badly on a USB memory stick.

I would normally use LVM on the hard disk and use the snapshot mode of
LVM to take a snapshot, then backup from the snapshot.
Basically one would do the following:
1) Send all processed the signal to inform them to complete the
current transactions and then halt.
2) Do an LVM snapshot
3) Send all processed the signal to inform them to continue.

Alternatively, do the backup when all the system activity is off.
I.e. databases halted and samba stopped.

Essentually, one should only use rsync on a file tree that is not changing.
Doing the snapshot is one way to achieve that.