[Hampshire] Server crashing during backup

Author: Daniel Pope
Date:
To: Hampshire LUG Discussion List
Subject: [Hampshire] Server crashing during backup

My Dad's office server has apparently been freezing during backup,
requiring a restart. I've now managed to get a complete backup out of
it, which is a relief, but I have not isolated the problem and therefore
I thought I'd mention the issue and explain my theory on the subject, to
see if anyone has any other theories or suggestions for diagnostics.

The system is a Novatech P4 server, and it backs up (initiated manually)
using rsync to a USB2 hard disk.

The first call I got was saying that they couldn't back up. dmesg
reported all manner of errors, which seems to happen on occasion with
with USB drives - all that I've tried, at least. ReiserFS would appear
to be a bad choice for a removable drive, bombing badly if the link
drops while the FS is mounted. Anyway, there had been a kernel panic
from somewhere in Reiser code and there were some dead processes
preventing me umounting/remounting the drive, so I rebooted the machine.

It was running a custom 2.6.9 kernel and I wondered if 2.6.18 would
contain updated Reiser code with better error recovery. It couldn't hurt
to try. So I installed a prepackaged build and got that up and running
(Incidentally, Debian's 2.6.18 kernel package dependencies are wrong. I
needed to manually upgrade initramfs-tools and rebuild the initrd before
it would boot.)

The backup also wouldn't mount so I just formatted it and started a new
backup. All appeared to be working. But a few days later I got a call
from them saying the server was freezing every time they started a backup.

There was nothing useful in the logs. There was a logger entry saying
the backup had started and then the next entry was syslogd coming back
up. I did notice that the RAID had resynced too, which is worrying when
your backup isn't working. I fscked the backup and it was indeed
frakked, but that might have been an effect rather than a cause.
reiserfsck --rebuild-tree fixed it but did have to be run twice.

I did get some console messages of the form "CPU over temperature:
throttling" as I fscked. So my guess is overheating? I don't know what
the threshold for throttling is or how bad it is to receive that error
message. Something that I do find strange is that even though the kernel
knows the CPU is running hot, I can't seem to find out how hot unless I
go through all the hassle of installing and configuring lm-sensors.
There's nothing in /proc or /sys afaict.

My working theory is that swapping from the old kernel to the stock new
one enabled hyperthreading. I think the old one was compiled for
uniprocessor because, well, it's nominally a uniprocessor box. The new
one is compiled for SMP.

Hyperthreading would presumably make more use of the execution units
plus a whole other pipeline thus making the CPU run much hotter? Maybe
Novatech didn't bother to bench test these boxes with HT enabled.

I've now managed to get a backup by using the -W flag to rsync, which
should have been used in the first place tbh, but I was lazy and didn't
bother to look it up.

Presumably there's some boot prompt option to disable HT?

Dan

This message is part of the following thread:
	the complete thread tree sorted by date

	James Courtier-Dutton at