Re: [Hampshire] [OT] MTBF

Top Page

Reply to this message
Author: James Courtier-Dutton
Date:  
To: Hampshire LUG Discussion List
Subject: Re: [Hampshire] [OT] MTBF
2009/7/20 Stephen Rowles <stephen@???>:
> James Courtier-Dutton wrote:
>> I think people don't seem to realize that HDs have very low resistance
>> to shock while switched on, and this is the main cause of HD failures.
>> Most (all?) modern HDDs have a whole raft of sensors and store life time
>> information about read errors, temperature range etc. etc. this is SMART
>> (you might see on the bios screen). In Linux you can query this using
>> smartctl:
>
> ~]# smartctl --all /dev/sda
>
> For example the stats from my current drive here at work:
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   108   093   006    Pre-fail
> Always       -       16203744
>  3 Spin_Up_Time            0x0003   098   095   070    Pre-fail
> Always       -       0
>  4 Start_Stop_Count        0x0032   100   100   020    Old_age
> Always       -       69
>  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       0
>  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail
> Always       -       139114079
>  9 Power_On_Hours          0x0032   088   088   000    Old_age
> Always       -       11202
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       99
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   061   057   045    Old_age
> Always       -       39 (Lifetime Min/Max 21/43)
> 194 Temperature_Celsius     0x0022   039   043   000    Old_age
> Always       -       39 (0 19 0 0)
> 195 Hardware_ECC_Recovered  0x001a   064   060   000    Old_age
> Always       -       164354431
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age
> Offline      -       0
> 202 TA_Increase_Count       0x0032   100   253   000    Old_age
> Always       -       0
>
> You can see all sorts of interesting things here which can easily be
> used to warn on pending failure of  a drive.
>
> Also most laptop drives now have accelerometers which will detect any
> dangerous shock conditions and park the drive heads to prevent further
> damage to the drive. I cannot find it now but I watched a video on the


None of the above smart parameters give any indication from the accelerometers.
So, one has no way of telling if shock was a contributing factor to
the HD failure.
It would be nice to see smart stats saying, we got this much shock
before we managed to park the heads.

Another thing, for the pre-fail smarts like:
Raw_Read_Error_Rate 16203744
Seek_Error_Rate 139114079
Hardware_ECC_Recovered 164354431

What is an acceptable value and what indicates things starting to go wrong?
My laptop HD has these values at zero!!!
On my desktop, they keep increasing over time. So, what is an
acceptable "rate" ?

James