Jump to content

disk read errors


Recommended Posts

Hi guys,

 

I just checked in on the web gui for my server yesterday and noticed that disk1 was showing 128 in the errors column.

 

Looking in the syslog I saw that a few days before it reported a load of disk read errors. I ran smartctl -a on the drive and it didn't show anything obvious to me, reallocated and pending sector counts are both 0.

 

So I decided to kick off a parity check overnight. That ran and reported no errors, but now the error count columns for disk1 is at 256 and disk4 is 128. Again smartctl for both drives doesn't show anything, but disk4 is showing similar messages in the syslog.

 

Is there anything to worry about? Are the disks about to start failing?

 

syslog - http://pastebin.com/LfyFZEhE

 

smartctl for disk1 - http://pastebin.com/xzWZbtG8

 

smartctl for disk4 - http://pastebin.com/FcNxhbWv

 

Screenshot of web gui

va0nIyvl.png

 

The hardware is my newly built server - a Xeon 1240v3, Supermicro X10SL7, 24GB RAM. All drives are connected to the motherboards onboard LSI 2308. The 2 drives in question are both WD 3TB Greens (the original 2 drives that I built my first unraid server with).

 

Any thoughts, help, suggestions?

 

Peter

 

Update:

 

smartctl for disk1 after long test http://pastebin.com/HhHDEwA0

 

smartctl for disk4 after long test http://pastebin.com/NSkfcTUh

 

Both still show no errors that I can see

Link to comment

Interesting.

 

Looks like disk1 had 2 separate series of read errors, each one affecting a series of consecutive sectors, and disk4 had one such event.

 

For each of these unRaid would have (should have) reconstructed the sector and issued a write back to the disk which should have forced a reallocation. But the smart reports are clean.

 

Read errors are pretty rare here, so we don't have a lot of data. I do not believe that a cabling issue would result in a condition like this. It is too precise (128/256). BTW 128 consecutive sectors is 1/2 Meg.) Cabling issues tend to generate CRC errors not read errors. Not that it is impossible that cabling is the problem, just unlikely IMO. But having two drives suddenly exhibiting this same behavior seems very unlikely too. Pointing to some common cause. But what?

 

The fact that 2 disks are involved leads me to ask if they are coming off of a common splitter. Power seems like it could be a common cause and we have seen bad splitters create havoc. I believe Joe L. had this happen. And it does take good power for the drive to operate.

 

The second thing I'd check is if they are coming off the same controller. You reported they are attached to the MB, but sometimes the motherboard ports are provided by different controllers. I'd check if these two drives are on a different controller than the others.

 

If no common cause is found, I agree with Dale that running disk diagnostics is the next step. A short and long smart report. If they are clean, I believe WD has some disk checking tools (lifeguard?)

 

Seems very strange we are not seeing SMART errors if it's the drives. But you never know. The firmware is not infallible. It could easily contain bugs.

 

Do some checking and report back.

Link to comment

Ok the long tests are running on both drives - it says they'll be finished later this evening. I'll post the results then.

 

The drives are both on the same controller - it's an LSI 2308 8 port controller that is built in to the motherboard. All of the drives are connected to that controller. No drives are connected to the "standard"/non-LSI ports.

 

Power is coming through splitters, there are 2 splitters powering 4 drives each. Off hand I can't remember which drives are connected to which splitters. Those 2 drives could well be on the same splitter - I have a feeling that they are. I'll check when I get the chance to power the server down and pull the drives out.

 

For more details about the system, and the splitters/cables I used, see my build thread http://lime-technology.com/forum/index.php?topic=32508.0

 

To be fair I've never been entirely happy with the splitters I used. Will have to start having a look for some others just in case.

Link to comment

BTW 128 consecutive sectors is 1/2 Meg.)

 

Are you certain that column reports physical sectors?  My understanding is that Linux uses LBA48 to address logical sectors (HDParm reports them this way) ... in which case this would be 64KB.

 

Regardless of the actual amount of data, it's certainly interesting that the errors seem to come in blocks of 128.

 

 

Link to comment

This could be caused by a variety of things .. it seems likely that what's happening is a block read is failing, generating the errors, but is then working fine when corrected - so the disk doesn't require any sector reallocation.    This could be caused by a power glitch (changing your splitter may help);  a data glitch in the cabling (any chance the cables to the two disks that had this issue are wrapped/tied together?);  or a "hiccup" in your memory subsystem -- although the latter should be automatically corrected by the ECC  (Nevertheless, I'd reduce your memory to 16GB so you only have 2 modules installed and see if that helps ... the signaling is much cleaner with only 2 modules).

 

 

Link to comment

Actually your server is just trying to give you something to tinker with in reaction to your comment:

Kind of disappointed that I've run out of "tinkering" to do  ;D

 

:) :)

That'll teach me to keep my mouth shut  :)

 

None of the cables are bundled together, some of them run quite close to each other, but they're not tied together.

 

Just checked the IPMI event log to see if there were any ECC related errors reported - that's empty. I would have thought that memory hiccups would present themselves in more ways, and more often than the errors I'm seeing?

 

Is there anyway, given the sector numbers reported with the errors, to find out what file(s) are occupying those sectors? If I could find that out perhaps I could do a read of the file under normal usage conditions (i.e. not a parity check) and see if further errors crop up.

Link to comment

Is there anyway, given the sector numbers reported with the errors, to find out what file(s) are occupying those sectors? If I could find that out perhaps I could do a read of the file under normal usage conditions (i.e. not a parity check) and see if further errors crop up.

 

I'm not aware of any Linux tools that will do this (but I'm decidedly NOT a "Linux guy").

 

You've got a lot of data on disk1, so this would take a long time, but one test you could do is copy all of that data to another system [if you don't have the spare space, copy 100-200GB at a time and then delete it from the destination] ... and WATCH the GUI to see if the error column changes AND to see if all the other disks are spinning up (a sign that UnRAID had to do a data rebuild).    Don't need to watch continuously, of course, but do need to check it often enough so that you'd see a spinup [i.e. check it more frequently than your spindown time ... I'd change that to 2-3 hours, so you don't need to check too often].

 

In fact, if you break the copies into "chunks", you could isolate the specific file(s) involved by binary searching the chunk that causes the errors (assuming they still occur).

 

Link to comment
  • 2 weeks later...

Bit of an update, there haven't been any further errors showing so far.

 

I had the chance to open the server up and as suspected the 2 WD 3TB drives were on the same power splitter.

 

I've rearranged them so they are now on different ones, which also involved connecting one of them to a different SATA port.

 

I also noticed that in a couple of places the SATA cables (including to those drives) ran very close to the power cables, so I rerouted them to put a bit of distance between them.

 

I'll keep an eye on things and see if the errors reoccur. I also have some new power splitters to swap in when the opportunity presents itself.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...