More problems with Supermicro server - three failed drives, on parity one data,


Recommended Posts

So just now, with about 7 hrs left to complete the parity check, my second party drive and one of my 8TB data drives drop out of the array with red X's. The parity check cancelled itself and I have attached diags before I stopped the array and am now trying to reboot the server although the console is stuck at 'Unmounting remote filesystems:

 

Not sure what is going on......

tower-diagnostics-20170115-1341.zip

Link to comment
  • Replies 71
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

I've stopped the array and it now says next to the parity disc that has the red X, all data on this disk will be erased when the array is started.

 

I don't actually think these disc are bad, but that something has forced them to drop out of the array.

 

I'm guessing if I start the array its going to rebuild parity and fix both the 8 TB data disk with a red x and the second parity disk with a red x?

Link to comment

OK I am just in my BIOS now, I have couple of questions.

 

Should I have EFI optimized boot enabled? Its currently disabled.

Should I have processor C3 enabled? Its currently disabled

Should Processor C6 be enabled? Its currently enabled

Hyper threading is enabled

Core Multi-processing is enabled

Execute Disabled Bit is enabled

VT for Direct I/O is enabled, I assume this is what you want me to disable

Hardware prefetcher is enabled

Adjacent Cache Line Prefetch is enabled

Direct Cache Access (DCA) is enabled

 

In the PCI configuration of the BIOS there are two settings:

 

Maximize memory below 4GB, currently disabled

Maximize Mapped I/O above 4GB currently disabled (should this be enabled?)

 

Thanks

Link to comment

It was your two Seagate VX disks that dropped off line - Disk 24 and Parity 2. Does that tally with what you saw?

 

ST4000DM000-1F2168_Z303SBXG has 4794 UDMA errors so check cable or seating into backplane. Other SMART reports are ok.

 

I see a lot of messages like this in your syslog:

 

Jan 14 00:25:27 Tower kernel: sas: ata30: end_device-1:1:35: dev error handler

Jan 14 00:25:27 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

I don't recognise them, but then I've never used a SAS expander. However "failed:1" doesn't look good.

 

Also SAS broadcasts:

 

Jan 14 00:36:48 Tower kernel: sas: broadcast received: 0

 

that are also outside of my experience.

 

Immediately after array start and loading of Ubooquity I see this:

 

Jan 14 00:25:22 Tower emhttp: Start failed: PID created but no process exists

Jan 14 00:25:25 Tower root: Fix Common Problems Version 2016.12.16

Jan 14 00:25:27 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

 

which doesn't bode well.

 

There's an awful lot going on in your syslog and as well as having a lot of hardware you have a complex configuration too. My approach would be to strip it right back to a basic NAS, and get that stable first. Then add back the bells and whistles.

 

Link to comment

After your reboot, with two disks disabled, your syslog still has a lot of this:

 

Jan 15 13:54:28 Tower vsftpd[10523]: connect from 127.0.0.1 (127.0.0.1)

Jan 15 13:54:33 Tower emhttp: Start failed: PID created but no process exists

Jan 15 13:54:36 Tower root: Fix Common Problems Version 2016.12.16

Jan 15 13:54:36 Tower vsftpd[10584]: connect from 127.0.0.1 (127.0.0.1)

Jan 15 13:54:37 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 15 13:54:37 Tower kernel: sas: ata5: end_device-1:0:20: cmd error handler

Jan 15 13:54:37 Tower kernel: sas: ata1: end_device-1:0:13: dev error handler

 

immediately after array start, which I find troubling but may be expected.

 

Link to comment

I've now got file system errors on my console that I have no idea what to do with, plus the parity check was cancelled I think, I've got problems with more drives now, I don't know if this is coincidence or bad luck. I cancelled the parity check because it looked like it was still going although there were no drive light activity on my drives. The array is still mounted. Diags attached.

 

tower-diagnostics-20170115-2012.zip

Link to comment

Three more disks dropped off-line but the others look OK and the one ending SBXG hasn't accumulated any more UDMA errors.

 

I see the same SAS messages as before and lots and lots of this:

 

Jan 15 18:02:43 Tower vsftpd[5373]: connect from 192.168.111.121 (192.168.111.121)

Jan 15 18:02:43 Tower vsftpd[5373]: [ashman] OK LOGIN: Client "192.168.111.121"

Jan 15 18:02:44 Tower vsftpd[5377]: connect from 192.168.111.120 (192.168.111.120)

Jan 15 18:02:44 Tower vsftpd[5377]: [ashman] OK LOGIN: Client "192.168.111.120"

 

Do you have any idea what that is all about? I doubt that it has much to do with your problem (though it might be a symptom, rather than a cause). As I said, I'd stop dockers and anything else that's trying to access the array.

 

I think the disks are fine in themselves. You may well have file system corruption and you now have too many disabled disks to be able to rebuild. Ultimately you'll probably want to do a New Config and rebuild both parities but before you do that you need to investigate the SAS problems. It looks to me as though it's something between the controllers and the disks themselves - cables/SAS expander/back-plane, maybe. Since I'm seeing messages I haven't seen before, I suspect the expander, though that's purely a guess.

 

If it was my server I'd have to break the problem down into simpler pieces and tackle each piece at a time. I'd put the pile of disks to one side and pick up a spare and do some testing, but someone else might have a better suggestion.

 

Link to comment

I did some testing with this setup before I swapped my disks. When I did that I had the HBA in the only x16 slot on this MB, but johnnie.black found that my board only ran the x16 at x4 electrically, so I put the card in an x8 slot where it has been since yesterday, up until then it was running fine although I hadn't run a parity check. I had started to run a parity check in the x16 slot but it was running so slow unRAID said it was going to take a week. This is what he found about my board. I am just wondering if changing slots had anything to do with it.

 

What do you suggest should be my next course of action? I don't want to do a new config and rebuild parity if disks are going to drop off the array again.

 

Intel® Server Board S5500HCV: Five expansion slots

o One PCI Express* Gen 2 slot (x16 Mechanically,x4 Electrically

o Two PCI Express* Express* Gen 2 x8 slots

o One PCI Express* Gen 1 slot (x8 Mechanically, x4 Electrically)

shared with SAS Module slot. This PCI Express* Gen 1 slot is not

available when the SAS module slot is in use and vice versa

Link to comment

I'd wait and see if anyone who has experience of the SAS Expander you're using can make any suggestions. I've used SAS HBAs and simple port multipliers but nothing more sophisticated than that. I'm afraid I would have used more HBAs instead if I had been building such a large array.

 

In the meantime I'd investigate hosts 192.168.111.120 and 192.168.111.121 to see why they are spamming your ftp daemon.

Link to comment

One more disk got disabled, file system issues are probably because they both can't be correctly emulated since parity2 is invalid.

 

Problem seems to be the SAS2LP, it's timing out, could be cable/backplane issue, maybe try the other x8 slot and make sure the controller it's well seated, first 2 disable disks are on the back backplane but the 3rd one is on the front.

 

You'll need to do a new config and although parity1 should be mostly in sync it would need a parity check, so faster to just sync both.

 

 

 

I don't remember seeing these before:

Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0
Jan 15 19:17:53 Tower kernel: sas: broadcast received: 0

 

But they look harmless, maybe someone else has more info.

Link to comment

It's not that I think its damaged, but that there was so much tension because it was being stretched that its likely the cause of the drives dropping off.

 

The thing is if I get another HBA I am not sure how to connect it, the rear backplane has two connectors and the front backplane has three. I may email Supermicro tech support and ask them.

Link to comment

Supermicro manual is not very clear but from what I could find it does support dual link.

 

https://forums.servethehome.com/index.php?threads/playing-with-my-sas-expander.7420/

 

The HBA needs to support it as well, I know the SASLP and the LSI9211-8i (H310/M1015) support it because I tested those, but never tested the SAS2LP and can't find anything about it on the net, you can try it or I may test mine when I get the chance.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.