Unraid partially boots but connection quickly times out (//tower lost)


Recommended Posts

Hi team!

 

I'm running 4.7 and everything's been working well. (Would've upgraded but too many other priorities in life over the past little while).

 

Today, I tried to access the array, and other than the first level of share names, when I dove in a bit deeper (in Finder on Mac or FileBrowser on iOS) it was just a spinning pinwheel.  Went into //Tower and no response.

 

Went directly to machine and logged in and "powerdown" executed.  Unfortunately, it hung and after 20 minutes I manually powered off. 

 

Restarted and it came up (//tower) with a parity check, but still could not deep dive into the shares.  And, after maybe 10-15 minutes, I lost connection to //tower as well.

 

Attached is my syslog.

 

Help is much appreciated...little bit of panic over here... :P:-[

 

syslog-2015-Apr-19.txt.zip

Link to comment

It's an older motherboard with limited ports, but with that nice CPU has probably been quite usable!  I do see some configuration issues, which have hampered you.  It's currently reserving 256MB for video, fine for graphical screens but unnecessary for unRAID.  Change that to the lowest possible, to give yourself possibly over 240MB of additional system memory.  The latest Seagate 2TB is attached to the Silicon Image card, which will slow it down.  I'd swap it with the 500GB drive, as you want the slowest drives on the slowest ports.  But the big problem is the BIOS configuration, the extra 2 SATA ports are configured to use the ancient emulated IDE support, not the SATA support they should be using.  In the BIOS settings, look for a native SATA mode, preferably AHCI if available, anything but IDE.

 

Your Disk 4 has a problem, possibly serious.  Because it's connected to one of those IDE configured ports, when it first reported a problem (about 14 minutes into the parity check), the IDE module did not seem to handle it well, and it then reported a Device Fault, which fatally caused the IDE support to disable the DMA's for both drives, Disk 4 and Disk 3.  It was never able to regain a good contact with Disk 4, so reported numerous read errors.  The read errors may be spurious, because it couldn't actually talk to the drive.

 

Power off first, then please correct the BIOS first, so that the normal SATA error handlers are handling the problems.  Then Obtain a SMART report for Disk 4.  You will have to determine what the new drive symbol for it is (currently hdd, but that will change once you correct the BIOS).

 

UnMENU is also causing trouble, something is not configured right for its startup.  So it fails to start, and then keeps looping, trying over and over to restart.

 

We can't advise further without knowing what the SMART report says about Disk 4.  The very first error was a "SeekComplete Error", which could be minor, or could indicate a major failure of the drive.  All errors after that were "DeviceFault Error", which I suspect mean that the drive firmware had crashed.  Powering down completely should bring it back up, if it hasn't completely failed.

Link to comment

Great, info RobJ. Thank you very much!

 

I'm trying to switch to AHCI on SATA ports 5 &6 but the BIOS says it's not possible.  When I do set it to AHCI, it won't boot.

 

If I physically rewire to put those drives on the main board (SATA 1-4) will I prevent full recovery? I'm pretty sure no, but just want to confirm before I start rewiring.

 

thanks!

Link to comment

I'm trying to switch to AHCI on SATA ports 5 &6 but the BIOS says it's not possible.  When I do set it to AHCI, it won't boot.

So you're saying there is an AHCI choice, but it won't let you select it?  Or that you can select it, but it won't boot?  Any choice is better than IDE anything.  But make sure that you check the boot order first.  Some (many?) motherboards try to *help* you by switching the boot order when a major change is made to the attached drives and/or drive controllers.  That often causes the system to fail to boot.  Try selecting AHCI then setting once more the boot order so that your flash drive is first to boot.  BUT see below...

 

If I physically rewire to put those drives on the main board (SATA 1-4) will I prevent full recovery? I'm pretty sure no, but just want to confirm before I start rewiring.

I forgot you are still on v4.7.  In newer versions, drives are identified by their serial number no matter where they have moved, so you can shut down and scramble the drives and cables all over, and when you boot unRAID will find them and properly install them in their correct disk number assignments.  Unfortunately, that wasn't true with earlier versions.  You would need to make the changes, then do the 'Trust Parity' procedure.  And right now, that's a risk.  So better to leave everything just where it is, including the IDE BIOS choice, until any disk recovery is complete.  At least, it makes finding the drive symbol easy for Disk 4, it's hdd.  So the SMART command would be

 

  smartctl -a -d ata /dev/hdd >/boot/smart_hdd.txt

Link to comment
So you're saying there is an AHCI choice, but it won't let you select it?  Or that you can select it, but it won't boot?  Any choice is better than IDE anything.  But make sure that you check the boot order first.  Some (many?) motherboards try to *help* you by switching the boot order when a major change is made to the attached drives and/or drive controllers.  That often causes the system to fail to boot.  Try selecting AHCI then setting once more the boot order so that your flash drive is first to boot.

 

Please see the literal screenshots I've attached. Look to the top right corner about the warnings about switching.  After I switched to AHCI then it won't boot (goes to flashing cursor in top left corner). When I switch back to IDE it boots.

 

So better to leave everything just where it is, including the IDE BIOS choice, until any disk recovery is complete.  At least, it makes finding the drive symbol easy for Disk 4, it's hdd.

 

Done. Please find it attached.

IMG_5986-1.JPG.bbbb0326d6bceea36f90d3c9f32ce01f.JPG

IMG_5987-1.JPG.8441c1ad7b0a6217bd3999f5d31b43bb.JPG

smart_hdd.txt

Link to comment

Unfortunately, I have to be at work shortly ...

 

SMART report is very worrying, if no one else is able to help, I'll be back later this afternoon, think you need a non-destructive badblocks command, but SMART short and long tests first, long only if short passes.

Link to comment

So you're saying there is an AHCI choice, but it won't let you select it?  Or that you can select it, but it won't boot?  Any choice is better than IDE anything.  But make sure that you check the boot order first.  Some (many?) motherboards try to *help* you by switching the boot order when a major change is made to the attached drives and/or drive controllers.  That often causes the system to fail to boot.  Try selecting AHCI then setting once more the boot order so that your flash drive is first to boot.

 

Please see the literal screenshots I've attached. Look to the top right corner about the warnings about switching.  After I switched to AHCI then it won't boot (goes to flashing cursor in top left corner). When I switch back to IDE it boots.

What does the warning say? The screenshot is too small to read. In BIOS, after changing to SATA make sure that the flash is still selected at the boot device.

 

So better to leave everything just where it is, including the IDE BIOS choice, until any disk recovery is complete.  At least, it makes finding the drive symbol easy for Disk 4, it's hdd.

 

Done. Please find it attached.

 

Switching to AHCI should make the system faster and more stable but should not be required for recovery. The flashing curser is the BIOS looking for a boot device. Changing to AHCI requires that the boot device be selected again on your MB.

 

Disk 4 has pending sectors. See here: http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector

Link to comment

Just curious what you have tried or done so far ...

 

I do recommend a SMART short test first:

  smartctl -t short /dev/hdd

Wait 2 and a half minutes, then grab another SMART report:

  smartctl -a -d ata /dev/hdd >/boot/smart_hdd2.txt

Check the test section for test results, should indicate if short test passed or failed.  If failed, drive needs to be rebuilt.

If passed, then turn off spin down for the drive, and run the SMART long test:

  smartctl -t long /dev/hdd

It's best to avoid using Disk 4 at all during this test.  Wait 4 to 6 hours, then grab another SMART report:

  smartctl -a -d ata /dev/hdd >/boot/smart_hdd3.txt

Check the test section again for test results.  If test not complete, then wait a few more hours and grab the SMART report again (until test complete).  Check whether either Current_Pending_Sector or Multi_Zone_Error_Rate have increased.  Report back!

Link to comment

Ok, back from work and trying a few things.

 

First I tried setting to AHCI again, and then updated the Boot device back to the USB drive so it would start up (thanks dgaschk).

However, when the client shows that two disks are missing so the whole thing doesn't mount.  I've attached two screenshots of the mainpage of the BIOS when I switch to AHCI and when I switch to IDE.  IDE allows me to start the array, but things go awry like I note above.

 

ok, will not let me post the files. too big so I will find another way to send the screenshots (to be big enough to read).  Here they are on Google drive  https://drive.google.com/folderview?id=0B6aGoLYlJ5VjfmdpSEFfM1A0NnZRS0J5Q3U0eXV5Ylkyam1zdDFhSk1TNlRRdW5YVVV6MVU&usp=sharing

 

Here's the result of the first SMART test.

When_I_switch_Drive_5_and_6_to_IDE.jpg.407b09c0cbc1f1409e0cbdf918832caf.jpg

smart_hdd2.txt

Link to comment

The read failure is not unusual for a disk with pending sectors. After the pending sectors are resolved, repeat the test and it should pass.

 

The warning is BIOS is not a problem because unRAID has AHCI drivers. The warning is meant for for Windows XP users.

 

If the SATA ports are set to AHCI and the flash is then re-assigned as the boot device, does the system start up?

Link to comment

Just a thought - since the OP is running 4.7 and that release identifies disks by how they are connected (rather than by their serial number like later releases do) I am not sure that it is possible to switch a drive from IDE mode to AHCI mode and still get it recognised as the same drive by unRAID.  However as I have no practical experience of 4.7 (I started my unRAID system with a v5 beta as I had 3TB disks) I could be wrong.  This might be relevant in any recovery scenario.

Link to comment
If the SATA ports are set to AHCI and the flash is then re-assigned as the boot device, does the system start up?

Sorry if what I said before was unclear, but no, it does not.

 

It's best to avoid using Disk 4 at all during this test.  Wait 4 to 6 hours, then grab another SMART report:

Ran overnight. Attached

 

This morning, things the browser client is showing that parity is valid, though it was "last checked" yesterday morning. (This is odd as I had been switching the tower on and off, and did not have it running for any length of time.  I don't *believe* I had it running over night two days ago, as my parity check would've taken a long time with my set up.)  I've poked around in the shares and after I perform a little bit of file access the system "hangs", meaning I lose connection (from the browser) to the Tower and can no longer see the shares as well.  So no miracle cure has happened (not expecting that anyway)

 

smart_hdd3.txt

syslog-2015-04-21.txt

Link to comment

Various observations, not necessarily related -

 

* We're agreed, leave the BIOS where it's set (using IDE), for now.  Once the system is recovered, then perhaps it could be looked into again.  It *should* be able to boot with all AHCI, don't know why it can't now.

 

* SMART reports - there was a real surprise on the new one.  Multi_Zone_Error_Rate dropped from 33218 to 11, and more importantly, its VALUE increased from 34 (quite low) to 200, perfect!  I have NEVER seen that happen before.  So at one point, it was having serious problems with whatever this is monitoring, but now it's completely fine, obviously a very good sign for the drive.  No change in the Current_Pending_Sector count of 1402, that still HAS to be fixed, but at least it did not get worse.  So the drive appears recoverable, usable, pending a full PreClear test.

 

* I really wanted to try something new, a non-destructive badblocks run to see if we could save all data in place, while forcing the drive to deal with all of the pending sectors.  I did a lot of research, but couldn't find a safe way to do it.  To clear a pending sector, you have to write to it, so you want to save the data first, then thoroughly test writing to it, then write the correct data back to it.  The drive will have either remapped the sector to a good sector, or recognized it as good now, reducing the number of pending sectors still awaiting testing.  A non-destructive pass saves the data first, then writes test patterns to it, then writes the good data back.  That forces the drive to deal once and for all with the sector, and it ends up with the correct data, now in a good sector.  Unfortunately, I could not find any discussion of what actually happens when badblocks tries to read a bad sector and gets a read error.  It will either force the drive to deal with the sector (but it has no good data to write back, so sector is good but data is corrupt!), OR it skips it, lists it on the bad sector list it creates, which doesn't accomplish anything for us.  Someday, I or someone should write a script that detects and lists all the bad sectors, then generates the correct data from all other drives including parity, then force writes it to those sectors.  Should not actually be too hard to do, much better than current workarounds.

 

* UnMENU - there's a problem with the way UnMENU is starting, and it is filling the syslog with garbage for about a minute each boot up.  I don't recognize the error message myself, but perhaps someone else will.  HOWEVER, it is consistent with one possible scenario.  There has been at least one user who misunderstood the directions for starting UnMENU, the part about using an echo command to insert the 'start UnMENU' command into the go file.  They thought it meant to insert the 'echo' command into the go file, which causes the go file to gain an additional line starting UnMENU every time it boots.  If you have booted UnMENU 100 times, then you will have a go file with 100 lines all trying to start UnMENU!  Check your go file, and if this is the case, remove all but one of those UnMENU start lines, and especially remove that echo command!

 

* You mentioned having issues this morning.  The end of the syslog unfortunately does not seem to indicate any problems, with one possible exception.  The Mover ran at 3:40am, with no issues.  Then at 6:47am, the system tried to send an email through gmail, but it failed.  There are no other email attempts, so I cannot tell if this is a network problem or an email configuration problem.  At about 7:25am, you logged into the server from another station, so that means the network was fine.  There are no further messages, so no real indication of trouble (assuming you captured this syslog AFTER the "hangs" and lost connection).

 

* Your choice, but if it were me, I would disable Transmission for now, and anything else that is not core functionality, until problem is solved.

 

* The drive will have to be rebuilt, either onto itself or onto a new replacement drive.  If you want to avoid buying another, your system is going to be partially down for awhile, with one drive missing.  Disk 4 needs to be removed from the array, then PreCleared, and if it is successful (clean report with no more pending sectors), then it can be reassigned and unRAID will rebuild Disk 4 data onto it.  Or you could consider this a good time to upgrade a disk, buy a 2TB drive, PreClear it, and rebuild Disk 4 onto it.  Your choice, both time-consuming, unless you happen to have a PreCleared replacement waiting already...

Link to comment
* Your choice, but if it were me, I would disable Transmission for now, and anything else that is not core functionality, until problem is solved.

Done.

 

* The drive will have to be rebuilt, either onto itself or onto a new replacement drive.  If you want to avoid buying another, your system is going to be partially down for awhile, with one drive missing.  Disk 4 needs to be removed from the array, then PreCleared, and if it is successful (clean report with no more pending sectors), then it can be reassigned and unRAID will rebuild Disk 4 data onto it.  Or you could consider this a good time to upgrade a disk, buy a 2TB drive, PreClear it, and rebuild Disk 4 onto it. 

I bought a new drive, precleared it 3X (2TB took 68 Hrs), installed and happy to report that things are stable and running again.  I've attached a new logfile as well.

 

* We're agreed, leave the BIOS where it's set (using IDE), for now.  Once the system is recovered, then perhaps it could be looked into again.  It *should* be able to boot with all AHCI, don't know why it can't now.

Did you have a chance to review my screenshots?

https://drive.google.com/folderview?id=0B6aGoLYlJ5VjfmdpSEFfM1A0NnZRS0J5Q3U0eXV5Ylkyam1zdDFhSk1TNlRRdW5YVVV6MVU&usp=sharing

Any other suggestions for how to get this to work? I can now start to rewire (by un-assigning the drives and then re-assigning them after physically rewiring them)

 

* SMART reports - there was a real surprise on the new one.  Multi_Zone_Error_Rate dropped from 33218 to 11, and more importantly, its VALUE increased from 34 (quite low) to 200, perfect!  I have NEVER seen that happen before.  So at one point, it was having serious problems with whatever this is monitoring, but now it's completely fine, obviously a very good sign for the drive.  No change in the Current_Pending_Sector count of 1402, that still HAS to be fixed, but at least it did not get worse.  So the drive appears recoverable, usable, pending a full PreClear test.

I'm already at 6 drives.  If I plug in the extra drive (the one I removed) will I be able to "see" it using the pre-clear scripts as per normal (I use Screen etc.).  Once finished I'll upgrade my smaller 500 MB drive to this one.

 

* UnMENU - there's a problem with the way UnMENU is starting, and it is filling the syslog with garbage for about a minute each boot up.  I don't recognize the error message myself, but perhaps someone else will.  HOWEVER, it is consistent with one possible scenario.  There has been at least one user who misunderstood the directions for starting UnMENU, the part about using an echo command to insert the 'start UnMENU' command into the go file.  They thought it meant to insert the 'echo' command into the go file, which causes the go file to gain an additional line starting UnMENU every time it boots.  If you have booted UnMENU 100 times, then you will have a go file with 100 lines all trying to start UnMENU!  Check your go file, and if this is the case, remove all but one of those UnMENU start lines, and especially remove that echo command!

Guilty as charged ::). I have cleaned this all up now.  Should hopefully be reflected in the new syslog

syslog-2015-04-27.txt

Link to comment

I bought a new drive, precleared it 3X (2TB took 68 Hrs), installed and happy to report that things are stable and running again.

That's terrific!

 

I've attached a new logfile as well.

Not so terrific, somehow all line endings were stripped from the syslog, leaving it as one VERY long line!  There are no Linux line endings, or DOS line endings.  If you could send it just the way you captured it, that would be better.

 

* We're agreed, leave the BIOS where it's set (using IDE), for now.  Once the system is recovered, then perhaps it could be looked into again.  It *should* be able to boot with all AHCI, don't know why it can't now.

Did you have a chance to review my screenshots?

https://drive.google.com/folderview?id=0B6aGoLYlJ5VjfmdpSEFfM1A0NnZRS0J5Q3U0eXV5Ylkyam1zdDFhSk1TNlRRdW5YVVV6MVU&usp=sharing

Any other suggestions for how to get this to work? I can now start to rewire (by un-assigning the drives and then re-assigning them after physically rewiring them)

The screens look great, just what we expected and wanted, so it's surprising it doesn't boot.  Can you double-check once more the boot order for your flash drive, make sure it is still first?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.