Failed unRAID build


Recommended Posts

I will preface this by saying that I don't have any logs to share at this time.  I can provide some in the future, however.

 

Hardware basics:

 

Motherboard - ASUS X99-A II Socket 2011-V3, Intel X99

Processor - i7-6850k

RAM - G. Skill Ripjaws DDR4 32gigs

Video - Asus GeForce 210 & Gigabyte GTX 1070

SAS cards (x2) - SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Card

HD cages (x3) - ICY DOCK MB155SP-B FatCage MB155SP-B 5x3.5" in 3x5.25" Hot Swap SATA HDD Cage

 

Drives:

- 8TB WD RED x4

- 500G Intel nvme + 500G Intel SSD for cache

 

I had one parity drive and three data drives.  The two cache disks were used in pool.

 

The fail:

 

My goal was to build a combination media server and headless gaming VM that I could use Steamplay with.

 

After a few false starts, I finally managed to get everything working.  I copied about 8 TB worth of data to it and then decided to make sure that everything was working as expected so I ran a parity test.  That was where the fun started.

 

The parity drive detected 52482 errors and unRAID took it offline.  I tried to run a SMART test on it while it was still in the machine, but it refused.  I pulled the drive out and put it in another machine and did a full SMART test there.  It passed.

 

In the mean time, I ran a read check on the remaining disks.  It passed.  But I saw xfs errors in the log for /dev/md1.  I also picked up a new drive to use as the replacement party so I shut down, installed the new drive and restarted.  Now md1 was missing.  So I put the system in maintenance mode and ran xfs_repair.  It said that I had to toss a bunch of journal entries.  I gritted my teeth and did so.  I ended up losing about 1TB of data from this, much of which were files that I transferred in the very first hours of loading up the machine.

 

A bit of googling showed that the SAS2LP cards aren't well loved by unRAID.  So I turned off vt-D (which irks me as I had been expecting gaming-class performance for my VM) as well as disabling int13 on the cards.

 

I also did a memtest, which has passed (which isn't super surprising, since I don't think that I ever seen a mem test fail, no matter no defective the RAM was).

 

So...  I don't trust this build at all.  Opinions?  Options?  The data is trashed well enough that I will have to start over, anyway.

Link to comment

As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem!

 

I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea.

Link to comment

As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem!

 

I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea.

 

I did not pre-clear the drives.  Since I have to blow everything away and start over any way, is there a simple way to do this to drives that are already initialized?

 

I have a feeling that the drives are fine, since the odds of getting two bad new WD Reds are vanishingly small.  But I am willing to try anything at this point.

Link to comment

As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem!

 

I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea.

 

I did not pre-clear the drives.  Since I have to blow everything away and start over any way, is there a simple way to do this to drives that are already initialized?

 

I have a feeling that the drives are fine, since the odds of getting two bad new WD Reds are vanishingly small.  But I am willing to try anything at this point.

 

I would post the SMART results of all drives.

 

Most people here will tell you preclearing drives is mandatory. 2 bad drives out of 4 is not unheard of. When I was buying WD drives from Newegg I had nearly a 25% DOA rate with roughly 30 drives ordered. I've since switched to Amazon and that DOA rate has been 0% with well over 30 drives purchased over the years. I don't trust Newegg anymore.

Link to comment

Just the smart attributes like this:

1	Raw read error rate	0x000f	118	099	006	Pre-fail	Always	Never	188369784
3	Spin up time	0x0003	091	090	000	Pre-fail	Always	Never	0
4	Start stop count	0x0032	100	100	020	Old age	Always	Never	305
5	Reallocated sector count	0x0033	100	100	010	Pre-fail	Always	Never	0
7	Seek error rate	0x000f	078	060	030	Pre-fail	Always	Never	67167292
9	Power on hours	0x0032	093	093	000	Old age	Always	Never	6146 (8m, 12d, 2h)
10	Spin retry count	0x0013	100	100	097	Pre-fail	Always	Never	0
12	Power cycle count	0x0032	100	100	020	Old age	Always	Never	37
183	Runtime bad block	0x0032	100	100	000	Old age	Always	Never	0
184	End-to-end error	0x0032	100	100	099	Old age	Always	Never	0
187	Reported uncorrect	0x0032	100	100	000	Old age	Always	Never	0
188	Command timeout	0x0032	100	100	000	Old age	Always	Never	0
189	High fly writes	0x003a	072	072	000	Old age	Always	Never	28
190	Airflow temperature cel	0x0022	077	047	045	Old age	Always	Never	23 (min/max 23/37)
191	G-sense error rate	0x0032	100	100	000	Old age	Always	Never	0
192	Power-off retract count	0x0032	100	100	000	Old age	Always	Never	918
193	Load cycle count	0x0032	099	099	000	Old age	Always	Never	2093
194	Temperature celsius	0x0022	023	053	000	Old age	Always	Never	23 (0 19 0 0 0)
195	Hardware ECC recovered	0x001a	118	099	000	Old age	Always	Never	188369784
197	Current pending sector	0x0012	100	100	000	Old age	Always	Never	0
198	Offline uncorrectable	0x0010	100	100	000	Old age	Offline	Never	0
199	UDMA CRC error count	0x003e	200	200	000	Old age	Always	Never	0
240	Head flying hours	0x0000	100	253	000	Old age	Offline	Never	521 (217 78 0)
241	Total lbas written	0x0000	100	253	000	Old age	Offline	Never	30647098880
242	Total lbas read	0x0000	100	253	000	Old age	Offline	Never	180970282749

Link to comment

Diagnostics archive attached.  Here are the SMART results.

 

NEW parity drive (added yesterday and not part of this mess...yet):

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   054    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       13
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       10
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 21/37)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

 

Old FAILED parity drive:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   129   129   054    Pre-fail  Offline      -       124
  3 Spin_Up_Time            0x0007   149   149   024    Pre-fail  Always       -       446 (Average 438)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       31
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       234
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       81
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 22/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

Disk 1 (XFS corruption):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       120
  3 Spin_Up_Time            0x0007   145   145   024    Pre-fail  Always       -       465 (Average 439)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       250
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       116
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       116
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 21/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

hex-diagnostics-20170109-1812.zip

Link to comment

Here is the preclear report of the drive that failed a parity check with 52482 errors.  Everything looks good to me.  Any thoughts?

 

############################################################################################################################
#                                                                                                                          #
#                                        unRAID Server Preclear of disk /dev/sdh                                           #
#                                       Cycle 1 of 1, partition start on sector 64.                                        #
#                                                                                                                          #
#                                                                                                                          #
#   Step 1 of 5 - Pre-read verification:                                                  [15:35:14 @ 142 MB/s] SUCCESS    #
#   Step 2 of 5 - Zeroing the disk:                                                       [15:22:55 @ 144 MB/s] SUCCESS    #
#   Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #
#   Step 4 of 5 - Verifying unRAID's Preclear signature:                                                        SUCCESS    #
#   Step 5 of 5 - Post-Read verification:                                                 [15:36:45 @ 142 MB/s] SUCCESS    #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#                              Cycle elapsed time: 46:34:57 | Total elapsed time: 46:34:57                                 #
############################################################################################################################


############################################################################################################################
#                                                                                                                          #
#                                               S.M.A.R.T. Status default                                                  #
#                                                                                                                          #
#                                                                                                                          #
#   ATTRIBUTE                    INITIAL  CYCLE 1  STATUS                                                                  #
#   5-Reallocated_Sector_Ct      0        0        -                                                                       #
#   9-Power_On_Hours             234      280      Up 46                                                                   #
#   194-Temperature_Celsius      34       38       Up 4                                                                    #
#   196-Reallocated_Event_Count  0        0        -                                                                       #
#   197-Current_Pending_Sector   0        0        -                                                                       #
#   198-Offline_Uncorrectable    0        0        -                                                                       #
#   199-UDMA_CRC_Error_Count     0        0        -                                                                       #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
#                                                                                                                          #
############################################################################################################################
#   SMART overall-health self-assessment test result: PASSED                                                               #
############################################################################################################################


--> ATTENTION: Please take a look into the SMART report above for drive health issues.

--> RESULT: Preclear Finished Successfully!.


root@Hex:/usr/local/emhttp#

Link to comment

This syslog is after a reboot, only runs for 13 minutes, and the array never starts, but the SMART reports are all fine, and there are no issues evident (except parity is not valid yet).  I don't think you have any drive issues, more likely an interface issue, such as a drive losing its connection, causing the parity check to fail with numerous errors.

 

The other problem was file system corruption in an XFS file system (which has nothing to do with the drive), which you have now fixed.  Without the syslog that covered that period, I cannot even speculate what went wrong.  Possibly a power or SATA cable disconnection during a write to a data drive?  That could explain both the file system corruption and the failed parity check.  I'd recheck all cable connections to the drives, both ends of the cables.

 

If you see any more errors, make sure you grab the diagnostics then, so we can see what errors are being reported.

Link to comment

This syslog is after a reboot, only runs for 13 minutes, and the array never starts, but the SMART reports are all fine, and there are no issues evident (except parity is not valid yet).  I don't think you have any drive issues, more likely an interface issue, such as a drive losing its connection, causing the parity check to fail with numerous errors.

 

The other problem was file system corruption in an XFS file system (which has nothing to do with the drive), which you have now fixed.  Without the syslog that covered that period, I cannot even speculate what went wrong.  Possibly a power or SATA cable disconnection during a write to a data drive?  That could explain both the file system corruption and the failed parity check.  I'd recheck all cable connections to the drives, both ends of the cables.

 

If you see any more errors, make sure you grab the diagnostics then, so we can see what errors are being reported.

 

I still don't trust this system. I am going to do a pre-clear on the drive that had the xfs corruption (well, all the drives, I guess) and then try another memtest with the latest version of the tool.  Given that it takes 48 hours to do a pre-clear, I suppose I will be checking back in again in a week or so...

Link to comment

Also, one thing that I was going to ask about (before the array died) was that I am getting fairly slow read transfer speeds from this machine.  On my Synology devices, I get about the max for a 1Gbit connection, ie; 100-110 megs a second when I copy a single large file.  On this server, I get around 60-70 megs a second.  This is for the mechanical drives, mind you.  For the cache SSD drives, transfers are where they should be.

 

I am using the on-board NIC, although I suppose I could add a second NIC card if I had to.  I also have managed switches that will support link aggregation / bonding, but I don't believe that this is a bandwith problem (given the SSD transfer speeds).  The CPU showed very little utilization during this period, too.

 

Before things blew up, I ran this test script and all the drives showed at least a 150megs/sec transfer rate.

 

Thoughts?

Link to comment

Further fun.  I am rebuilding the new parity drive and have a log full of these:

 

Jan 14 12:53:33 Hex kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jan 14 12:53:33 Hex kernel: sas: trying to find task 0xffff88080fac5400
Jan 14 12:53:33 Hex kernel: sas: sas_scsi_find_task: aborting task 0xffff88080fac5400
Jan 14 12:53:33 Hex kernel: sas: sas_scsi_find_task: task 0xffff88080fac5400 is aborted
Jan 14 12:53:33 Hex kernel: sas: sas_eh_handle_sas_errors: task 0xffff88080fac5400 is aborted
Jan 14 12:53:33 Hex kernel: sas: ata10: end_device-8:3: cmd error handler
Jan 14 12:53:33 Hex kernel: sas: ata7: end_device-8:0: dev error handler
Jan 14 12:53:33 Hex kernel: sas: ata8: end_device-8:1: dev error handler
Jan 14 12:53:33 Hex kernel: sas: ata9: end_device-8:2: dev error handler
Jan 14 12:53:33 Hex kernel: sas: ata10: end_device-8:3: dev error handler
Jan 14 12:53:33 Hex kernel: sas: ata11: end_device-8:4: dev error handler
Jan 14 12:53:33 Hex kernel: ata10.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x6 frozen
Jan 14 12:53:33 Hex kernel: ata10.00: failed command: READ FPDMA QUEUED
Jan 14 12:53:33 Hex kernel: ata10.00: cmd 60/00:00:e0:4d:11/04:00:5e:02:00/40 tag 23 ncq 524288 in
Jan 14 12:53:33 Hex kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 14 12:53:33 Hex kernel: ata10.00: status: { DRDY }
Jan 14 12:53:33 Hex kernel: ata10: hard resetting link
Jan 14 12:53:34 Hex kernel: sas: sas_form_port: phy3 belongs to port3 already(1)!
Jan 14 12:53:36 Hex kernel: drivers/scsi/mvsas/mv_sas.c 1430:mvs_I_T_nexus_reset for device[3]:rc= 0
Jan 14 12:53:36 Hex kernel: ata10.00: configured for UDMA/133
Jan 14 12:53:36 Hex kernel: ata10: EH complete
Jan 14 12:53:36 Hex kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Link to comment

You really need to post your diagnostics. And can you confirm that IOMMU is still disabled in the BIOS? There is still an issue with certain Marvell SAS/SATA controller chips (including yours) that affects current versions of unRAID. The problem is with the driver, not with unRAID itself or the hardware. http://lime-technology.com/forum/index.php?topic=40683.0

 

Yes, VT-D is disabled in the BIOS.  I have attached the latest diagnostics.

hex-diagnostics-20170114-1543.zip

Link to comment

I have made a bunch of changes.  So far, things appear to be better.  I am copying 9TB to it right now and the log is free of errors and warnings.

 

Changes:

 

- Removed the SAS2LP cards and moved the connections back to the main board

- Re-enabled VT-D in the BIOS

- Removed iommu=pt  from syslinux.cfg

- Set "md_write_method" to "Reconstruct Write"

 

My write speeds are fantastic with "Reconstruct Write"; around 100 megs a second.  The funny thing is, copying files back from the array gets around 40 megs a second.

 

I don't get this at all...

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.