Excessus Posted January 9, 2017 Share Posted January 9, 2017 I will preface this by saying that I don't have any logs to share at this time. I can provide some in the future, however. Hardware basics: Motherboard - ASUS X99-A II Socket 2011-V3, Intel X99 Processor - i7-6850k RAM - G. Skill Ripjaws DDR4 32gigs Video - Asus GeForce 210 & Gigabyte GTX 1070 SAS cards (x2) - SUPERMICRO AOC-SAS2LP-MV8 PCI-Express 2.0 x8 SATA / SAS 8-Port Controller Card HD cages (x3) - ICY DOCK MB155SP-B FatCage MB155SP-B 5x3.5" in 3x5.25" Hot Swap SATA HDD Cage Drives: - 8TB WD RED x4 - 500G Intel nvme + 500G Intel SSD for cache I had one parity drive and three data drives. The two cache disks were used in pool. The fail: My goal was to build a combination media server and headless gaming VM that I could use Steamplay with. After a few false starts, I finally managed to get everything working. I copied about 8 TB worth of data to it and then decided to make sure that everything was working as expected so I ran a parity test. That was where the fun started. The parity drive detected 52482 errors and unRAID took it offline. I tried to run a SMART test on it while it was still in the machine, but it refused. I pulled the drive out and put it in another machine and did a full SMART test there. It passed. In the mean time, I ran a read check on the remaining disks. It passed. But I saw xfs errors in the log for /dev/md1. I also picked up a new drive to use as the replacement party so I shut down, installed the new drive and restarted. Now md1 was missing. So I put the system in maintenance mode and ran xfs_repair. It said that I had to toss a bunch of journal entries. I gritted my teeth and did so. I ended up losing about 1TB of data from this, much of which were files that I transferred in the very first hours of loading up the machine. A bit of googling showed that the SAS2LP cards aren't well loved by unRAID. So I turned off vt-D (which irks me as I had been expecting gaming-class performance for my VM) as well as disabling int13 on the cards. I also did a memtest, which has passed (which isn't super surprising, since I don't think that I ever seen a mem test fail, no matter no defective the RAM was). So... I don't trust this build at all. Opinions? Options? The data is trashed well enough that I will have to start over, anyway. Quote Link to comment
tyrindor Posted January 9, 2017 Share Posted January 9, 2017 As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem! I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea. Quote Link to comment
Excessus Posted January 9, 2017 Author Share Posted January 9, 2017 As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem! I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea. I did not pre-clear the drives. Since I have to blow everything away and start over any way, is there a simple way to do this to drives that are already initialized? I have a feeling that the drives are fine, since the odds of getting two bad new WD Reds are vanishingly small. But I am willing to try anything at this point. Quote Link to comment
tyrindor Posted January 9, 2017 Share Posted January 9, 2017 As far as I know all the SAS2LP issues were older versions of unRAID. I've had 6 SAS2LP running in 2 servers for over 3 years. I have INT13 & VT-d disabled, but enabling it shouldn't corrupt XFS filesystem! I'd assume you have a bad drive since you already did a memory test. Did you preclear the drives to ensure they aren't faulty? Throwing in 4 new drives without testing them is not a good idea. I did not pre-clear the drives. Since I have to blow everything away and start over any way, is there a simple way to do this to drives that are already initialized? I have a feeling that the drives are fine, since the odds of getting two bad new WD Reds are vanishingly small. But I am willing to try anything at this point. I would post the SMART results of all drives. Most people here will tell you preclearing drives is mandatory. 2 bad drives out of 4 is not unheard of. When I was buying WD drives from Newegg I had nearly a 25% DOA rate with roughly 30 drives ordered. I've since switched to Amazon and that DOA rate has been 0% with well over 30 drives purchased over the years. I don't trust Newegg anymore. Quote Link to comment
Excessus Posted January 9, 2017 Author Share Posted January 9, 2017 Full or short SMART test? The full tests will take a day or so to run. Quote Link to comment
tdallen Posted January 9, 2017 Share Posted January 9, 2017 Run Tools -> Diagnostics. It will include the current SMART data as well as other logs to help with diagnosing the issue. Quote Link to comment
tyrindor Posted January 9, 2017 Share Posted January 9, 2017 Just the smart attributes like this: 1 Raw read error rate 0x000f 118 099 006 Pre-fail Always Never 188369784 3 Spin up time 0x0003 091 090 000 Pre-fail Always Never 0 4 Start stop count 0x0032 100 100 020 Old age Always Never 305 5 Reallocated sector count 0x0033 100 100 010 Pre-fail Always Never 0 7 Seek error rate 0x000f 078 060 030 Pre-fail Always Never 67167292 9 Power on hours 0x0032 093 093 000 Old age Always Never 6146 (8m, 12d, 2h) 10 Spin retry count 0x0013 100 100 097 Pre-fail Always Never 0 12 Power cycle count 0x0032 100 100 020 Old age Always Never 37 183 Runtime bad block 0x0032 100 100 000 Old age Always Never 0 184 End-to-end error 0x0032 100 100 099 Old age Always Never 0 187 Reported uncorrect 0x0032 100 100 000 Old age Always Never 0 188 Command timeout 0x0032 100 100 000 Old age Always Never 0 189 High fly writes 0x003a 072 072 000 Old age Always Never 28 190 Airflow temperature cel 0x0022 077 047 045 Old age Always Never 23 (min/max 23/37) 191 G-sense error rate 0x0032 100 100 000 Old age Always Never 0 192 Power-off retract count 0x0032 100 100 000 Old age Always Never 918 193 Load cycle count 0x0032 099 099 000 Old age Always Never 2093 194 Temperature celsius 0x0022 023 053 000 Old age Always Never 23 (0 19 0 0 0) 195 Hardware ECC recovered 0x001a 118 099 000 Old age Always Never 188369784 197 Current pending sector 0x0012 100 100 000 Old age Always Never 0 198 Offline uncorrectable 0x0010 100 100 000 Old age Offline Never 0 199 UDMA CRC error count 0x003e 200 200 000 Old age Always Never 0 240 Head flying hours 0x0000 100 253 000 Old age Offline Never 521 (217 78 0) 241 Total lbas written 0x0000 100 253 000 Old age Offline Never 30647098880 242 Total lbas read 0x0000 100 253 000 Old age Offline Never 180970282749 Quote Link to comment
Excessus Posted January 9, 2017 Author Share Posted January 9, 2017 Diagnostics archive attached. Here are the SMART results. NEW parity drive (added yesterday and not part of this mess...yet): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 13 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6 22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 10 194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 21/37) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 Old FAILED parity drive: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 129 129 054 Pre-fail Offline - 124 3 Spin_Up_Time 0x0007 149 149 024 Pre-fail Always - 446 (Average 438) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 31 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 234 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22 22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 81 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 81 194 Temperature_Celsius 0x0002 181 181 000 Old_age Always - 33 (Min/Max 22/44) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 Disk 1 (XFS corruption): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 130 130 054 Pre-fail Offline - 120 3 Spin_Up_Time 0x0007 145 145 024 Pre-fail Always - 465 (Average 439) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 25 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 250 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25 22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 116 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 116 194 Temperature_Celsius 0x0002 171 171 000 Old_age Always - 35 (Min/Max 21/45) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 hex-diagnostics-20170109-1812.zip Quote Link to comment
Excessus Posted January 11, 2017 Author Share Posted January 11, 2017 Here is the preclear report of the drive that failed a parity check with 52482 errors. Everything looks good to me. Any thoughts? ############################################################################################################################ # # # unRAID Server Preclear of disk /dev/sdh # # Cycle 1 of 1, partition start on sector 64. # # # # # # Step 1 of 5 - Pre-read verification: [15:35:14 @ 142 MB/s] SUCCESS # # Step 2 of 5 - Zeroing the disk: [15:22:55 @ 144 MB/s] SUCCESS # # Step 3 of 5 - Writing unRAID's Preclear signature: SUCCESS # # Step 4 of 5 - Verifying unRAID's Preclear signature: SUCCESS # # Step 5 of 5 - Post-Read verification: [15:36:45 @ 142 MB/s] SUCCESS # # # # # # # # # # # # # # # ############################################################################################################################ # Cycle elapsed time: 46:34:57 | Total elapsed time: 46:34:57 # ############################################################################################################################ ############################################################################################################################ # # # S.M.A.R.T. Status default # # # # # # ATTRIBUTE INITIAL CYCLE 1 STATUS # # 5-Reallocated_Sector_Ct 0 0 - # # 9-Power_On_Hours 234 280 Up 46 # # 194-Temperature_Celsius 34 38 Up 4 # # 196-Reallocated_Event_Count 0 0 - # # 197-Current_Pending_Sector 0 0 - # # 198-Offline_Uncorrectable 0 0 - # # 199-UDMA_CRC_Error_Count 0 0 - # # # # # # # # # # # ############################################################################################################################ # SMART overall-health self-assessment test result: PASSED # ############################################################################################################################ --> ATTENTION: Please take a look into the SMART report above for drive health issues. --> RESULT: Preclear Finished Successfully!. root@Hex:/usr/local/emhttp# Quote Link to comment
RobJ Posted January 12, 2017 Share Posted January 12, 2017 This syslog is after a reboot, only runs for 13 minutes, and the array never starts, but the SMART reports are all fine, and there are no issues evident (except parity is not valid yet). I don't think you have any drive issues, more likely an interface issue, such as a drive losing its connection, causing the parity check to fail with numerous errors. The other problem was file system corruption in an XFS file system (which has nothing to do with the drive), which you have now fixed. Without the syslog that covered that period, I cannot even speculate what went wrong. Possibly a power or SATA cable disconnection during a write to a data drive? That could explain both the file system corruption and the failed parity check. I'd recheck all cable connections to the drives, both ends of the cables. If you see any more errors, make sure you grab the diagnostics then, so we can see what errors are being reported. Quote Link to comment
Excessus Posted January 12, 2017 Author Share Posted January 12, 2017 This syslog is after a reboot, only runs for 13 minutes, and the array never starts, but the SMART reports are all fine, and there are no issues evident (except parity is not valid yet). I don't think you have any drive issues, more likely an interface issue, such as a drive losing its connection, causing the parity check to fail with numerous errors. The other problem was file system corruption in an XFS file system (which has nothing to do with the drive), which you have now fixed. Without the syslog that covered that period, I cannot even speculate what went wrong. Possibly a power or SATA cable disconnection during a write to a data drive? That could explain both the file system corruption and the failed parity check. I'd recheck all cable connections to the drives, both ends of the cables. If you see any more errors, make sure you grab the diagnostics then, so we can see what errors are being reported. I still don't trust this system. I am going to do a pre-clear on the drive that had the xfs corruption (well, all the drives, I guess) and then try another memtest with the latest version of the tool. Given that it takes 48 hours to do a pre-clear, I suppose I will be checking back in again in a week or so... Quote Link to comment
Excessus Posted January 12, 2017 Author Share Posted January 12, 2017 Also, one thing that I was going to ask about (before the array died) was that I am getting fairly slow read transfer speeds from this machine. On my Synology devices, I get about the max for a 1Gbit connection, ie; 100-110 megs a second when I copy a single large file. On this server, I get around 60-70 megs a second. This is for the mechanical drives, mind you. For the cache SSD drives, transfers are where they should be. I am using the on-board NIC, although I suppose I could add a second NIC card if I had to. I also have managed switches that will support link aggregation / bonding, but I don't believe that this is a bandwith problem (given the SSD transfer speeds). The CPU showed very little utilization during this period, too. Before things blew up, I ran this test script and all the drives showed at least a 150megs/sec transfer rate. Thoughts? Quote Link to comment
Excessus Posted January 14, 2017 Author Share Posted January 14, 2017 Further fun. I am rebuilding the new parity drive and have a log full of these: Jan 14 12:53:33 Hex kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 14 12:53:33 Hex kernel: sas: trying to find task 0xffff88080fac5400 Jan 14 12:53:33 Hex kernel: sas: sas_scsi_find_task: aborting task 0xffff88080fac5400 Jan 14 12:53:33 Hex kernel: sas: sas_scsi_find_task: task 0xffff88080fac5400 is aborted Jan 14 12:53:33 Hex kernel: sas: sas_eh_handle_sas_errors: task 0xffff88080fac5400 is aborted Jan 14 12:53:33 Hex kernel: sas: ata10: end_device-8:3: cmd error handler Jan 14 12:53:33 Hex kernel: sas: ata7: end_device-8:0: dev error handler Jan 14 12:53:33 Hex kernel: sas: ata8: end_device-8:1: dev error handler Jan 14 12:53:33 Hex kernel: sas: ata9: end_device-8:2: dev error handler Jan 14 12:53:33 Hex kernel: sas: ata10: end_device-8:3: dev error handler Jan 14 12:53:33 Hex kernel: sas: ata11: end_device-8:4: dev error handler Jan 14 12:53:33 Hex kernel: ata10.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x6 frozen Jan 14 12:53:33 Hex kernel: ata10.00: failed command: READ FPDMA QUEUED Jan 14 12:53:33 Hex kernel: ata10.00: cmd 60/00:00:e0:4d:11/04:00:5e:02:00/40 tag 23 ncq 524288 in Jan 14 12:53:33 Hex kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 14 12:53:33 Hex kernel: ata10.00: status: { DRDY } Jan 14 12:53:33 Hex kernel: ata10: hard resetting link Jan 14 12:53:34 Hex kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Jan 14 12:53:36 Hex kernel: drivers/scsi/mvsas/mv_sas.c 1430:mvs_I_T_nexus_reset for device[3]:rc= 0 Jan 14 12:53:36 Hex kernel: ata10.00: configured for UDMA/133 Jan 14 12:53:36 Hex kernel: ata10: EH complete Jan 14 12:53:36 Hex kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Quote Link to comment
John_M Posted January 14, 2017 Share Posted January 14, 2017 You really need to post your diagnostics. And can you confirm that IOMMU is still disabled in the BIOS? There is still an issue with certain Marvell SAS/SATA controller chips (including yours) that affects current versions of unRAID. The problem is with the driver, not with unRAID itself or the hardware. http://lime-technology.com/forum/index.php?topic=40683.0 Quote Link to comment
Excessus Posted January 14, 2017 Author Share Posted January 14, 2017 You really need to post your diagnostics. And can you confirm that IOMMU is still disabled in the BIOS? There is still an issue with certain Marvell SAS/SATA controller chips (including yours) that affects current versions of unRAID. The problem is with the driver, not with unRAID itself or the hardware. http://lime-technology.com/forum/index.php?topic=40683.0 Yes, VT-D is disabled in the BIOS. I have attached the latest diagnostics. hex-diagnostics-20170114-1543.zip Quote Link to comment
Excessus Posted January 14, 2017 Author Share Posted January 14, 2017 I should mention that I have already done the "append iommu=pt initrd=/bzroot" fix about a week ago. Quote Link to comment
Excessus Posted January 15, 2017 Author Share Posted January 15, 2017 I have made a bunch of changes. So far, things appear to be better. I am copying 9TB to it right now and the log is free of errors and warnings. Changes: - Removed the SAS2LP cards and moved the connections back to the main board - Re-enabled VT-D in the BIOS - Removed iommu=pt from syslinux.cfg - Set "md_write_method" to "Reconstruct Write" My write speeds are fantastic with "Reconstruct Write"; around 100 megs a second. The funny thing is, copying files back from the array gets around 40 megs a second. I don't get this at all... Quote Link to comment
JorgeB Posted January 15, 2017 Share Posted January 15, 2017 The funny thing is, copying files back from the array gets around 40 megs a second. Try this FAQ entry to see if it helps. Quote Link to comment
Excessus Posted January 15, 2017 Author Share Posted January 15, 2017 The funny thing is, copying files back from the array gets around 40 megs a second. Try this FAQ entry to see if it helps. Thanks! I'm in the middle of letting the file integrity plugin do its thing so I don't want to restart the array just now. But I have made the change and will report back. Quote Link to comment
Excessus Posted February 4, 2017 Author Share Posted February 4, 2017 Just a final follow-up. Everything is working fine, now. I replaced the Smartmicro cards with Dell H310 Perc cards and I haven't had another weird error, since. I'd have to come down on the firm NOT RECOMMENDED side of these cards for current unRAID builds. Quote Link to comment
tyoung5ND Posted February 4, 2017 Share Posted February 4, 2017 FWIW - I have also mad many issues with the SUPERMICRO AOC-SAS2LP-MV8 card. See my post here: https://lime-technology.com/forum/index.php?topic=44248.0 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.