Jump to content

Drive error in syslog


joshpond

Recommended Posts

Hi all,

 

Just inserted an new drive into a supermicro SASLP AOC MV8. There is another drive on there already with no errors.

Started to see the syslog fill up with this error listed below:

 

Feb 13 18:27:33 Tower kernel: ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 (Drive related)
Feb 13 18:27:33 Tower kernel: ata2: status=0x51 { DriveReady SeekComplete Error } (Errors)
Feb 13 18:27:33 Tower kernel: ata2: error=0x04 { DriveStatusError } (Errors)
Feb 13 18:27:33 Tower kernel: ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 (Drive related)
Feb 13 18:27:33 Tower kernel: ata2: status=0x51 { DriveReady SeekComplete Error } (Errors)
Feb 13 18:27:33 Tower kernel: ata2: error=0x04 { DriveStatusError } (Errors)

 

Started preclear and it seems to have stopped. From what I could find on google is this error basically saying that it can't find a start sector as the drive is unformatted and after preclear and adding to the array it should go?

 

Running 4.7

 

Thanks Josh

Link to comment

Ok, a lot more errors now,

 

tried to play a avi through the popcorn hour and it would only play for a few secs and then cut out.

 

thanks Josh

 

Feb 13 18:27:33 Tower kernel: ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00 (Drive related)
Feb 13 18:27:33 Tower kernel: ata2: status=0x51 { DriveReady SeekComplete Error } (Errors)
Feb 13 18:27:33 Tower kernel: ata2: error=0x04 { DriveStatusError } (Errors)
Feb 13 18:47:02 Tower kernel: ------------[ cut here ]------------
Feb 13 18:47:02 Tower kernel: WARNING: at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308() (Minor Issues)
Feb 13 18:47:02 Tower kernel: Hardware name: System Product Name
Feb 13 18:47:02 Tower kernel: Modules linked in: md_mod xor r8169 ahci mvsas libsas scst scsi_transport_sas (Drive related)
Feb 13 18:47:02 Tower kernel: Pid: 29001, comm: hdparm Not tainted 2.6.32.9-unRAID #8 (Errors)
Feb 13 18:47:02 Tower kernel: Call Trace: (Errors)
Feb 13 18:47:02 Tower kernel:  [<c102449e>] warn_slowpath_common+0x60/0x77 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c10244c2>] warn_slowpath_null+0xd/0x10 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11b624d>] ata_qc_issue+0x10b/0x308 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11ba260>] ata_scsi_translate+0xd1/0xff (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11baa40>] ata_sas_queuecmd+0x120/0x1d7 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11bc6df>] ? ata_scsi_pass_thru+0x0/0x21d (Errors)
Feb 13 18:47:02 Tower kernel:  [<f844769a>] sas_queuecommand+0x65/0x20d [libsas] (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11a816c>] ? scsi_done+0x0/0xd (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11a82c0>] scsi_dispatch_cmd+0x147/0x181 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11ace4d>] scsi_request_fn+0x351/0x376 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1126798>] __blk_run_queue+0x78/0x10c (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1124446>] elv_insert+0x67/0x153 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11245b8>] __elv_add_request+0x86/0x8b (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1129343>] blk_execute_rq_nowait+0x4f/0x73 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11293dc>] blk_execute_rq+0x75/0x91 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11292cc>] ? blk_end_sync_rq+0x0/0x28 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112636f>] ? get_request+0x204/0x28d (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11269d6>] ? get_request_wait+0x2b/0xd9 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112c2bf>] sg_io+0x22d/0x30a (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112c5a8>] scsi_cmd_ioctl+0x20c/0x3bc (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11b3257>] sd_ioctl+0x6a/0x8c (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112a420>] __blkdev_driver_ioctl+0x50/0x62 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112ad1c>] blkdev_ioctl+0x8b0/0x8dc (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1131e2d>] ? kobject_get+0x12/0x17 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c112b0f8>] ? get_disk+0x4a/0x61 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c101b028>] ? kmap_atomic+0x14/0x16 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c11334a5>] ? radix_tree_lookup_slot+0xd/0xf (Errors)
Feb 13 18:47:02 Tower kernel:  [<c104a179>] ? filemap_fault+0xb8/0x305 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1048c43>] ? unlock_page+0x18/0x1b (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1057c63>] ? __do_fault+0x3a7/0x3da (Errors)
Feb 13 18:47:02 Tower kernel:  [<c105985f>] ? handle_mm_fault+0x42d/0x8f1 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c108b6c6>] block_ioctl+0x2a/0x32 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c108b69c>] ? block_ioctl+0x0/0x32 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c10769d5>] vfs_ioctl+0x22/0x67 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1076f33>] do_vfs_ioctl+0x478/0x4ac (Errors)
Feb 13 18:47:02 Tower kernel:  [<c105dcdd>] ? do_mmap_pgoff+0x232/0x294 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1076f93>] sys_ioctl+0x2c/0x45 (Errors)
Feb 13 18:47:02 Tower kernel:  [<c1002935>] syscall_call+0x7/0xb (Errors)
Feb 13 18:47:02 Tower kernel: ---[ end trace 823ba0e3f82c0bce ]---

 

Edit:

1)Stopped preclear, powerdown, moved the drive to a motherboard slot, started preclear again, no more errors in syslog but still the same playback errors.

2)Stopped preclear, powerdown, remove drive all seems to be ok again.

3) reinsert drive into motherboard slot and running fine still.

 

Is it possible that the running the preclear script whilst watching movies is causing the playback errors?

I don't recall it doing that in the past.

 

Thanks Josh

Link to comment

Does anyone have any ideas or anywhere I can start trying please?

The error seemed to occur in mapping memory.  My best first guess, most likely, you did not have enough free memory available at that moment. (or, less likely, your memory voltage, timing, or clock speed is not set for your specific memory strips)

 

I'd use the -r, -w, and/or -b options to the preclear script to limit its memory use as described here:

http://lime-technology.com/forum/index.php?topic=2817.msg104768;topicseen#msg104768

 

I'd try something like

-r32768 -w32768 -b10

Link to comment

Thanks Joe,

You could be right with the memory. I run 2GB and on idle this is the free memory info:

Memory Info

 

(from /usr/bin/free)

 

            total      used      free    shared    buffers    cached

Mem:      1813316    1239720    573596          0      47844    1076436

-/+ buffers/cache:    115440    1697876

Swap:            0          0          0

 

I think I have a lot of files as the cache_dirs is caching the jukebox/picture files. I might try tidying it up and getting that ssd for the jukebox stuff.

All the memory settings are auto and have been fine for months so It may be the memory usage. Do you know how much memory preclear wll use?

 

Thanks bcbgboy13

 

This is the full syslog attached, since the restart. No errors this time, precleared fine but wasn't running anything.

The new drive was a samsung 1.5TB ecogreen, HD154UI. I was running 6 of them with a WD20EARS as parity before I added another Samsung.

I added it to one of the SAS ports of the supermicro and that was when it started spitting out all the first lot of errors which stopped when I started the preclear but that brought on the video cutting out and the second post of errors.

The new hdd was then moved to a mobo port and preclear started again and the video was still get cutting out so stopped the preclear and all fine again.

 

I suspect that

1) preclear used memory that caused the video crashes

2) The hdd on the SAS port caused the syslog errors as it was unformatted

 

What do you think?

 

I'll try moving the precleared drive back to SAS port and see if I get any issues.

 

Thanks Josh

Syslog.txt

Link to comment

Josh, make sure you have the latest BIOS. The 8xx chipset is the latest AMD chipset and the early Motherboard BIOSes are bound to have "bugs" and are in need for "compatibility improvements". One of the BIOSes for example fixes "samsung hard drive compatibility" - BIOS 1301

 

Then I see you already had a lot of problems with this particular board http://lime-technology.com/forum/index.php?topic=8560.0 , there were other people here not happy with this one - http://lime-technology.com/forum/index.php?topic=7683.0 and they later found that their board was defective, and that on Newegg there are some people not happy with it too.

 

1. Disable that cache_dirs to eliminate the memory problems.

 

2. It is possible that this new Samsung that you added is/was defective to a point (you should disable the spin down and perform both short and long SMART test on it) as there was a change in attribute 195

Before:

Feb 16 18:28:28 Tower preclear_disk-diff[15552]: 195 Hardware_ECC_Recovered  0x001a  100  100  000    Old_age  Always      -      716 (Misc)

After:

Feb 16 18:28:28 Tower preclear_disk-diff[15552]: 195 Hardware_ECC_Recovered  0x001a  100  100  000    Old_age  Always      -      450238210 (Misc)

 

I do not own any Samsungs and cannot comment if this RAW value is important or not (as this can have meanings only to the HD manufacturer). You have a lot of similar drives so you can perform the short test on them to see the output and compare.

 

3. You may have a bad second breakout cable (the first Samsung on the SM card is on port 1 and the other is on port 5) so I guess you are using two different cables here:

Feb 13 20:17:51 Tower emhttp: pci-0000:02:00.0-sas-phy1:1-0x0100000000000000:1-lun0 host1 (sdb) SAMSUNG_HD154UI_S1XWJDWZ602362 (Drive related)

Feb 13 20:17:51 Tower emhttp: pci-0000:02:00.0-sas-phy5:1-0x0500000000000000:5-lun0 host1 (sdc) SAMSUNG_HD154UIS1XWJDWZ602350 (Drive related)

I know you moved the "suspected" HD to a motherboard port (but not sure where you connected back the other that was originally from the motherboard)

 

4. You have three possible slots for the SM card - try to move it around to see if the problem will persist (and consult your manual in regards to the shared resourses - IRQs)

 

 

If nothing helps then I do have reasons to believe your board may be a dud. You have a "premium" board with a lot of PCIe devices.

It is normal when your computer boots for you to see them initialized/enumerated in the syslog. Now we had a lot of releases in the last two-three months and the different kernels may have different ways of doing that (and generating corresponding lines in the syslog) but you have only two devices in yours:

 

Feb 13 20:17:43 Tower kernel: pcieport 0000:00:03.0: irq 25 for MSI/MSI-X

Feb 13 20:17:43 Tower kernel: pcieport 0000:00:03.0: setting latency timer to 64

Feb 13 20:17:43 Tower kernel: pcieport 0000:00:15.0: irq 26 for MSI/MSI-X

Feb 13 20:17:43 Tower kernel: pcieport 0000:00:15.0: setting latency timer to 64

 

Shawn (who later found his board to be defective) had three:

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:09.0: irq 25 for MSI/MSI-X

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:09.0: setting latency timer to 64

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:0a.0: irq 26 for MSI/MSI-X

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:0a.0: setting latency timer to 64

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:15.0: irq 27 for MSI/MSI-X

Sep  6 22:54:29 Tower kernel: pcieport 0000:00:15.0: setting latency timer to 64

 

and I found in these posts a guy (perfessor101) with the same motheboard ?? who has four:

 

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:03.0: irq 25 for MSI/MSI-X

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:03.0: setting latency timer to 64

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:09.0: irq 26 for MSI/MSI-X

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:09.0: setting latency timer to 64

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:0a.0: irq 27 for MSI/MSI-X

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:0a.0: setting latency timer to 64

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:15.0: irq 28 for MSI/MSI-X

Dec  5 20:07:39 Tower kernel: pcieport 0000:00:15.0: setting latency timer to 6

and apparently without problems.

 

So you can contact him to compare the syslogs ( make sure that you use the same Unraid release), search here more extensively for users with the same board and compare the syslogs if available and then see how you can RMA the board if indeed this turns out to be a defective board.

 

Good luck

Link to comment

Wow, thanks bcbgboy13.

 

I've just returned the precleared hdd to the SAS card slot and parity check ran fine last night. No errors in syslog either so possibly to do with memory causing video crashes and unformatted hdd causing SAS card issues. (I think the other drives on the SAS card were added formatted)

 

Josh, make sure you have the latest BIOS. The 8xx chipset is the latest AMD chipset and the early Motherboard BIOSes are bound to have "bugs" and are in need for "compatibility improvements". One of the BIOSes for example fixes "samsung hard drive compatibility" - BIOS 1301

Currently running 1606, not the latest but the second latest. Will give the latest one a go.

 

2. It is possible that this new Samsung that you added is/was defective to a point (you should disable the spin down and perform both short and long SMART test on it) as there was a change in attribute 195

Before:

Feb 16 18:28:28 Tower preclear_disk-diff[15552]: 195 Hardware_ECC_Recovered  0x001a  100  100  000    Old_age  Always      -      716 (Misc)

After:

Feb 16 18:28:28 Tower preclear_disk-diff[15552]: 195 Hardware_ECC_Recovered  0x001a  100  100  000    Old_age  Always      -      450238210 (Misc)

 

I do not own any Samsungs and cannot comment if this RAW value is important or not (as this can have meanings only to the HD manufacturer). You have a lot of similar drives so you can perform the short test on them to see the output and compare.

Will start running those tests now.

 

3. You may have a bad second breakout cable (the first Samsung on the SM card is on port 1 and the other is on port 5) so I guess you are using two different cables here:

Feb 13 20:17:51 Tower emhttp: pci-0000:02:00.0-sas-phy1:1-0x0100000000000000:1-lun0 host1 (sdb) SAMSUNG_HD154UI_S1XWJDWZ602362 (Drive related)

Feb 13 20:17:51 Tower emhttp: pci-0000:02:00.0-sas-phy5:1-0x0500000000000000:5-lun0 host1 (sdc) SAMSUNG_HD154UIS1XWJDWZ602350 (Drive related)

I know you moved the "suspected" HD to a motherboard port (but not sure where you connected back the other that was originally from the motherboard)

I have 2 SAS cables connected to the backplanes and 1 hdd on each, also had a spare motherboard port. I spread everything out to test when I first added the SM card due to the problems listed.

 

If nothing helps then I do have reasons to believe your board may be a dud. You have a "premium" board with a lot of PCIe devices.

I have a lot of the things disabled from trying to get the SAS card to work on my board so possibly the reason. I've been on all the posts where people have had problems with this board and the SAS card and I've managed to get to the bottom of it. It's due to a lack of option ROM as it gets taken up with the AHCI motherboard settings which prevents the latest firmware on the SAS card from loading. Downgrading the firmware allows it to work as the firmware is smaller.

 

At the moment everything is working well with no errors in the syslog. Streaming was fine last night too. I'll run some of the tests still though just to check.

 

Thanks Josh

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...