limetech Posted July 21, 2012 Share Posted July 21, 2012 For those of you who can not use latest -rc6-r8168-test because of I/O errors occuring on spun-down disks, please download unRAID Server Release 5.0-rc6-r8168-test2, and see if this solves the problem. It includes a patch from one of the linux-scsi dev's. I expect an "official" fix for this which will be incorporated in a future linux kernel, but whatever they end up doing, I will back-port to our kernel. Quote Link to comment
BRiT Posted July 21, 2012 Share Posted July 21, 2012 From the Linux mailing list discussing this, a similar change on manage_start_stop did not fix their issue. I'm not certain how that ties in with allow_restart. But I'm willing to give this a try on Sunday should I have sufficient time then. For those wanting to see the discussion on the Linux SCSI mailing list, here's one link to the related message: http://permalink.gmane.org/gmane.linux.scsi/76474 Quote Link to comment
limetech Posted July 21, 2012 Author Share Posted July 21, 2012 From the Linux mailing list discussing this, a similar change on manage_start_stop did not fix their issue. I'm not certain how that ties in with allow_restart. But I'm willing to give this a try on Sunday should I have sufficient time then. For those wanting to see the discussion on the Linux SCSI mailing list, here's one link to the related message: http://permalink.gmane.org/gmane.linux.scsi/76474 Here's where you can follow the dev discussion, and where I got the workaround: http://article.gmane.org/gmane.linux.scsi/76474 Looking through the code this morning I agree with Matthias' analysis. Actually I have another test release which has a code change to workaround this in a different manner (backs out the aforementioned commit), but this approach is better. Quote Link to comment
BRiT Posted July 21, 2012 Share Posted July 21, 2012 Agreed. It would be preferable to not have to roll patches to the Linux kernel. Also, that's a much nicer link to the discussion than I could find. Any ideas on what would be needed on a system affected by this issue IF it were put to sleep then resumed? Would the setting allow_restart to 1 have to be done again? Quote Link to comment
gfjardim Posted July 21, 2012 Share Posted July 21, 2012 If applied to the RC6, the drives on the BR10i card doesn't spindown, not at all. The emhttp UI freezes trying to spun them down. Quote Link to comment
PeterB Posted July 21, 2012 Share Posted July 21, 2012 This certainly appears to prevent the error but, testing with a non-array drive, either the drive does not spin down, or it spins up again very quickly. I will try moving an array drive back onto the LSI controller. EDIT: Okay, I moved one of my array drives back onto the LSI controller and started up. I clicked the 'spin down' button and all the drives appeared to be spun down, but there was a repetitive clicking coming from the server. On inspecting the syslog there was an "mdcmd spindown" for the drive on the LSI controller every eleven seconds. Jul 22 07:47:02 Tower emhttp: Spinning down all drives... Jul 22 07:47:02 Tower kernel: mdcmd (28): spindown 0 Jul 22 07:47:03 Tower kernel: mdcmd (29): spindown 1 Jul 22 07:47:03 Tower kernel: mdcmd (30): spindown 2 Jul 22 07:47:04 Tower kernel: mdcmd (31): spindown 3 Jul 22 07:47:05 Tower kernel: mdcmd (32): spindown 4 Jul 22 07:47:07 Tower emhttp: shcmd (61): /usr/sbin/hdparm -y /dev/sdb &> /dev/null Jul 22 07:47:10 Tower kernel: mdcmd (33): spindown 2 Jul 22 07:47:12 Tower kernel: mdcmd (34): spindown 2 Jul 22 07:47:21 Tower kernel: mdcmd (35): spindown 2 Jul 22 07:47:31 Tower kernel: mdcmd (36): spindown 2 Jul 22 07:47:48 Tower kernel: mdcmd (37): spindown 2 Jul 22 07:48:11 Tower emhttp: Spinning down all drives... Jul 22 07:48:11 Tower kernel: mdcmd (38): spindown 0 Jul 22 07:48:12 Tower kernel: mdcmd (39): spindown 1 Jul 22 07:48:12 Tower kernel: mdcmd (40): spindown 2 Jul 22 07:48:13 Tower kernel: mdcmd (41): spindown 3 Jul 22 07:48:14 Tower kernel: mdcmd (42): spindown 4 Jul 22 07:48:16 Tower emhttp: shcmd (62): /usr/sbin/hdparm -y /dev/sdb &> /dev/null Jul 22 07:48:17 Tower kernel: mdcmd (43): spindown 2 Jul 22 07:48:19 Tower kernel: mdcmd (44): spindown 2 Jul 22 07:48:31 Tower kernel: mdcmd (45): spindown 2 Jul 22 07:48:33 Tower kernel: mdcmd (46): spindown 2 Jul 22 07:48:43 Tower kernel: mdcmd (47): spindown 2 Jul 22 07:48:45 Tower kernel: mdcmd (48): spindown 2 Jul 22 07:51:26 Tower emhttp: Spinning down all drives... Jul 22 07:51:26 Tower kernel: mdcmd (49): spindown 0 Jul 22 07:51:26 Tower kernel: mdcmd (50): spindown 1 Jul 22 07:51:26 Tower kernel: mdcmd (51): spindown 2 Jul 22 07:51:27 Tower kernel: mdcmd (52): spindown 3 Jul 22 07:51:28 Tower kernel: mdcmd (53): spindown 4 Jul 22 07:51:30 Tower emhttp: shcmd (63): /usr/sbin/hdparm -y /dev/sdb &> /dev/null Jul 22 07:51:31 Tower kernel: mdcmd (54): spindown 2 Jul 22 07:51:34 Tower kernel: mdcmd (55): spindown 2 Jul 22 07:51:39 Tower kernel: mdcmd (56): spindown 2 Jul 22 07:51:45 Tower kernel: mdcmd (57): spindown 2 Jul 22 07:51:48 Tower kernel: mdcmd (58): spindown 2 Jul 22 07:51:56 Tower kernel: mdcmd (59): spindown 2 Jul 22 07:52:07 Tower kernel: mdcmd (60): spindown 2 Jul 22 07:52:18 Tower kernel: mdcmd (61): spindown 2 Jul 22 07:52:29 Tower kernel: mdcmd (62): spindown 2 Jul 22 07:52:40 Tower kernel: mdcmd (63): spindown 2 Jul 22 07:52:51 Tower kernel: mdcmd (64): spindown 2 Jul 22 07:53:02 Tower kernel: mdcmd (65): spindown 2 Jul 22 07:53:13 Tower kernel: mdcmd (66): spindown 2 Jul 22 07:53:24 Tower kernel: mdcmd (67): spindown 2 Jul 22 07:53:35 Tower kernel: mdcmd (68): spindown 2 Jul 22 07:53:46 Tower kernel: mdcmd (69): spindown 2 Jul 22 07:53:57 Tower kernel: mdcmd (70): spindown 2 Jul 22 07:54:08 Tower kernel: mdcmd (71): spindown 2 Jul 22 07:54:12 Tower mountd[3993]: authenticated mount request from 10.2.1.15:936 for /mnt/user/Movies (/mnt/user/Movies) Jul 22 07:54:12 Tower mountd[3993]: authenticated mount request from 10.2.1.15:994 for /mnt/disk1 (/mnt/disk1) Jul 22 07:54:19 Tower kernel: mdcmd (72): spindown 2 Jul 22 07:54:19 Tower mountd[3993]: authenticated mount request from 10.2.1.15:994 for /mnt/disk1 (/mnt/disk1) Jul 22 07:54:30 Tower kernel: mdcmd (73): spindown 2 Jul 22 07:54:37 Tower login[3499]: ROOT LOGIN on '/dev/tty1' Jul 22 07:54:41 Tower kernel: mdcmd (74): spindown 2 Jul 22 07:54:44 Tower shutdown[9032]: shutting down for system reboot Jul 22 07:55:10 Tower init: Switching to runlevel: 6 Jul 22 07:55:18 Tower rc.unRAID[9331]: Stopping unRAID. I quickly reverted to my previous configuration. It is clear that, with the line added in the go script, the original errors go away - whether this is because the spindown never happens, or the fault is really circumvented, I'm not sure. However, I was not happy with the repeated clicking and 11 second spindown commands. Also, during my testing, I had a lock up of all user interfaces - emhttp and unRAID both became unresponsive, the system console wouldn't produce a prompt and I couldn't get a telnet connection. Whether anything is learned from my testing, I'm not sure, but this is definitely NOT a solution! Quote Link to comment
PeterB Posted July 22, 2012 Share Posted July 22, 2012 The conclusion reached there is not the last word, however IMO, because only the mpt2sas driver exhibits this behavior. Indeed. It seems logical, to me, that the commit which caused this problem may not be wrong, but has exposed a bug (or shortcoming) in the mpt2sas driver. Quote Link to comment
pantner Posted July 23, 2012 Share Posted July 23, 2012 The 3.5 Kernal has been released http://kernelnewbies.org/Linux_3.5#head-19c15f3d7710d6ae3c549480b002e1288af5c8e4 and here http://kernelnewbies.org/Linux_3.5_DriverArch#head-0d4ff4acd978ab7dfc2a8fac025e449a0335b99e it states as a change ?mpt2sas: Added multisegment mode support for Linux BSG Driver (commit) No idea if you already know or if that means anything (doesn't mean much to me!) Quote Link to comment
PeterB Posted July 23, 2012 Share Posted July 23, 2012 ?mpt2sas: Added multisegment mode support for Linux BSG Driver (commit) Doesn't suggest to me that it addresses our problem. Quote Link to comment
WingmanNZ Posted July 23, 2012 Share Posted July 23, 2012 More followings on this issue. http://article.gmane.org/gmane.linux.scsi/76481 From: Matthias Prager <linux <at> matthiasprager.de> Subject: Re: 'Device not ready' issue on mpt2sas since 3.1.10 Newsgroups: gmane.linux.scsi Date: 2012-07-22 23:14:00 GMT (20 hours and 4 minutes ago) Hello Tejun, Am 22.07.2012 19:31, schrieb Tejun Heo:> > I haven't consulted SAT but it seems like a bug in SAS driver or > firmware. If it's a driver bug, we better fix it there. If a > firmware bug, working around those is one of major roles of drivers, > so I think setting allow_restart is fine. as it turns out my workaround (setting allow_restart=1) isn't all that useful after all. There are no more i/o errors because the drive just never goes to standby mode anymore (at least 'hdparm -y /dev/sda' does not seem to have any effect anymore). I don't really understand why - do sas drives ever get to standby mode? (they have allow_restart=1 set by default) And is this desired or expected behavior for sata disk on sas controllers? For the moment the only way for me to have my sata drives sleeping without i/o errors is to revert your original commit (85ef06d1d252f6a2e73b678591ab71caad4667bb - tested with kernels 3.1.10, 3.4.4, 3.4.5, 3.4.6 and 3.5.0) -- Matthias P.S. I hope I'm not getting on everybody's nerves here (especially yours Tejun) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo <at> vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Quote Link to comment
madshi Posted July 24, 2012 Share Posted July 24, 2012 Maybe this is a stupid question, but is that original commit which is causing all this trouble so very important for unRAID? Wouldn't it be an option for unRAID to simply revert this commit? That may not be the ideal solution, but isn't it better than what we have right now? Quote Link to comment
limetech Posted July 24, 2012 Author Share Posted July 24, 2012 Maybe this is a stupid question, but is that original commit which is causing all this trouble so very important for unRAID? Wouldn't it be an option for unRAID to simply revert this commit? That may not be the ideal solution, but isn't it better than what we have right now? Yes, this is what I'm going to do. I just wanted to understand the code before doing it Quote Link to comment
BRiT Posted July 24, 2012 Share Posted July 24, 2012 I'm curious if the commit in question can be partially implemented. Specifically what I was thinking might work is instead of masking out the one event added simply mask out no events. ie: On this commit diff [ http://git.opencores.org/?a=commitdiff&p=linux&h=85ef06d1d252f6a2e73b678591ab71caad4667bb ] in the "fs/block_dev.c" code, instead of using: /* * Trigger event checking and tell drivers to flush MEDIA_CHANGE * event. This is to ensure detection of media removal commanded * from userland - e.g. eject(1). */ disk_flush_events(bdev->bd_disk, DISK_EVENT_MEDIA_CHANGE); mutex_unlock(&bdev->bd_mutex); Use: /* WTF: Dont FLUSH any events. * Trigger event checking and tell drivers to flush MEDIA_CHANGE * event. This is to ensure detection of media removal commanded * from userland - e.g. eject(1). */ disk_flush_events(bdev->bd_disk, 0); mutex_unlock(&bdev->bd_mutex); I think if you revert the full commit there is another commit downstream that would need to be patched since it uses the renamed function that this original commit did. Quote Link to comment
BRiT Posted July 26, 2012 Share Posted July 26, 2012 All kinds of goodness with experimental fixes from the Linux SCSI folks here: http://thread.gmane.org/gmane.linux.scsi/75915/focus=76560 Quote Link to comment
PeterB Posted July 26, 2012 Share Posted July 26, 2012 ... and I see that Tom has now joined the discussion! Quote Link to comment
PeterB Posted July 26, 2012 Share Posted July 26, 2012 LSI has just released Phase14 firmware for the 2008 controller cards. There seem to be a lot of fixes to do with Sense command handling, for instance: SCGCQ00286160 (DFCT) - SAS2 IT - Phase 14 ? When SATA Drive is Moved to Standby State via Mode Page(0x1a), ASC/ASCQ is Incorrect for REQUEST SENSE also: SCGCQ00280727 (DFCT) - (Sata Only) Executing Start Stop Unit to move SATA Drive from Standby to Active doesn?t work and: SCGCQ00283334 (DFCT) - (Sata Only) Request Sense Command is showing incorrect data during stopped condition My guess is that they've got involved in this issue, and have found a number of holes in the firmware. I'll try flashing P14 and see whether there's any improvement. Edit: No joy - the fault still persists with P14 firmware. Quote Link to comment
limetech Posted July 26, 2012 Author Share Posted July 26, 2012 Please refer to topmost post with new instructions. Quote Link to comment
PeterB Posted July 27, 2012 Share Posted July 27, 2012 Tests on a non-array drive connected to the lsi controller seem to be positive. I can read/write/spin down/read/write/read with no problem and no errors recorded in the syslog. I will try returning some array drives to the lsi controller. Quote Link to comment
gfjardim Posted July 27, 2012 Share Posted July 27, 2012 Tests on a non-array drive connected to the lsi controller seem to be positive. I can read/write/spin down/read/write/read with no problem and no errors recorded in the syslog. I will try returning some array drives to the lsi controller. Here, I've done this test every time: 1) initiated the array in maintenance mode; 2) spun down all drives, spun them up, then spun them down again; 3) initiate a read-only reiserfsck on all disks; 4) tried to spun the drives up and down another 3 times. This time, no errors detected. The workaround apparently solved the spinup bug. Quote Link to comment
PeterB Posted July 27, 2012 Share Posted July 27, 2012 Yes, I've returned four drives to the lsi controller, and all appears to be well - I can spin all the drives down, then access them. They come back up, supplying the requested data, with no errors in the syslog, just the hdparm command logged: Jul 27 10:52:39 Tower mountd[3988]: authenticated mount request from 10.2.1.15:701 for /mnt/user/Movies (/mnt/user/Movies) Jul 27 10:52:46 Tower emhttp: Spinning down all drives... Jul 27 10:52:46 Tower kernel: mdcmd (25): spindown 0 Jul 27 10:52:46 Tower kernel: mdcmd (26): spindown 1 Jul 27 10:52:46 Tower kernel: mdcmd (27): spindown 2 Jul 27 10:52:47 Tower kernel: mdcmd (28): spindown 3 Jul 27 10:52:48 Tower kernel: mdcmd (29): spindown 4 Jul 27 10:52:51 Tower emhttp: shcmd (60): /usr/sbin/hdparm -y /dev/sdb &> /dev/null Also, my parity check speed is back - I see 110MB/s+ at 1% completion. I wonder what has changed, apart from my use of the Rocket controller. I need to investigate the performance of that card! Quote Link to comment
chickensoup Posted July 27, 2012 Share Posted July 27, 2012 That's a good sign. Looks like we might see an rc7 soon, or will rc7 likely be a rename of -8168-test2 in the hopes for a -final? Quote Link to comment
jaybee Posted July 27, 2012 Share Posted July 27, 2012 Could this be it.... final around the corner? Quote Link to comment
peter_sm Posted July 27, 2012 Share Posted July 27, 2012 @Tom When next release is on the way, do you plan to update Netallk & Samba ? [*]Netatalk 3.0 [*]Samba 3.6.6 Quote Link to comment
madburg Posted July 27, 2012 Share Posted July 27, 2012 This looks very promising from the LSI stand point. Tom, does this mean 5 RC may move to a later kernel version (I believe this LSI issue was the one holding the move forward back) or are you leaning toward back porting the patch (once its official) to a previous kernel? I have not see what the official stand point is for this issue from the dev's, thought (official fix). Hope it's still progressing along. Checked on it this morning. Quote Link to comment
JM2005 Posted July 29, 2012 Share Posted July 29, 2012 3 - IBM BR10i LSI Cards running in IT mode. FW:1.32.00.00, BIOS 6.34.00.00 / 20-DEC-10 (LSI P20) 5.0-rc6-r8168-test2 1 day, 8 hours, 45 minutes Been running Just fine! Thanks Tom! Edited to reflect info on cards. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.