Author Topic: Experimental fix/test for mpt2sas controllers (LSI)  (Read 20954 times)

Offline limetech

  • Administrator
  • Hero Member
  • *****
  • Posts: 3395
    • Lime Technology
Experimental fix/test for mpt2sas controllers (LSI)
« on: July 21, 2012, 11:56:04 AM »
For those of you who can not use latest -rc6-r8168-test because of I/O errors occuring on spun-down disks, please download unRAID Server Release 5.0-rc6-r8168-test2, and see if this solves the problem.  It includes a patch from one of the linux-scsi dev's.

I expect an "official" fix for this which will be incorporated in a future linux kernel, but whatever they end up doing, I will back-port to our kernel.
« Last Edit: July 26, 2012, 11:30:11 AM by limetech »

Offline BRiT

  • Hero Member
  • *****
  • Posts: 2701
    • WTF.com
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #1 on: July 21, 2012, 12:10:17 PM »
From the Linux mailing list discussing this, a similar change on manage_start_stop did not fix their issue. I'm not certain how that ties in with allow_restart.

But I'm willing to give this a try on Sunday should I have sufficient time then.

For those wanting to see the discussion on the Linux SCSI mailing list, here's one link to the related message: http://permalink.gmane.org/gmane.linux.scsi/76474

Offline limetech

  • Administrator
  • Hero Member
  • *****
  • Posts: 3395
    • Lime Technology
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #2 on: July 21, 2012, 12:25:02 PM »
From the Linux mailing list discussing this, a similar change on manage_start_stop did not fix their issue. I'm not certain how that ties in with allow_restart.

But I'm willing to give this a try on Sunday should I have sufficient time then.

For those wanting to see the discussion on the Linux SCSI mailing list, here's one link to the related message: http://permalink.gmane.org/gmane.linux.scsi/76474

Here's where you can follow the dev discussion, and where I got the workaround:
http://article.gmane.org/gmane.linux.scsi/76474

Looking through the code this morning I agree with Matthias' analysis.  Actually I have another test release which has a code change to workaround this in a different manner (backs out the aforementioned commit), but this approach is better.

Offline BRiT

  • Hero Member
  • *****
  • Posts: 2701
    • WTF.com
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #3 on: July 21, 2012, 12:27:32 PM »
Agreed. It would be preferable to not have to roll patches to the Linux kernel. Also, that's a much nicer link to the discussion than I could find.

Any ideas on what would be needed on a system affected by this issue IF it were put to sleep then resumed? Would the setting allow_restart to 1 have to be done again?

Offline gfjardim

  • Sr. Member
  • ****
  • Posts: 357
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #4 on: July 21, 2012, 02:56:38 PM »
If applied to the RC6, the drives on the BR10i card doesn't spindown, not at all.

The emhttp UI freezes trying to spun them down.

Offline PeterB

  • Hero Member
  • *****
  • Posts: 1622
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #5 on: July 21, 2012, 04:20:56 PM »
This certainly appears to prevent the error but, testing with a non-array drive, either the drive does not spin down, or it spins up again very quickly.

I will try moving an array drive back onto the LSI controller.


EDIT:

Okay, I moved one of my array drives back onto the LSI controller and started up.  I clicked the 'spin down' button and all the drives appeared to be spun down, but there was a repetitive clicking coming from the server.  On inspecting the syslog there was an "mdcmd spindown" for the drive on the LSI controller every eleven seconds.

Code: [Select]
Jul 22 07:47:02 Tower emhttp: Spinning down all drives...
Jul 22 07:47:02 Tower kernel: mdcmd (28): spindown 0
Jul 22 07:47:03 Tower kernel: mdcmd (29): spindown 1
Jul 22 07:47:03 Tower kernel: mdcmd (30): spindown 2
Jul 22 07:47:04 Tower kernel: mdcmd (31): spindown 3
Jul 22 07:47:05 Tower kernel: mdcmd (32): spindown 4
Jul 22 07:47:07 Tower emhttp: shcmd (61): /usr/sbin/hdparm -y /dev/sdb &> /dev/null
Jul 22 07:47:10 Tower kernel: mdcmd (33): spindown 2
Jul 22 07:47:12 Tower kernel: mdcmd (34): spindown 2
Jul 22 07:47:21 Tower kernel: mdcmd (35): spindown 2
Jul 22 07:47:31 Tower kernel: mdcmd (36): spindown 2
Jul 22 07:47:48 Tower kernel: mdcmd (37): spindown 2
Jul 22 07:48:11 Tower emhttp: Spinning down all drives...
Jul 22 07:48:11 Tower kernel: mdcmd (38): spindown 0
Jul 22 07:48:12 Tower kernel: mdcmd (39): spindown 1
Jul 22 07:48:12 Tower kernel: mdcmd (40): spindown 2
Jul 22 07:48:13 Tower kernel: mdcmd (41): spindown 3
Jul 22 07:48:14 Tower kernel: mdcmd (42): spindown 4
Jul 22 07:48:16 Tower emhttp: shcmd (62): /usr/sbin/hdparm -y /dev/sdb &> /dev/null
Jul 22 07:48:17 Tower kernel: mdcmd (43): spindown 2
Jul 22 07:48:19 Tower kernel: mdcmd (44): spindown 2
Jul 22 07:48:31 Tower kernel: mdcmd (45): spindown 2
Jul 22 07:48:33 Tower kernel: mdcmd (46): spindown 2
Jul 22 07:48:43 Tower kernel: mdcmd (47): spindown 2
Jul 22 07:48:45 Tower kernel: mdcmd (48): spindown 2
Jul 22 07:51:26 Tower emhttp: Spinning down all drives...
Jul 22 07:51:26 Tower kernel: mdcmd (49): spindown 0
Jul 22 07:51:26 Tower kernel: mdcmd (50): spindown 1
Jul 22 07:51:26 Tower kernel: mdcmd (51): spindown 2
Jul 22 07:51:27 Tower kernel: mdcmd (52): spindown 3
Jul 22 07:51:28 Tower kernel: mdcmd (53): spindown 4
Jul 22 07:51:30 Tower emhttp: shcmd (63): /usr/sbin/hdparm -y /dev/sdb &> /dev/null
Jul 22 07:51:31 Tower kernel: mdcmd (54): spindown 2
Jul 22 07:51:34 Tower kernel: mdcmd (55): spindown 2
Jul 22 07:51:39 Tower kernel: mdcmd (56): spindown 2
Jul 22 07:51:45 Tower kernel: mdcmd (57): spindown 2
Jul 22 07:51:48 Tower kernel: mdcmd (58): spindown 2
Jul 22 07:51:56 Tower kernel: mdcmd (59): spindown 2
Jul 22 07:52:07 Tower kernel: mdcmd (60): spindown 2
Jul 22 07:52:18 Tower kernel: mdcmd (61): spindown 2
Jul 22 07:52:29 Tower kernel: mdcmd (62): spindown 2
Jul 22 07:52:40 Tower kernel: mdcmd (63): spindown 2
Jul 22 07:52:51 Tower kernel: mdcmd (64): spindown 2
Jul 22 07:53:02 Tower kernel: mdcmd (65): spindown 2
Jul 22 07:53:13 Tower kernel: mdcmd (66): spindown 2
Jul 22 07:53:24 Tower kernel: mdcmd (67): spindown 2
Jul 22 07:53:35 Tower kernel: mdcmd (68): spindown 2
Jul 22 07:53:46 Tower kernel: mdcmd (69): spindown 2
Jul 22 07:53:57 Tower kernel: mdcmd (70): spindown 2
Jul 22 07:54:08 Tower kernel: mdcmd (71): spindown 2
Jul 22 07:54:12 Tower mountd[3993]: authenticated mount request from 10.2.1.15:936 for /mnt/user/Movies (/mnt/user/Movies)
Jul 22 07:54:12 Tower mountd[3993]: authenticated mount request from 10.2.1.15:994 for /mnt/disk1 (/mnt/disk1)
Jul 22 07:54:19 Tower kernel: mdcmd (72): spindown 2
Jul 22 07:54:19 Tower mountd[3993]: authenticated mount request from 10.2.1.15:994 for /mnt/disk1 (/mnt/disk1)
Jul 22 07:54:30 Tower kernel: mdcmd (73): spindown 2
Jul 22 07:54:37 Tower login[3499]: ROOT LOGIN  on '/dev/tty1'
Jul 22 07:54:41 Tower kernel: mdcmd (74): spindown 2
Jul 22 07:54:44 Tower shutdown[9032]: shutting down for system reboot
Jul 22 07:55:10 Tower init: Switching to runlevel: 6
Jul 22 07:55:18 Tower rc.unRAID[9331]: Stopping unRAID.

I quickly reverted to my previous configuration.

It is clear that, with the line added in the go script, the original errors go away - whether this is because the spindown never happens, or the fault is really circumvented, I'm not sure.  However, I was not happy with the repeated clicking and 11 second spindown commands.

Also, during my testing, I had a lock up of all user interfaces - emhttp and unRAID both became unresponsive, the system console wouldn't produce a prompt and I couldn't get a telnet connection.

Whether anything is learned from my testing, I'm not sure, but this is definitely NOT a solution!
« Last Edit: July 21, 2012, 09:05:10 PM by PeterB »
unRAID 5.0-rc12a on X9SCM-iiF/Xeon E3-1230v2, 8GB (2*4GB Kingston DDR3 1600), Thermaltake V5, Seasonic X-650, Kingston MobileLite G2 with 2GB SD card, Supermicro AOC-USAS2-L8i, HighPoint RocketRaid 620, 3*iStarUSA BPN-350V2-SS 5in3 cages (fans removed),  2*2TB WD EARS, 2TB WD EARX, 2TB Samsung HD204, 1TB Hitachi 5K1000, 1TB Samsung HD103, 500GB Samsung HD502 cache.  Powered through an APC BK650-AS.

Offline PeterB

  • Hero Member
  • *****
  • Posts: 1622
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #6 on: July 21, 2012, 11:08:17 PM »
The conclusion reached there is not the last word, however IMO, because only the mpt2sas driver exhibits this behavior.

Indeed.  It seems logical, to me, that the commit which caused this problem may not be wrong, but has exposed a bug (or shortcoming) in the mpt2sas driver.
unRAID 5.0-rc12a on X9SCM-iiF/Xeon E3-1230v2, 8GB (2*4GB Kingston DDR3 1600), Thermaltake V5, Seasonic X-650, Kingston MobileLite G2 with 2GB SD card, Supermicro AOC-USAS2-L8i, HighPoint RocketRaid 620, 3*iStarUSA BPN-350V2-SS 5in3 cages (fans removed),  2*2TB WD EARS, 2TB WD EARX, 2TB Samsung HD204, 1TB Hitachi 5K1000, 1TB Samsung HD103, 500GB Samsung HD502 cache.  Powered through an APC BK650-AS.

Offline pantner

  • Full Member
  • ***
  • Posts: 185
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #7 on: July 22, 2012, 10:32:13 PM »
The 3.5 Kernal has been released
http://kernelnewbies.org/Linux_3.5#head-19c15f3d7710d6ae3c549480b002e1288af5c8e4

and here
http://kernelnewbies.org/Linux_3.5_DriverArch#head-0d4ff4acd978ab7dfc2a8fac025e449a0335b99e

it states as a change
Quote
?mpt2sas: Added multisegment mode support for Linux BSG Driver (commit)

No idea if you already know or if that means anything (doesn't mean much to me!)

Offline PeterB

  • Hero Member
  • *****
  • Posts: 1622
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #8 on: July 23, 2012, 01:11:50 AM »
?mpt2sas: Added multisegment mode support for Linux BSG Driver (commit)

Doesn't suggest to me that it addresses our problem.
unRAID 5.0-rc12a on X9SCM-iiF/Xeon E3-1230v2, 8GB (2*4GB Kingston DDR3 1600), Thermaltake V5, Seasonic X-650, Kingston MobileLite G2 with 2GB SD card, Supermicro AOC-USAS2-L8i, HighPoint RocketRaid 620, 3*iStarUSA BPN-350V2-SS 5in3 cages (fans removed),  2*2TB WD EARS, 2TB WD EARX, 2TB Samsung HD204, 1TB Hitachi 5K1000, 1TB Samsung HD103, 500GB Samsung HD502 cache.  Powered through an APC BK650-AS.

Offline WingmanNZ

  • Member
  • **
  • Posts: 64
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #9 on: July 23, 2012, 12:20:54 PM »
More followings on this issue.

http://article.gmane.org/gmane.linux.scsi/76481

Quote
From: Matthias Prager <linux <at> matthiasprager.de>
Subject: Re: 'Device not ready' issue on mpt2sas since 3.1.10
Newsgroups: gmane.linux.scsi
Date: 2012-07-22 23:14:00 GMT (20 hours and 4 minutes ago)
Hello Tejun,

Am 22.07.2012 19:31, schrieb Tejun Heo:>
> I haven't consulted SAT but it seems like a bug in SAS driver or
> firmware.  If it's a driver bug, we better fix it there.  If a
> firmware bug, working around those is one of major roles of drivers,
> so I think setting allow_restart is fine.

as it turns out my workaround (setting allow_restart=1) isn't all that
useful after all. There are no more i/o errors because the drive just
never goes to standby mode anymore (at least 'hdparm -y /dev/sda' does
not seem to have any effect anymore). I don't really understand why - do
sas drives ever get to standby mode? (they have allow_restart=1 set by
default) And is this desired or expected behavior for sata disk on sas
controllers?

For the moment the only way for me to have my sata drives sleeping
without i/o errors is to revert your original commit
(85ef06d1d252f6a2e73b678591ab71caad4667bb - tested with kernels 3.1.10,
3.4.4, 3.4.5, 3.4.6 and 3.5.0)

--
Matthias

P.S. I hope I'm not getting on everybody's nerves here (especially yours
Tejun)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Offline madshi

  • Member
  • **
  • Posts: 97
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #10 on: July 23, 2012, 11:34:38 PM »
Maybe this is a stupid question, but is that original commit which is causing all this trouble so very important for unRAID? Wouldn't it be an option for unRAID to simply revert this commit? That may not be the ideal solution, but isn't it better than what we have right now?

Offline limetech

  • Administrator
  • Hero Member
  • *****
  • Posts: 3395
    • Lime Technology
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #11 on: July 24, 2012, 08:30:50 AM »
Maybe this is a stupid question, but is that original commit which is causing all this trouble so very important for unRAID? Wouldn't it be an option for unRAID to simply revert this commit? That may not be the ideal solution, but isn't it better than what we have right now?
Yes, this is what I'm going to do.  I just wanted to understand the code before doing it  ;)

Offline BRiT

  • Hero Member
  • *****
  • Posts: 2701
    • WTF.com
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #12 on: July 24, 2012, 02:15:13 PM »
I'm curious if the commit in question can be partially implemented. Specifically what I was thinking might work is instead of masking out the one event added simply mask out no events.

ie: On this commit diff [ http://git.opencores.org/?a=commitdiff&p=linux&h=85ef06d1d252f6a2e73b678591ab71caad4667bb ] in the "fs/block_dev.c" code, instead of using:

Code: [Select]
/*
* Trigger event checking and tell drivers to flush MEDIA_CHANGE
* event.  This is to ensure detection of media removal commanded
* from userland - e.g. eject(1).
*/
disk_flush_events(bdev->bd_disk, DISK_EVENT_MEDIA_CHANGE);

mutex_unlock(&bdev->bd_mutex);

Use:
Code: [Select]
/* WTF: Dont FLUSH any events.
* Trigger event checking and tell drivers to flush MEDIA_CHANGE
* event.  This is to ensure detection of media removal commanded
* from userland - e.g. eject(1).
*/
disk_flush_events(bdev->bd_disk, 0);

mutex_unlock(&bdev->bd_mutex);

I think if you revert the full commit there is another commit downstream that would need to be patched since it uses the renamed function that this original commit did.

Offline BRiT

  • Hero Member
  • *****
  • Posts: 2701
    • WTF.com
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #13 on: July 25, 2012, 06:45:20 PM »
All kinds of goodness with experimental fixes from the Linux SCSI folks here: http://thread.gmane.org/gmane.linux.scsi/75915/focus=76560

Offline PeterB

  • Hero Member
  • *****
  • Posts: 1622
Re: Experimental fix/test for mpt2sas controllers (LSI)
« Reply #14 on: July 25, 2012, 08:04:23 PM »
... and I see that Tom has now joined the discussion!
unRAID 5.0-rc12a on X9SCM-iiF/Xeon E3-1230v2, 8GB (2*4GB Kingston DDR3 1600), Thermaltake V5, Seasonic X-650, Kingston MobileLite G2 with 2GB SD card, Supermicro AOC-USAS2-L8i, HighPoint RocketRaid 620, 3*iStarUSA BPN-350V2-SS 5in3 cages (fans removed),  2*2TB WD EARS, 2TB WD EARX, 2TB Samsung HD204, 1TB Hitachi 5K1000, 1TB Samsung HD103, 500GB Samsung HD502 cache.  Powered through an APC BK650-AS.