dangil Posted May 5, 2012 Share Posted May 5, 2012 Since beta6a, until at least beta 14 (testing rc2 now), I have a intermittent issue while stopping the array. syslog spits a BUG: unable to handle kernel NULL pointer dereference at (null) when umounting sleeping disks I have a Supermicro X8SIL-V motherboard, with 6 onboard sata. If I only use these ports, I don't see the issue. I had a LSI 3081e-r controller, that I thought was the issue, but I replaced it with a JMicron JMB362 based 2-port pci-e 1x card and installed the 7th drive of my array After 1 hour after power-on, the 7th disk (array disk6) that is connected to this JMB362 controller, went to sleep. 10 minutes after, I tried to stop the array, but the interface returned without the array stopped, and after that I need to reboot the server to regain control. So I think that when there is more than 1 sata controller, the issue appears. attached is the syslog with the kernel BUG info syslog.zip Quote Link to comment
dangil Posted May 6, 2012 Author Share Posted May 6, 2012 running 5.0 rc-2, and it happened again I let the array sleep for some time Then I spun up the disks first... checked that everything was accessible then I hit STOP, to stop the array, and the webgui returned with the array still running if I hit stop the array again, webgui will freeze. Everything that try to access the disks freezes from now on any suggestions? again, this only happens when I add a secondary sata controller. First the 3081e-r LSI. Now I removed it, and I am using a JMB362 based controller without a secondary sata controller this error doesn't occours tried this : root@Tower:/mnt# echo stop > /proc/mdcmd -bash: echo: write error: Device or resource busy Samba is running. lsof didn't return anything usefull /mnt is empty. df outputs: root@Tower:/mnt# df -h Filesystem Size Used Avail Use% Mounted on /dev/sdf1 1.9G 128M 1.8G 7% /boot df: `/mnt/disk6': No such file or directory I will leave the server on this state so I can test any suggestions syslog.zip Quote Link to comment
limetech Posted May 6, 2012 Share Posted May 6, 2012 Are you using a Xeon processor? Quote Link to comment
dangil Posted May 6, 2012 Author Share Posted May 6, 2012 No. I am using a core i3 540 with 4GB DDR3 Unbuffered ECC Quote Link to comment
dangil Posted May 8, 2012 Author Share Posted May 8, 2012 nobody has any other suggestions ? Quote Link to comment
limetech Posted May 8, 2012 Share Posted May 8, 2012 nobody has any other suggestions ? Is this problem intermittant, or can you make it happen all the time? Quote Link to comment
dangil Posted May 8, 2012 Author Share Posted May 8, 2012 intermittant, unfortunately but it only happens when I have a secondary sata controller. I tried to reproduce it several times, by sleeping all disks and then stopping the array, but I couldn't reproduce. Tried to sleep only the disk attached to the secondary sata controller, but I couldn't reproduce it either after several attempts I suspect it has something to do with how long the disks stay sleeping in conjunction with the presence of a secondary controller. if I remove the secondary controller and leave all disks with the onboard controller, the error doesn't happen (after a few months of using the server, with long sleep cycles). after I attached a new disk to this JMB362 controller, the error happened on the same day, after the disks went to sleep after 1 hour as configured. When I was using the LSI based controller, the same error also appeared occasionally . If I log on the server, /mnt is empty, which suggests the disks were unmonted. SMB services are restarted, but the stop command to mdcmd fails with the device is busy message. If I try to stop the array again or try to shutdown the server, a sync() command is issued which locks up everything lsof output doesn't list any of the /dev/sdX disks or /mnt disks if I try to unload md_cmd module, it says it's in use and cant be unloaded why did you mentioned a xeon processor ? I think the i3 540 running on intel 3420 chipset behaves as a xeon, being able to use ECC memory. Could this indicate something ? Quote Link to comment
limetech Posted May 8, 2012 Share Posted May 8, 2012 why did you mentioned a xeon processor ? I think the i3 540 running on intel 3420 chipset behaves as a xeon, being able to use ECC memory. Could this indicate something ? I was working with someone via email a few months ago who was using your motherboard with a xeon processor that also exhibited this problem fairly reliably. Eventually I think one of the kernel updates made it "go away". I think this is a race condition in the linux kernel related to ReiserFS un-mounting before the linux buffer cache is fully flushed (exacerbated by a very fast processor). We were trying to set up a definitive test case that I could post to the kernel development mailing list, but we just could not make a reliably repeatable test case. I now have a similar m/b in house and I'll order a fast processor and try and go down this rabbit hole again Quote Link to comment
dangil Posted May 8, 2012 Author Share Posted May 8, 2012 Great news ! Thanks for the info I will wait for a possible fix then Quote Link to comment
dangil Posted May 11, 2012 Author Share Posted May 11, 2012 an Update I did some testing, and this bug isn't related with disk sleeping. I booted the array, and did a few stop/start array cycles in a row. doing this I could reproduce the bug consistently. It really appears it is a race condition or a timing issue. but the bug manisfests consistently on the same disk always. the disk attached to the offboard PCI-e sata controller Since I aready switched sata controllers, I will try to test it with different disks, because I suspect I am always testing this with the same 1.5TB seagate drive. If this bug manifests itself with other HDs too, I guess I will have to limit my array to the 6 onboard sata ports for now... Does anybody else has the same hardware configuration as I do and is running without issues ? (Supermicro X8SIL-V + Core i3 540 + 4GB DDR3 ECC + Secondary PCI-e Sata controller) ? searching the forums for the Call Trace output (queue_delayed_work do_journal_end journal_end_sync), I found a few other members with the same symptoms, like users madburg, nezil, gfjardim it appears madburg ran reiserfsck on the disks and the problem was fixed? is that correct? how can I do this without invalidating the parity on my array? syslog.zip Quote Link to comment
dgaschk Posted May 11, 2012 Share Posted May 11, 2012 See Check_Disk_Filesystems in my sig. Quote Link to comment
limetech Posted May 11, 2012 Share Posted May 11, 2012 an Update I did some testing, and this bug isn't related with disk sleeping. I booted the array, and did a few stop/start array cycles in a row. doing this I could reproduce the bug consistently. It really appears it is a race condition or a timing issue. but the bug manisfests consistently on the same disk always. the disk attached to the offboard PCI-e sata controller Since I aready switched sata controllers, I will try to test it with different disks, because I suspect I am always testing this with the same 1.5TB seagate drive. If this bug manifests itself with other HDs too, I guess I will have to limit my array to the 6 onboard sata ports for now... Does anybody else has the same hardware configuration as I do and is running without issues ? (Supermicro X8SIL-V + Core i3 540 + 4GB DDR3 ECC + Secondary PCI-e Sata controller) ? searching the forums for the Call Trace output (queue_delayed_work do_journal_end journal_end_sync), I found a few other members with the same symptoms, like users madburg, nezil, gfjardim it appears madburg ran reiserfsck on the disks and the problem was fixed? is that correct? how can I do this without invalidating the parity on my array? It's not clear if this only happens with corrupted file system. But procedure is below: With array Stopped, check the "Maintenance mode" box and then Start the array. Maintenance mode starts the array but does not mount any of the hard drives. (The reason you want the array Started is so that any changes made by reiserfsck update parity so that parity remains consistent.) You can then check the individual file systems via telnet session like this: reiserfsck /dev/md1 <-- corresponds to disk1 reiserfsck /dev/md2 <-- corresponds to disk2 etc. To check the Cache disk, you need to look on the Main page and see which linux device identifier has been assigned to it. This will be string inside parenthesis, e.g., (sde). Then use this command: reiserfsck /dev/sde1 <-- substitute "sde" for identifier on your system, and add a "1". The reiserfsck utility will ask you to type "Yes" to continue. Type exactly like that (without the quotes). If the utility finds a problem, it will ask you to re-run, but with a switch specified, typically "--fix-fixable" but don't do this unless the utility says to. The reiserfsck utility can take a long time to run, depending on how large the file system is. Quote Link to comment
dangil Posted May 11, 2012 Author Share Posted May 11, 2012 I ran reiserfsck on all array disks.. only disk6, the one the unmount was failing with, had 5 transactions played back, but none had errors detected after that, I couldn't reproduce the bug... tried several start/stop array cycles in a row, and nothing lets wait a few days and see... Quote Link to comment
dangil Posted May 12, 2012 Author Share Posted May 12, 2012 the bug reapeared. after the initial testing, I let the disks sleep and waited a few hours. after that, I hit the stop button, and the same kernel bug appeared... this time disk6 had 0 transactions replayed during the reiserfsck command... I guess it's not related with reiserfs corruption afterall is there a way to remove an empty disk from the array without rebuilding parity? Quote Link to comment
dgaschk Posted May 12, 2012 Share Posted May 12, 2012 You could zero the drive, but why? If the system can't properly calculate parity the system has no protection. Quote Link to comment
dangil Posted May 12, 2012 Author Share Posted May 12, 2012 I am testing something different. Started the array in maintenance mode, and manually mounted disk6. I will sleep all disks and let it sit for a few hours. then I will try to unmount the disk manually and see what happens Quote Link to comment
dangil Posted May 13, 2012 Author Share Posted May 13, 2012 well. I could mount and umount disk6 manually several times, even after several hours of disk sleep inconclusive for now... Quote Link to comment
dangil Posted May 15, 2012 Author Share Posted May 15, 2012 can I unmount a disk while the array is started, but in normal mode ? (not in maintenance mode) I assume I must stop samba first Quote Link to comment
limetech Posted May 15, 2012 Share Posted May 15, 2012 can I unmount a disk while the array is started, but in normal mode ? (not in maintenance mode) I assume I must stop samba first Yes, or you can set SMB Export to 'No'. Quote Link to comment
dangil Posted May 17, 2012 Author Share Posted May 17, 2012 I did several cycles of mount/umount disk6, that is the disk attached to pci-e controller and none failed I have a hunch: could this bug be brought up because the umount of disk5 on the onboard sata controller is not completelly finished when, in the sequence, the umount for disk6, attached to the pci-e controler is started? what if a few seconds were added between all the umount commands ? could someone create a test case, with 2 disks, one attached to a onboard sata controller, and other attached to an offboard sata controller, and a script that mounts and unmounts them in sequence ? what led me to this hunch is that on my syslog , 2 disks remain busy after umount crashes the kernel. the conclusion is that the unmount of the second to last disk and the unmount of the last disk conflict with each other. perhaps from a race condition between them. I don't have the capabilities to debug this low level kernel stuff...but perhaps someone else has Quote Link to comment
dangil Posted May 21, 2012 Author Share Posted May 21, 2012 Tom, could you implement a slight delay (3 seconds for example), between each unmount when the stop button is pressed on the webgui ? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.