Another hurricane problem - network keeps dropping

Oddwunn · August 31, 2011

Hi guys,

I am running a 20 disk version 4.5.6 system, a machine that has been running without a hiccup for the last 9 months or so. During the hurricane, we had some power fluctuations (voltage drops), and we even lost power for about 3 seconds. I left Tower1 off, but when I turned it back on again later, everything came up and looked normal, but I noticed that my 2 SageTV HD300 boxes (also running Linux) no longer saw Tower1 on the network, though my Windows machines could write to it and read from it normally, though it does not show up on Network Neighborhood. I thought I had a problem with the Sage boxes, though it was quite a coincidence that both boxes developed the exact same problem at the same time. Then I noticed that my Windows machines would drop connectivity to the server for no apparent reason, and at the same time I would lose the web interface, and even Putty would not give me any control. The drop takes place randomly, sometimes after 10 minutes, sometimes after an hour, and sometimes after several hours.

Since I have a second server, Tower2, which is working 100% correctly since the hurricane, I figured that the rest of my LAN hardware is fine and that I probably developed a flaky LAN port in my Supermicro X8SIL motherboard. I decided to try adding a PCI 1gb Realtek chip addon card, but the problem is that I can not find anywhere in the mobo's BIOS to disable the onboard LAN, so my assumption is that unRAID is still trying to use the onboard LAN instead of the addon card, a theory which I somewhat verified by plugging the CAT6 cable back into the onboard LAN port and verifying that I could now once again gain control through the web GUI and can access files, though ONLY through my Windows machines, not the 2 Sage boxes.

1. Do you think that I am correct in assuming that my onboard LAN is damaged?

2. Does anyone know how to disable the onboard LAN port on a Supermicro X8SIL (or is it port(S)? - there are 2 LAN connectors on this mobo)?

3. If I can not disable the onboard LAN, is there a way I can config unRAID to recognize and use the addon LAN card instead of the onboard LAN, and assign it the old IP of 192.168.1.110?

4. Am I barking up the wrong tree entirely?

Thanks in advance!

Oddwunn · August 31, 2011

Oh yeah, and even though I was not able to capture a syslog once I lost the network, I did capture a syslog on this most recent boot up...Right now Tower1 is working fine with Windows (both in file transfer and showing up in Network Neighborhood), but the 2 Sage boxes do not see Tower1 at all.

syslog.txt

Johnm · August 31, 2011

I belive that supermicro board has a jumper on the mobo to disable the onboard nics

is your server set to static IP or DHCP? possible your IP changed?

i didnt read your syslog.

Oddwunn · August 31, 2011

I belive that supermicro board has a jumper on the mobo to disable the onboard nics

Ahhh...I didn't think of that. I will look for them...

is your server set to static IP or DHCP?

Static - 192.168.1.110

Funny thing is that the last 3 times I booted Tower1 and entered the command "ifcong eth0", no IP was reported. But this last time that I booted and entered the same command, it reported the IP of 192.168.1 110. I don't know what was different this time.

Oddwunn · September 1, 2011

Oops...in my last post I wrote "ifcong eth0"...it should have read "ifconfig eth0"

Oddwunn · September 1, 2011

Can anyone help? Does my syslog show anything abnormal?

Oddwunn · September 2, 2011

I finally figured out how to disable the 2 onboard LAN connectors (using jumpers...thanks Johnm!) and then I rebooted with the Ethernet cable connected to the addon NIC. Even though ifconfig eth0 reported an IP of 192.168.1.110, I could not connect to the server in any way (web interface, Putty, or by maaping the drives from a Windows machine). Here is the syslog:

syslog1.txt

Oddwunn · September 2, 2011

Since I could not access Tower1 by any method, I safely shut down the unit from the console. I re-enabled the onboard LAN connectors, connected my Ethernet cable to one of them, and now I am up and running again, though I don't know for how long. Here is the syslog after boot up:

syslog2.txt

mbryanr · September 2, 2011

Sep 2 10:20:39 Tower1 kernel: e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX

Suspect you wanted 1000Mbps...

pengrus · September 2, 2011

Just a suggestion, try setting tower1 to use just straight DHCP. Just to see if it works. I have an x8sil-fo, and the ONLY configuration I could get it to work in was DHCP. Static was a no-go, even DHCP reservation didn't work. Couldn't divine a reason why this was happening, and since then have just been happy to let it ask for the same address every 86400 seconds. Works fine.

If nothing else, it's a data point.

Oddwunn · September 2, 2011

Here is a capture from Putty of my currently running Tower1. Can you tell what is going on from this log? (Currently the machine is working, but with very slow transfers).

Tower1 login: root

Linux 2.6.32.9-unRAID.

root@Tower1:~# tail -f /var/log/syslog

Sep 2 10:22:11 Tower1 kernel: [<c108b69c>] ? block_ioctl+0x0/0x32

Sep 2 10:22:11 Tower1 kernel: [<c10769d5>] vfs_ioctl+0x22/0x67

Sep 2 10:22:11 Tower1 kernel: [<c1076f33>] do_vfs_ioctl+0x478/0x4ac

Sep 2 10:22:11 Tower1 kernel: [<c10282cd>] ? __do_softirq+0xf0/0xf8

Sep 2 10:22:11 Tower1 kernel: [<c1076f93>] sys_ioctl+0x2c/0x45

Sep 2 10:22:11 Tower1 kernel: [<c1002935>] syscall_call+0x7/0xb

Sep 2 10:22:11 Tower1 kernel: ---[ end trace 364a40f68d879b2c ]---

Sep 2 10:22:12 Tower1 ata_id[2538]: HDIO_GET_IDENTITY failed for '/dev/block/65 :96'

Sep 2 10:22:18 Tower1 in.telnetd[2595]: connect from 192.168.1.107 (192.168.1.1 07)

Sep 2 10:22:21 Tower1 login[2596]: ROOT LOGIN on `pts/0' from `192.168.1.107'

Sep 2 10:23:34 Tower1 ata_id[2774]: HDIO_GET_IDENTITY failed for '/dev/block/65:96'

Sep 2 10:23:34 Tower1 ata_id[2791]: HDIO_GET_IDENTITY failed for '/dev/block/65:96'

Sep 2 10:35:34 Tower1 kernel: mdcmd (148): spindown 0

Sep 2 10:35:44 Tower1 kernel: mdcmd (150): spindown 1

Sep 2 10:35:47 Tower1 kernel: mdcmd (151): spindown 2

Sep 2 10:35:49 Tower1 kernel: mdcmd (152): spindown 3

Sep 2 10:35:50 Tower1 kernel: mdcmd (153): spindown 4

Sep 2 10:35:51 Tower1 kernel: mdcmd (154): spindown 5

Sep 2 10:35:51 Tower1 kernel: mdcmd (155): spindown 6

Sep 2 10:35:54 Tower1 kernel: mdcmd (156): spindown 7

Sep 2 10:35:56 Tower1 kernel: mdcmd (157): spindown 8

Sep 2 10:35:59 Tower1 kernel: mdcmd (158): spindown 9

Sep 2 10:36:01 Tower1 kernel: mdcmd (159): spindown 10

Sep 2 10:36:04 Tower1 kernel: mdcmd (160): spindown 11

Sep 2 10:36:04 Tower1 kernel: mdcmd (161): spindown 12

Sep 2 10:36:05 Tower1 kernel: mdcmd (162): spindown 13

Sep 2 10:36:07 Tower1 kernel: mdcmd (163): spindown 14

Sep 2 10:36:10 Tower1 kernel: mdcmd (164): spindown 15

Sep 2 10:36:12 Tower1 kernel: mdcmd (165): spindown 16

Sep 2 10:36:15 Tower1 kernel: mdcmd (166): spindown 17

Sep 2 10:36:17 Tower1 kernel: mdcmd (167): spindown 18

Sep 2 10:36:20 Tower1 kernel: mdcmd (168): spindown 19

Sep 2 10:36:22 Tower1 kernel: mdcmd (169): spindown 20

Sep 2 12:21:17 Tower1 kernel: mdcmd (797): spindown 0

Sep 2 12:21:17 Tower1 kernel: mdcmd (798): spindown 9

Sep 2 12:21:20 Tower1 kernel: mdcmd (799): spindown 10

Sep 2 12:21:22 Tower1 kernel: mdcmd (800): spindown 11

Sep 2 12:21:23 Tower1 kernel: mdcmd (801): spindown 12

Sep 2 12:21:23 Tower1 kernel: mdcmd (802): spindown 13

Sep 2 12:21:26 Tower1 kernel: mdcmd (803): spindown 14

Sep 2 12:21:28 Tower1 kernel: mdcmd (804): spindown 15

Sep 2 12:21:31 Tower1 kernel: mdcmd (805): spindown 16

Sep 2 13:59:17 Tower1 emhttp: shcmd (76): /usr/local/sbin/set_ncq sdt 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (77): /usr/local/sbin/set_ncq sdk 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (78): /usr/local/sbin/set_ncq sdl 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (79): /usr/local/sbin/set_ncq sdu 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (80): /usr/local/sbin/set_ncq sdv 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (81): /usr/local/sbin/set_ncq sds 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (82): /usr/local/sbin/set_ncq sdh 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (83): /usr/local/sbin/set_ncq sdi 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (84): /usr/local/sbin/set_ncq sdj 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (85): /usr/local/sbin/set_ncq sdb 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (86): /usr/local/sbin/set_ncq sdc 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (87): /usr/local/sbin/set_ncq sdm 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (88): /usr/local/sbin/set_ncq sdn 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (89): /usr/local/sbin/set_ncq sdd 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (90): /usr/local/sbin/set_ncq sde 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (91): /usr/local/sbin/set_ncq sdf 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (92): /usr/local/sbin/set_ncq sdg 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (93): /usr/local/sbin/set_ncq sdo 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (94): /usr/local/sbin/set_ncq sdp 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (95): /usr/local/sbin/set_ncq sdq 1 >/dev/null

Sep 2 13:59:17 Tower1 emhttp: shcmd (96): /usr/local/sbin/set_ncq sdr 1 >/dev/null

Sep 2 13:59:17 Tower1 kernel: mdcmd (1391): set md_num_stripes 1280

Sep 2 13:59:17 Tower1 kernel: mdcmd (1392): set md_write_limit 768

Sep 2 13:59:17 Tower1 kernel: mdcmd (1393): set md_sync_window 288

Sep 2 13:59:17 Tower1 kernel: mdcmd (1394): set spinup_group 0 0

Sep 2 13:59:17 Tower1 kernel: mdcmd (1395): set spinup_group 1 508

Sep 2 13:59:17 Tower1 kernel: mdcmd (1396): set spinup_group 2 506

Sep 2 13:59:17 Tower1 kernel: mdcmd (1397): set spinup_group 3 502

Sep 2 13:59:17 Tower1 kernel: mdcmd (1398): set spinup_group 4 494

Sep 2 13:59:17 Tower1 kernel: mdcmd (1399): set spinup_group 5 478

Sep 2 13:59:17 Tower1 kernel: mdcmd (1400): set spinup_group 6 446

Sep 2 13:59:17 Tower1 kernel: mdcmd (1401): set spinup_group 7 382

Sep 2 13:59:17 Tower1 kernel: mdcmd (1402): set spinup_group 8 254

Sep 2 13:59:17 Tower1 kernel: mdcmd (1403): set spinup_group 9 130048

Sep 2 13:59:17 Tower1 kernel: mdcmd (1404): set spinup_group 10 129536

Sep 2 13:59:17 Tower1 kernel: mdcmd (1405): set spinup_group 11 128512

Sep 2 13:59:17 Tower1 kernel: mdcmd (1406): set spinup_group 12 126464

Sep 2 13:59:17 Tower1 kernel: mdcmd (1407): set spinup_group 13 122368

Sep 2 13:59:17 Tower1 kernel: mdcmd (1408): set spinup_group 14 114176

Sep 2 13:59:17 Tower1 kernel: mdcmd (1409): set spinup_group 15 97792

Sep 2 13:59:17 Tower1 kernel: mdcmd (1410): set spinup_group 16 65024

Sep 2 13:59:17 Tower1 kernel: mdcmd (1411): set spinup_group 17 1835008

Sep 2 13:59:17 Tower1 kernel: mdcmd (1412): set spinup_group 18 1703936

Sep 2 13:59:17 Tower1 kernel: mdcmd (1413): set spinup_group 19 1441792

Sep 2 13:59:17 Tower1 kernel: mdcmd (1414): set spinup_group 20 917504

Sep 2 14:03:21 Tower1 shfs: duplicate object: /mnt/disk11/Classics/Battle of the Bulge (1965)/mymovies.xml

Sep 2 14:03:21 Tower1 shfs: duplicate object: /mnt/disk11/Classics/Battle of the Bulge (1965)/folder.jpg

Sep 2 14:03:40 Tower1 shfs: duplicate object: /mnt/disk9/Classics/Planet of the Apes (2001)/mymovies.xml

Sep 2 14:03:40 Tower1 shfs: duplicate object: /mnt/disk9/Classics/Planet of the Apes (2001)/folder.jpg

Sep 2 14:03:52 Tower1 shfs: duplicate object: /mnt/disk9/Classics/War of the Worlds (2005)/mymovies.xml

Sep 2 14:03:52 Tower1 shfs: duplicate object: /mnt/disk9/Classics/War of the Worlds (2005)/folder.jpg

Oddwunn · September 2, 2011

Suspect you wanted 1000Mbps...

You suspect correctly.

Just a suggestion, try setting tower1 to use just straight DHCP.

Ok, I'll give it a try. I have run Tower1 using DHCP in the past...it ran the same as with a static IP, but maybe something is different now.

Maybe I should replace my motherboard???

jespeed · September 2, 2011

Hello pengrus and oddwunn,

I have the same board as you do pengrus and I only use a static ip. I too had to install an intel nic due to strange on board LAN issues. Ever since installing intel nic I've had no issues.

I believe I had to edit a config file to get the static ip address but cannot remember which one. I am away from my server for the next few weeks so I can't look. Sorry.

Just to let you know that getting a static ip to work with this motherboard is doable.

Take care and good luck,

Jim S.

mbryanr · September 3, 2011

Here is a capture from Putty of my currently running Tower1. Can you tell what is going on from this log? (Currently the machine is working, but with very slow transfers).

Tower1 login: root

Linux 2.6.32.9-unRAID.

root@Tower1:~# tail -f /var/log/syslog

Sep 2 10:22:11 Tower1 kernel: [<c108b69c>] ? block_ioctl+0x0/0x32

Sep 2 10:22:11 Tower1 kernel: [<c10769d5>] vfs_ioctl+0x22/0x67

Sep 2 10:22:11 Tower1 kernel: [<c1076f33>] do_vfs_ioctl+0x478/0x4ac

Sep 2 10:22:11 Tower1 kernel: [<c10282cd>] ? __do_softirq+0xf0/0xf8

Sep 2 10:22:11 Tower1 kernel: [<c1076f93>] sys_ioctl+0x2c/0x45

Sep 2 10:22:11 Tower1 kernel: [<c1002935>] syscall_call+0x7/0xb

Sep 2 10:22:11 Tower1 kernel: ---[ end trace 364a40f68d879b2c ]---

Sep 2 10:22:12 Tower1 ata_id[2538]: HDIO_GET_IDENTITY failed for '/dev/block/65 :96'

Sep 2 10:22:18 Tower1 in.telnetd[2595]: connect from 192.168.1.107 (192.168.1.1 07)

Sep 2 10:22:21 Tower1 login[2596]: ROOT LOGIN on `pts/0' from `192.168.1.107'

Sep 2 10:23:34 Tower1 ata_id[2774]: HDIO_GET_IDENTITY failed for '/dev/block/65:96'

Sep 2 10:23:34 Tower1 ata_id[2791]: HDIO_GET_IDENTITY failed for '/dev/block/65:96'

Sep 2 10:35:34 Tower1 kernel: mdcmd (148): spindown 0

Sep 2 10:35:44 Tower1 kernel: mdcmd (150): spindown 1

Sep 2 10:35:47 Tower1 kernel: mdcmd (151): spindown 2

Sep 2 10:35:49 Tower1 kernel: mdcmd (152): spindown 3

Sep 2 10:35:50 Tower1 kernel: mdcmd (153): spindown 4

Sep 2 10:35:51 Tower1 kernel: mdcmd (154): spindown 5

Sep 2 10:35:51 Tower1 kernel: mdcmd (155): spindown 6

Sep 2 10:35:54 Tower1 kernel: mdcmd (156): spindown 7

Sep 2 10:35:56 Tower1 kernel: mdcmd (157): spindown 8

Sep 2 10:35:59 Tower1 kernel: mdcmd (158): spindown 9

http://lime-technology.com/forum/index.php?topic=8654.msg84041#msg84041

Oddwunn · September 3, 2011

Ever since installing intel nic I've had no issues.

I ordered an Intel NIC from Newegg...should be here on Tuesday. I hope this cures my problem, but I have my doubts, especially since the Realtek addon NIC that I already tried (also in the list of approved chips that work with unRAID) didn't work at all.

If this doesn't work then I will have to buy a new motherboard I guess...I am just shooting in the dark at this point.

Johnm · September 5, 2011

i assume you changed out cables, ports on the switch, switch if possible, checked you don't have anything weird like your windows box is sending jumbo packets,

if you have a spare thumb drive, and a spare hard drive. maybe boot up a fresh trial unraid and assign it one disk. see how that works?

if you have a spare drive. try installing windows. see if it is a nic or unraid issue..

if you had 2 norco builds (or compatible hotswap units), I'd say swap the hard drives and flash drives and see if the problem follows the flash or stays at hardware level.

usually these boards are solid.. yeah. lemons do exist. also power surges can come through ethernet...

it is just not often a newish serverboard goes tits up in under 2 years after the first week or so..

Oddwunn · September 5, 2011

Thanks Johnm!

i assume you changed out cables, ports on the switch, switch if possible, checked you don't have anything weird like your windows box is sending jumbo packets,

I have 2 servers, Tower1 and Tower2, in very close proximity. Tower2 is working perfectly (reading and writing from several Windows machines) and Tower1 is giving me problems. I exchanged the ethernet cables between Tower1 and Tower2, effectively changing everything downstream from the boxes themselves...The problems stay with the box, so therefore I have concluded (perhaps incorrectly) that the problem is in the box itself and not elsewhere in my LAN.

if you have a spare thumb drive, and a spare hard drive. maybe boot up a fresh trial unraid and assign it one disk. see how that works?

I will give that a try! Today, before reading this post, I upgraded unRAID from version 4.5.6 to version 4.7, and so far things seem to be better, but it is too soon to tell for sure.

if you have a spare drive. try installing windows. see if it is a nic or unraid issue..

Yes, I will do that also, but I need to set up a bootable external DVD drive in order to load Windows onto a hard drive (or at least the way I like to load Windows).

if you had 2 norco builds (or compatible hotswap units), I'd say swap the hard drives and flash drives and see if the problem follows the flash or stays at hardware level.

I have 2 Norco boxes, but one of them is a 24 drive hot swap unit, while the other one is a low end 15 drive non hot swap box. Exchanging drives is not possible without some serious investment in time and redoing of the arrays.

usually these boards are solid.. yeah. lemons do exist. also power surges can come through ethernet...
it is just not often a newish serverboard goes tits up in under 2 years after the first week or so..

This mobo is over a year old...it has never been "rock solid" in my estimation...maybe I had a flaky board from the start?

I discovered another problem today, but I don't know if it happened before, after the hurricane, or if it has been there all along. Tower1 is in a 24 disk Norco case populated with 20 disks. The parity drive is not mounted in any of the hot swap bays and is controlled by one of the motherboard's onboard SATA controllers. The 20 disks are controlled by 3 Supermicro 8 port controllers like this:

Controller #1 - disks 1-8

Controller #2 - disks 9-16

Controller #3 - disks 17-20

Since I have only written data to disks 1- 15 (until today), everything had been working fine. Today I decided to copy disk15 onto disk20 using Midnight Commander. The write crapped out after about 5 minutes and disk20 took on a "read only" status. Thinking that I had a bad disk, I changed it with a known good spare disk already precleared, allowed the array to rebuild the new disk, and then tried copying disk15 to disk20 again. It crapped out again after the same amount of time and took on a "read only" status just like the first disk. So then I tried swapping the 4 channel SAS cables from one connector to the other on the third controller (so that I would be using the other half of the same controller)...same result. Next I will try moving disk20 to a different hot swap bay (like bay 24 - so that it uses a different backplane) and see what happens there.

Wow, this is getting worse all the time...

Oddwunn · September 5, 2011

Another thought...doesn't parity build itself by reading information from all 20 disks? If this is the case, then disks 17 - 20 must be readable, or parity would not build correctly. And in the same vein, since I replaced disk20 and parity had to rebuild it, doesn't that mean that the system is writing to disk20 just fine? If those 2 premises are correct, then why can't I copy disk15 to disk20???

Johnm · September 5, 2011

egads....the horror of it all.

Another thought...doesn't parity build itself by reading information from all 20 disks? If this is the case, then disks 17 - 20 must be readable, or parity would not build correctly. And in the same vein, since I replaced disk20 and parity had to rebuild it, doesn't that mean that the system is writing to disk20 just fine? If those 2 premises are correct, then why can't I copy disk15 to disk20???

if you are building parity to any disk other then 20. it is reading 20. if you are rebuilding 20.. you are writing to 20. even if it is blank, it should write whatever junk was underneath the MBR.

(unless the drive was zeroed, there would be data on the drive that the MBR is unaware of [old and deleted files])

in theory, if you only have 20 data disks plus parity, you could yank out an MV8... put the leftover drives on the mobo.

I saw you were worried about that in the other thread.

Oddwunn · September 6, 2011

My last experiment gave me the same results, so I think I can safely conclude that the Norco backplanes are not the problem (unless I have 2 identically bad backplanes).

My next plan is to try swapping 8 port SATA cards to see if the problem moves with the slot or with the card. This should narrow things down to a problem motherboard or bad SATA card.

in theory, if you only have 20 data disks plus parity, you could yank out an MV8... put the leftover drives on the mobo.

Yup, that will be the final experiment. I will check the BIOS settings that you told me about in the other thread first. In order to try this suggestion I will have to buy a 4X1 breakout cable in order to connect the SAS connector on the backplane to the 4 individual ports on the motherboard. And of course this will be for troubleshooting purposes only, as I will never be able to use the full 24 disks in the array once unRAID supports them.

Oddwunn · September 6, 2011

Ok, now I am even more confused than before. It seems like *disk20* is the only problem disk I have. Disks 17,18, and 19, all on the same controller, work just fine. I have now tried switching controller cards, hot swap bays (and therefore backplanes), unRAID versions, and cables, and through all of this, *disk20* remains the one disk that I can not write to, either over the LAN or from disk to disk using Midnight Commander. I have now used 3 different hard drives, all of which work perfectly when used in other servers (including this one) and have proven to be good, reliable drives, but they always crap out when put into the mysterious *disk20* position.

The only thing left to do is to physically pull out the drive from the hot swap bay, temporarily connect it to one of the motherboard's onboard SATA ports, and give it one more try. After that I no longer know what to do or where to go.

Johnm · September 6, 2011

odd..

Are you trying to write directly to the disk \\tower\disk20 ?

is it a permissions issue?

did you try anything besides MC to write with?

Oddwunn · September 7, 2011

Are you trying to write directly to the disk \\tower\disk20 ?

Yes, directly to disk20.

is it a permissions issue?

No, it seems to be a write failure. The disk is changed to "read only" status by unRAID when it fails. It fails after writing about 30 to 40 gb of data...quite consistently.

did you try anything besides MC to write with?

Just MC and over the LAN from my Windows boxes...both methods work fine on the other 19 disks.

I have now tried EVERYTHING. I tried removing the drive from the hot swap bays altogether connecting it to one of the motherboard's onboard SATA ports (eliminating the Supermicro 8 port cards, the cables, and the Norco backplanes) - it still fails! So then in a desperate attempt to get it to work, I tried yet another drive (my 4th drive)...this time I changed brands from Samsung to Hitachi (even though I have about a dozen of these same Samsung drives working perfectly in the same box)...same problem! (Oh, and yes, the Samsungs have the updated firmware)

I have now tried what I consider to be everything...I have switched drives, cables, hotsway bays, backplanes, controllers, and even changed from unRAID version 4.5.6 to 4.7. Nineteen disks work perfectly and disk20 fails when I try to write to it every single time.

Anyone have any more ideas?

Johnm · September 7, 2011

Anyone have any more ideas?

Do you remember that scene in Office Space where they go out to the field with the baseball bat?

I honestly don't have a clue.

I have seen issues in WHS v.1 where people had issues with disk 32 (the last drive you are allowed to add).

It sounds like you covered every possible way to test disk20 hardware wise.

Am i missing something obvious that i am unaware of? like an upper limit to number of files in a user share?

I assume you made a new share on drive 20? try CP commands?

I assume you ran a parity check prior to this?

maybe you have a bad spot on your parity drive and it cant create parity? (I don't think that would cause this issue though?)

try smart reports and disk20 and parity?

Are there any errors in your syslog when disk20 pukes?

unfortunately, I am still learning unRAID by trail and error myself. you have gone beyond my knowledge at this time.

Oddwunn · September 8, 2011

In a desperate attempt to get disk20 to work correctly, I tried one more thing. I removed disk20 from the array completely and then rebuilt parity with the remaining 19 disk array. Then I added disk20 into the array as a new disk so that the array would not rebuild disk20 like it had been doing previously, but rather it treated disk20 as a new, additional disk to the array. The array then cleared the disk, formatted it, and successfully added it to the array. Then I went back and tried writing to disk20 once again and whaddaya know, it now works! Yahoo!!

The only thing I can figure is that the original disk20 must have been bad and that the array had built parity with this bad disk (badly formatted, perhaps?), so every time I replaced disk20 and allowed parity to rebuild the disk, it rebuilt the new disk perfectly just like the old bad one, leaving me with a disk that write failed just like the original disk. Does this make any sense??

Anyway, many thanks, Johnm, for hanging in there with me and trying your best to help me solve this dilemma. This was a learning experience and something I will keep in mind in the future.

Now today that Intel NIC I thought would be here on Tuesday finally arrived. I hope that the new NIC solves my original problem, that of the network dropping randomly.

Another hurricane problem - network keeps dropping

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation