jdelport

Members
  • Posts

    24
  • Joined

  • Last visited

Converted

  • Gender
    Male
  • Location
    South Africa

jdelport's Achievements

Noob

Noob (1/14)

0

Reputation

  1. I have replaced the cables as well with new ones. So with different cables and different drives in different ports the only common denominator left is the specific port. Just a check - when you were moving the drives between ports were you also moving the cable? Thought it is worth confirming that the SATA cable has been eliminated as a possible culprit.
  2. Thanks a lot for all your time. That is exactly what I'm going to do - is buy a card like you suggested. Since it seems to follow one of the ports, I can then replace that port and hopefully the other ports keep on working.
  3. I hear you about the disk, but now I get the read error on disk5, so whichever disk I plug into that specific SATA port on the mobo gets the read errors. Which tells me the mobo/SATA ports are shot?
  4. Still the same error on disk 1, so I don't think it's a SATA cable problem as all the cables that I have swapped with new ones still get read errors on those drives. I have now spread the power lines differently between the 7 disks. Not that I think that would be the problem as I have the 850W single rail PSU and johnnie.black also reckons that is plenty. I have then also swapped the SATA ports on the mobo between disk1 and disk5. If the SATA ports or the controller are to blame, then my swap of the ports should now show the read error on a different disk. That's my thinking anyway. If that happens then as a last resort I will have to get a new mobo, CUP and RAM, as my CPU is still socket LGA1155 and I probably won't get a mobo with that socket anymore.
  5. I have replaced the SATA data cable for disk 4 and that seemed to have sorted out disk 4. Then I got the same issues on disk 1 and I have now also swapped disk 1's SATA cable. Busy rebuilding and will post the results when done or when errors occur. I have also replaced the PSU with a 850W single rail which should be fine for 7 disks?
  6. Hi Robj, Sorry for not supplying more info. I have another open post: http://lime-technology.com/forum/index.php?topic=47389.msg453596#msg453596 And was also just looking at and asking on posts that looked related. You will find more info in my post referred to above. Thanks for the reference to the wiki and the info on finding the device symbols. I'm looking into that and will try and understand how to go about doing that. Regards Johan
  7. Hi johnnie.black, thanks for taking the time to try and help me. Yes I think that disk 4 was probably offline when I pulled the diags. I re-did the diags now while disk4 was online and it does contain a SMART for that disk now also: https://www.dropbox.com/s/c99e651f7yesb8o/tower-diagnostics-20160311-1412.zip?dl=0 But from what I can see in the SMART report the disk itself seems fine? By the way how did you figure out that ata1 as per the sysylog is physical disk 4? That's one of the things I couldn't figure out. Now that I know that I can also go and swap cables and/or ports on that disk and try and figure out whether it's a cable or the controller. Does that sort of analysis and troubleshooting sounds about right?
  8. Hi thegurujim, Did you ever solve your problem? I have exactly the same issue with a disk rebuild and was hoping I could troubleshoot somehow and not have to replace everyting. Thanks Johan
  9. How did you figure out which disk was the culprit from the logs? I have the same issue and I'm also getting "ata1: hard resetting" and would like to figure out how I can find out which disk is the problem so I can also try and swap out cables/ports.
  10. OK thanks. I have added the new diagnostics now also: https://www.dropbox.com/s/du3arypn3rejkmg/tower-diagnostics-20160311-1342.zip?dl=0 By the way, in the meantime disk 4 has also gone "missing" while the server was just standing there and doing nothing (array is even stopped). And then after a few minutes disk 4 cam back online again. So I now have a suspicion that the PSU or the motherboard/controller has some problems? It's funny that it worked for more than 3 years, and now suddenly this? But I guess that's the way electronics go...
  11. I had a power failure and a UPS shotdown through the web interface of my unRAID server seems to have caused a problem. Suddenly I had 1 disk show up with the red cross (disabled). I tried un-assigning and then re-assigning the disk in order to force a rebuild of the disk, but after a while the disk together with some other disks would stop responding. Finally decided to replace the initially failed disk with a new one, but the rebuild runs EXTREMELY slow. Took many hours to do 40MB and then stated that the 2TB (now replaced with a 4TB new disk) would take 100+ days at less than 1MB/s! So I decided that something else was wrong. I checked the syslog and it had the following entries every few seconds: Mar 10 21:19:38 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Mar 10 21:19:38 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Mar 10 21:19:38 Tower kernel: ata1.00: configured for UDMA/33 Mar 10 21:19:38 Tower kernel: ata1: EH complete Mar 10 21:19:39 Tower kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x4890800 action 0xe frozen Mar 10 21:19:39 Tower kernel: ata1.00: irq_stat 0x0c400040, interface fatal error, connection status changed Mar 10 21:19:39 Tower kernel: ata1: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch } Mar 10 21:19:39 Tower kernel: ata1.00: failed command: READ DMA EXT Mar 10 21:19:39 Tower kernel: ata1.00: cmd 25/00:40:80:50:72/00:05:00:00:00/e0 tag 18 dma 688128 in Mar 10 21:19:39 Tower kernel: res 50/00:00:47:02:b6/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Mar 10 21:19:39 Tower kernel: ata1.00: status: { DRDY } Mar 10 21:19:39 Tower kernel: ata1: hard resetting link Mar 10 21:19:43 Tower kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Mar 10 21:19:43 Tower kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Mar 10 21:19:43 Tower kernel: ata1.00: configured for UDMA/33 Mar 10 21:19:43 Tower kernel: ata1: EH complete Mar 10 21:19:43 Tower kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x4890800 action 0xe frozen Mar 10 21:19:43 Tower kernel: ata1.00: irq_stat 0x0c400040, interface fatal error, connection status changed Mar 10 21:19:43 Tower kernel: ata1: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch } Mar 10 21:19:43 Tower kernel: ata1.00: failed command: READ DMA EXT Mar 10 21:19:43 Tower kernel: ata1.00: cmd 25/00:40:00:7f:72/00:05:00:00:00/e0 tag 9 dma 688128 in Mar 10 21:19:43 Tower kernel: res 50/00:00:c7:30:b6/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Mar 10 21:19:43 Tower kernel: ata1.00: status: { DRDY } Mar 10 21:19:43 Tower kernel: ata1: hard resetting link Complete syslog available at: https://www.dropbox.com/s/ouy3kool74r8be8/tower-syslog-20160310-2119.zip?dl=0 So there seems to be a problem on the ata1 disk. First of all, how do I figure out which is the ata1 disk that's causing the problem? I don't believe it an actual disk problem, as the SMART reports for all disks are fine, but rather a mobo or controller or PSU or cable problem. I will change the cable and have already bought a new PSU that I'm going to install tonight. But I'd like to be able to identify the problem "port" that is linked to ata1. Has anybody has similar issues before? How do you suggest I troubleshoot. I'n VERY worried about losing data, as I have already tried rebuilding disk6 (I have 7 disks including the parity) so disk 6 does not have its original data on any more and needs to be rebuilt. So I need this rebuild to work before I am protected again and if I have any other failures now I'm out of data... Thanks for the help Johan
  12. Thanks again @RobJ. I think I have everything back up and running again correctly. I'll provide a short summary below in case in can help anyone in future. I have 3 last questions though: 1) After the last restart (I'll summarize below what was done to fix hardware issue) I managed to get the rebuild running and even after 2+ hours and 600GB out of 2TB it was still going. Something that told me the progress was a lot farther than before and things were now working. However after about 2 or so hours emhttp stopped working again. No webUI response. I could hear the disks churning so I left it overnight and this morning stopped and restarted from command line and the disk rebuild was done and array is all green again. But why did emhttp stop responding? Attached is that syslog after the rebuild and just before I restarted from command line. I can't see anything in the syslog on that? 2) @RobJ: I figured out which disks were attached to the failed SATA controller by seeing in the webUI which disks were missing after the failure. But I could not figure out from the syslog how to match the ataX number back to a device sdX number. Please how did you get that? 3) Should I still run reiserfs check as per wiki on all disks and let it fix whatever it wants to (suggested parameters that reiserfsck provides)? Is this safe and will it not damage my data even if reiserfsck suggests --rebuild-sb? I have read the wiki on this, but just want to makes sure. Here is a summary of what happened and how it was resolved in case it can help anyone else in future: 1) unRAID array stopped responding on old hardware. Reset server. 2) UR came up, but one disk was disabled. All disks SMART report was OK. I checked and the disabled disk was attached to a plug-in SATA card. I assumed the card to be faulty and didn't wan't to take any chances, so I bought a new mobo with enough on-board SATA ports. 3) UR started on new hardware and I decided to copy contents from failed disk first before starting with disk rebuild - using unprotected array with disabled disk. This copy process kept on failing with failed SATA controller as per my posts above. 4) Many replies suggested power issues and I eventually identified the failed controller through the disks connected to it. I took one of the disks connected to this controller and connected it to an open port on another controller on the mobo as the mobo has 7 internal SATA ports and I have only 6 disks. This seemed to have solved the problem and I'm not getting any controller failures anymore (touchwood). 5) I believe all the posts suggesting that my PSU is maybe borderline as far as it's capabilities goes, are correct and I am going to look at getting a proper single rail PSU as per the wiki this coming week. The moving of a disk to another SATA port may have just "balanced" some of the power distribution to the extent that the PSU can now cope with the load. What think you? Lastly, thanks a lot to the forum for the great help. This is one of the main factors that makes unRAID the product of choice in this space. Great support from community, simple and stable product. Regards Johan syslog5.txt
  13. I had done some basic cable checks and inspection, but will do more thorough ones as suggested thanks.
  14. Thanks a lot guys for the responses @jonathanm and @frank1940. I've read the posts and wiki you referred to and it makes sense. I'm going to look into getting a local PSU with the necessary properties. I live in South Africa and it should be available from the bigger suppliers here. The theory makes sense and I want to be safe within the limits of the PSU as described to be sure. It still worries me though that this machine has worked for more than a year on a similar specced PSU and now suddenly all the funnies. Maybe a perfect storm? Please keep any other ideas coming while I try and source an adequate PSU. Regards Johan
  15. The PSU is a HuntKey Green Power: http://www.huntkeydiy.com/en/product/p-82-433.html It's specs state: 3.3V 24A, 5V 17A, 12V1 16A, 12V2 18A The drives are Seagate Barracudas and they are 5V 0.72A, 12V 0.52A spec. The PSU has 2 power rails. One supplies 4 drives and the other supplies 2 drives and 2 small fans each less than 0.2A. I don't think that's a problem? I restarted again after the last error and managed to copy around 120GB before the array stopped responding again. This time the error in the syslog looked different to me. Latest syslog: https://www.dropbox.com/s/2nek9hxbw5ua2xe/syslog3.txt I then stopped the array from the webUI and then started it again. But then it all froze up again and I had to stop from telnet and restart. Could this be a software issue since all my hardware apart from the drives have been replaced now? And even if the one drive's failure was a real failure (which it is not as the SMART report shows), then surely the changes for another drive to fail now is very remote. The next time I start the machine the drives and not disabled either, except of course for the one drive in the initial failure. So that means the other drives are fine. This one is a real puzzle...