CyberMew Posted July 11, 2016 Share Posted July 11, 2016 I'm having this issue often as well, once I had a few corrupted file entries that caused it to hang (I think) and produce tons of errors on the console which I've fixed, but not sure if there's anything to be concerned about. I've attached my logs before powerdown and after a forced restart. So should I upgrade the filesystem as well in order to fix these issues? tower-diagnostics-20160711-2338.zip tower-diagnostics-20160712-0002.zip Quote Link to comment
RobJ Posted July 12, 2016 Share Posted July 12, 2016 You're having a hardware issue with your system, causing 'machine check events', each preceded by momentary overheating CPU (all 4 cores) messages. I would install mcelog from the NerdPack, and see what it says, the next time it reports them. And of course you may want to examine the CPU cooling. Quote Link to comment
CyberMew Posted July 13, 2016 Author Share Posted July 13, 2016 I might not have applied sufficient thermal paste on the CPU, hence the higher temps especially when there are a couple users trying to stream/transcode from Plex. I will try to replace the thermal paste again this weekend! I've installed NerdPack, but how would I go about using mcelog? Would the high cpu temps really be causing unraid web to lockup and render the disks inactive? There are no disk activities nor can we access the webui/shares via smb when it happens. I can't do a powerdown when this happens as well. Just a side question, not sure if it's in the logs (and if I've mentioned this before), but my parity drive and disk4 often get errors. Are they ok or should I replace them? Quote Link to comment
CyberMew Posted July 13, 2016 Author Share Posted July 13, 2016 Found the answer to my side question (finally!! the errors were making me uncomfortable) Both drives were connected to https://www.amazon.com/gp/product/B00AZ9T3OU/, which I realise is problematic after looking at someone else's post http://lime-technology.com/forum/index.php?topic=50332.0, which linked me to http://lime-technology.com/forum/index.php?topic=40683.45 Thank the heavens! Going to try the solutions in that thread before it's time to switch out to another card! Quote Link to comment
RobJ Posted July 13, 2016 Share Posted July 13, 2016 I've installed NerdPack, but how would I go about using mcelog? mcelog is a module that's triggered when a Machine Check Event occurs, and is able to gather and log a fair amount of info about the MCE. Without it, all we know from the syslog is that an MCE occurred, but not the source of it (CPU, RAM, etc) or what the error was. If an MCE occurs again, there will be more info logged for us, to use in figuring out what needs to be fixed or replaced. Would the high cpu temps really be causing unraid web to lockup and render the disks inactive? There are no disk activities nor can we access the webui/shares via smb when it happens. I can't do a powerdown when this happens as well. I don't know. But it pays to deal with what we *can* deal with first, then see what else turns up, hoping that it will then be clearer to us with the obvious issues gone. Quote Link to comment
CyberMew Posted July 14, 2016 Author Share Posted July 14, 2016 Oh that's great, I thought I had to do something in order to get it to appear in the logs. Glad that it's automatic. I just got home, and seems like it's happening again. Unable to access my Plex, no hdd lights activity etc. This time however, I could access the webgui (a surprise!), which then I tried to do a stop array but it stuck at the first thing - Stopping Docker. No surprises there. Then proceed to do a powerdown, which didn't work as per usual. I will edit this post to attach the logs again once I'm able to get it out. edit: attached logs. Also, I've restarted with the bootup command line fix provided in the marvell topic, looks like it isn't working. Jul 15 00:19:36 Tower emhttp: shcmd (11): /usr/local/sbin/set_ncq sdh 1 &> /dev/null Jul 15 00:53:37 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jul 15 00:53:37 Tower kernel: ata9.00: failed command: SMART Jul 15 00:53:37 Tower kernel: ata9.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag 28 pio 512 in Jul 15 00:53:37 Tower kernel: ata9.00: status: { DRDY } Jul 15 00:53:37 Tower kernel: ata9: hard resetting link Jul 15 00:53:38 Tower kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 15 00:53:38 Tower kernel: ata9.00: configured for UDMA/133 Jul 15 00:53:38 Tower kernel: ata9: EH complete Jul 15 00:54:22 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jul 15 00:54:22 Tower kernel: ata9.00: failed command: IDENTIFY DEVICE Jul 15 00:54:22 Tower kernel: ata9.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 4 pio 512 in Jul 15 00:54:22 Tower kernel: ata9.00: status: { DRDY } Jul 15 00:54:22 Tower kernel: ata9: hard resetting link Jul 15 00:54:23 Tower kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jul 15 00:54:23 Tower kernel: ata9.00: configured for UDMA/133 Jul 15 00:54:23 Tower kernel: ata9: EH complete Not sure if it's affecting anything related to this main problem, but I'll also get another non-marvell sata card if possible to try and eliminate possible problems. tower-diagnostics-20160715-0007.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.