Doozy of a pickle of a head-scratcher


Recommended Posts

I've got a client build that I've been struggling with for weeks now.  The ironic bit - it is all my design!  Finally last night I decided to pour myself a stiff drink, run a few more tests, then seek help here.  As far as I can tell, the issue is simply bad luck as opposed to design flaw, but still I figured I would give the community a chance to give me some input on this one.

 

Here's the build:

 

LMLiL.jpg

 

bpXhC.jpg

 

Here's the specs:

 

15 Drive Budget Box

Mobo: ASUS M4A785-M

CPU: AMD Sempron 140

RAM: Kingston 2GB DDR2 800

PSU: CORSAIR Builder Series CMPSU-500CX 500W

Case: COOLER MASTER Centurion 590 RC-590-KKN1-GP Black SECC / ABS ATX Mid Tower Computer Case

SATA Expansion Cards:

    SUPERMICRO AOC-SASLP-MV8 PCI Express x4 Low Profile SAS SAS RAID Controller

    2 port SATA2 Serial ATA II PCI-Express RAID Controller Card (Silicon Image SIL3132)

Cables:

    3ware CBL-SFF8087OCF-05M 1 unit of 0.5m Multi-lane Internal (SFF-8087) Serial ATA breakout cable, forward x 2

    Molex power splitters x 2

Fans: GELID Solutions FN-PX09-20 92mm Case Fan x 3 (replacing stock fans in Supermicro cages)

Hot Swap Drive Bays: SUPERMICRO CSE-M35T-1B Black 5 Bay Hot-Swapable SATA HDD Enclosure x 3

Hard Drives: 2 TB WD EARS w/ jumpers x 2 plus a bunch of older test drives

 

Here's the issue in a nutshell:

 

The video out keeps failing, resulting in either a blank screen or this:

 

click to view - beware of eye cancer

 

The full story:

 

As part of my burn in processes, I run 6 instances of preclear on different drives simultaneously (directly from the system console) - I am using the newest version of preclear (love the new results output, Joe).  I've determined with other servers that 2 GBs of RAM is enough to simultaneously preclear 6 x 2 TB drives.  I figure this is a good way to test that all the disk controllers, disk cages, etc. are working.  This server reliably fails this test every time the SASLP card is in play.  When using only the motherboard or SIL3132 card ports, the server passes 6 simultaneous preclears with no problem.  However, whenever the ports on the SASLP card are used in this test, the server fails.  In fact, whenever the SASLP card ports are used this server seems incapable of reliably passing preclear on just a single drive.  Sometimes it will work, sometimes it won't.  If I start off one preclear cycle and leave it, sometimes it will complete, and other times it will fail (blank or eye cancer screen) within minutes.  If I attempt to start multiple preclears, it will reliably fail within a minute (usually before I'm even done typing in all the commands - and I'm a fast typer).

 

Here's the weird thing - only the video out seems to be failing.  The rest of the server keeps on chugging away.  The drive lights blink (indicating that the preclear processes are still running), and if I type blind the server will still respond to keyboard commands - ctrl-c to cancel preclear (confirmed by the drive light turning off), powerdown to shut down, etc.  The web interface often still works as well, though once in a while it will freeze when the screen dies.

 

Normally I would conclude that the problem is a bad motherboard and just order an RMA.  However, here's the thing - this is the second motherboard in a row that is exhibiting these same (or at least very similar) symptoms.  The first one I ordered for this build I diagnosed with a bad PCIe x16 slot.  It would throw up errors whenever any one of several known good SASLP cards that I have was used in it.  It also failed the above preclear test, but I don't remember the same screen blanking behavior (though it was a while ago, so I may have just forgotten).  The board I currently have is a brand new replacement for the previous one.  Neither was open box, refurb, or anything like that.

 

Here's the tests I've done to point the blame squarely on the motherboard:

 

Tested RAM - passes memtest for 36+ hours in DIMM A1 (one 24+ hour session, one 12+ hour session)

Excluded PSU by replicating problem with two other PSUs - Corsair CX 650W (brand new), and Antec Neo Eco 400W (used, but known good)

Excluded cables by using these same cables in other builds to perform the same tests (successful)

Excluded SASLP card by testing it in another build (successful), and by testing other known-good SASLP cards in this build (problem replicates)

Excluded the flash drive by testing alternates (problem replicates), and by testing it in other builds (successful)

Excluded potential over-voltage issues by running the server with just a single drive cage powered (problem replicates) and by trying a large 650W PSU (problem replicates)

Reseated the RAM and CPU (problem replicates), tried all the RAM slots (problem replicates, also certain other DIMMs cause memtest to fail, so that may be another red flag)

Excluded both RAM and CPU by ensuring that the server will complete 6 simultaneous preclear passes when the SASLP card/PCIe x16 slot are not used (successful).

Removed the Sil3132 card so that the SASLP card was the only expansion card (problem replicates)

Tested the drive cages in other builds (successful)

Excluded the monitor and keyboard by trying alternates (problem replicates)

 

I've used an IR thermometer to look for any signs of overheating, and have found none.  I've also updated the motherboard's BIOS, to no avail.  I wish I had a PCI video card laying around to test, but alas I do not.

 

Given all of the above, I'm inclined to say that this motherboard is incompatible with the SASLP card.  However, I've read of at least one other person successfully using both together in their server.  I've also searched the BIOS for any settings relating to PEG, which might be reserving the PCIe x16 slot for video-card only operation - I've found nothing of the sort.

 

Note that the motherboard itself is Level 1 certified, but the user who did that certification was not using it with a SASLP card.  Another user has anecdotally offered evidence for Level 2 certification, but provided no hard evidence (syslog) as of yet.

 

So it has to be the motherboard (again), right?  Am I missing something?  Common sense and probability tell me that two bad motherboards in a row is less likely than something else being the common root of both problems.  If that's the case, I can't find it.

 

Scouring the forums, the only consistent problems I've seen others have with this motherboard are problems booting from the unRAID flash drive.  Ironically, I've never had that problem with this board.  Still, I followed the advice to use Forced-FDD mode instead of Auto just because I figured it couldn't hurt.  Disabling INT13 on the SASLP card on this motherboard also worked as expected.

 

I'm off to try another test now - I'm going to skip the preclear test above and see if I can just get this server to boot into unRAID and complete a parity sync and check using drives on the SASLP card.  I'm not optimistic.  After that I'll probably try to contact Asus tech support and see if they have any input.  Again, I'm not optimistic as past experience with tech support in general indicates that I'll know more about the problem and potential solutions than they will.

 

Unless someone else recognizes something I've missed, I'll probably RMA this board (again), and search for a new one to recommend for my budget builds.  A real shame - this one had so much potential.

 

Attached is a syslog captured just after one of the blank screen crashes.  You'll see the preclear sessions start off, then the syslog will end abruptly at which point the screen had gone blank and I hard powered off the server.  This syslog was captured via the web interface.

stiff_drink_inducing_syslog.txt

Link to comment

Just thinking out loud here, but what seems strange to me is a) the board works fine without the SASLP, b) a proven SASLP exhibits the problem, and c) it also shows issues with video.  To me, this sounds like some sort of conflict between the video and the SASLP.  Have you disabled all unused items on the mobo (com/parallel ports, etc.)?  Do you happen to have a video card you can throw in there and disable the onboard video, just as a test?

Link to comment

Not much input from me, but normally when I hit these kind of walls I try to think of what have I not tried. I know thats like saying I didn't try jumping out the window, which has nothing to do with this at all.

 

My question is: IF it is failing and causing your Video to fail is your Video sharing ram with the video card? I'm guessing it might be and your not really doing much in the means of Video, but with all the Preclears running and since we all know Linux likes to gobble up as much as possible could there be a coralation?

 

Like I said probably not, but something else to throw out there.

Link to comment

I can empathise. A few days ago, I thought I had two drives fail, just after removing a different drive from the array and doing an "initconfig".

That was a panicky moment.

 

It turned out two SATA ports had died on the motherboard -- I was able to access both drives on another machine and found all the data was intact. I'm glad I didn't panic and do something silly like clear the drives again.

Link to comment

Raj -

 

Sorry you're pulling your hair out.  I am also having a frustrationg problem described below.  I really don't think that the SASLP is not the best implemented board.  Although it does play nice in some systems, it does things that no other SATA controller I have used or seen referenced:

 

1 - Gets the HDIO_GET_IDENTITY errors

2 - Fails smartctl report requests when drives under heavy use (several examples in your syslog)

3 - Hangs servers with nasty errors, preventing smooth server shutdowns

 

I have one in my PCIe x16 slot, and an Adaptec 1430sa in my x4 slot.

 

I can run the thing for days or weeks using it for Samba shares and parity checks.  Works flawlessly.

 

But if I am using the interface (default or unmenu), it is like Russian roulette.  One time or another the controller is going to lock up and cause me to have to hard boot the server.  I think it has something with pulling the smart data.  There is no rhyme or reason to the crash - and not even any indication that any particular drive has failed.  I've had it happen on the first launching of the GUI, and had it happen weeks after a boot.  The SASLP totally loses its ability to do I/O.  Several times I've thought that the problem was fixed, only to see it happen again.

 

Another person with my MB reported inability to run 2 SASLPs.  They ran one on their x4 slot, and tried to add one to their x16 slot.  I don't remember all the details, but they could not make it work.  I ran 2 1430sa's in my box for a long time.  Never a problem.  As I said, I don't think the SASLP board is well implemented.

 

Maybe SASLP boards don't like certain x16 slots?  I am thinking about swapping my 1430sa and SASLP boards, so the SASLP will be in the x4 slot.   Sounds easy but not - will take a while and require quite a bit of recabling.  The other option I'm considering is to switch to my C2SEE motherboard, but I don't have all the parts (need memory and a 2 port x1 controller).  Unfortunately I have no way to A-B my SASLP with another one to see if there is something wrong with it.  My assumption is that it is good, but if I continue to get the errors, even with another MB, the card will be the next suspect.

 

I am wondering how the SASLP would hold up if you added already precleared disks to them and tried to define an array and formatted them.  The SASLP clearly does not like unpartitioned / unformatted disks.  If you added drives you might find that it settles down and works fine.  Might be an interesting test, but will destroy your precleared disks.

 

If anyone has any suggestions for me to diagnose my issue, please respond.  One of the most frustrating things is I have no definitive test to know when it is fixed. I just do something and have to wait to see if it crashes something.  PITA!

Link to comment

Raj, stupid question did you calculate video ram with your preclear calculations? Perhaps lower the amount of memory for video display.

 

Re-flash the mainboard bios. After that reset back to factory settings. Someone already mentioned, disable onboard video and install a cheap video card. Better yet, disable onboard video and operate server through telnet and see if you can crash it or freeze it like you said before. In rare cases I've seen the onboard cache go bad on a mainboard and cause all sorts of odd problems that were never resovled.

 

 

Link to comment

Forgot to mention in the first post, I also set the onboard video RAM to the lowest setting (32 MB)...

 

I think I'm just about ready to throw in the towel with this board.  I received my new dev flash drive in the mail today (I ordered it days ago, but coincidentally it came the same day that I broke my previous one).  I loaded it up with unRAID 4.6 and tried to boot this server from it.  Guess what - I ran into all the boot problems that others have reported with this board.  It simply won't boot from the USB drive not matter what settings I try (and the settings are sticking).  I booted other servers from the same USB drive without issue.

 

While I might try one or two more of your suggestions, I think I'm going to pursue the refund as opposed to RMA route with this board.  Hopefully two boards in a row with the same issues will be enough to convince Newegg.  Hopefully my client will also be understanding.  Looks like I'm also in the market for a new budget board (that will work with the rest of this hardware).

 

All said and done, I really miss the Biostar A760G-M2+.  I feel as though the ideal unRAID board for 15 drive or smaller servers has come and went.  I'll have a drink in its honor.

 

I also had a nice date with a fun Russian girl...so the day wasn't a total loss.

Link to comment

Raj,

 

I feel for you, I think we've all been in your shoes at one time or another. I just looked and your favorite board (Biostar A760G M2+) is available new on ebay.

 

Good luck.

 

Boy howdy, you are right!  I just bought three of them...my next three builds should be problem-free.

Link to comment
The other option I'm considering is to switch to my C2SEE motherboard

 

I too am working on a new build. It is a C2SEE, 4GB RAM, 1 SASLP, 1 Sil3132, 15 drives.

First problem was I could not get any Monprice Sil3132 to work in any slot with any firmware.

 

Started up with another brand Sil3132 and fired off 15 parallel preclears. One of mobo connected drives started throwing errors right out of the gate. Killed it's preclear and later found it was to be bad contact between the drive and backplane. So 14 preclears went on happily, but one was really really slow. It was the one connected to the Sil3132. During this time I did several SMART reports from drives connected to the the SASLP with no issue other than a few HDIO_GET_IDENTITY messages. I am confident these were related to things I was doing in the console.

 

The interresting, and still unresolved, thing involves the very badly performing drive, a Samsung HD154U, that had been working fine in another box just a couple of weeks ago. I swapped it with a drive connected to the SASLP. The drive I moved to the Sil3132 worked just fine so there was no problem with cabling or the Sil3132 but the Samsung connected to the SASLP was still really slow. Averaged 20G/s during the preread but the speed jumped all over the place. As low as 9g/s and as high as 59g/s, it would change with every refresh. Any attempt to run a SMART reprot at the same time as the preclear faild. There was one very informationve "at drivers/ata/libata-core.c:5186 ata_qc_issue+0x10b/0x308" warning in the log. Maybe the warning was related to SMART failing, I have seen that combination before.

 

So, of course it is a drive problem, well maybe but am not yet sure. What is clear is that the SASLP worked perfect with all 8 drives connected to it running preclears, even when running a SMART report on one of them. With the problem Samsung connected to it, the SASLP didn't work so perfectly.

 

I am planning to move the Samsung to a mobo port later tonight to see what happens with it connected there.

Link to comment

Don't forget to post pics in the "Pimp My Russian Date" thread.

 

You'll have to wait for date #2 ;)

 

Asked Newegg for a refund, they gave it to me with no hassles.  Gotta love Newegg...you just throw some tech jargon at them and they do whatever you want.  "The motherboard failed SMART when I calibrated the encasement parameters, and the silicon reticulating module prevented CMOS transfer.  Oh, and BIOS caught fire.  Can I have my money back?"

 

In other good news, I contacted the eBay seller directly and he's promised me two more Biostar A760G M2+'s...he also gave me a promo code for a few bucks off.  Sounds like he can get some more in the future as well.  Hopefully I'll just get a direct line from him and I won't have to waste time on these other boards.  Still, I suppose I'll have to find a new budget board to recommend to others that is more readily available (meaning available on Newegg).

 

...and no, I'm not going to post pics of the Russian girl, sorry.  How about some hot hard drive action instead?  That's 4 TB of sexy, right there.

 

Tomorrow I'm off to SF to deliver a 22 Drive Beast.  Should be fun.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.