3 M1015's in ESXi not showing all drives in unRAID


Recommended Posts

Sorry for the long post.  I tried to be as thorough as possible.

 

I am building an Atlas-like system and I'm having problems.  I have 3 LSI 1015 HBA's that have all been flashed to IT mode (2 to p10 firmware and 1 to P11 firmware).  The two P10 firmware cards have been running fine fully populated in two separate regular (non-ESXi) unRAID installs.  The third (P11) I recently purchased, but I swapped it with one of the other two and it has been running fine the last few days fully populated in a regular unRAID install. 

 

Now I'm trying to consolidate both machines (20 total drives) into one unRAID server in ESXi and I can't get all the drives to show in unRAID.  All three M1015's show in ESXi and I have them all set to passthrough and assigned all three to unRAID, but unRAID is not seeing all the drives connected to the three cards.  unRAID only sees 15 drives no matter what I do.  I've tried all the troubleshooting procedures I could think of including:

 

With each of my 8087 cables connected to a different backplane  on the 4224,  I tried connecting just two cables to the first card.  That showed 8 drives in unRAID (as expected).  I then swapped those two cables with two others and again unRAID showed 8 drives.  Finally, I removed those two and connected that last cable and unRAID showed 4 drives.  All as expected.

 

I again connected two cables to the first card and unRAID showed 8 drives.  I then removed those two cables and connected them to the second card and unRAID showed the same 8 drives.  Finally, I removed the two cables from the second card and connected them to the third card.  Again, unRAID showed the same 8 drives. 

 

This would seem to rule out the cards, cables and backplanes as the problem.  It seems to be that if all three cards are installed, the first will show all 8 drives, the second 7 and then none on the third card.

 

I even tried removing one cable off the first card and connecting it to the third card so that the first card has 4 drives and the remaining two cards both have 8 drives connected.  In that situation, unRAID shows 11 drives.  Again it shows all the drives connected to the first card, 7 drives from the second and none from the third.

 

Does anyone have any idea what the problem could be? I tried follow the Atlas thread as much as possible.  The only difference being that he says to disable OPROM for slots 6 & 7 and I disabled them for 5, 6 & 7 since I populated all three.  I really need to get my servers back up and running so any help is greatly appreciated.

 

I have attached my syslog.

 

syslog.txt

Link to comment
  • Replies 139
  • Created
  • Last Reply

Top Posters In This Topic

if you see any references to int13, disable it.

that usually limits you to 13 drives though, not 15.

 

I have same board with 3x m1015's and one LPSAS-MV8 all running. the only difference is i have 2 m1015's and 1 mv8 assigned to unraid (the 3rd m1015 is for freenas).

 

try turning the oprom back on.

also try booting to just unraid without your esxi stick in. see if you still have an issue.

Link to comment

Leaving the BIOS as it was and just booting off the unRAID flash drive seems to have found more drives.  It's hard to tell though because I can't log onto the web GUI.  When I run ifconfig eth0 it doesn't report an IP address for some reason.  Where can I check the drives via the command line? I checked /dev/disk/by-uuid/ and it has 19 entries.

 

A couple differences I noticed between booting straight into unRAID and running it via ESXi;

 

1.  Straight boot is way faster than ESXi.  I mean way faster!  The dots that show when it's loading bzroot zip across the screen whereas in ESXi they are much slower.  A rough estimate I would guess around 45 seconds for unRAID to boot by itself versus 2.5 minutes through ESXi.

 

2.  Booting unRAID in ESXi gives me the following error every time:

 

Sep  2 02:18:43 Tower kernel: mpt2sas2: _base_wait_for_doorbell_int: failed due to timeout count(10000), int_status(0)!
Sep  2 02:18:43 Tower kernel: mpt2sas2: doorbell handshake int failed (line=3031)
Sep  2 02:18:43 Tower kernel: mpt2sas2: _base_get_ioc_facts: handshake failed (r=-14)
Sep  2 02:18:43 Tower kernel: mpt2sas2: fault_state(0x0000)!
Sep  2 02:18:43 Tower kernel: mpt2sas2: sending diag reset !!

 

But booting straight into unRAID did not give me that error.

 

 

I'll try the rest of your suggestions now John.

Link to comment

I managed to find one little problem, but it doesn't fix the main problem.  It seems that one of my M1015's must have a bad pin or connection on port 0 as it only shows 3 drives even when all four are connected.  According the card's BIOS slot 5 never shows as populated and I've swapped cables and switched backplanes and it always the case.  So I am assuming something is wrong with port 0 on the card.  In my setup I am only using 5 of the 6 ports on the three cards anyway so I marked it and left it disconnected.

 

Now, if I boot straight into unRAD (bypassing ESXi) /dev/disk/by-uuid/ shows 20 entries instead of 19.  Running unRAID through ESXi now yields 16 drives instead of 15.  So it looks like that bad port was causing a drive to never show up in either situation.

 

So I think the remaining problem has something to with that doorbell/handshake error I posted above.  Booting straight to unRAID does not give me that error.  Booting unRAID via ESXi does on mpt2sas2.  You can also see mpt2sas0 and mpt2sas1 in the log so those are obviously the three individual cards and one must be getting ignored in ESXi/unRAID because of the error.  I tried moving the cards around in the slots and it seems any card(s) in the 4x PCI-E slots have a problem.

 

 

Link to comment

While waiting for a reply to my problem above, I decided to go ahead and assign drives and build parity booting unRAID without ESXi.  Of course there's another problem.  I'm getting 4 MB/S parity speed.  I ran a parity check on both servers before I dismantled them and they both averaged 45-50 MB/S parity speeds.  This whole project is quickly turning into an epic failure.  I've attached a log grabbed during the parity sync.  I'd appreciate any feedback.

 

Syslog: http://pastesite.com/43056

Link to comment

Flash the cards with the newest P14 firmware. Since you are planning to use 3x controllers with a Supermicro board you either do not flash any BIOS on them or if you do flash a BIOS then after booting with each card individually disable the BIOS.

 

And while flashing familiarize yourself with the "sas2flash" user guide here - http://www.lsi.com/sep/Documents/oracle/files/SAS2_Flash_Utility_Software_Ref_Guide.pdf

and try to run some command on the suspected bad card.

 

Link to comment

Once you download the complete Pxx package and extract all the files there will be many folders with one of them named"sasbios_rel"

 

Inside this folder are only two files - one is the actual "BIOS" needed for the flashing and the other, named "mptbios.txt" is the user manual for the BIOS.

 

Take a look inside for more details.

Link to comment

I upgraded all three cards to P14 firmware and it didn't help at all.  I decided to run memtest from my unRAID flash drive and it doesn't do anything.  All I get is the screen below with no activity even after letting it sit for over 15 minutes.

 

7926054878_78497ffede_b.jpg

2012-09-03_20-08-38_71 by rockdawg2232, on Flickr

 

I am running two sticks of Super Talent DDR3-1333 8GB ECC Micron Chip Server Memory and I tried removing each stick and moving them around, but every time memtest acts the same. Anybody see anything like that before?

Link to comment

You will have to upgrade to latest memtest 4.20 - this one is frequently omitted when people do frequent upgrades from one versions to the next one and they only copy the two main files.

 

Memtest 4.20 change-log:

    New Features

        Added failsafe mode (press F1 at startup)

        Added support for Intel "Sandy Bridge" CPU

        Added support for AMD "fusion" CPU

        Added Coreboot "table forward" support

    Bug Fixes

        Corrected some memory brands not detected properly

        Various bug fixes

 

Link to comment

I would still like some feedback on the issue with the M1015 that keeps throwing an error in unRAID in passthrough mode.

I've had three M1015's on my X9SCM-F v1.0 board but only 2 of the three were to unRAID the 3rd was to a Windows VM.  So not any help with your problem.  I needed the PCIe slots for other cards like tuners that is why I went to SAS expander's.  Then I only needed one card to get 24 drives in unRAID.  I currently have a temporary ESXi server install with unRAID VM on a X7SBE that has 17 drives in the VM.  That's going to be replaced with another Tyan S5512.  My X9SCM-F is being moved to a Windows non-VM install.
Link to comment

Did you have either M1015 plugged into a 4x slot?  I ask because if I have two of the three M1015's plugged into 4x slots, neither card's drives appear in unRAID.  When I get home I'll try plugging a single card into a 4x slot and see what happens.  From what I've seen I don't think it will work.  It seems that as soon as unRAID in ESXi sees one of the M1015's in a 4x it throws the following error:

 

Sep  2 02:18:43 Tower kernel: mpt2sas2: _base_wait_for_doorbell_int: failed due to timeout count(10000), int_status(0)!
Sep  2 02:18:43 Tower kernel: mpt2sas2: doorbell handshake int failed (line=3031)
Sep  2 02:18:43 Tower kernel: mpt2sas2: _base_get_ioc_facts: handshake failed (r=-14)
Sep  2 02:18:43 Tower kernel: mpt2sas2: fault_state(0x0000)!
Sep  2 02:18:43 Tower kernel: mpt2sas2: sending diag reset !!

 

Except it will do it to mpt2sas1 if I have two cards in the 4x slots.

Link to comment

I have 3 M1015s passed through to unRaid.

 

Slots from cpu to outer side I have them in 1, 2, 4 (3rd slot is blank).

 

Ensure that you have all 3 passed through. Then did you reboot esxi?

Then add them to unRaid VM, make sure you pick a different one for each addition (i made that bone head mistake once)

 

I'm running unRaid 5b11

Link to comment

Thanks for replying graywolf.  It's good to know that it should be possible to get this working.

 

I'm pretty sure I tried that configuration, but I'll try again when I get home.  I did reboot after adding them (ESXi said I had too).  Not sure what you mean by "make sure you pick a different one for each addition".  Can you elaborate?

Link to comment

I just tried putting the cards in slots 1, 2, 4 like you mention graywolf and it still errors on me.  In fact, I even tried removing two of the cards and leaving just one in one of the 4x slots and I get the same error and no drives are visible in unRAID.  I just don't understand what is up.  Again, if I boot straight into unRAID (bypassing ESXi) everything works.

Link to comment

I have noticed one difference in the syslogs between straight unRAID and unRAID through ESXi:

 

Straight unRAID

Sep  3 09:17:29 Tower kernel: mpt2sas2: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (16608336 kB)

 

unRAID through ESXi

Sep  2 02:18:43 Tower kernel: mpt2sas2: 32 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (2074648 kB)

 

Why wouldn't they both be 64 BIT?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.