Jump to content

Red Ball during parity upgrade [solved]


Recommended Posts

I was upgrading my parity drive today from 2TB to 3TB and encountered a red ball on a completely different drive while it was going.

 

Here's what I did:

• Shut down the server

• Removed a 640GB that was in the unRAID in my case, put it in an eSATA enclosure

• Installed the 3TB (left my original 2TB parity still installed)

• Also added a 60GB SSD that I'll later use as cache

• Booted back up

• Saw unRAID found all the drives I was expecting and started normally (remember I left the 2TB original parity still installed)

• I think I restarted but set the array to not auto-start

• Assigned the new 3TB as parity

• Started the parity rebuild on to the new parity drive

 

About an hour and a half later, my wife mentioned that some of the video shares weren't working, and I saw that Disk 5 was reporting a red ball.

 

Checked the syslog. It was huge - nearly all of it was filled with entries similar to this:

Feb 28 17:32:54 Tower kernel: md: disk5 read error

Feb 28 17:32:54 Tower kernel: handle_stripe read error: 121995568/5, count: 1

Feb 28 17:32:54 Tower kernel: md: disk6 read error

Feb 28 17:32:54 Tower kernel: handle_stripe read error: 121995568/6, count: 1

 

Attempted to use the clean powerdown, ended up using that, some kill commands and eventually a hard shut off.

 

I double checked and reseated all cabling/connections to Disk 5 and the motherboard SATA port it's on.

 

I'm a bit at a loss of what to do at this point. I still have the 2TB parity, as the 3TB parity didn't complete, but with Disk 5 redballing, I'm not sure how to proceed.

 

Thanks for any advice and help! I've linked to syslogs, since the one is decently large. The small one is the syslog from right before the issue in case that's helpful.

 

https://www.dropbox.com/s/0c50a0jhs7f1dag/syslog-20130228-163140.txt.zip

https://www.dropbox.com/s/sae98vllcwvr684/syslog-20130228-175311.txt.zip

Link to comment

One more thing that might be helpful is that I backed up my super.dat about a week ago. I tried swapping that in, and it found my original 2TB as parity and gave it a green ball, though it gave a red ball on a completely different 500GB disk of mine. I'm hesitant to start the array like this, as I have no idea if it's safe, considering the 2TB parity drive was valid before the 3TB parity was started, so any changes to the array at that point wouldn't be on the 2TB (though to be honest, I don't understand all the finer points of the parity rebuild process).

Link to comment

What would I do at this point?

 

I would

boot without starting unRAID.

capture a smart log  smartctl -a /dev/sd? > /boot/sda.smart.1

do a short test.

capture a smart log  smartctl -a /dev/sd? > /boot/sda.smart.short

compare the two.

if there are no errors. do a long test.

wait.

capture a smart log  smartctl -a /dev/sd? > /boot/sda.smart.long

compare.

 

You will be looking for unreadable LBA's. pending sectors, reallocated sectors, etc, etc.

See what that brings to light.

 

I might try later on (and I'm no expert at this)

Re-assigning the old 2tb parity and doing a trust my parity procedure to see what happens.

Frankly, I'm no expert with this and it did not work for me, so wait for advice from other members before doing this.

 

If you absolutely cannot get parity working, you can 'attempt' to forgo parity altogether and see if you can manually mount the failed drive.  If you can mount it you can try and copy the files from that drive to another (perhaps) usb or spare drive.

If all else fails, you can look back on the forums for what I had to do with ddrescue.

I litterally had to use ddrescue to copy the failed drive to another disk, then recover the filesystem with reiserfsck.

In the end, I lost 1 sector out of 1tb of data.

 

 

 

 

 

Link to comment

Just an update:

• Smart short and long tests showed no issues

• I managed to swap in the super.dat file I had from about a week ago and got the array started in maintenance mode

• Ran reisierfsck -check on the 2tb drive that redballed. No issues reported.

• Formatted my 3TB drive that I intended to be parity with unmenu's reiser formatting, and mounted it as read/write. I determined since the parity rebuild on to this 3TB was not complete, it was essentially useless, and I was comfortable with the stability of that 3TB since it passed 3 preclear sessions with no issues.

• Installed Weebotech's ddrescue package, and started ddrescue, invoking just the -n flag, and outputting to the 3TB about 15 minutes ago. It's completed about 54GB so far with no errors.

 

My plan is to let this complete, and if it reports no issues, I'll also copy the contents of Disk 12 using ddrescue, as with the older super.dat it's redballed (though I'm guessing it has no true issues).

 

Then I need to determine how to get the data *out* of the .img files I saved from ddrescue, and eventually just invalidate parity and force it to rebuild on to my 2TB, and if that's successful, do the same on the 3TB.

Link to comment

I don't make img files.

 

 

I ddrescule onto a full physical disk of the same size or model.

Usually then you can just mount the disk and copy the files.

 

 

Help me understand something here.

You did a reiserfsck on the physical disk that redballed once before?

If so then you have a physical drive?

or are you trying to access a drive that is presented virtually through parity with all the drives?

 

 

 

Link to comment

I didn't have another 2TB available to copy to – a 3TB or 1.5TBs were all that I had spare.

 

I believe ran reiserfsck on the physical drive that redballed once before, since I switched out to my old super.dat, it was reporting as green (for some reason - Disk 12 is reporting as redballed, so I'm assuming that's the parity-emulated one).

Link to comment

I didn't have another 2TB available to copy to – a 3TB or 1.5TBs were all that I had spare.

 

I believe ran reiserfsck on the physical drive that redballed once before, since I switched out to my old super.dat, it was reporting as green (for some reason - Disk 12 is reporting as redballed, so I'm assuming that's the parity-emulated one).

 

 

Anything that is accessible physically or emulated is accessible by just copying the files.

I.E. rsync.

 

 

I used ddrescule because there was a bad section of the drive that was totally unreadable.

If your smart short/long test was successful, the drive should be readable.

 

 

You "might" be able to access the .img file via loopback. but frankly I don't know.

Link to comment

I figured I could probably just dd or rsync the files to another location, which I may end up doing. I went with ddrescue just to save a bit of time, as it seems to copy nearly as fast as dd, but in case it eventually encounters an unreadable block, I want to have that time and wear and tear on a possibly failing drive spent wisely. If it finishes ddrescue with 0 reported errors, at least I'll know that the drive itself is probably OK from a physical standpoint, and I'll just need to explore and see if there's any data missing from it.

 

Frankly I was kind of surprised that reiserfsck didn't report any issues, because unRAID reported the drive as unformatted when I attempted to include the drive in the array using my super.dat that had the 3TB as parity.

Link to comment

So here's where I'm at today:

• I copied my questionable 2TB to an image file using ddrescue on that 3TB that's out of the array

• Did the same with one of my 500GBs that redballed as well, when using a different super.dat

• Neither reported any errors in copying with ddrescue

• Explored both the physical drives contents and the images. No missing files that I could tell, and it seemed to be my best choice to just accept the disks as OK.

• Did an initconfig to invalidate parity (backed up my various super.dats though just to be careful), set back up all my data disks and parity disks carefully.

• Started the array, started building parity from scratch on all disks.

• Exploring the two disks that unRAID marked as redballed at various points. Nothing major missing that I can tell, but I don't have any logs of what precisely was on them, nor any checksums. I guess I'll just have to stumble across any problems, and replace the files from their originals as necessary.

 

 

I determined how to mount those img files from ddrescue in case it helps anyone in the future:

 

Make a mount point for your file:

mkdir /mnt/500gb

 

Use fdisk to determine the offset (I had to do this on the physical disk I was working on. unRAID doesn't have the "file" command to check the offset in the image file itself.):

fdisk -l /mnt/sdj

 

Use losetup to set up the image on the loopback device using the offset you determined:

losetup --offset 32256 /dev/loop0 500gb.img

 

Mount the loopback to the mountpoint you created:

mount /dev/loop0 /mnt/500gb

Link to comment

I determined how to mount those img files from ddrescue in case it helps anyone in the future:

 

Make a mount point for your file:

mkdir /mnt/500gb

 

Use fdisk to determine the offset (I had to do this on the physical disk I was working on. unRAID doesn't have the "file" command to check the offset in the image file itself.):

fdisk -l /mnt/sdj

 

Use losetup to set up the image on the loopback device using the offset you determined:

losetup --offset 32256 /dev/loop0 500gb.img

 

Mount the loopback to the mountpoint you created:

mount /dev/loop0 /mnt/500gb

 

 

Brilliant, so if we had the file command would it tell us the offset?

Do we need to ask Tom to install it?

Should it be posted online at google code for others?

 

 

What would the output of the file command look like for those using it?

Perhaps we could create a wiki page with this knowledge for future reference.

 

 

Link to comment

I think I figured out how to calculate the offset using fdisk on the image file.

 

fdisk 500gb.img

 

Response will be:

You must set cylinders.
You can do this from the extra functions menu.

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): 

 

Type "x" then enter for Expert mode, followed by "p" then enter for "print partition table." In the case of my 500GB drive this is the response:

Disk 500gb.img: 1 heads, 63 sectors, 0 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
1 00   0   0    0   0   0    0         63  976773105 83
Partition 1 has different physical/logical beginnings (non-Linux?):
     phys=(0, 0, 0) logical=(1, 0, 1)
Partition 1 has different physical/logical endings:
     phys=(0, 0, 0) logical=(15504335, 0, 63)
Partition 1 does not end on cylinder boundary.
2 00   0   0    0   0   0    0          0          0 00
3 00   0   0    0   0   0    0          0          0 00
4 00   0   0    0   0   0    0          0          0 00

 

Type "q" then enter to quit without saving any changes.

 

The key is the "Nr" column to find partiton 1 (only one partition on this image), then "Start" to find the sector it starts on (in this case 63).

 

Using 63 (sectors) times 512 (bytes per sector), you get your bytes offset amount.

Link to comment

A few more things:

 

If you prefer, you can use the -r flag when you set up the image using losetup on the loopback device to set it as read only.

 

When you're all done, you can unmount the image using:

umount /mnt/500gb

 

and then clear the loopback device using:

losetup -d /dev/loop0

 

The final thing I'm not sure of is how this works for 4K "Advanced Format" disks. I believe the partitions start on different sectors, and I know they sort of emulate 512 bytes even though they're using 4096 bytes per sector. Not sure how that plays out in the end for mounting images created from a 4K disk.

Link to comment

So final update about my situation:

I ended up successfully building parity on the 2TB just trusting the data disks. Over the past few days, I've not noticed any issues on my array, and haven't found any missing/corrupt files, but the steps I took, it could be possible that I'll find them in the future.

 

I'm going to re-test the 3TB drive very carefully again, and assign it as parity on a later date once I'm back in my comfort zone on it.

 

For now, marking this as solved. I hope the ddrescue image mounting notes help someone else in the future :-)

Link to comment

I would suggest doing an md5deep or md5sum for each disk. Save them somewhere.

This way should you have any future issue, you can use these files to test for changed/corrupt/missing files.

 

It will take a long time, but it will be well worth it.

The files are also good in case you want to do a quick search on the server for all files matching a regular expression.

 

example:

egrep "FILENAME.ISO" *.md5sum

 

 

Link to comment

I think I will do something like that for the future. I was checking out your stuff you've posted about regularly scheduling md5deep cataloging on a per disk basis, and it's quite interesting.

 

Once I get my new laptop for development and build an unRAID server I'll have a go at re-writing it.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...