Jump to content

How can I resolve my parity errors?


Blade

Recommended Posts

 

Ever since updating to 4.5 final I have been getting parity error because of the HPA on my drive. Previous to 4.5 final it was not an issue at all. I just lived with the first data drive losing a bit of space to the HPA from the Gigabyte motherboard. I have since disabled the bios backup on the unraid server in the bios but when the parity ran today I got errors

 

Last checked on 2/1/2010 12:07:33 PM, finding 3102043 errors

 

How can I resolve this and get it back to 0 errors? I have 11 data drives and 1 parity drive.

 

Thx

 

Link to comment
  • Replies 114
  • Created
  • Last Reply

 

Ever since updating to 4.5 final I have been getting parity error because of the HPA on my drive. Previous to 4.5 final it was not an issue at all. I just lived with the first data drive losing a bit of space to the HPA from the Gigabyte motherboard. I have since disabled the bios backup on the unraid server in the bios but when the parity ran today I got errors

 

Last checked on 2/1/2010 12:07:33 PM, finding 3102043 errors

 

How can I resolve this and get it back to 0 errors? I have 11 data drives and 1 parity drive.

 

Thx

 

Step 1.  Post a syslog.

Step 2.  Run a second parity check. The first should have already fixed the parity.  If the second finds no additional parity errors, you are done

Step 3 stop the array, reboot, make sure the HPA is not added once more.

Step 4. Post a syslog.

 

Joe L.

 

Link to comment

Here is a little info on my system

 

I have 1 parity and 11 data drives

6 SATA ports on the motherboard

I have a SATAII CARD ROSEWILL|RC-218 RETAIL - Retail ---- 4 drives on this one

I have a SYBA SD-SA2PEX-2IR PCI Express SATA II Controller Card - Retail ---- 2 drives on this one

I did not have a problem until I upgraded to 4.5 final which I guess ignores the HPA or something. I did have my Gigabyte motherboard set to use the bios backup on data disk1 which was no problem until I upgraded to 4.5 final. I had been using 4.5 beta 12 for the longest time with not 1 parity error ever. I have since upgraded the Gigabyte bios and set it to not backup the bios.

 

I am really lost on this one but I really want good parity in case a drive fails. I would hate to lose of all my hours of loading my blu rays.

 

Link to comment

I ran a full parity check again last night and it finished this morning with 1.5 million sync errors.

I tried running the check again this morning and it keeps giving me lots of errors.

I am wondering if switching back to 4.5 beta 12 will help me or not.

Should I just re-enable the bios backup in the bios.

All of the data is perfectly fine.

I just have no idea how to fix this.

 

Link to comment

It is a bit confusing to me too.  I'm trying to understand what is happening.

 

I would expect parity sync errors one time... but not again and again, unless the bios is continuing to add its data to the area it thinks it reserved.

 

The issue is that the size of the reiserfs partition on the disk, as stated in its definition in the partition, is larger than the partition with the added HPA.  What apparently happened is the file-system was created before the HPA was added.  Then, then BIOS added the HPA, cutting off the last megabyte of space from the disk, but NOT changing the partition size as defined in the partition table in the MBR.

 

So.... disk was originally manufactured with a size of 500107862016 bytes.  ( 976773168 sectors of 512 bytes)

 

Disk installed in unRAID array.  Partition 1 created of 976773105 sectors.  (It starts at sector 63, sector 1 through 62 are unused, sector 0 = MBR)

 

We do not know what size the reiserfs is in the partition, but let's assume it was created when the disk was full size, therefore it expects to be able to use the entire set of 976773105 sectors.

 

Now, your BIOS adds an HPA, making the disk size to be reported as 500106780160 bytes. ( 1081856 bytes, or 2113 sectors  smaller)

 

apparently, the older version of Linux either did not look at the disparity of partition size vs reported disk size... the newer kernel does.    It is trying to help by using the full size of the disk when it detects the first partition extends beyond the artificially shortened physical disk, but fits in the actual physical disk.  It does that here:

Jan  1 12:00:23 Tower kernel: hdc: Host Protected Area detected.

Jan  1 12:00:23 Tower kernel: ^Icurrent capacity is 976771055 sectors (500106 MB)

Jan  1 12:00:23 Tower kernel: ^Inative  capacity is 976773168 sectors (500107 MB)

Jan  1 12:00:23 Tower kernel: hdc: 976771055 sectors (500106 MB) w/16384KiB Cache, CHS=60801/255/63

Jan  1 12:00:23 Tower kernel: hdc: cache flushes supported

Jan  1 12:00:23 Tower kernel:  hdc: hdc1

Jan  1 12:00:23 Tower kernel: hdc: p1 size 976773105 exceeds device capacity, enabling native capacity

Jan  1 12:00:23 Tower kernel: hdc: detected capacity change from 500106780160 to 500107862016

 

The fix is to get rid of the HPA, to make sure the reiserfs is not corrupt in the larger space, or keep the HPA, and re-size the partition to fit the smaller space, and again check/fix the file-system to fit in the partition.

 

As far as the parity errors.... I'm still not certain what is going on.  I'd expect it to fix itself the first time it runs, and over a million parity errors could occur once.... but they should not occur a second time as it should have corrected the parity the first time.

 

You might send support@lime-technology an e-mail pointing Tom to this thread...  He may have ideas about the parity calcs.

In the interim, I'd revert back to the older version of unRAID you were using and see if the parity errors still occur.  Right now I'd make copies of any critical files on your server as I do not trust your ability to recover from a disk failure.

 

Basically, I think you need to:

1. disable the HPA creation going forward (BIOS config option/update)

2. permanently reset the HPA.  You can probably do that using the hdparm command.

3. ensure the partition is sized correctly (It probably is already, as it is being detected as the actual full size of the disk)

4. check the reiserfs file system to be certain it has no corruption.

5. check parity once more.

6. reboot, make sure BIOS does not put HPA on again, or on a different disk

7. re-check parity a final time.

 

Joe L.

Link to comment

Thanks Joe.

I appreciate you looking at this. I am really lost of this one. I am very new to Unraid. I have only had it running for 6 months or so.

I sent an email to Tom and pointed him to this thread. I really hope I can salvage this and get my parity correct. I would hate to lose my data on 11 disks.

 

I have disabled the bios backup in the gigabyte bios settings. This definitely started with 4.5 final install here.

I really need a step by step procedure as I do not trust myself to go off and figure this one out.

God I hope Tom can help.

Link to comment

I would immediately go back to the version of unRAID that worked and ensure you once again have a properly parity build/check.

 

Then,

I would boot from a boot CD and remove the HPA. I use the Ultimate Boot Disk and one of the HD utils last time I had to do this.

Reboot unRAID and do a file system check on that drive.

Check the array again to ensure you're still getting a good parity check.

 

If the above works, then you are ready to upgrade again. There is a command to tell unRAID to do a parity check without doing any correcting. You can use this to test without hurting any data or the parity. So, if you did happen to lose a drive you could go back to the working version of unRAID and then replace it. This is why I think getting back to a system that has a good parity check is something you should do right away.

 

If you start to work through this and need help with any of the steps then let us know how you're making out. Someone can tell you the next thing to do.

 

Once thing to note. If the motherboard is still writing the HPA each boot you'll get an error that it can only be read once or something like that which means you will not be able to erase it.

 

Peter

 

Link to comment

Before you do anything, if you have any critical data that is irreplaceable on the server, make a backup copy elsewhere.

 

Then, a quick question... Did you add a boot parameter to config/syslinux.cfg in an attempt to work around an HPA on an earlier release of unRAID and forgotten it is there?  Perhaps the new 4.5 release is actually using it now.  I mention this because I see this line in the syslog:

Jan  1 12:00:23 Tower kernel: Kernel command line: initrd=bzroot rootdelay=10 libata.ignore_hpa=1 BOOT_IMAGE=bzimage

 

Apparently this line is being respected by the SATA driver on /dev/sdh as seen here:

Jan  1 12:00:23 Tower kernel: ata7.00: HPA unlocked: 1953523055 -> 1953525168, native 1953525168

Jan  1 12:00:23 Tower kernel: ata7.00: ATA-8: WDC WD10EADS-00M2B0, 01.00A01, max UDMA/133

Jan  1 12:00:23 Tower kernel: ata7.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32)

Jan  1 12:00:23 Tower kernel: ata7.00: configured for UDMA/133

Jan  1 12:00:23 Tower kernel: scsi 7:0:0:0: Direct-Access     ATA      WDC WD10EADS-00M 01.0 PQ: 0 ANSI: 5

Jan  1 12:00:23 Tower kernel: sd 7:0:0:0: [sdh] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

Jan  1 12:00:23 Tower kernel: sd 7:0:0:0: [sdh] Write Protect is off

Jan  1 12:00:23 Tower kernel: sd 7:0:0:0: [sdh] Mode Sense: 00 3a 00 00

Jan  1 12:00:23 Tower kernel: sd 7:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 

You also seem to have an HPA on one of your older PATA disks /dev/hdc:

Jan  1 12:00:23 Tower kernel: hdc: Host Protected Area detected.

Jan  1 12:00:23 Tower kernel: ^Icurrent capacity is 976771055 sectors (500106 MB)

Jan  1 12:00:23 Tower kernel: ^Inative  capacity is 976773168 sectors (500107 MB)

Jan  1 12:00:23 Tower kernel: hdc: 976771055 sectors (500106 MB) w/16384KiB Cache, CHS=60801/255/63

Jan  1 12:00:23 Tower kernel: hdc: cache flushes supported

Jan  1 12:00:23 Tower kernel:  hdc: hdc1

Jan  1 12:00:23 Tower kernel: hdc: p1 size 976773105 exceeds device capacity, enabling native capacity

Jan  1 12:00:23 Tower kernel: hdc: detected capacity change from 500106780160 to 500107862016

Perhaps you can try without the added boot code in syslinux.cfg.  (you'll need to change it, then reboot)

 

Next, before you do anything to reset the HPA it is best to know how the disk was partitioned. (was it partitioned with the HPA in place, or without it?) Does the partition end at the physical end of the disk, or the artificial HPA end of the disk?

 

Here are several commands you can run to help us know how the disks are currently partitioned:

Type

sfdisk -g /dev/hdc

 

and

sfdisk -g /dev/sdh

 

and

blockdev --getsz /dev/hdc

 

and

blockdev --getsz /dev/sdh

 

and

fdisk -l -u /dev/hdc

 

and

fdisk -l -u /dev/sdh

 

and

od -x -A d /dev/hdc | head

and lastly

od -x -A d /dev/sdh | head

 

Those are all different ways of displaying the geometry of the drives and the current partitioning.   If the partitions are the correct size, then we can go forward and check file system for errors... Otherwise, we need to fix the partitioning.  None of the above commands will modify the disk, all they do is read it.

 

As a double-check, you can download, unzip, and invoke a script I wrote for the other thread to examine a disk's partition table to see if it is correctly partitioned.

You can find the script attached to this post:  http://lime-technology.com/forum/index.php?topic=5072.msg47122#msg47122

you would run it on your disk as follows:

unraid_partition_disk.sh  /dev/sdh

and

unraid_partition_disk.sh  /dev/hdc

 

The unraid_partition_disk.sh script makes no change to the disk unless you request it to by using the "-p" option.  If you use the "-p" option it only re-creates the partition table and MBR if you respond to a "are you sure" prompt, so it is also safe to use on your existing disk as shown in the example above (without the -p option)    Do not make any changes to the disks until after you first report back on their current partitioning.  (in other words, forget there is a "-p" option for now  ;))

 

This post describes how to use the hdparm command to reset an HPA.  

http://lime-technology.com/forum/index.php?topic=5072.msg46903#msg46903

 

Obviously, you'll need to use the correct value for your drives, but you do not need to use another distribution or boot up a different OS.

For you to see the current native size and HPA you would type:

hdparm -N /dev/sdh

 

To set the disk to use its full native size you would use:

hdparm -N p1953525168 /dev/sdh

(the "1953525168" is the native size as reported in your syslog for that disk. Preceding it with a "p" is the syntax for the hdparm command to make the change permanent. )

 

followed by

hdparm -N /dev/sdh

to see if it worked.

 

For your smaller /dev/hdc disk, the hdparm command to reset the HPA would be:

hdparm -N p976773168 /dev/hdc

 

followed by

hdparm -N /dev/hdc

to see if it was effective.  You can read about the hdparm command here: http://lime-technology.com/forum/index.php?topic=4194   Please verify the numbers I've given by the output of an initial hdparm -N on each drive.

 

If the HPA was able to be removed/reset to full size of the drive, then a reboot might get you to where all the disks will mount.  (there still might be corruption of the file-system, so first priority is to get the disk to report its full size.)

 

So... there are some preliminary steps you can take to get rid of the HPA on /dev/sdh.   Before you do anything, if you have any critical data that is irreplaceable on the server, make a backup copy elsewhere.    

 

As already mentioned, if you can remove the ignore_hpa boot code and proceed on 4.5 without any issues, do that first... post another syslog.   If need be, revert back to your older release and get to where parity checks are clean (although you'll probably need to do them twice, first to set things correctly, second to verify)   We really want to get to where we do not suspect any hardware issues.

 

If it makes you feel any better, the million parity errors are probably all in the HPA area of the disks (that last million or so bytes it is reserving) and would not affect your recovery of a failed disk... but they sure don't make me feel comfortable that they do not go away... That still has me stumped.

 

Joe L.

Link to comment

This is my syslinux.cfg file:

 

default menu.c32

menu title Lime Technology LLC

prompt 0

timeout 50

label unRAID OS

  menu default

  kernel bzimage

  append initrd=bzroot rootdelay=10 libata.ignore_hpa=1

label Memtest86+

  kernel memtest

 

Link to comment

It was added a while ago when I saw some responses to the HPA issue with Giagbyte motherboards

 

So my first step should be the following:

 

I should change my syslinux.cfg file to the following and reboot

 

default menu.c32

menu title Lime Technology LLC

prompt 0

timeout 50

label unRAID OS

 menu default

 kernel bzimage

 append initrd=bzroot rootdelay=10

label Memtest86+

 

I want to do one thing at a time and then report back. If this is the first thing I should do, let me know and I will do it and post a syslog upon a reboot.

Thx

Link to comment

So my first step should be the following:

 

I should change my syslinux.cfg file to the following and reboot

 

default menu.c32

menu title Lime Technology LLC

prompt 0

timeout 50

label unRAID OS

 menu default

 kernel bzimage

 append initrd=bzroot rootdelay=10

label Memtest86+

 

I want to do one thing at a time and then report back. If this is the first thing I should do, let me know and I will do it and post a syslog upon a reboot.

Thx

I would FIRST run all the commands I gave to print the existing partitioning on those two drives.  I'd hate for an HPA to be recognized, and mess you up in some other way (corrupted file-system??) since I'm certain your syslog is saying the partition goes to the physical end of the drive, not the artificially smaller end made by the presence of the HPA.

 

Joe L.

Link to comment

Then, I'd revert back to the earlier version of unRAID..., leaving the syslinux.cg as it is for now, just to be sure the newer 4.5 version is not causing your issues. 

 

Do you remember adding the ignore_hpa boot code?  Was it on the older release?

 

Joe L.

Link to comment

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...