Author Topic: Bit rot in unRAID  (Read 3505 times)

Offline abs0lut.zer0

  • Sr. Member
  • ****
  • Posts: 263
Bit rot in unRAID
« on: June 17, 2012, 01:51:58 AM »
http://lime-technology.com/forum/index.php?topic=20612.msg182945#msg182945

from this message onwards in this post bit rot is discussed...


i just want to make a post where we can bring it out and discuss it, it is always something that i don't really fully understand and the implications are quite severe.

i know that over time no disk access causes it besides all the factors that i have read here..

http://en.wikipedia.org/wiki/Bit_rot

there are some people who use md5 check sums and some other exotic means to try avoid it,
is there anything that is easily done, as the a fore mentioned solutions are really impractical i think as they take vast amounts of time

this is not a comparison by any means because unRAID is unique ...  i would like to ask how zfs claims that it does not occur this problem

thanks to any
~Terok Nor~ • unRAID 5.0.5 • Supermicro X8SIL-F • i5 650 @3.2GHz • 8GB Apacer DDR3 1333  • 1xWD20EARX •1xWD20EARS • 1xST32000542AS • Parity ST3000DM001 • 650w

Offline tyrindor

  • Hero Member
  • *****
  • Posts: 573
Re: Bit rot in unRAID
« Reply #1 on: June 17, 2012, 02:35:51 AM »
I don't really think it's an issue. It happens on CDs/DVDs/etc too, and I have CDs from the day they came out that still work like new. If I was running a server that would result in 1 byte causing my company to be sued or go bankrupt, then i'd worry about it, but in 99.99% of cases 1 corrupted byte won't even be noticeable. I've been using electronics and massive storage servers for about 20 years and i've never experienced bit rot.

« Last Edit: June 17, 2012, 02:39:48 AM by tyrindor »
CPU: Intel i3-2120 (3.3GHz)
Motherboard: Supermicro MBD-X9SCM-F-O
Memory: 8GB Kingston (DDR3 1333)
SAS Cards: 3x Supermicro AOC-SAS2LP-MV8
Power Supply: Corsair AX850
Case: Norco 4224

Offline UhClem

  • Full Member
  • ***
  • Posts: 180
Re: Bit rot in unRAID
« Reply #2 on: June 17, 2012, 11:38:09 AM »
Moderator:
Please consider merging the referenced (above, in OP) section of the other (very generic) thread into this (specific, and aptly titled) thread.

Thanks.

(This is a very worthwhile topic that should not be diluted, or worse, go unnoticed.)

Offline boof

  • Hero Member
  • *****
  • Posts: 746
Re: Bit rot in unRAID
« Reply #3 on: June 18, 2012, 01:06:58 AM »
this is not a comparison by any means because unRAID is unique ...  i would like to ask how zfs claims that it does not occur this problem

ZFS does 'end to end' checksumming of the data and metadata blocks it writes to disk.

If, on read back, it finds that the checksum of the block it's just read disagrees with the previously stored checksum it can attempt to fix it by either reconstructing that block using a raidz parity rebuild for that individual block (presuming this problem is occurring in a raidz pool!) or by going to another copy of the block if you have replication enabled. There may also be something it can do based on it's copy on write methodology - I don't know how long it keeps 'old' copies of data around once it's written a new version or if it even tracks this internally.

I'm not convinced how infallible this protection is - it still needs to be able to reconstruct the block and I would presume in the (unlikely?) event where that particular block of data has problems on multiple disks it won't be able to reconstruct. And if you're only running zfs on a single disk or collections of single disks with no sort of replication enabled or parity based recovery possible - all it can do is warn you a checksum has failed.

So 'having zfs' as a filesystem I don't think inherently protects you from this. You still need to be careful and appreciate there may be edge cases.

This is all just my understanding though, I could be very wrong.

I'd be more worried, personally, about bad hardware causing data corruption than over time bitrot on the disks. ZFS may not protect you from this as if the data is corrupted before it's written to the filesystem then the checksum will still be correct - just for the corrupted data.

In short I'd be kitting out with ECC ram and enterprise kit as a priority before I relied on ZFS to save me. Though I appreciate ZFS could (if you were planning on using it anyway) be a quick and easy 'rude not to' layer of protection. In my own experience I've never (noticeably) had any problems with this sort of corruption so I don't bother with any of it and just have a decent backup methodology in place including verification and versioning of data. But as drive densities increase and the overall amount of data I store increases I may change my approach but likely only as a future result of being bitten hard by the problem.

I'd be very interested in any case studies or papers where people have prodded at ZFS' recovery from bitrot.

Offline abs0lut.zer0

  • Sr. Member
  • ****
  • Posts: 263
Re: Bit rot in unRAID
« Reply #4 on: June 18, 2012, 03:14:29 AM »
Moderator:
Please consider merging the referenced (above, in OP) section of the other (very generic) thread into this (specific, and aptly titled) thread.

Thanks.

(This is a very worthwhile topic that should not be diluted, or worse, go unnoticed.)

+1
~Terok Nor~ • unRAID 5.0.5 • Supermicro X8SIL-F • i5 650 @3.2GHz • 8GB Apacer DDR3 1333  • 1xWD20EARX •1xWD20EARS • 1xST32000542AS • Parity ST3000DM001 • 650w

Offline bubbaQ

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3204
Re: Bit rot in unRAID
« Reply #5 on: June 18, 2012, 01:32:23 PM »
I am in the business where a single bit error can be very destructive.... a single bit error can million dollar court judgments get reversed or even mean someone goes to jail... or not.  That is why MD5 and SHA-1 hashes are done on all evidence files.

I have processed hundreds of terabytes of data... some stored for a decade or more.... and never had a hash value change.  I have never seen, or heard of from a reliable source, of any "bit rot" from a hard drive (optical media, yes).  I've heard of corruption of a file at the file system level due to known hazards (and a diff of the two files shows it was corruption, and not bit rot), but never "bit rot" of a random bit flipping, and when you consider modern drives ECC, you realize that bit rot is not likely to manifest itself even if it happened.

Worrying about bit rot instead of MUCH more likely issues, is like walking around with a hard had in case some random piece of fascia falls off a building, but then crossing against the light on a busy street.

Offline Joe L.

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 18823
Re: Bit rot in unRAID
« Reply #6 on: June 18, 2012, 02:12:45 PM »
I am in the business where a single bit error can be very destructive.... a single bit error can million dollar court judgments get reversed or even mean someone goes to jail... or not.  That is why MD5 and SHA-1 hashes are done on all evidence files.

I have processed hundreds of terabytes of data... some stored for a decade or more.... and never had a hash value change.  I have never seen, or heard of from a reliable source, of any "bit rot" from a hard drive (optical media, yes).  I've heard of corruption of a file at the file system level due to known hazards (and a diff of the two files shows it was corruption, and not bit rot), but never "bit rot" of a random bit flipping, and when you consider modern drives ECC, you realize that bit rot is not likely to manifest itself even if it happened.

Worrying about bit rot instead of MUCH more likely issues, is like walking around with a hard had in case some random piece of fascia falls off a building, but then crossing against the light on a busy street.
With all due respect, I tend to agree... but... I can remember at least two or three instances of a specific disk drive returning no error, but a different checksum when the same set of blocks were read over the years.  They caused intermittent, seemingly random parity errors.

Now, there are roughly 10,000 members of the lime-tech forum... If you figure three disks per, that is representative of at least 30,000 disks.   To me, that says that one out of 10,000 disks, over its lifetime might exhibit the behavior we are concerned about.    (not every unRAID owner is a member, and most probably have more than 3 disks, but is is as good a point to guess the populations of disks we are talking about.

I personally think on rare occasion bits pass the ECC code when read from the disk, and pass the CRC checks when transmitted to the disk controller, but flip state while in the cache ram of the disk drive.   That is not something that is easy to detect.  If you are just playing a movie, or music, you'll likely never notice a single bit error.

If it were not that we perform periodic parity checks where EVERY bit of EVERY disk is read, odds of the inconsistent read from disk-cache RAM bit problems being detected are slim.   In fact, I would suspect power supply issues, or failing filter capacitors in the disk electronics for most inconsistent bit errors when read from cache RAM on the disk electronics.  (induced by noise on the power supply to the disks from other disks)  It is possible for a manufacturer to set a parity bit on it (use ECC ram?) but I've not read of any who do on the drives themselves.  I suppose that if one did, it might be a small and selective market. If you then consider most PCs only have one disk, and are less likely to experience noise from other disks simultaneously seeking, you'll understand most disks would never exhibit symptoms, even if marginal electronics are present.

Nothing except MD5 and SHA-1 checksums (or something similar) will detect that the file written is identical to that read.

Joe L.
« Last Edit: June 18, 2012, 02:16:56 PM by Joe L. »

Offline chickensoup

  • Sr. Member
  • ****
  • Posts: 430
Re: Bit rot in unRAID
« Reply #7 on: June 18, 2012, 08:01:04 PM »
Moderator:
Please consider merging the referenced (above, in OP) section of the other (very generic) thread into this (specific, and aptly titled) thread.

I apologise if my previous thread was not "aptly titled" although the conversation did go off topic :P Feel free to move the bit rot discussion to this thread, it's where it belongs.

I do agree that this is something that, however unlikely should be addressed but perhaps not directly as "bit rot," more as "invalid bit/s" or "bad checksums." Regardless of what causes the data to change unexpectedly, unRAID should be able to detect this change one way or another. I know this is what the parity drive does to some extent but there are plenty of cases on the forum where relying on parity can get you in to trouble (not intentionally starting the corrective vs non-corrective p/check discussion, but..)

Has Tom ever mentioned a filesystem change down the track? Considering Reiser is unlikely to be developed or well supported in the long term, should we really be weighing up the pros & cons of a new FS?
Link to my build:
Core i3 540 | Gigabyte H57M-USB3 | 2GB DDR3 | Custom modified chassis | Corsair VX550W 80 PLUS PSU
1x 500GB Seagate 7200 (Cache) | 6x 2TB Seagate LP 5900 | 1x 1TB WD GP 5400 | 1x 1TB Seagate 7200
Total Storage: Raw 14.5TB, Redundant 12TB, Formatted 11.06TB

Offline BRiT

  • Hero Member
  • *****
  • Posts: 2934
    • WTF.com
Re: Bit rot in unRAID
« Reply #8 on: June 18, 2012, 08:40:58 PM »
There was talks of unRAID being filesystem agnostic by opening things up to allow the user to select whatever filesystem they'd want to use. I don't know if that was strictly for the cache drive, but I imagine there's some practical restrictions for drives in the array. The likely restriction needing to be the filesystem supporting the ability to grow with the disk in the instances where someone is rebuilding a failed disk onto a larger replacement disk.

Offline Joe L.

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 18823
Re: Bit rot in unRAID
« Reply #9 on: June 19, 2012, 05:18:02 AM »
Has Tom ever mentioned a filesystem change down the track? Considering Reiser is unlikely to be developed or well supported in the long term, should we really be weighing up the pros & cons of a new FS?
He talked about why he choose reiserfs... and stated reasons why others were not as easy to use.

The functionality required is:
ability to be re-sized in place.
stable


The desired functionality is:
no need to specify number of "inodes" up front (number of directory entries)
reasonable performance with large files.
not wasteful of space.
journaling, to prevent data loss if power loss, etc.


As far as reiserfs... vs. another...    I've not had experience attempting to recover other file-systems when a user accidentally clobbered them, but it is amazing how much can be recovered by reiserfsck.  For that reason alone, I see no reason to replace it.    It just needs to store the files.  When unRAID was developed initially, there was no read-write NTFS driver.  The NTFS drive ware read-only.  Today, I would guess it would have been an alternative choice as there is an ntfs-3g driver that would probably work as well.

As far as bit rot...  if there is a read error on a disk, unRAID (in combination with SMART firmware on the disks) is somewhat self-repairing.  The un-readable sector is re-constructed from the other disks, then sent to the process reading it, and ALSO re-written to the disk where the read failed.  The SMART firmware can then re-allocate the sector if needed.   This does not solve the issue with bits flipping in RAM, but does handle the far more common mechanical failures of disk platters.

Offline lionelhutz

  • Hero Member
  • *****
  • Posts: 3211
Re: Bit rot in unRAID
« Reply #10 on: June 19, 2012, 10:11:41 AM »
I suppose bad sectors could be considered bit rot as it's defined.

I have seen hardware issues posted here that caused bad data, but that wasn't really bitrot, or bad data caused by the HDD platter deteriorating over time. The bad data was caused by bad electronics not processing or transferring the bits correctly.

All the parity error isues posted seem to be are caused by things like the previous paraghraph or were caused by things like the upgrade bug or things like hard-powering the server off. People with well-built stable servers don't seem to have any issues with an unexplained parity error or two just randomly popping up every so often.

The other linked thread is wrong about RAID5. RAID5 can reconstruct from a bit error, since the data is stored in stripes, not disk by disk, and the stripe can be reconstructed. unRAID with the disk by disk protection could not reconstruct the error.


Offline Joe L.

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 18823
Re: Bit rot in unRAID
« Reply #11 on: June 19, 2012, 01:08:25 PM »
Unraid has exacty the same capabilities as raid 5
Raid5 has no ability to repair that umRAID does not have.

Offline UhClem

  • Full Member
  • ***
  • Posts: 180
Re: Bit rot in unRAID
« Reply #12 on: June 19, 2012, 01:14:56 PM »
When I wrote "This is a very worthwhile topic ..."
I didn't mean that it was a common problem, and everybody had a good chance of being bitten. (But, it is grossly misunderstood by the vast majority.)

... That is why MD5 and SHA-1 hashes are done on all evidence files.

I have processed hundreds of terabytes of data... some stored for a decade or more.... and never had a hash value change.  I have never seen, or heard of from a reliable source, of any "bit rot" from a hard drive (optical media, yes).  I've heard of corruption of a file at the file system level due to known hazards (and a diff of the two files shows it was corruption, and not bit rot), but never "bit rot" of a random bit flipping, and when you consider modern drives ECC, you realize that bit rot is not likely to manifest itself even if it happened.
You and I are in a very small minority. (I've been maintaining MD5s on all my media files [~12TB] for the last 5 years.) I haven't had any bit rot either. But, I have caught errors, following disk-to-disk, and network, copying of file(s). For the vast majority of users, without a means of verifying the integrity of the destination copy, those errors would typically go undetected; if/when such an error does eventually get detected, it gets "blamed" on the disk drive it was read from, and incorrectly categorized as "bit rot".

But ... every (current era) disk drive has bit rot! It is unavoidable, and it was planned for in the design of the drive and its firmware. In almost every occurrence, the firmware/ECC detects AND corrects the "bit rot" and it never manifests in the real world. In those rare, but (wince!) dreaded cases where firmware/ECC detects but CAN NOT correct the error (serious rot), the drive issues the UCE (UnCorrectableError). Before issuing the UCE, the drive will make several (10+) attempts to get a correctable read. And, the kernel driver will make a couple of re-tries on the UCE it does get.

The super-elusive (and, maybe, mythologically apocryphal) case is where the "bit rot" is such, that, when ECC is applied to it, it (appears to) correct it, but actually produces data different from the original.  [In NerdSpeak: a "Carl Sagan" "hash collision".] Probably as likely as three albinos, on separate continents, each winning their nation's lottery on the same day.

To [begin to] appreciate the complexities in modern disk drives (magnetic recording; not SSD), consider that there are  several hundred data tracks packed (onto each surface) within the width of a single human hair!!  (... my head hurts :) ...) That's about 200-300K tracks per surface on today's 3.5" drive. [Historical note: 40 years ago, there were about 400 tracks per surface on a 14" drive! [and only 10 sectors per track] (my first Unix driver :))]



Offline Joe L.

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 18823
Re: Bit rot in unRAID
« Reply #13 on: June 19, 2012, 02:16:12 PM »
[Historical note: 40 years ago, there were about 400 tracks per surface on a 14" drive! [and only 10 sectors per track] (my first Unix driver :))]
And you could sprinkle magnetic developer powder on the disk and read the tracks and bit patterns with a magnifying glass... 
http://en.wikipedia.org/wiki/Magnetic_developer

Those days are long gone.   Today you need far better tools.  (If your eyes are still good enough  :o)
« Last Edit: June 19, 2012, 02:34:20 PM by Joe L. »

Offline lionelhutz

  • Hero Member
  • *****
  • Posts: 3211
Re: Bit rot in unRAID
« Reply #14 on: June 19, 2012, 04:21:02 PM »
Unraid has exacty the same capabilities as raid 5
Raid5 has no ability to repair that umRAID does not have.

Well, I should have read up on RAID5 again. RAID5 doesn't use the parity data when reading back a disk so it wouldn't automatically recover from bit rot. However, if it did use the parity on reads it could recover from bit rot. I wonder it any of the better controllers could be set to check parity on reads?