Parity Check Shows 58 Sync Errors, What Now?


Recommended Posts

I haven't run a parity check in 60 days so I ran a NOCORRECT and it came back with 58 sync errors. Should I go ahead and run the check again to correct the errors or is this something to worry about? I don't have the log from the parity check since I normally turn the server off overnight.

 

unRAID 4.7

 

Pastebin Syslog

 

SMART Drive Logs

Run it again in NOCORRECT mode, if the SAME addresses show in the syslog, perform a correcting sync.

If different addresses show, DO NOT perform a correcting check, but instead try to find the hardware returning randm bits at times.  Start with an overnight memory test.  Then, run tests on each disk getting checksums of ranges of blocks.  Look for a disk that occasionally returns random results.

Link to comment

Well I ran the parity check overnight and this morning it's reporting no (0) sync errors. In the past I've run NOCORRECT and then a regular check and it would correct a different number of errors as well. What would cause that?

 

I would run memtest for as long as possible. I usually just run it overnight for overkill. If there's a single error in memtest then the RAM is faulty.

Link to comment

I would run memtest for as long as possible. I usually just run it overnight for overkill. If there's a single error in memtest then the RAM is faulty.

 

I ran memtest for 48+ hours when I originally built the server with 0 errors.

 

What do you think Joe?

That was then... this is now...  you did not have errors then... you do now....
Link to comment

So I downloaded prime95 to try and force a memory error and got this error:

 

Feb  4 17:09:43 unRAID login[4944]: ROOT LOGIN  on `pts/0' from `Ken-Windows7' (Logins)
Feb  4 17:11:11 unRAID kernel: mprime invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0 (Minor Issues)
Feb  4 17:11:11 unRAID kernel: Pid: 4992, comm: mprime Not tainted 2.6.32.9-unRAID #8 (Errors)
Feb  4 17:11:11 unRAID kernel: Call Trace: (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104ab61>] oom_kill_process+0x59/0x1cd (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104afb9>] __out_of_memory+0xef/0x102 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104b02a>] out_of_memory+0x5e/0x83 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c104cfe9>] __alloc_pages_nodemask+0x375/0x42f (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1059686>] handle_mm_fault+0x254/0x8f1 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c129f124>] ? schedule+0x691/0x72f (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1017050>] do_page_fault+0x17c/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1016ed4>] ? do_page_fault+0x0/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c12a07ce>] error_code+0x66/0x6c (Errors)
Feb  4 17:11:11 unRAID kernel:  [<c1016ed4>] ? do_page_fault+0x0/0x1e4 (Errors)
Feb  4 17:11:11 unRAID kernel: Mem-Info:
Feb  4 17:11:11 unRAID kernel: DMA per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:    0, btch:   1 usd:   0
Feb  4 17:11:11 unRAID kernel: Normal per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:  186, btch:  31 usd: 168
Feb  4 17:11:11 unRAID kernel: HighMem per-cpu:
Feb  4 17:11:11 unRAID kernel: CPU    0: hi:  186, btch:  31 usd: 164
Feb  4 17:11:11 unRAID kernel: active_anon:354816 inactive_anon:74677 isolated_anon:0
Feb  4 17:11:11 unRAID kernel:  active_file:8 inactive_file:14 isolated_file:0
Feb  4 17:11:11 unRAID kernel:  unevictable:54391 dirty:0 writeback:0 unstable:0
Feb  4 17:11:11 unRAID kernel:  free:12293 slab_reclaimable:851 slab_unreclaimable:1810
Feb  4 17:11:11 unRAID kernel:  mapped:1521 shmem:24 pagetables:965 bounce:0
Feb  4 17:11:11 unRAID kernel: DMA free:7984kB min:64kB low:80kB high:96kB active_anon:5576kB inactive_anon:2304kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15768kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 867 1982 1982
Feb  4 17:11:11 unRAID kernel: Normal free:40568kB min:3732kB low:4664kB high:5596kB active_anon:706968kB inactive_anon:60672kB active_file:24kB inactive_file:36kB unevictable:17460kB isolated(anon):0kB isolated(file):0kB present:887976kB mlocked:0kB dirty:0kB writeback:0kB mapped:188kB shmem:0kB slab_reclaimable:3404kB slab_unreclaimable:7240kB kernel_stack:616kB pagetables:1484kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:100 all_unreclaimable? no
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 0 8924 8924
Feb  4 17:11:11 unRAID kernel: HighMem free:620kB min:512kB low:1712kB high:2912kB active_anon:706720kB inactive_anon:235732kB active_file:8kB inactive_file:20kB unevictable:200104kB isolated(anon):0kB isolated(file):0kB present:1142312kB mlocked:0kB dirty:0kB writeback:0kB mapped:5896kB shmem:96kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2368kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no
Feb  4 17:11:11 unRAID kernel: lowmem_reserve[]: 0 0 0 0
Feb  4 17:11:11 unRAID kernel: DMA: 0*4kB 2*8kB 2*16kB 2*32kB 1*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 7984kB
Feb  4 17:11:11 unRAID kernel: Normal: 104*4kB 109*8kB 73*16kB 29*32kB 23*64kB 7*128kB 12*256kB 16*512kB 5*1024kB 3*2048kB 3*4096kB = 40568kB
Feb  4 17:11:11 unRAID kernel: HighMem: 17*4kB 5*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 620kB
Feb  4 17:11:11 unRAID kernel: 54460 total pagecache pages
Feb  4 17:11:11 unRAID kernel: 0 pages in swap cache
Feb  4 17:11:11 unRAID kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb  4 17:11:11 unRAID kernel: Free swap  = 0kB
Feb  4 17:11:11 unRAID kernel: Total swap = 0kB
Feb  4 17:11:11 unRAID kernel: 515649 pages RAM
Feb  4 17:11:11 unRAID kernel: 287827 pages HighMem
Feb  4 17:11:11 unRAID kernel: 5487 pages reserved
Feb  4 17:11:11 unRAID kernel: 4837 pages shared
Feb  4 17:11:11 unRAID kernel: 495608 pages non-shared
Feb  4 17:11:11 unRAID kernel: Out of memory: kill process 4990 (mprime) score 57106 or a child (Errors)
Feb  4 17:11:11 unRAID kernel: Killed process 4990 (mprime) (Errors)

 

So I did what Nick suggested and this is the result of his commands:

 

ps -e -orss=,args= | sort -b -k1,1n | pr -TW$COLUMNS

    0 [aio/0]
    0 [async/mgr]
    0 [ata/0]
    0 [ata_aux]
    0 [bdi-default]
    0 [events/0]
    0 [kacpi_hotplug]
    0 [kacpi_notify]
    0 [kacpid]
    0 [kblockd/0]
    0 [khelper]
    0 [khubd]
    0 [kseriod]
    0 [kslowd000]
    0 [kslowd001]
    0 [ksoftirqd/0]
    0 [ksuspend_usbd]
    0 [kswapd0]
    0 [kthreadd]
    0 [mdrecoveryd]
    0 [migration/0]
    0 [nfsiod]
    0 [reiserfs/0]
    0 [rpciod/0]
    0 [scsi_eh_0]
    0 [scsi_eh_1]
    0 [scsi_eh_2]
    0 [scsi_eh_3]
    0 [scsi_eh_4]
    0 [scsi_eh_5]
    0 [scsi_eh_6]
    0 [spinupd]
    0 [spinupd]
    0 [spinupd]
    0 [spinupd]
    0 [spinupd]
    0 [spinupd]
    0 [sync_supers]
    0 [unraidd]
    0 [usb-storage]
    0 [usbhid_resumer]
  304 init
  384 /usr/sbin/atd -b 15 -l 1
  420 /usr/sbin/klogd -c 3 -x
  424 logger -tunmenu -plocal7.info -is
  468 /usr/sbin/ifplugd -i eth0 -fwI -u0 -d10
  476 /sbin/rpc.portmap
  488 /bin/bash /boot/unmenu/uu
  516 /sbin/agetty 38400 tty3 linux
  520 /sbin/agetty 38400 tty1 linux
  520 /sbin/agetty 38400 tty2 linux
  520 /sbin/agetty 38400 tty4 linux
  520 /sbin/agetty 38400 tty5 linux
  520 /sbin/agetty 38400 tty6 linux
  532 /usr/sbin/inetd
  540 /usr/sbin/acpid
  588 /usr/sbin/syslogd -m0
  624 pr -TW80
  636 sort -b -k1,1n
  656 /usr/sbin/crond -l10
  680 /sbin/apcupsd
  692 ps -e -orss=,args=
  700 /sbin/rpc.statd
  752 in.telnetd: Ken-Windows7
  804 /sbin/udevd --daemon
1312 -bash
1612 /usr/local/sbin/shfs /mnt/user -o noatime,big_writes,allow_other,default_p
1648 /usr/local/sbin/emhttp
1728 /usr/sbin/smbd -D
1924 /usr/sbin/nmbd -D
3368 /usr/sbin/smbd -D
3768 awk -W re-interval -f ./unmenu.awk

 

 

free -m

             total       used       free     shared    buffers     cached
Mem:          1992        276       1716          0          3        212
-/+ buffers/cache:         59       1933
Swap:            0          0          0

 

Plugins:

apcupsd

bwm-ng

mail and ssmtp

unRAID Status Alert

unRAID Power-Down on disk overtemp

Clean Powerdown

screen

myMain

Link to comment

Is there a file "results.txt"?

 

No. There's a stress.txt in the root dir but no results.txt.

 

root@unRAID:~# ls
initconfig@   mdcmd*                  powerdown@   samba@       whatsnew.txt*
license.txt*  mprime*                 prime.txt    stress.txt*
local.txt     p95v279.linux32.tar.gz  readme.txt*  undoc.txt*
root@unRAID:~#

 

Link to comment

I ran two more parity NOCORRECTs and they both had a different number of errors that occurred in different spots. If it's not RAM it could be anything?

correct, it could be anything, but most likely one of the disks.

see here for a test script and how to use it:

http://www.lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

Since I'm getting different parity errors in different blocks should I just run the script for the entire size of the drive?

 

Also, I'm running long SMART tests on all my drives, but it says the results will show up in the "Smart-Status-Report". Where can I find that report?

Link to comment

I ran two more parity NOCORRECTs and they both had a different number of errors that occurred in different spots. If it's not RAM it could be anything?

correct, it could be anything, but most likely one of the disks.

see here for a test script and how to use it:

http://www.lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors

 

Since I'm getting different parity errors in different blocks should I just run the script for the entire size of the drive?

That would take forever...  I'd just run it on a range of blocks that encompasses several failures (hoping there are several closely spaced)

Also, I'm running long SMART tests on all my drives, but it says the results will show up in the "Smart-Status-Report". Where can I find that report?

They show up in the normal smart report.  It is just you must wait the appripriate time to see them.  They are near the bottom of the report and look like this:

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Aborted by host              90%    18286        -

# 2  Short offline      Completed without error      00%      3041        -

# 3  Extended offline    Completed without error      00%      2802        -

# 4  Short offline      Completed without error      00%      1368        -

Don't forget to disable spin-down when performing a long test.  Otherwise you'll get a "Aborted by host" message.

"extended offline" = "long" test.

 

The disk will probably say to check in 255 minutes.  That is nowhere near enough time for today's disks... figure 3 or 4 hours for a long test to complete.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.