Defragmenting ReiserFS


Recommended Posts

My disk I/O for newly-written files on my cache drive has ground to a halt. I wonder if it's because my drive filled up and became fragmented?

 

It's noticeable on files written to by nzbget -- I'll download 45GB of data and each unpacked rar will take an age to be read internally by the unRAID server or be copied to my Windows PC.

This doesn't happen with all files, only newly written ones -- I''m seeing a slowdown of over 8x. And, it's not every rar. Weird.

 

If I unpack those files to my Windows PC then copy them back to the unRAID server, the unpacked files are read and written at normal speeds, so I don't really know what's going on.

 

Should I run a reiserfsck or can I check for fragmentation? I searched around, but only found defrag tools from 2005 or 2006. Someone did report file fragmentation of 30%, but that may have been an older version of ReiserFS -- what's in unRAID (4.7)?

 

Cheers,

 

Neil.

Link to comment

My disk I/O for newly-written files on my cache drive has ground to a halt.

Perhaps you are just now using a portion of the drive that has bad / marginal sectors? I've seen several drives with slow spots, some developed bad sectors, some just had really bad performance whenever that specific area was used. I'd run a long smart test and inspect the results.
Link to comment

My disk I/O for newly-written files on my cache drive has ground to a halt.

Perhaps you are just now using a portion of the drive that has bad / marginal sectors? I've seen several drives with slow spots, some developed bad sectors, some just had really bad performance whenever that specific area was used. I'd run a long smart test and inspect the results.

Good idea. Actually, I just looked at the drive and it has:

» udma_crc_error_count=259

 

Which may be a bad connection. The server was moved a few weeks ago, so I'll check the connections again, could be a bad cable/connection.

Link to comment

use the badblocks in read only mode to test every sector.

Then schedule a smart -t long test.

 

If you have bad sectors, you may need to reformat the drive with preclear or badblocks in write mode.

 

badblocks in write mode uses 4 patterns to test the drive, 0x55,0xaa,0xff,0x00 if you have no more bad sectors after the 4 pass write mode test, chances are you will be OK. Then you can do a preclear without the actual preclearing. I.E. just put the signature and partition on the drive.

 

Also,  I've found reiserfs can be very slow creating new files when it starts to get full.

In fact so slow that creating a file via SMB can time out while the drive is searching the directory tree and superblocks for free blocks.

 

 

Link to comment

use the badblocks in read only mode to test every sector.

Then schedule a smart -t long test.

 

Thanks for the info. What does that mean? Is "badblocks" a program? What command line should I be using?

 

I reseated the drive in the cage and I don't think it made any difference, so I don't think it was a connection problem. I already started a long smart test, should I cancel that and do the badblocks thing first?

 

Cheers.

Link to comment

use the badblocks in read only mode to test every sector.

Then schedule a smart -t long test.

 

Thanks for the info. What does that mean? Is "badblocks" a program? What command line should I be using?

 

I reseated the drive in the cage and I don't think it made any difference, so I don't think it was a connection problem. I already started a long smart test, should I cancel that and do the badblocks thing first?

 

Cheers.

 

let the long test go to completion. (you may need to tell unRAID not to power down the drive for 3-4 hours.

 

badblocks is a program that comes with unRAID.

 

root@atlas /boot/bin #badblocks

Usage: badblocks [-b block_size] [-i input_file] [-o output_file] [-svwnf]

[-c blocks_at_once] [-p num_passes] [-t test_pattern [-t test_pattern [...]]]

device [last_block [start_block]]

 

use

badblocks -o /boot/badblocks.log -sv /dev/disk/by-id/(the id of the disk you want to test).

It will do a read only test. It should last anywhere from 4-24 hours depending on the drive size.

all badblocks will be written to the badblocks.log file.

If  there are any badblocks in that file, then the drive will need to be re-written to.

There are ways of addressing specific blocks, but I do not have the expertise for that.

 

What I do is move the data out of the way and do a full 4 pass write test which will either re-assign (remap) the bad sectors or refresh them with a newly written sector header.

 

I've found that sudden power offs or even scheduled graceful power offs can cause the drives to have issues with sectors.

 

http://linux.die.net/man/8/badblocks

 

Link to comment

You can run take the array off line,

run badblocks in read/write mode which will read and re-write every sector.

See man page.

 

It takes a really long time.

If your data is valuable, you may want to get another drive as spare.

 

 

After you do that, if any sectors have been remapped, you may need to recreate or check parity.

Or run the parity check in correct/nocorrect mode. I'm not sure about this part though.

Someone with more experience needs to respond.

I'm not sure if parity gets updated from the drive, or the drive gets updated from the parity.

Link to comment

After 24 hours, the long test stuck at 90% and would not complete. Or something aborted it. Dunno.

 

I think I'm going to empty the drive, pull it and check it with WD's own tools. There seems to be a lack of tools for ReiserFS. It could be fragged, who knows?

 

The developer of nzbget emailed me to say there could be an issue with fragmentation:

 

Hi Neil,

 

I've seen your thread "Defragmenting ReiserFS" (http://lime-technology.com/forum/index.php?topic=22906.0).

 

One of the reason for high fragmentation could be the option "DirectWrite". With this option NZBGet creates a sparse file which is then filled when the articles are downloaded. The idea behind this is to avoid writing data twice (once into temp article-files, then into resulting file). Due to the nature of sparse files they initially doesn't have any space allocated on the disk, the space is allocated when the data is written. This can produce very fragmented files.

 

I've looked around and found that EXT4 filesystem has a special feature developed to solve this exact problem. There is a new system function in kernel called "fallocate". With this function a program can instruct the system that it is going to write to the file soon. The system can then allocate needed disk space avoiding fragmentation. The key feature of EXT4 in this respect is that it doesn' actually write zeros to allocated area but marks the file area as "uninitialized" instead making such an allocation very fast. This is not the unique feature of EXT4 though. The docs says the function is supported by btrfs, ext4, ocfs2, and xfs filesystems.

 

I'm going to add support for this feature but it will take some time - the web-interface has higher priority at the moment.

 

I suggest you to try to disable DirectWrite-option and see how it performs. This should solve (at least partially) the fragmentation problem but may slow down the download because the data have to be written twice (first into temp files, then into resulting file). However due to better post-processing speed your could get the resulted unpacked file faster. Please let me know how it works for you.

 

In a meantime you could think about switching to EXT4 to be ready for new feature in NZBGet (it looks like ReiserFS doesn't support "fallocate").

 

I've disabled DirectWrite. I think there was something in the unRAID roadmap about supporting any filesystem type for the cache drive, but I don't know whether it was ever implemented?

Link to comment

In direct answer to your question, there are NO tools that actually defragment RiserFS.  You would have to copy the files off of the disk, re-format then copy the files back on to the disk, to defragment the file system.  There is a roumer that a new repacker tool will be included in the new version of RiserFS called RiserFS4.

 

--Sideband Samurai

Link to comment

If the SMART long test is aborted perhaps you should post the smart log for review.

 

 

Capture it.

 

Also do a badblocks readonly test, this will give you confidence about the underlying drive health.

Then do another smart log (smarctl -a) and diff -u the two logs.

 

 

Post all results here for review.

 

 

 

Link to comment

Here you go!

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZXXXXXXXX
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Oct  6 19:16:09 2012 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
				was aborted by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25)	The self-test routine was aborted by
				the host.
Total time to complete Offline 
data collection: 		 (37200) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   198   161   021    Pre-fail  Always       -       5066
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1169
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       14312
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       185
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       63
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3099
194 Temperature_Celsius     0x0022   120   111   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       259
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%     14287         -
# 2  Extended offline    Aborted by host               90%     14258         -
# 3  Extended offline    Aborted by host               90%     14258         -
# 4  Short offline       Aborted by host               10%     14258         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

Ok The aborted by host is the controller sending some message that interrupts the smart long test.

From what I remember it can take over 3 hours to run a smart long test, so you need to disable any sleep timers.

 

 

There are no pending, reallocated or uncorrectable sectors. That shows that the low level format looks good.

You may still need to do a bad blocks read test to pass through every sector.

Thats' what the SMART Long test does.

 

 

If you have a filesystem with tons of little files, that can also cause delays in creating or finding files.

From what I see so far, I do not see anything alarming other than incomplete long read test.

Link to comment

I emptied the drive and pulled it to run WD's own tests -- Data Lifeguard Diagnostics. This drive still has a few months of warranty left, so if there is a problem, I'd rather be able to give WD their own diag codes.

It's running an "extended test" in a USB enclosure -- 12 hours down, 8 to go.

 

I replaced that 2TB cache drive with a 250GB drive that's basically empty.

 

What I noticed was that directory reads are now instant, whereas on the 2TB drive, they would be very slow. I'm guessing that's the problem referred to earlier.

The drive did have bajillions of files on it.

 

What I'd like to know is what are the issues with having many files on a ReiserFS drive?

Is the drive slowing down because there are "too many" files on the drive?

How many is too many?

 

Or is it slowing down because the directory contents are getting fragged?

 

Or (having read Wikipedia) is it Tail Packing causing slowdowns on already full, fragged drive?

 

Or, is it really fragmentation? With no defrag tools for ReiserFS, emptying a drive and refilling it periodically will be a pain.

 

Wikipedia says "ReiserFS had a problem with very fast filesystem aging when compared to other filesystems – in several usage scenarios filesystem performance decreased dramatically with time." that's too vague, but could be describing what I was seeing.

 

It'd be nice to know if other FS are going to be supported in 5.x for the cache drive and when.

Link to comment

I don't have details on the whys, however I can add to my experience.

 

 

reiserfs filesystems with tons of small files is slow when adding files.  I have this issue also.

 

 

I see massive drive activity and searching during the new file allocation.

This was one of my reasons to request ext3/ext4 support.

 

 

Reiserfs does not have pre-allocated inode tables, which can be good and bad.

I believe inodes are allocated on an as needed basis, thus causing all sorts of inode table and/or directory fragmentation.

 

 

The other point to consider, Unix directories are sequentially accessed.

The larger a single directory, the slower adding a file is. If the directory has to be expanded, the directory file "." has to be expanded.

During that expansion I believe the directory is locked.

 

 

The benefit of ext3/ext4 would be inode tables and amount are pre-allocated

The downside is you could run out of inodes for many small files.

The benefit of reiserfs is inode tables are not pre-allocated and you do not have to concern yourself with the amount of inodes.

Down side being near full file systems are very slow to add files unless you've already added a file and all the relevant directories/inodes are already in ram. Once the required tables/inodes are in ram, it's pretty normal.

One of my reasons to keep reiserfs is in the resiliency of recovery. We've seen people overwrite their file systems with DD and yet still be able to recover a significant amount of data.

 

 

Organizing the drive for what is put where could help.

Also Re-organizing the drive periodically could help.

 

 

I've found the reorganizing to even aid ext3.

I would just rsync the filesystem from one drive to another with --remove-sent-files.

Then rsync it back.

When it was put back I found the drive to operate faster because rsync seems to allocate all the directory entries first.

 

 

a bajillion files might slow down any filesystem, especially if it is near capacity.

I believe XFS is used on my DVR so that it handles these sort of issues with better performance.

 

 

Just some thoughts on the subject. I do not have 100% factual evidence, only my personal experience.

 

 

Link to comment

Thanks for the info, Mr. Weebo of Weebotech Industries!

 

The extended test finished on the old 2TB cache drive and no errors were found.

 

I think it's quite likely that some fragmentation was killing performance, whether it was directory or file fragmentation or both. I had a lot of files on the drive, but it wasn't that huge a number, maybe 75,000 to 100,000 which I wouldn't consider a lot. Is that a lot? About 50,000 were very small files.

 

My current plan is to add a PCI-E HBA to the Microserver, giving it a total of 7 drives, with the old 2TB cache going in to the array and being replaced by a much smaller 250GB cache drive.

That would, by necessity, result in the cache drive having far fewer files and also making it much easier to empty should I want to "wipe clean" the ReiserFS cruft from it periodically.

 

Cheers!

 

Neil.

Link to comment

 

here's my break down.

   146485 disk1.filelist
  4761906 disk2.filelist
     7141 disk3.filelist
  2704215 disk4.filelist
   169797 disk5.filelist
        0 disk6.filelist
      610 disk7.filelist
     2647 disk8.filelist
     1601 disk9.filelist
     4979 disk10.filelist
     4270 disk11.filelist
     3013 disk12.filelist
     2377 disk13.filelist
     2530 disk14.filelist
    63326 disk15.filelist
  7874897 total

 

it also depends on how the files are added over time plus how full the filesystem is vs what tables are in memory already.

If a directory gets fragmented, it can severely hamper performance.

 

 

For example, Disk3 takes a long time to create a new file if the filesystem data is flushed out of ram.

So directory fragmentation may be the issue since that disk is the cache where I load movies to be viewed and after viewing they are moved to other disks. I suppose it's how new directory entries are created and how deep they are accessible throughout the reiserfs trees. It seems like there's allot of searching going on when I go to add a new movie.

 

 

There's a 10-30 second pause of searching before even the first block is written. after that everything is fine until the tables are flushed again.

 

 

Link to comment

Is that 4 million files on disk2? If so, "too many files" is clearly not my problem...

 

It takes a long time to traverse that filesystem too.

That's a backup drive.

I dare not do a du -hs down the wrong directory!

It is too many files, but I still have to do backups.

The rsync backup is a mirror image of the other system's direectory structure with hard links to duplicate files. Since I do the backup on unRAID  and PULL files via rsync from the other machines, I do not have a problem with performance.

 

 

DISK3 is where I have problems.  Allot of large files over a long period of time, thus fragmeinting directories across the drive.

Once the file system has been traversed and the file blocks start writing there are no performance issues.

The issue I have is, if the first allocation takes to long, the samba connect can time out.

What I usually do is spin up the drive. make a file. remove it.

Then do my ANYDVD rip to disk3.

I started having performance issues doing that so I now I write to the cache drive first, then rsync the files into place.

The model of the drive comes into play too. Especially when you start accessing the inner tracks.

 

 

While disk1 has 150,000 files or so on a 1TB drive, it's one of the samsung drives that have really good performance.

 

 

I bet there's a way to reorgainize this stuff with rsync, I just haven't had the need to do it yet.

 

 

 

 

 

 

 

Link to comment
  • 2 weeks later...

Is there a way to do regular maintenence on a filesystem under unraid?  I was hoping this would not be necessary, but it seems that I, too, am suffering slow (dismally slow...) performance from certain folders containing hundreds of thousands of files. 

 

The whole concept of moving files from one drive to another (on another server, or a backup USB, whatever) and then putting them back again seems like a kludge to me, not to mention unsafe.  The whole reason I have my data on unraid in the first place is for the parity protection.

 

I suppose I'll set up a second unraid server to test new betas and keep a secondary copy of my most critical files.

 

Still, I was hoping unraid's filesystem wouldn't be plagued with the same problems I've faced with years and years of Microsoft's filesystems.

Link to comment

It's always been known that very large directories are inefficient in unix filesystems.

 

 

Consider why they created a termlib (or is it termcap) directory where the first level was the first character of the group of files.

It gets worse with larger filesystems and then fragmentation of directories.

 

 

Maybe ext4 or btfs will handle it better. It could be a reiserfs issue.

I know my DVR uses XFS for speed.

In the meantime you could rename the directory, then rsync the directory tree back to the original name, thus possibly compressing and defragging the directory file (at the least).

 

 

Another choice is to use a multilevel directory tree for the thousands of files, thus making smaller directory files.

Link to comment

I am reading up on rsync as I type this, but so far syncing files back to themselves is confusing to me, and I have to admit I have zero understanding of what it means.

 

I have begun restructuring my directories, but I am still finding that I have performance issues.  For example, if I have a folder with 10,000 files in it, clicking on that directory in Windows' explorer.exe may take minutes to respond and display.  If I create 20 directories within it, and put 500 files in each, then clicking on each subdirectory works well.  However, after spending time browsing elsewhere, if I come back in explorer.exe and click the root folder (containing the 20 subdirectories), it still take minutes to respond.  So the files are better organized, but the root directory is still very large, and thus slow.  That's what I'm trying to fix...

Link to comment

I am reading up on rsync as I type this, but so far syncing files back to themselves is confusing to me, and I have to admit I have zero understanding of what it means.

 

I have begun restructuring my directories, but I am still finding that I have performance issues.  For example, if I have a folder with 10,000 files in it, clicking on that directory in Windows' explorer.exe may take minutes to respond and display.  If I create 20 directories within it, and put 500 files in each, then clicking on each subdirectory works well.  However, after spending time browsing elsewhere, if I come back in explorer.exe and click the root folder (containing the 20 subdirectories), it still take minutes to respond.  So the files are better organized, but the root directory is still very large, and thus slow.  That's what I'm trying to fix...

 

 

That particular root directory needs to be compressed and the only way to do that is rebuilding it from scratch.

mv the root directory to a new name.

mkdir a new directory using the old directory name.

move each subdirectory to the newly created old directory.

rmdir the new root directory that was renamed in step 1.

 

 

When directories grow very large and there are massive deletions, the prior utilized space is never recovered.

Each subsequent space holder is then read even if the file was deleted.

if that directory happens to have been very large and fragmented across the filesystem, you will have performance issues.

 

 

There might be other causes, but this is a known situation.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.