RFC: MD5 checksum/Hash Software

WeeboTech · November 20, 2014

Currently I prefer txt files that a database for the catalog as it allows me to use the power of the shell.

While I'm familiar with locking and modifying text files in place, it's a bear.

I've been experimenting with a few formats so far.

storing md5sum type data in a .gdbm which is very fast and records can be updated in place.

storing the hash values into the extended attributes of the file. Again, fast and easy to update in place.

storing the filesystem stat metadata into the sqlite table along with hash values. additional fields such as update and verify time, Now some text field for labeling or grouping files by some user supplied values.

Each has it's advantages.

With each of these I can export the files into a basic md5sums type of file, or import a text md5sums into the db or metadata.

with .gdbm files you have 1 file per disk. It's simple key=value such as filepath=hash. Very fast to look up a key once the .gdbm is built.

Records can be updated in place and sweeped very fast. The over head is about 90MB for 200,000 files vs 40MB for the md5sum itself.

Why I've been experimenting with this format is for speed and another way of capturing the stat information to a single location.

Sweeping a disk with 200,000 files takes time. dir_cache speeds that up by keeping the data constantly accessed.

I was designing my own dircache tool and capturing the stat information into the .gdbm as an experiment.

It worked well. it only added 2-3 seconds to a full sweep vs find's 1 second. Records were updated in place and if the hash data was stored and the file changed. the hash data was dropped so you knew it had to be recalculated by another program.

With file extended attributes, the metadata is in the filesystem. It moves with the file when you rsync the data with the -X switch.

Yet moving the data via windows over smb would loose the data.

Also when you update the data (depending on how the file is opened/recreated) you loose the data or it is out of date.

The benefit is no database and for write once read many type of archives, it's simple storage.

With the sqlite table it opens many possibilities at the expense of space and speed for very large databases.

The current software allows you to have 1 database per whatever collection method you decide.

The locate command allows you to do the locate and/or export function across more then one database at the expense of some speed.

However what I have so far uses regular expressions for filtering the output along with a new export engine I'm making to print out whatever metadata you need. (from the stat information).

I use the locate very often to find mp3 files among my millions of files.

So far I think the sqlite table is the most flexible as it would provide standard language access to the raw data.

I do have a tool to provide a bash loadable for for SQL access to/from the individual DB.

NAS · November 20, 2014

Very well summarised. I only used flat text files because it was easy, flexible and fast enough. I have no problems using a database at all and in many ways I prefer it.

Certainly I absolutely hate the idea of meta data as there are just to many scenarios where you lose it by mistake.

I do have a question though. I cant think of a single project I have used where sqlite hasnt at some point proven to be a limiting factor. Apart form the increased dependencys is there any reason not to use a full database?

WeeboTech · November 20, 2014

Apart form the increased dependency is there any reason not to use a full database?

It's all about simplicity and dependency.

SQLite

pros

There are simple command line tools to access the sqlite.

It's relatively fast for read mostly operations. Once imported this is my expectation.

sqlite being a flat file database lends itself well to moving to another location easily.

With Tom's buy in on binding php with SQLite it now becomes accessible from emhttp.

There are simple bash loadable bindings to access SQLite from within bash and define variables easily. More on that as I move forward.

With the right browser plugin in Firefox you can access the SQLite database.

cons

speed

over the past week as I've been consolidating boxes of 1TB HD's to a larger HD, I thought about mysql.

MYSQL

pros

A single database for multiple servers over the network. This could provide a consolidated view of data and even archival hard drives.

potential speed increase by dedicated database.

potential to use flat files with the correct shared libraries.

Swapping shared libraries changes it from local flat files to network.

A consolidated database of multiple servers and hash sums could provide a method for finding duplicate data (This is my main impetus to consider it).

cons

Complexity in setup. aka user ids, storage, daemon access, network access.

Dependency on another package.

Binding to PHP and limetech having to support that or a 3rd party needing to provide an updated mysql aware php.

Complexity in moving a database to another location. Per disk databases highly unlikely unless using flat local files.

unknowns

I'm not experienced in the whole swapping shared libraries for local flat file access vs network access. Someone would have to write a guide.

I do have a skeleton bash binding for mysql, but I've never really used it.

Thoughts

I'm sure there's a way to build a shared library that can be swapped to switch from sqlite to mysql.

I've seen some examples, but that's a longer learning curve for me. I still have to do a code review of an example I found.

I have to make some decisions soon. Use what I know how to use for now, grow later or delay as I learn.

Also, as I go through the catalogging and hashing process I'm seeing what I need.

I post here looking for feedback seeing what others might need and how to find a common set.

With my 3 methods so far, I might just release a whole suite of tools in 3 sets and see what people find useful most.

md5deep works well, but it doesn't really handle the whole update in place nicely.

So you need to store name, mtime, size, hash. had I been able to get the author to use the raw epoch time instead of his converted time, This would be simple scripting.

Storing this data in .gdbm files was pretty simple once I got the bash .gdbm binding to work.

Speed became the factor in bash. All of the external program calls eat up cpu cycles. Still, it worked well.

Sqlite takes this to the next level with simpler command line access and minimal setup for the end user.

stewartwb · November 20, 2014

So I use md5sum with find command, and I run a command pr disk on each cpu-core.

find /mnt/${DISK} -type f -exec md5sum {} \; > $MD5DIR/MD5_${DATESTAMP}_${DISK}.md5

I got most of the program from this forum, I think, and I'm not sure who to credit. Or I made it myself from bits and pieces, I'm not really sure.

I have attached the file.

Best Alex

Alex - thanks for posting your script. I've been searching for a method of generating MD5 hashes per disk, and your script looks like it might be the solution.

I'm trying to run this script on my unRAID 6b10a server, but I don't know how to check the results, and I'm not able to interpret the script visually. Is it supposed to create a folder called "hash" on each disk, and drop a single MD5 file there? Also, am I correctly interpreting that each time the script generates hashes for a particular disk it will rename the file with a timestamp, so you can check for unexpected changes in MD5 hash values over time?

Thanks in advance for sharing.

-- stewartwb

Alex R. Berg · November 21, 2014

Your welcome, and I'm glad it could help.

I think if I answered here I might be stealing/polluting weebotechs thread, so I moved the discussion/answer here: http://lime-technology.com/forum/index.php?topic=36380.0

Best Alex

NAS · November 21, 2014

Apart form the increased dependency is there any reason not to use a full database?

It's all about simplicity and dependency.

SQLite

pros

There are simple command line tools to access the sqlite.

It's relatively fast for read mostly operations. Once imported this is my expectation.

sqlite being a flat file database lends itself well to moving to another location easily.

With Tom's buy in on binding php with SQLite it now becomes accessible from emhttp.

There are simple bash loadable bindings to access SQLite from within bash and define variables easily. More on that as I move forward.

With the right browser plugin in Firefox you can access the SQLite database.

cons

speed

over the past week as I've been consolidating boxes of 1TB HD's to a larger HD, I thought about mysql.

MYSQL

pros

A single database for multiple servers over the network. This could provide a consolidated view of data and even archival hard drives.

potential speed increase by dedicated database.

potential to use flat files with the correct shared libraries.

Swapping shared libraries changes it from local flat files to network.

A consolidated database of multiple servers and hash sums could provide a method for finding duplicate data (This is my main impetus to consider it).

cons

Complexity in setup. aka user ids, storage, daemon access, network access.

Dependency on another package.

Binding to PHP and limetech having to support that or a 3rd party needing to provide an updated mysql aware php.

Complexity in moving a database to another location. Per disk databases highly unlikely unless using flat local files.

unknowns

I'm not experienced in the whole swapping shared libraries for local flat file access vs network access. Someone would have to write a guide.

I do have a skeleton bash binding for mysql, but I've never really used it.

Thoughts

I'm sure there's a way to build a shared library that can be swapped to switch from sqlite to mysql.

I've seen some examples, but that's a longer learning curve for me. I still have to do a code review of an example I found.

I have to make some decisions soon. Use what I know how to use for now, grow later or delay as I learn.

Also, as I go through the catalogging and hashing process I'm seeing what I need.

I post here looking for feedback seeing what others might need and how to find a common set.

With my 3 methods so far, I might just release a whole suite of tools in 3 sets and see what people find useful most.

md5deep works well, but it doesn't really handle the whole update in place nicely.

So you need to store name, mtime, size, hash. had I been able to get the author to use the raw epoch time instead of his converted time, This would be simple scripting.

Storing this data in .gdbm files was pretty simple once I got the bash .gdbm binding to work.

Speed became the factor in bash. All of the external program calls eat up cpu cycles. Still, it worked well.

Sqlite takes this to the next level with simpler command line access and minimal setup for the end user.

Excellent summary and I totally agree with everything you have said. This is starting to interest me more and more. If we can design the structure in such a way that a user can add extra fields of info without compromising the core speed and integrity of what you are doing here it opens the way for the addition of other tools to add metadata beyond CRC info. Essentially it could be both a disk catalog and a CRC catalog.

It also creates a location for the "disk history from birth to death" feature that is accepted already? to be implemented post 6.3?

Also dealing with offline/disks on a shelf pushes it beyond the realms of traditional unRAID and I expect a large percentage of the user base have a bunch of reiser disks sitting about with non critical data on them and bespoke solutions to know what they contain.

Sign me up.

WeeboTech · November 21, 2014

Excellent summary and I totally agree with everything you have said. This is starting to interest me more and more. If we can design the structure in such a way that a user can add extra fields of info without compromising the core speed and integrity of what you are doing here it opens the way for the addition of other tools to add metadata beyond CRC info. Essentially it could be both a disk catalog and a CRC catalog.

This is why I mentioned the label field.

Right now I'm considering the use of serial numbers on the drives to make it perfectly clear what is where.

At that point using the serial number data, I can refer to the smartctl logs I have, plus any notes I have.

It also creates a location for the "disk history from birth to death" feature that is accepted already? to be implemented post 6.3?

This is going beyond what I had planned. I think that should be in a different table/database.

They certainly can refer to one another with the drivemodel/serial number and the label field. I.E. like a foreign key.

Also dealing with offline/disks on a shelf pushes it beyond the realms of traditional unRAID and I expect a large percentage of the user base have a bunch of reiser disks sitting about with non critical data on them and bespoke solutions to know what they contain.

This is my impetus for the new label 'reference' field.

Now I'll have to consider how to clean the database of stagnant lingering entries on the local file system without removing the archival entries.

For now, it will be a command line by hand situation to add in these foreign external disks.

I.E. until SNAP or some other mechanism comes by to mount these disks easily from the webGui.

RFC: MD5 checksum/Hash Software

Recommended Posts

WeeboTech

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

NAS

Link to comment

WeeboTech

Link to comment

stewartwb

Link to comment

Alex R. Berg

Link to comment

NAS

Link to comment

WeeboTech

Link to comment

Join the conversation