Inventory - Manage an inventory of your files with SHA 256 hash (beta 1)

jbartlett · July 4, 2013

This script will allow you to track your files for changes to the file date & size including an optional SHA-256 hash.

The attached zip file comes with sqlite3 and sha256deep included because the files are not included with stock UNRAID. You can leave them or remove them if you already have the binaries installed.

If you do not implicitly specify a database file to use, the script will create a file called "inventory.db" located on your cache drive if found, otherwise on /mnt/disk1

Be kind if reviewing the code, this is my first bash script and I learned by google.

Options:

-p, --path Full path to scan, required

-a, --add Add new files to the Inventory database

-v, --verify Verify files in the Inventory database

-u, --update Verify files & update database with changes

-m, --missing Scan for missing inventoried files

-r, --removemissing Scan for and remove missing inventoried files from db

-f, --filter Only process files matching file mask

Default: *

-s, --sha Compute a SHA256 value for adding or verification

-d, --db Location of the Inventory sqlite database

-l, --log Log files added/changed/missing to syslog

-b, --backup Backup database before adding/removing files

-i, --id Specify an alphanumeric ID to use instead of a

psudo-random number

If specifying a path to scan or location of the database file that contains spaces, place the path inside double-quotes.

Examples:

Add new files on disk1 to a database named "disk1.db" stored on the cache drive and generate SHA hash

inventory.sh -a -s -p /mnt/disk1 -d /mnt/cache/disk1.db

Check for missing files from the previous "disk1" database

inventory.sh -m -p /mnt/disk1 -d /mnt/cache/disk1.db

Check for changed files from above example but only compare file date & size

inventory.sh -v -p /mnt/disk1 -d /mnt/cache/disk1.db

Check for changed files from above example but check the SHA hash (will update file date if SHA matches but date does not)

inventory.sh -v -s -p /mnt/disk1 -d /mnt/cache/disk1.db

Update changed files from above example

inventory.sh -u -s -p /mnt/disk1 -d /mnt/cache/disk1.db

Remove missing files from the database from the above examples

inventory.sh -r -p /mnt/disk1 -d /mnt/cache/disk1.db

Add new files in the share 'TV' and compute SHA256

inventory.sh -a -s -p /mnt/user0/TV

Add all files stored on disk1 without SHA256

inventory.sh -a -p /mnt/disk1

Add all files stored on disk1 without SHA256 in database stored on disk2

inventory.sh -a -p /mnt/disk1 -d /mnt/disk2/inventory.db

Add only *.mov files in a user share that contains a space

inventory.sh -a -p "/mnt/user0/My Videos" -f *.mov

Scan for missing files in a subdirectory on a user share

inventory.sh -m -p /mnt/user0/Documents/John/Catalog

This script is multi-thread safe in that you can run multiple instances simultaneously. The script uses a "random" number for all internal processing which requires generating a file so there won't be any file conflicts. Well, a conflict is very rare but you can specify a custom ID instead of a random number using the -i | --id parameter.

TO DO:

1. Allow checking the SHA hash and updating if a non-SHA verification was done and the file date does not match

2. Add support to scan "lost+found" directories on all drives to attempt recovery of files with a matching SHA256 hash stored in a database

inventory.zip

vm · July 7, 2013

Thanks for sharing the script!

Verify all files in the database with file date, time, and SHA256 hash

inventory.sh -v -s

Verify all files in the database with file date, time

inventory.sh -v

I'm having difficulty with these options, as it always insists I provide a path - unless I'm misunderstanding the syntax? BTW, the db already exists, as I successfully ran a few add commands. Also I can verify if I explicitly provide the path.

# ./inventory.sh -v -s

Inventory, by John Bartlett, beta 1

Verify Files [Database: /mnt/apps/inventory/inventory.db]

Path was not specified. Please specify with -p or --path option

# ./inventory.sh -v 

Inventory, by John Bartlett, beta 1

Verify Files [Database: /mnt/apps/inventory/inventory.db]

Path was not specified. Please specify with -p or --path option

victor

vm · July 7, 2013

Hi John,

I ran into an issue where files would get repeatedly added when running "-a" even though they were not changed . I tracked it down to having square brackets in some pathnames (directories to be precise), which as I understand grep treats as a regular expression. This caused grep to not find the file in $dbsha.txt and so the code treated it as a new file. I temporarily fixed it by changing the grep to use the -F switch (thereby ignoring regex), but I don't know if that will have some other negative implications, and there is perhaps a better solution.

example (keeps re-adding):

/mnt/user/audio/hd/music_96_24/George_Benson--Breezin'_96_24_[FLAC]/folder.jpg

Original Code

if grep -q "$currfile|" "$dbsha.txt"; then

My temporary fix

if grep -F -q "$currfile|" "$dbsha.txt"; then

victor

Edit: to add example

WeeboTech · July 8, 2013

nice stuff jbartlett, I've been working on a similiar project but in C.

Actually if I could get the author of md5deep/shadeep to store the mtime as epoch seconds, I would be able to do most of it in shell.

Although I asked, he directed me to tripwire instead. Not where I wanted to go.

Anyway I was looking at your well written script.

You may want to break down the actions into shell functions.

It would be easier to follow.

I'm going to have some tools that you may be able to use and speed up the script.

I see a bunch of call outs which are fine, until you are processing 7 million files like I do.

Call outs like these can really drag a script down.

filedate=$(stat -c %Y "$currfile")

filesize=$(stat -c %s "$currfile")

Anyway read up on bash loadables.

I've compiled them and I'm testing a few out.

finfo, strftime, lsql, gdbm and a few others.

What the loadables allow you to do is dynamically load C routines into the current bash executable.

This alleviates the expense of forks when running the subordinate process to assign a variable.

for exaple with the stat, we can use

finfo -v FILESIZE filename.dat which will load the filesize of the file into the FILESIZE variable without forking.

same with mtime,

strftime allows reformatting of the epoch time (like date) and storing it in a variable.

While it's very easy and convenient to fork out and reassign, when you do it in a loop that occurs hundreds of thousands of times, it bogs.

hence the reason I was working on loadables.

I spent the weekend working on an md5 loadable so I wasn't forking a million times for an md5sum.

Oh, also read up about coproc.

You can hand off your sql updates to a coprocess, real neat stuff.

I also learned to day about altering the PRAGMA of the sqlite db.

by adjusting the journal to be in memory and turning off the sync, the speed difference was staggering.

At first I did a ftw64 to 60,000 records for insertion.

it took 18 minutes to insert records.

I turned off the disk based journaling and set it to memory and it dropped dramatically.

here are my test results.

QLITE
time ./ftwinocache_sqlite $HOME
selects 0, deletes 6628, duplicates 6628, inserts 52251, errors 0 

real    18m1.68s
user    0m44.53s
sys     3m15.21s

no journaling 
synchronous = OFF
journal_mode = MEMORY
time ./ftwinocache_sqlite $HOME
real    0m58.21s
user    0m35.86s
sys     0m22.31s

journaling 
synchronous = ON
journal_mode = MEMORY
time ./ftwinocache_sqlite $HOME
selects 0, deletes 6629, duplicates 6629, inserts 52252, errors 0 

real    3m47.92s
user    0m37.73s
sys     1m58.72s

I would abandon sqlite, in favor of something faster like GDBM or even mysql to flat files (with the right libraries it works like sqlite) however, with sqlite, you can download them to a workstation, load a module into firefox and browse the database if needed.

Plus with the lsql loadable, you have direct access to the records. Cool stuff.

Anyway hit me up with a pm and we can collaborate. I can send you some of the loadables to play with.

WeeboTech · July 8, 2013

Here's where some of the fun begins.

First test on a directory with 5,000 files.

find /tmp -type f -mount | wc -l

5079

#!/bin/bash 

DB="/tmp/files.sqlite3"

sqlite3 ${DB} "CREATE TABLE IF NOT EXISTS locate (name TEXT PRIMARY KEY, mtime INTEGER, size INTEGER, checksum TEXT)"

if [ -z "${1}" ] 
   then DIR=$PWD
   else DIR="${1}"
fi

find "${DIR}" -mount -type f -print | while read FILENAME
do
    MTIME=$(stat --printf "%Y" "${FILENAME}") 
    SIZE=$(stat --printf "%s" "${FILENAME}") 
    sqlite3 ${DB} "INSERT INTO locate (name,mtime,size,checksum) VALUES ('$FILENAME','$MTIME','$SIZE','');"
done

Programs run from Shell

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./sqlite3_test.sh /tmp

real 1m37.602s

user 0m8.672s

sys 0m33.185s

Second test with loadable C modules into the bash core.

#!/bin/bash

[ ${DEBUG:=0} -gt 0 ] && set -x -v

enable -f $PWD/finfo   finfo
enable -f $PWD/sqlite3 lsql

DB="/tmp/files.lsql"

lsql -d ${DB} "CREATE TABLE IF NOT EXISTS locate (name TEXT PRIMARY KEY, mtime INTEGER, size INTEGER, checksum TEXT)"

if [ -z "${1}" ] 
   then DIR=$PWD
   else DIR="${1}"
fi

find "${DIR}" -mount -type f -print | while read FILENAME
do
    finfo -v MTIME -m "${FILENAME}"
    finfo -v SIZE  -s "${FILENAME}"
    lsql -d ${DB} "INSERT INTO locate (name,mtime,size,checksum) VALUES ('$FILENAME','$MTIME','$SIZE','');"
done

Loadable Modules

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./lsql_test.sh /tmp

real 1m4.671s

user 0m4.027s

sys 0m9.684s

While the improvement isn't all that much, it shows that an in core C module without forking saves a decent amount of time. This becomes especially evident when your file count grows.

Here's where some of the real fun comes in.

Those greps could be replaced with.

lsql -a FIELDS -d /tmp/files.lsql "SELECT * from locate where NAME='/tmp/package-sqlite/install/slack-desc';"

set | grep FIELDS

FIELDS=([0]="/tmp/package-sqlite/install/slack-desc" [1]="1373172452" [2]="942" [3]="")

and something like this becomes possible.

lsql -a LIST -d /tmp/files.lsql "SELECT name from locate where NAME LIKE '%slack-desc%';"

set | grep LIST

LIST=([0]="/tmp/tgz/powerdown/usr/doc/powerdown-1.03/slack-desc" [1]="/tmp/tgz/powerdown/install/slack-desc" [2]="/tmp/unRAID/package-postfix/install/slack-desc" [3]="/tmp/package-sqlite/install/slack-desc")

Cool stuff.

I have not explored associate arrays yet.

There may be a way to do that as well.

Also another way to get rid of those greps is to use a GDBM file.

With the grep you are reading the same file over and over and over and over.

While once it is read it is in memory, the continual verification could bog down the script.

With a GDBM file there is a key (filename) and data(anything else you want to store).

Lookups are very fast.

Originally I had planned to use GDBM files. I'm very familiar with them, but there aren't many basic tools for accessing them easily. Except for the bash gdbm loadable, which must be compiled manually.

At least with sqlite, you can install sqlite3, move the file to another platform and/or use firefox with the sqlite extension.

Yet if the gdbm file is only used for temporary shadeep db conversion (for fast lookups). It would not matter.

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# bash
root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# enable -f $PWD/gdbm gdbm
root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# help gdbm
gdbm: gdbm [-euikvr] [-KVW array] file [key | key value ...]
    Interface to gdbm(3).  It simulates disk-based associative array.
        gdbm file           -- print all key/\t/value pairs, ie. dict.items()
        gdbm -k file        -- print all keys, ie. dict.keys()
        gdbm -v file        -- print all values, ie. dict.values()
        gdbm file key       -- print dbm[key], ie. ${dbm[key]}
    
        gdbm -r file        -- reorganize database
    
        gdbm -K array file      -- save all keys into array
        gdbm -V array file      -- save all values into array
        gdbm -W array file      -- save all key/value pairs into array sequentially
    
        gdbm file key value     -- store key/value, ie. dbm[key]=value
        gdbm -i file key value  -- store key/value, only if key is new
        gdbm -v file key name   -- store value in name variable, ie. name=${dbm[key]}
    
        gdbm -e file            -- test if file is GDBM database
        gdbm -e file key        -- test if key exists
        gdbm -e file key value  -- test if key exists and value is dbm[key]
    
        gdbm -u file key        -- delete key, ie. unset dbm[key]
        gdbm -u file key value  -- delete key, only if value is dbm[key]
    
    More than one key/value pair can be specified on command line, in which
    case, they would be processed in pairs.  It returns 1 immediately on any
    error or test failure.  If all arguments are processed without error, then
    returns success (0).  Each 'gdbm' command is complete, in that it opens and
    closes the database file.

WeeboTech · July 8, 2013

and s'more food for thought with gdbm files.

#!/bin/bash 

DB="/tmp/files.gdbm"

enable -f $PWD/finfo   finfo
enable -f $PWD/gdbm    gdbm 

if [ -z "${1}" ] 
   then DIR=$PWD
   else DIR="${1}"
fi

find "${DIR}" -mount -type f -print | while read FILENAME
do
    finfo -v MTIME -m "${FILENAME}"
    finfo -v SIZE  -s "${FILENAME}"
    gdbm ${DB} "${FILENAME}" "${MTIME},${SIZE}"
done

loading 5200 files into GDBM using loadables.

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./gdbm_test.sh /tmp

real 0m7.363s

user 0m1.847s

sys 0m0.940s

Testing it out.

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# enable -f $PWD/gdbm gdbm

root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# gdbm /tmp/files.gdbm | wc -l

5218

snippet of data.

gdbm /tmp/files.gdbm | grep loadables

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5 1373247716,20106

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/head 1373216169,11506

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5main.c 1373241102,5007

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/perl/bperl.sh 1373294220,128

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/pushd.c 1373170632,16909

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5lib.o 1373242957,9832

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/finfo.o 1373174409,20680

/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/dirname.o 1373216055,6664

Accessing a single key after load is very very fast.

time gdbm /tmp/files.gdbm /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/mktime.o

1373289349,10008

real 0m0.001s

user 0m0.000s

sys 0m0.000s

load into variable

gdbm -v /tmp/files.gdbm /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/mktime.o VAR

set | grep VAR

VAR=1373289349,10008

There's all kinds of neat stuff that can be done with this.

jbartlett · July 8, 2013

Thanks for sharing the script!

Verify all files in the database with file date, time, and SHA256 hash

inventory.sh -v -s

Verify all files in the database with file date, time

inventory.sh -v

I'm having difficulty with these options, as it always insists I provide a path - unless I'm misunderstanding the syntax? BTW, the db already exists, as I successfully ran a few add commands. Also I can verify if I explicitly provide the path.
# ./inventory.sh -v -s

Inventory, by John Bartlett, beta 1

Verify Files [Database: /mnt/apps/inventory/inventory.db]

Path was not specified. Please specify with -p or --path option

# ./inventory.sh -v 

Inventory, by John Bartlett, beta 1

Verify Files [Database: /mnt/apps/inventory/inventory.db]

Path was not specified. Please specify with -p or --path option
victor

Apologies - the original intent was to allow the script to test all files in the database and while the reason currently escapes me (long day), I needed to implement a path option for it. All command require a path to be specified. I'll update the original post to correct.

jbartlett · July 8, 2013

Hi John,

I ran into an issue where files would get repeatedly added when running "-a" even though they were not changed . I tracked it down to having square brackets in some pathnames (directories to be precise), which as I understand grep treats as a regular expression. This caused grep to not find the file in $dbsha.txt and so the code treated it as a new file. I temporarily fixed it by changing the grep to use the -F switch (thereby ignoring regex), but I don't know if that will have some other negative implications, and there is perhaps a better solution.

I ran into an issue too with brackets such as [1] and I believed I fixed that by escaping the brackets in the regex. The grep is also checking for valid regex using brackets so a -F won't work if you're not storing SHA values. I'll double-check the code.

ETA: I see it now, I forgot to escape brackets in a location

Looks like your -F option is a good fix. That logic only checks the file names and not an optional SHA so it works. I'll add that.

jbartlett · July 8, 2013

WeeboTech,

I'll take a look at those recommendations. Anything to help speed up the process will be a boon overall. I likewise have shares that contains many thousands of files.

WeeboTech · July 8, 2013

WeeboTech,

I'll take a look at those recommendations. Anything to help speed up the process will be a boon overall. I likewise have shares that contains many thousands of files.

As I'm thinking about this, you can use the lsql and pick off each file during an verify/update.

In reality, if you are doing an shabin for every file anyway, the lsql SELECT instead of the grep is going to be better.

If you want speed,

We can export the sqldb into a gdbm file.

then use GDBM to access the keys directly. It should be very fast.

I just found a technique for splitting a string into an array.

So you can take apart the exported DB values.

string="1:2:3:4:5"
array=(${string//:/ })
set | grep array
for i in "${!array}"
do
    echo "$i=>${array}"
done

Frankly, I would just do it the lsql way.

lsql -a FIELDS -d /tmp/files.lsql "SELECT * from locate where NAME='/tmp/package-sqlite/install/slack-desc';"

set | grep FIELDS

FIELDS=([0]="/tmp/package-sqlite/install/slack-desc" [1]="1373172452" [2]="942" [3]="")

gdbm -v /tmp/files.gdbm "/tmp/package-sqlite/install/slack-desc" VALUE

set | grep VALUE

VALUE=1373172452,942

FIELDS=(${VALUE//,/ })

root@slacky:/tmp# set | grep FIELDS

FIELDS=([0]="1373172452" [1]="942")

Frankly, I would just do it the lsql way.

As a recommendation during the update, If the mtime or size has not changed, do you really need to recalculate the SHA Checksum?

In trip wire it would probably do that check, are you trying to duplicate what tripwire does, or just inventory your files for changes. It seems that if you have not altered the time or size of the file, there should be no reason to do the sha and update the sha file.

In verify mode, Yes, you would want to calculate all values, but update should only do the check sum calc if you find that the file has been altered on the filesystem. I suppose you could use ctime instead of mtime, Chances are these files are not going to change that often.

So you may do an update daily and a verify monthly or weekly to save some time.

You could have another time field, so that every time a file is verified, it updates that time field. Then alter the logic so you only verify files that have not been verified in, lets say..., 7 days or 30 days, etc, etc.

jbartlett · July 9, 2013

I just found a technique for splitting a string into an array.

So you can take apart the exported DB values.
string="1:2:3:4:5"
array=(${string//:/ })
set | grep array
for i in "${!array}"
do
    echo "$i=>${array}"
done

I was looking for a way to do that! The regex comparison was such a PITA because it kept returning matches that shouldn't have matched - it was because I had tried to indicate a repeating value using {64} - the curly brackets didn't work and couldn't be escaped.

As a recommendation during the update, If the mtime or size has not changed, do you really need to recalculate the SHA Checksum?

In trip wire it would probably do that check, are you trying to duplicate what tripwire does, or just inventory your files for changes. It seems that if you have not altered the time or size of the file, there should be no reason to do the sha and update the sha file.

You'd specify the -s switch if you want to verify the integrity of the file regardless if the date & size remains unchanged - the date can be touched anyways so is unreliable in determining if a file has been modified or not.

You could have another time field, so that every time a file is verified, it updates that time field. Then alter the logic so you only verify files that have not been verified in, lets say..., 7 days or 30 days, etc, etc.

Excellent idea. I'll add that to the To Do list

Inventory - Manage an inventory of your files with SHA 256 hash (beta 1)

Recommended Posts

jbartlett

Link to comment

vm

Link to comment

vm

Link to comment

WeeboTech

Link to comment

WeeboTech

Link to comment

WeeboTech

Link to comment

jbartlett

Link to comment

jbartlett

Link to comment

jbartlett

Link to comment

WeeboTech

Link to comment

jbartlett

Link to comment

Archived