jbartlett Posted July 4, 2013 Share Posted July 4, 2013 This script will allow you to track your files for changes to the file date & size including an optional SHA-256 hash. The attached zip file comes with sqlite3 and sha256deep included because the files are not included with stock UNRAID. You can leave them or remove them if you already have the binaries installed. If you do not implicitly specify a database file to use, the script will create a file called "inventory.db" located on your cache drive if found, otherwise on /mnt/disk1 Be kind if reviewing the code, this is my first bash script and I learned by google. Options: -p, --path Full path to scan, required -a, --add Add new files to the Inventory database -v, --verify Verify files in the Inventory database -u, --update Verify files & update database with changes -m, --missing Scan for missing inventoried files -r, --removemissing Scan for and remove missing inventoried files from db -f, --filter Only process files matching file mask Default: * -s, --sha Compute a SHA256 value for adding or verification -d, --db Location of the Inventory sqlite database -l, --log Log files added/changed/missing to syslog -b, --backup Backup database before adding/removing files -i, --id Specify an alphanumeric ID to use instead of a psudo-random number If specifying a path to scan or location of the database file that contains spaces, place the path inside double-quotes. Examples: Add new files on disk1 to a database named "disk1.db" stored on the cache drive and generate SHA hash inventory.sh -a -s -p /mnt/disk1 -d /mnt/cache/disk1.db Check for missing files from the previous "disk1" database inventory.sh -m -p /mnt/disk1 -d /mnt/cache/disk1.db Check for changed files from above example but only compare file date & size inventory.sh -v -p /mnt/disk1 -d /mnt/cache/disk1.db Check for changed files from above example but check the SHA hash (will update file date if SHA matches but date does not) inventory.sh -v -s -p /mnt/disk1 -d /mnt/cache/disk1.db Update changed files from above example inventory.sh -u -s -p /mnt/disk1 -d /mnt/cache/disk1.db Remove missing files from the database from the above examples inventory.sh -r -p /mnt/disk1 -d /mnt/cache/disk1.db Add new files in the share 'TV' and compute SHA256 inventory.sh -a -s -p /mnt/user0/TV Add all files stored on disk1 without SHA256 inventory.sh -a -p /mnt/disk1 Add all files stored on disk1 without SHA256 in database stored on disk2 inventory.sh -a -p /mnt/disk1 -d /mnt/disk2/inventory.db Add only *.mov files in a user share that contains a space inventory.sh -a -p "/mnt/user0/My Videos" -f *.mov Scan for missing files in a subdirectory on a user share inventory.sh -m -p /mnt/user0/Documents/John/Catalog This script is multi-thread safe in that you can run multiple instances simultaneously. The script uses a "random" number for all internal processing which requires generating a file so there won't be any file conflicts. Well, a conflict is very rare but you can specify a custom ID instead of a random number using the -i | --id parameter. TO DO: 1. Allow checking the SHA hash and updating if a non-SHA verification was done and the file date does not match 2. Add support to scan "lost+found" directories on all drives to attempt recovery of files with a matching SHA256 hash stored in a database inventory.zip Link to comment
vm Posted July 7, 2013 Share Posted July 7, 2013 Thanks for sharing the script! Verify all files in the database with file date, time, and SHA256 hash inventory.sh -v -s Verify all files in the database with file date, time inventory.sh -v I'm having difficulty with these options, as it always insists I provide a path - unless I'm misunderstanding the syntax? BTW, the db already exists, as I successfully ran a few add commands. Also I can verify if I explicitly provide the path. # ./inventory.sh -v -s Inventory, by John Bartlett, beta 1 Verify Files [Database: /mnt/apps/inventory/inventory.db] Path was not specified. Please specify with -p or --path option # ./inventory.sh -v Inventory, by John Bartlett, beta 1 Verify Files [Database: /mnt/apps/inventory/inventory.db] Path was not specified. Please specify with -p or --path option victor Link to comment
vm Posted July 7, 2013 Share Posted July 7, 2013 Hi John, I ran into an issue where files would get repeatedly added when running "-a" even though they were not changed . I tracked it down to having square brackets in some pathnames (directories to be precise), which as I understand grep treats as a regular expression. This caused grep to not find the file in $dbsha.txt and so the code treated it as a new file. I temporarily fixed it by changing the grep to use the -F switch (thereby ignoring regex), but I don't know if that will have some other negative implications, and there is perhaps a better solution. example (keeps re-adding): /mnt/user/audio/hd/music_96_24/George_Benson--Breezin'_96_24_[FLAC]/folder.jpg Original Code if grep -q "$currfile|" "$dbsha.txt"; then My temporary fix if grep -F -q "$currfile|" "$dbsha.txt"; then victor Edit: to add example Link to comment
WeeboTech Posted July 8, 2013 Share Posted July 8, 2013 nice stuff jbartlett, I've been working on a similiar project but in C. Actually if I could get the author of md5deep/shadeep to store the mtime as epoch seconds, I would be able to do most of it in shell. Although I asked, he directed me to tripwire instead. Not where I wanted to go. Anyway I was looking at your well written script. You may want to break down the actions into shell functions. It would be easier to follow. I'm going to have some tools that you may be able to use and speed up the script. I see a bunch of call outs which are fine, until you are processing 7 million files like I do. Call outs like these can really drag a script down. filedate=$(stat -c %Y "$currfile") filesize=$(stat -c %s "$currfile") Anyway read up on bash loadables. I've compiled them and I'm testing a few out. finfo, strftime, lsql, gdbm and a few others. What the loadables allow you to do is dynamically load C routines into the current bash executable. This alleviates the expense of forks when running the subordinate process to assign a variable. for exaple with the stat, we can use finfo -v FILESIZE filename.dat which will load the filesize of the file into the FILESIZE variable without forking. same with mtime, strftime allows reformatting of the epoch time (like date) and storing it in a variable. While it's very easy and convenient to fork out and reassign, when you do it in a loop that occurs hundreds of thousands of times, it bogs. hence the reason I was working on loadables. I spent the weekend working on an md5 loadable so I wasn't forking a million times for an md5sum. Oh, also read up about coproc. You can hand off your sql updates to a coprocess, real neat stuff. I also learned to day about altering the PRAGMA of the sqlite db. by adjusting the journal to be in memory and turning off the sync, the speed difference was staggering. At first I did a ftw64 to 60,000 records for insertion. it took 18 minutes to insert records. I turned off the disk based journaling and set it to memory and it dropped dramatically. here are my test results. QLITE time ./ftwinocache_sqlite $HOME selects 0, deletes 6628, duplicates 6628, inserts 52251, errors 0 real 18m1.68s user 0m44.53s sys 3m15.21s no journaling synchronous = OFF journal_mode = MEMORY time ./ftwinocache_sqlite $HOME real 0m58.21s user 0m35.86s sys 0m22.31s journaling synchronous = ON journal_mode = MEMORY time ./ftwinocache_sqlite $HOME selects 0, deletes 6629, duplicates 6629, inserts 52252, errors 0 real 3m47.92s user 0m37.73s sys 1m58.72s I would abandon sqlite, in favor of something faster like GDBM or even mysql to flat files (with the right libraries it works like sqlite) however, with sqlite, you can download them to a workstation, load a module into firefox and browse the database if needed. Plus with the lsql loadable, you have direct access to the records. Cool stuff. Anyway hit me up with a pm and we can collaborate. I can send you some of the loadables to play with. Link to comment
WeeboTech Posted July 8, 2013 Share Posted July 8, 2013 Here's where some of the fun begins. First test on a directory with 5,000 files. find /tmp -type f -mount | wc -l 5079 #!/bin/bash DB="/tmp/files.sqlite3" sqlite3 ${DB} "CREATE TABLE IF NOT EXISTS locate (name TEXT PRIMARY KEY, mtime INTEGER, size INTEGER, checksum TEXT)" if [ -z "${1}" ] then DIR=$PWD else DIR="${1}" fi find "${DIR}" -mount -type f -print | while read FILENAME do MTIME=$(stat --printf "%Y" "${FILENAME}") SIZE=$(stat --printf "%s" "${FILENAME}") sqlite3 ${DB} "INSERT INTO locate (name,mtime,size,checksum) VALUES ('$FILENAME','$MTIME','$SIZE','');" done Programs run from Shell root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./sqlite3_test.sh /tmp real 1m37.602s user 0m8.672s sys 0m33.185s Second test with loadable C modules into the bash core. #!/bin/bash [ ${DEBUG:=0} -gt 0 ] && set -x -v enable -f $PWD/finfo finfo enable -f $PWD/sqlite3 lsql DB="/tmp/files.lsql" lsql -d ${DB} "CREATE TABLE IF NOT EXISTS locate (name TEXT PRIMARY KEY, mtime INTEGER, size INTEGER, checksum TEXT)" if [ -z "${1}" ] then DIR=$PWD else DIR="${1}" fi find "${DIR}" -mount -type f -print | while read FILENAME do finfo -v MTIME -m "${FILENAME}" finfo -v SIZE -s "${FILENAME}" lsql -d ${DB} "INSERT INTO locate (name,mtime,size,checksum) VALUES ('$FILENAME','$MTIME','$SIZE','');" done Loadable Modules root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./lsql_test.sh /tmp real 1m4.671s user 0m4.027s sys 0m9.684s While the improvement isn't all that much, it shows that an in core C module without forking saves a decent amount of time. This becomes especially evident when your file count grows. Here's where some of the real fun comes in. Those greps could be replaced with. lsql -a FIELDS -d /tmp/files.lsql "SELECT * from locate where NAME='/tmp/package-sqlite/install/slack-desc';" set | grep FIELDS FIELDS=([0]="/tmp/package-sqlite/install/slack-desc" [1]="1373172452" [2]="942" [3]="") and something like this becomes possible. lsql -a LIST -d /tmp/files.lsql "SELECT name from locate where NAME LIKE '%slack-desc%';" set | grep LIST LIST=([0]="/tmp/tgz/powerdown/usr/doc/powerdown-1.03/slack-desc" [1]="/tmp/tgz/powerdown/install/slack-desc" [2]="/tmp/unRAID/package-postfix/install/slack-desc" [3]="/tmp/package-sqlite/install/slack-desc") Cool stuff. I have not explored associate arrays yet. There may be a way to do that as well. Also another way to get rid of those greps is to use a GDBM file. With the grep you are reading the same file over and over and over and over. While once it is read it is in memory, the continual verification could bog down the script. With a GDBM file there is a key (filename) and data(anything else you want to store). Lookups are very fast. Originally I had planned to use GDBM files. I'm very familiar with them, but there aren't many basic tools for accessing them easily. Except for the bash gdbm loadable, which must be compiled manually. At least with sqlite, you can install sqlite3, move the file to another platform and/or use firefox with the sqlite extension. Yet if the gdbm file is only used for temporary shadeep db conversion (for fast lookups). It would not matter. root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# bash root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# enable -f $PWD/gdbm gdbm root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# help gdbm gdbm: gdbm [-euikvr] [-KVW array] file [key | key value ...] Interface to gdbm(3). It simulates disk-based associative array. gdbm file -- print all key/\t/value pairs, ie. dict.items() gdbm -k file -- print all keys, ie. dict.keys() gdbm -v file -- print all values, ie. dict.values() gdbm file key -- print dbm[key], ie. ${dbm[key]} gdbm -r file -- reorganize database gdbm -K array file -- save all keys into array gdbm -V array file -- save all values into array gdbm -W array file -- save all key/value pairs into array sequentially gdbm file key value -- store key/value, ie. dbm[key]=value gdbm -i file key value -- store key/value, only if key is new gdbm -v file key name -- store value in name variable, ie. name=${dbm[key]} gdbm -e file -- test if file is GDBM database gdbm -e file key -- test if key exists gdbm -e file key value -- test if key exists and value is dbm[key] gdbm -u file key -- delete key, ie. unset dbm[key] gdbm -u file key value -- delete key, only if value is dbm[key] More than one key/value pair can be specified on command line, in which case, they would be processed in pairs. It returns 1 immediately on any error or test failure. If all arguments are processed without error, then returns success (0). Each 'gdbm' command is complete, in that it opens and closes the database file. Link to comment
WeeboTech Posted July 8, 2013 Share Posted July 8, 2013 and s'more food for thought with gdbm files. #!/bin/bash DB="/tmp/files.gdbm" enable -f $PWD/finfo finfo enable -f $PWD/gdbm gdbm if [ -z "${1}" ] then DIR=$PWD else DIR="${1}" fi find "${DIR}" -mount -type f -print | while read FILENAME do finfo -v MTIME -m "${FILENAME}" finfo -v SIZE -s "${FILENAME}" gdbm ${DB} "${FILENAME}" "${MTIME},${SIZE}" done loading 5200 files into GDBM using loadables. root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# time ./gdbm_test.sh /tmp real 0m7.363s user 0m1.847s sys 0m0.940s Testing it out. root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# enable -f $PWD/gdbm gdbm root@slacky:/unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables# gdbm /tmp/files.gdbm | wc -l 5218 snippet of data. gdbm /tmp/files.gdbm | grep loadables /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5 1373247716,20106 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/head 1373216169,11506 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5main.c 1373241102,5007 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/perl/bperl.sh 1373294220,128 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/pushd.c 1373170632,16909 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/md5lib.o 1373242957,9832 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/finfo.o 1373174409,20680 /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/dirname.o 1373216055,6664 Accessing a single key after load is very very fast. time gdbm /tmp/files.gdbm /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/mktime.o 1373289349,10008 real 0m0.001s user 0m0.000s sys 0m0.000s load into variable gdbm -v /tmp/files.gdbm /unRAID/home/rcotrone/src.slacky/bash/bash-4.1/examples/loadables/mktime.o VAR set | grep VAR VAR=1373289349,10008 There's all kinds of neat stuff that can be done with this. Link to comment
jbartlett Posted July 8, 2013 Author Share Posted July 8, 2013 Thanks for sharing the script! Verify all files in the database with file date, time, and SHA256 hash inventory.sh -v -s Verify all files in the database with file date, time inventory.sh -v I'm having difficulty with these options, as it always insists I provide a path - unless I'm misunderstanding the syntax? BTW, the db already exists, as I successfully ran a few add commands. Also I can verify if I explicitly provide the path. # ./inventory.sh -v -s Inventory, by John Bartlett, beta 1 Verify Files [Database: /mnt/apps/inventory/inventory.db] Path was not specified. Please specify with -p or --path option # ./inventory.sh -v Inventory, by John Bartlett, beta 1 Verify Files [Database: /mnt/apps/inventory/inventory.db] Path was not specified. Please specify with -p or --path option victor Apologies - the original intent was to allow the script to test all files in the database and while the reason currently escapes me (long day), I needed to implement a path option for it. All command require a path to be specified. I'll update the original post to correct. Link to comment
jbartlett Posted July 8, 2013 Author Share Posted July 8, 2013 Hi John, I ran into an issue where files would get repeatedly added when running "-a" even though they were not changed . I tracked it down to having square brackets in some pathnames (directories to be precise), which as I understand grep treats as a regular expression. This caused grep to not find the file in $dbsha.txt and so the code treated it as a new file. I temporarily fixed it by changing the grep to use the -F switch (thereby ignoring regex), but I don't know if that will have some other negative implications, and there is perhaps a better solution. I ran into an issue too with brackets such as [1] and I believed I fixed that by escaping the brackets in the regex. The grep is also checking for valid regex using brackets so a -F won't work if you're not storing SHA values. I'll double-check the code. ETA: I see it now, I forgot to escape brackets in a location Looks like your -F option is a good fix. That logic only checks the file names and not an optional SHA so it works. I'll add that. Link to comment
jbartlett Posted July 8, 2013 Author Share Posted July 8, 2013 WeeboTech, I'll take a look at those recommendations. Anything to help speed up the process will be a boon overall. I likewise have shares that contains many thousands of files. Link to comment
WeeboTech Posted July 8, 2013 Share Posted July 8, 2013 WeeboTech, I'll take a look at those recommendations. Anything to help speed up the process will be a boon overall. I likewise have shares that contains many thousands of files. As I'm thinking about this, you can use the lsql and pick off each file during an verify/update. In reality, if you are doing an shabin for every file anyway, the lsql SELECT instead of the grep is going to be better. If you want speed, We can export the sqldb into a gdbm file. then use GDBM to access the keys directly. It should be very fast. I just found a technique for splitting a string into an array. So you can take apart the exported DB values. string="1:2:3:4:5" array=(${string//:/ }) set | grep array for i in "${!array}" do echo "$i=>${array}" done Frankly, I would just do it the lsql way. lsql -a FIELDS -d /tmp/files.lsql "SELECT * from locate where NAME='/tmp/package-sqlite/install/slack-desc';" set | grep FIELDS FIELDS=([0]="/tmp/package-sqlite/install/slack-desc" [1]="1373172452" [2]="942" [3]="") gdbm -v /tmp/files.gdbm "/tmp/package-sqlite/install/slack-desc" VALUE set | grep VALUE VALUE=1373172452,942 FIELDS=(${VALUE//,/ }) root@slacky:/tmp# set | grep FIELDS FIELDS=([0]="1373172452" [1]="942") Frankly, I would just do it the lsql way. As a recommendation during the update, If the mtime or size has not changed, do you really need to recalculate the SHA Checksum? In trip wire it would probably do that check, are you trying to duplicate what tripwire does, or just inventory your files for changes. It seems that if you have not altered the time or size of the file, there should be no reason to do the sha and update the sha file. In verify mode, Yes, you would want to calculate all values, but update should only do the check sum calc if you find that the file has been altered on the filesystem. I suppose you could use ctime instead of mtime, Chances are these files are not going to change that often. So you may do an update daily and a verify monthly or weekly to save some time. You could have another time field, so that every time a file is verified, it updates that time field. Then alter the logic so you only verify files that have not been verified in, lets say..., 7 days or 30 days, etc, etc. Link to comment
jbartlett Posted July 9, 2013 Author Share Posted July 9, 2013 I just found a technique for splitting a string into an array. So you can take apart the exported DB values. string="1:2:3:4:5" array=(${string//:/ }) set | grep array for i in "${!array}" do echo "$i=>${array}" done I was looking for a way to do that! The regex comparison was such a PITA because it kept returning matches that shouldn't have matched - it was because I had tried to indicate a repeating value using {64} - the curly brackets didn't work and couldn't be escaped. As a recommendation during the update, If the mtime or size has not changed, do you really need to recalculate the SHA Checksum? In trip wire it would probably do that check, are you trying to duplicate what tripwire does, or just inventory your files for changes. It seems that if you have not altered the time or size of the file, there should be no reason to do the sha and update the sha file. You'd specify the -s switch if you want to verify the integrity of the file regardless if the date & size remains unchanged - the date can be touched anyways so is unreliable in determining if a file has been modified or not. You could have another time field, so that every time a file is verified, it updates that time field. Then alter the logic so you only verify files that have not been verified in, lets say..., 7 days or 30 days, etc, etc. Excellent idea. I'll add that to the To Do list Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.