skank Posted June 9, 2012 Share Posted June 9, 2012 My music is stored like this: \Music\Albums \Music\Singles \Music\Top 100 Some mp3s are double or triple... How can i easily find (in 6000 mp3s) duplicate files and delete them on unraid? Quote Link to comment
Joe L. Posted June 9, 2012 Share Posted June 9, 2012 any ideas? yes. The attached script will scan your entire disk for duplicate files, regardless of their names or paths. (only EXACT duplicates are reported... If a different size, or content, files are NOT considered duplicates) The output file it creates will have content that looks like this: /mnt/disk1/Pictures/Misc-Pictures/100OLYMP13/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP5/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP13/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP5/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP11/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP12/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP11/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP12/P2281705.JPG /mnt/disk3/data/shared/100OLYMP/P2281705.JPG /mnt/disk3/data/packages5-10/cpio-unmenu-package.conf /mnt/disk6/boot/packages/cpio-unmenu-package.conf /mnt/disk6/boot/packages5-10/cpio-unmenu-package.conf /mnt/disk6/boot/unmenu/cpio-unmenu-package.conf /mnt/disk3/Pictures/102SANDS/SANY2050.JPG /mnt/disk6/Pictures/Misc-Pictures/102SANDS/SANY2050.JPG /mnt/disk3/data/packages5-10/libX11-1.1.5-i486-1.tgz /mnt/disk3/data/packagesSept2009/libX11-1.1.5-i486-1.tgz /mnt/disk6/boot/packages/libX11-1.1.5-i486-1.tgz /mnt/disk6/boot/packages5-10/libX11-1.1.5-i486-1.tgz /mnt/disk4/Mp3/Stations/Country/181 FM Classic Hits Home of The 60 s and 70 s.url /mnt/disk4/Mp3/Stations/Top 50/181 FM Classic Hits Home of The 60 s and 70 s.url /mnt/disk1/Pictures/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG /mnt/disk1/Pictures/PictureFrame/DCIM/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG /mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.BUP /mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.IFO /mnt/disk1/Movies/SD_VIDEO/MOV0AF.avi.nfo /mnt/disk4/Pictures/SD_VIDEO/MOV0AF.avi.nfo echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt" FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt The script of commands in the attached find_dupes.sh script is: set -v sysctl vm.vfs_cache_pressure=200 find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- | uniq -D -w 15 | cut -d" " -f2- | tee /mnt/disk1/dupes_tmp2 sed "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" | tee /mnt/disk1/dupes_tmp3 sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' | tee /mnt/disk1/dupes_tmp4 cat /mnt/disk1/dupes_tmp4 | xargs md5sum | tee /mnt/disk1/dupes_tmp5 sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- | tee /mnt/disk1/dupes_out.txt echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt" download and unzip in your flash drive. then run find_dupes.sh It will create temp files on disk1. If this is not OK, change the script accordingly. There are a number of intermediate temp files it creates when running. You can delete them all once you get the final result. The final set of files that are dupes is in /mnt/disk1/dupes_out.txt As you can see in the above sample output from my newer server, the files can be in different directories, and even have different names, but if they have the same md5 checksum, they are considered the same file. The actual process runs in several steps. The first line sets verbose mode so you can see the script running. The second sets a kernel parameter to make it easier for other processes to grab memory if they need it. The third "find" command line finds all the files and lists then preceeded by their size in bytes. The fourth finds those that do not have a unique size. (If they are unique in size, they cannot be a duplicate of another files) The fifth finds those that are not unique MD5 checksums in their first 4Meg of content. (If uniqueness occurs in the first 4Meg, the file is unique, no need to check the balance of the file, regardless of content.) The sixth computes the MD5 checksum for those files not unique in their first 4 Meg for their entire contents. The seventh line sorts those and groups them in a way that is readable and deletes those that have a unique MD5 checksum. The output is put in /mnt/disk1/dupes_out.txt While the processing is occurring, the output is also sent to the terminal being used. It is interesting to watch. It will take quite a few hours if you have a large amount of files to scan. Oh yes, I limited the scan to 8 directories deep. (I had some windows backups that were far deeper and did not want to bother with them in the results) It is up to you to delete all but one of the duplicates... (the process does nothing to delete files. It will only show you where they are. Whatever you do, if it shows you have two copies of a file, do NOT delete both unless you want NO copies of the file to remain) It is expected you'll use the opportunity to organize the files as you desire, deleting all but ONE of the desired files. Joe L. find_dupes.zip Quote Link to comment
skank Posted June 9, 2012 Author Share Posted June 9, 2012 Wow, this is so Great! Very handy! THx à lot Joe! Will try asap tomorrow! THx again. Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! my output looks much more hectic also, below is a screenshot, everything is together Quote Link to comment
bonienl Posted June 10, 2012 Share Posted June 10, 2012 Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! You can filter the final output file using the "grep" command grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt Quote Link to comment
Joe L. Posted June 10, 2012 Share Posted June 10, 2012 Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! You can filter the final output file using the "grep" command grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt If you do that, you'll lose the spaces the script puts between the different files. Instead, just modify the very first find command like this: from find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 to find /mnt/disk*/AllData/music ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 If your "music" sub-directory actually is "Music" use a capitalized "Music" in the "find" command instead of "music", otherwise, the script will not match the directory name and nothing will print. (but it will run really fast, since no files will be found) Joe L. Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 Ok Will try it Joe, thx Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 mmm strange, it works for everything but not mp3 somehow hes skipping those files here.... Quote Link to comment
althoralthor Posted June 10, 2012 Share Posted June 10, 2012 Joe- Awesome script! Thanks for sharing! Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct Quote Link to comment
Joe L. Posted June 10, 2012 Share Posted June 10, 2012 i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3 Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3 when i type that in, he says no such file or directory Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things Quote Link to comment
Joe L. Posted June 10, 2012 Share Posted June 10, 2012 its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things That is a different request. Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different. (and the checksums) Quote Link to comment
skank Posted June 10, 2012 Author Share Posted June 10, 2012 its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things That is a different request. Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different. (and the checksums) yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture Quote Link to comment
JonathanM Posted June 10, 2012 Share Posted June 10, 2012 yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you. Quote Link to comment
skank Posted June 11, 2012 Author Share Posted June 11, 2012 yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you. It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title So theres no way to show it up? How come this scripts shows it then (when its running, i see those double songs pass by) but doesnt export it to the txt? Any idea Joe? Quote Link to comment
JonathanM Posted June 11, 2012 Share Posted June 11, 2012 It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title I can rename a file anything I want. Just because the name matches, doesn't mean anything. Duplicate files have the same binary contents, which Joe's script finds just fine. If the files have different contents, they are different, and you will have to judge for yourself which one to keep. How you make the decision is up to you. Quote Link to comment
Joe L. Posted June 11, 2012 Share Posted June 11, 2012 i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3 when i type that in, he says no such file or directory You must put the correct path to YOUR files, not the text I gave. Replace /path/to/mp3_file.mp3 with the path and name of YOUR mp3 you think should be found as a dupe. Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it You are seeing the very first pass of the process. It lists EVERY file preceded by their size in bytes, regardless of its contents. Quote Link to comment
skank Posted June 12, 2012 Author Share Posted June 12, 2012 i see... Anway i've found a way to delete the duplicate mp3 files using mp3tag I could sort the files by name, and that way i could easily find doubles Thx anyway Quote Link to comment
Kertison Posted November 4, 2012 Share Posted November 4, 2012 My music is stored like this: \Music\Albums \Music\Singles \Music\Top 100 Some mp3s are double or triple... How can i easily find (in 6000 mp3s) duplicate files and delete them on unraid? I always prefer Duplicate Files Deleter to find out duplicate files and delete as well. It's comparatively hassle free & user friendly utility than i used before. Quote Link to comment
trurl Posted November 4, 2012 Share Posted November 4, 2012 and there's also Duplicate Cleaner Quote Link to comment
BLKMGK Posted November 4, 2012 Share Posted November 4, 2012 Funny, just the other day I was googling for a dupe finder! Tons of them out there and I'll be scanning my stuff too. Personally I'm looking for exact dupes of the data so will be using something that scans and compares hashes. It will take forever but I know there will be some to find, especially in my pictures which is a mess. Different versions of the same song I'll keep especially from different albums. Quote Link to comment
JonathanM Posted November 4, 2012 Share Posted November 4, 2012 Clonespy is a totally free windows program that works well for me. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.