dedup
File duplicate detector
Overview
This project contains a simple script to help identify duplicate files based on file content rather than file name.
The tool take a single folder and determine duplicates as output.
Install and usage
There are two version of this tool.
The compiled version in Crystal is a bit faster than the Ruby version.
In the future there is an opportunity to adapt the Crystal version to a parallel version to make use of all CPU cores. But as of version 0.35, Crystal still only support concurrency and not true parallelism.
Run ruby version
Use ruby 2.3 or later
./dedup.rb <folder> [-debug]
Run crystal version
Build and run using crystal 30.1 or later
crystal run dedup.cr -- <folder> [-debug]
Build release and then run executable
crystal build dedup.cr --release --no-debug
./dedup <folder> [-debug]
Performance
The compiled Crystal version seems quite a lot faster compared with the Ruby version.
time ./dedup.rb /home/share/photos -debug
Total number of analyzed files: 70033
Number of files that share size with at least another file: 19065
Number of files that share size and file start with at least one other file: 16545
Number for files that share size and hash (SHA1) with at least one other file: 16536
Detected file extensions currently not covered:
, 51 files
xmp, 895 files
...
rb, 1 files
0,/home/share/photos/Foto 2009/FotoD80 2009/090107_hus_hackspett_mm/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/090401_vinterfjäll/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/Ny m090324_Husstatus_Alicia/DSC_0021.NEF
0,/home/share/photos/Foto 2008 och tidigare/Foto_D80_2008/081027_allmän_höst/DSC_0021.NEF
0,/home/share/photos/Lightroom_back_up/081027_allmän_höst/DSC_0021.NEF
...
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 2.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 3.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180.JPG
36m09.42s real 15m38.62s user 7m48.83s system
Crystal version is roughly 70 times faster than the Ruby version.
time ./dedup /home/share/photos -debug
Total number of analyzed files: 70033
Number of files that share size with at least another file: 19065
Number of files that share size and file start with at least one other file: 16545
Number for files that share size and SHA256 hash with at least one other file: 16536
Detected file extensions currently not covered:
, 51 files
xmp, 895 files
...
rb, 1 files
0,/home/share/photos/Foto 2009/FotoD80 2009/090107_hus_hackspett_mm/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/090401_vinterfjäll/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/Ny m090324_Husstatus_Alicia/DSC_0021.NEF
0,/home/share/photos/Foto 2008 och tidigare/Foto_D80_2008/081027_allmän_höst/DSC_0021.NEF
0,/home/share/photos/Lightroom_back_up/081027_allmän_höst/DSC_0021.NEF
...
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 2.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 3.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180.JPG
0m43.59s real 0m14.33s user 0m20.46s system
Both version produce the same result.
Runtime issues
If you have problem like the following in the compiled Crystal version.
Unhandled exception: Error opening file '...' with mode 'r': Too many open files (Errno)
Failed to raise an exception: END_OF_STACK
[0xce853fd8706] __crystal_sigfault_handler +39750
...
It actually is a stack limitation issue. You may need to raise the stack limitation of your OS.
On e.g. OpenBSD change stack limitation in login.conf.
If your user is associted with the staff
login class, change stack limitation in login.conf
as follows.
staff:\
:stacksize-cur=32M:
You can check your login class via userinfo <user>
.
$ userinfo peter
login peter
passwd *
uid 1000
groups peter wheel
change NEVER
class staff
...
- 1
- 0
- 0
- 0
- 7 days ago
- January 8, 2021
Thu, 21 Jan 2021 08:17:46 GMT