dedup

File duplicate detector

Overview

This project contains a simple script to help identify duplicate files based on file content rather than file name.

The tool take a single folder and determine duplicates as output.

Install and usage

There are three version of this tool.

The compiled version in Crystal is quite a bit faster than the Ruby version.

The go version is also fast.

In the future there is an opportunity to adapt the Crystal version to a parallel version to make use of all CPU cores. But as of version 1.7, Crystal still only support concurrency and not true parallelism.

Run ruby version

Use ruby 2.3 or later

./dedup.rb <folder> [-debug] [-dedup]

Run crystal version

Build and run using crystal 1.4 or later

crystal run dedup.cr -- <folder> [-debug] [-dedup]

Build release and then run executable

crystal build dedup.cr --release --no-debug
./dedup <folder> [-debug] [-dedup]

Run go version

Build and run using go 1.17 or later

go run dedup.go <folder>

Build release and then run executable

go build dedup.go
./dedup <folder>

Performance

The compiled Crystal version seems quite a lot faster compared with the Ruby version.

time ./dedup.rb /home/share/photos -debug 
Total number of analyzed files: 70033
Number of files that share size with at least another file: 19065
Number of files that share size and file start with at least one other file: 16545
Number for files that share size and hash (SHA1) with at least one other file: 16536
Detected file extensions currently not covered:
, 51 files
xmp, 895 files
...
rb, 1 files
0,/home/share/photos/Foto 2009/FotoD80 2009/090107_hus_hackspett_mm/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/090401_vinterfjäll/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/Ny m090324_Husstatus_Alicia/DSC_0021.NEF
0,/home/share/photos/Foto 2008 och tidigare/Foto_D80_2008/081027_allmän_höst/DSC_0021.NEF
0,/home/share/photos/Lightroom_back_up/081027_allmän_höst/DSC_0021.NEF
...
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 2.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 3.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180.JPG
   36m09.42s real    15m38.62s user     7m48.83s system

Crystal version is roughly 70 times faster than the Ruby version.

time ./dedup /home/share/photos -debug 
Total number of analyzed files: 70033
Number of files that share size with at least another file: 19065
Number of files that share size and file start with at least one other file: 16545
Number for files that share size and SHA256 hash with at least one other file: 16536
Detected file extensions currently not covered:
, 51 files
xmp, 895 files
...
rb, 1 files
0,/home/share/photos/Foto 2009/FotoD80 2009/090107_hus_hackspett_mm/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/090401_vinterfjäll/DSC_0021.NEF
0,/home/share/photos/Foto 2009/FotoD80 2009/Ny m090324_Husstatus_Alicia/DSC_0021.NEF
0,/home/share/photos/Foto 2008 och tidigare/Foto_D80_2008/081027_allmän_höst/DSC_0021.NEF
0,/home/share/photos/Lightroom_back_up/081027_allmän_höst/DSC_0021.NEF
...
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 2.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180 3.JPG
7301,/home/share/photos/Foto_2020/2020_iPhone1_2/IMG_4180.JPG
    0m43.59s real     0m14.33s user     0m20.46s system

Both version produce the same result.

Deduplication

It is possible to add -dedup flag to command to de-duplicate identical copies of files.

This feature make use of UNIX hard links and is only useful on UNIX file systems with hard link support.

File names that previously pointed to multiple copies of files with same content, will after deduplication, point to a single file (with a single i-node nunber). The required space is also reduced to a single file.

To check the result after deduplication ls and du can be used.

ls -lRi <folder>
du -h <folder>

The -i flag will print i-node numbers for each file. Disk usage command will give the resulting space used by folder.

Do some tests on a copy of a folder BEFORE you use this on you complete photo catalog to make sure the tool does what you expect.

Process duplicates

You can make a simple script that process the output

For instance this script (process.rb) filters out the first occurance of each file group

#!/usr/bin/env ruby
	
prev = -1
ARGF.each do |l|
	cur = l.split(",")[0].to_i
	if cur != prev
		prev = cur
	else
		puts l
	end
end

Runtime issues

If you have problem like the following in the compiled Crystal version.

Unhandled exception: Error opening file '...' with mode 'r': Too many open files (Errno)	
Failed to raise an exception: END_OF_STACK
[0xce853fd8706] __crystal_sigfault_handler +39750
...

It actually is a stack limitation issue. You may need to raise the stack limitation of your OS.

On e.g. OpenBSD change stack limitation in login.conf.

If your user is associted with the staff login class, change stack limitation in login.conf as follows.

staff:\
	:stacksize-cur=32M:

You can check your login class via userinfo <user>.

$ userinfo peter
login	peter
passwd	*
uid	1000
groups	peter wheel
change	NEVER
class	staff
...
Repository

dedup

Owner
Statistic
  • 3
  • 0
  • 0
  • 0
  • 0
  • almost 2 years ago
  • January 8, 2021
License

Links
Synced at

Sun, 24 Nov 2024 22:07:20 GMT

Languages