Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 2.06 KB

README.md

File metadata and controls

49 lines (40 loc) · 2.06 KB

dedupe

Deduplicate files within a given list of directories by keeping one copy and making the rest links.

This should happen atomically on Linux/Unix platforms, but due to limitations in the Python/Windows layers, may not be atomic on Windows. There may also be issues on Windows regarding the use of symlinks.

Usage: dedupe.py [options] dir1 [dir2...]

Options:
  -h, --help            show this help message and exit
  -n, --dry-run         Don't actually do the delete/link, just list what
                        would be linked
  -q, --quiet           Don't log which files were re-linked
  -r, --recurse         Recurse into subdirectories
  --min-size=MIN_SIZE   Minimum file-size to consider
  -s WHEN, --symlink=WHEN, --sym-link=WHEN
                        Should sym-links be used ([never], fallback, always)
  -a ALGORITHM, --algorithm=ALGORITHM
                        Choice of algorithm (one of DSA, DSA-SHA, MD4, MD5,
                        RIPEMD160, SHA, SHA1, SHA224, SHA256, SHA384, SHA512,
                        dsaEncryption, dsaWithSHA, ecdsa-with-SHA1, md4, md5,
                        ripemd160, sha, sha1, sha224, sha256, sha384, sha512,
                        whirlpool)

Originally I had tried out a similar dedupe tool by @jeek but it had a few issues:

  • it hashed every file, not just the ones that were suspected of being duplicates. So on my low-end hosting service, jailshell would kill off the process because it was taking too long (or maybe doing too much I/O in one command). With my 3.2GB of photos, @jeek's would time-out after minutes where this one ran in seconds.

  • it didn't have the option of doing a dry-run to see which files would be deduplicated, without actually performing the deduplication

  • it didn't detect whether deduplicated files were on different devices/mount-points and attempts to link() would fail in those cases

  • it didn't support doing sym-links

Thus, this utility was born. Basically, this should behave as the same utility, but faster and with a few more features.