I’ve been using Photoprism to manage my large and growing photo library. We had simply outgrown using a single machine to manage the library, and Apple had burned us a couple of times by changing their native photo management system. I’m also not the type to trust someone else to keep and secure my photos, so I’m going to host it myself.
I have backups of those photo libraries which I’m working from, and unfortunately those backups seem to have replication of the photos. No problem right? Photoprism has the ability to detect duplicates and reject them. Sweet. However, it does rely on the photos being exactly the same binary.
My problems start when I have a bunch of smaller photos, which look ok – but are clearly not the original. In this particular case the original is 2000×2000, and the alternate version is 256×256 (see top of post for an example of two images). Great – just delete the small one, but with 1000’s of photos how do I know that one is a duplicate of another but resized?
There are other flags here too, the smaller resized version is missing a proper EXIF date stamp. So sure, I can just sort out the photos based on ones with valid EXIF data and then I have a bunch of others which don’t have data. But, what if one of those photos isn’t a resized version? Maybe it’s a photo of something that I only have a small version of?
Again, with 1000’s of photos to review, I’m not going to be able to reasonably figure out which ones are keepers or not. Good thing that doing dumb stuff is what computers are good at. However, looking at two images and determining if they are the same thing is not as easy as you might think.
The folks at imagemagick have some good ideas on comparing for differences, they even tackle the same issue of identifying duplicates but still end up relying on you creating your own solution based on some advice.
Since I had this problem, I did cook up some scripting and an approach which I’ll share here. It’s messy, and I still rely on a human to decide – but for the most part I get a computer to do some brute force work to make the problem human sized.
Let’s assume we have a few directories of files.
originals
– these are the EXIF tagged photos
unknown
– the collection of untagged possible (smaller) duplicates
First we are going to create small thumbnails of both of them. I’m leaning heavily on the oiiotools, the OpenImageIO tooling.
1 2 3 4 5 6 7 8 9 |
#! /usr/bin/env bash # # For every file in the path # create a small thumbnail in ./thumbs # find /path/to/originals -type f -name \*.jpg -print0 | while IFS= read -r -d '' f; do dest=$(basename $f) oiiotool --resize 64x64 "$f" -o ./thumbs/"$dest" done |
After you’ve run this twice, once for originals
and once for unknown
(renaming the thumbs directory to avoid conflict) – let’s call those two thumb directories: source
and target
respectively.
Now we have
source
– 64×64 thumbnails of the originals
target
– 64×64 thumbnails of the unknowns
We’re going to create two more directories, match
and matched
. These will be used to store the results of the following comparison script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
#! /usr/bin/env bash # # Walk the target directory # # For every file, walk the source directory - looking for a pass # # If we pass - move the file(s) # find ./target -type f -print0 | while IFS= read -r -d '' tf; do echo 'thumb switch ' $tf find ./source -type f -print0 | while IFS= read -r -d '' sf; do # echo 'trying ' $sf #idiff --fail 0.05 $tf $sf > /dev/null #idiff --fail 0.07 $tf $sf > /dev/null #idiff --fail 0.30 $tf $sf > /dev/null idiff --fail 0.35 $tf $sf > /dev/null if [[ $? -eq 0 ]]; then file=$(basename $tf) new=$file$(basename $sf) echo $file $tf $sf $new # Move the matching target to the match directory mv $tf ./match # Copy the source to the match directory, # prefixing the name with the target file name cp $sf ./match/$new # Remove the source file (really we 'moved' it) rm $sf # Move the unknown file that matches the target # thumbnail to the matched directory mv ./unknown/$file ./matched break; fi done done |
The idiff
program is doing the heavy lifting here. Yes, this is an O(n^2) algorithm, horrible but it works and computers will happily work all night. Due to how the idiff
program compares images it will FAIL two visually similar images, I had to bump the ‘fail’ criteria to be more permissive. I started out with a very modest 0.05 value, which worked for a good number of my images. I then discovered it had not caught all of the visual matches, and kept increasing the value. At 0.40 I’m getting a few bad matches, but at 0.30 I was not matching some of the photos, 0.35 may be the sweet spot but it may vary based on your images. Even with bad matches the next step comes in as a fail-safe.
Once this script is done, and it will vary based on the number of thumbnail images we are comparing (for me it was many hours). You can then open up the match
directory in a file browser. Because of the file renaming trick in the script, we should see nicely matched up pairs of images side by side in a tile view of the directory. This allows us to run fast and loose with the idiff
program, and do a quick human scan of the result.
Assuming all is well, we can delete all of the files in the match
directory. If someone went wrong, we can just copy all (or just the bad matches) from the matched
directory back to the unknown
and try something else. We can also delete all of the matched
files if things are good because each of them is the same image as one of the originals
.
Let’s recap the directories and steps
originals
– our photo library of good photos
unknown
– badly EXIF tagged images that may be duplicates
source
– thumbnails of originals
target
– thumbnails of unknown
match
– paired up images that we believe are the same
matched
– images moved from unknown
for all matches
Steps
- Create thumbnails from
originals
andunknown
. - Run the comparison script
- Review the images in the
match
directory, the file (re)naming should mean they are side-by-side and easy to see pairs. - If all goes well, clean up the
match
/matched
directories. Theunknown
directory now contains unmatched images - If things didn’t go well in step (3), we can recover the bad matches (or all of them) from the
matched
directory and copy them back tounknown
In the worst case – we’re back to manually looking at originals
vs. unknown
to figure out which of the unknown
images we want to keep.
I didn’t directly reference the exiftool in this post, but without that excellent tool I would not have been able to sensibly sort the original set of photos into originals
and unknown
.