Comparing images to detect duplicates

I’ve been using Photoprism to manage my large and growing photo library. We had simply outgrown using a single machine to manage the library, and Apple had burned us a couple of times by changing their native photo management system. I’m also not the type to trust someone else to keep and secure my photos, so I’m going to host it myself.

I have backups of those photo libraries which I’m working from, and unfortunately those backups seem to have replication of the photos. No problem right? Photoprism has the ability to detect duplicates and reject them. Sweet. However, it does rely on the photos being exactly the same binary.

My problems start when I have a bunch of smaller photos, which look ok – but are clearly not the original. In this particular case the original is 2000×2000, and the alternate version is 256×256 (see top of post for an example of two images). Great – just delete the small one, but with 1000’s of photos how do I know that one is a duplicate of another but resized?

There are other flags here too, the smaller resized version is missing a proper EXIF date stamp. So sure, I can just sort out the photos based on ones with valid EXIF data and then I have a bunch of others which don’t have data. But, what if one of those photos isn’t a resized version? Maybe it’s a photo of something that I only have a small version of?

Again, with 1000’s of photos to review, I’m not going to be able to reasonably figure out which ones are keepers or not. Good thing that doing dumb stuff is what computers are good at. However, looking at two images and determining if they are the same thing is not as easy as you might think.

The folks at imagemagick have some good ideas on comparing for differences, they even tackle the same issue of identifying duplicates but still end up relying on you creating your own solution based on some advice.

Since I had this problem, I did cook up some scripting and an approach which I’ll share here. It’s messy, and I still rely on a human to decide – but for the most part I get a computer to do some brute force work to make the problem human sized.

Let’s assume we have a few directories of files.

originals – these are the EXIF tagged photos
unknown – the collection of untagged possible (smaller) duplicates

First we are going to create small thumbnails of both of them. I’m leaning heavily on the oiiotools, the OpenImageIO tooling.

After you’ve run this twice, once for originals and once for unknown (renaming the thumbs directory to avoid conflict) – let’s call those two thumb directories: source and target respectively.

Now we have

source – 64×64 thumbnails of the originals
target – 64×64 thumbnails of the unknowns

We’re going to create two more directories, match and matched. These will be used to store the results of the following comparison script.

The idiff program is doing the heavy lifting here. Yes, this is an O(n^2) algorithm, horrible but it works and computers will happily work all night. Due to how the idiff program compares images it will FAIL two visually similar images, I had to bump the ‘fail’ criteria to be more permissive. I started out with a very modest 0.05 value, which worked for a good number of my images. I then discovered it had not caught all of the visual matches, and kept increasing the value. At 0.40 I’m getting a few bad matches, but at 0.30 I was not matching some of the photos, 0.35 may be the sweet spot but it may vary based on your images. Even with bad matches the next step comes in as a fail-safe.

Once this script is done, and it will vary based on the number of thumbnail images we are comparing (for me it was many hours). You can then open up the match directory in a file browser. Because of the file renaming trick in the script, we should see nicely matched up pairs of images side by side in a tile view of the directory. This allows us to run fast and loose with the idiff program, and do a quick human scan of the result.

Assuming all is well, we can delete all of the files in the match directory. If someone went wrong, we can just copy all (or just the bad matches) from the matched directory back to the unknown and try something else. We can also delete all of the matched files if things are good because each of them is the same image as one of the originals.

Let’s recap the directories and steps

originals – our photo library of good photos
unknown – badly EXIF tagged images that may be duplicates
source – thumbnails of originals
target – thumbnails of unknown
match – paired up images that we believe are the same
matched – images moved from unknown for all matches

Steps

  1. Create thumbnails from originals and unknown.
  2. Run the comparison script
  3. Review the images in the match directory, the file (re)naming should mean they are side-by-side and easy to see pairs.
  4. If all goes well, clean up the match/matched directories. The unknown directory now contains unmatched images
  5. If things didn’t go well in step (3), we can recover the bad matches (or all of them) from the matched directory and copy them back to unknown

In the worst case – we’re back to manually looking at originals vs. unknown to figure out which of the unknown images we want to keep.

I didn’t directly reference the exiftool in this post, but without that excellent tool I would not have been able to sensibly sort the original set of photos into originals and unknown.

Leave a Reply

Your email address will not be published. Required fields are marked *