Comparing images to detect duplicates

I’ve been using Photoprism to manage my large and growing photo library. We had simply outgrown using a single machine to manage the library, and Apple had burned us a couple of times by changing their native photo management system. I’m also not the type to trust someone else to keep and secure my photos, so I’m going to host it myself.

I have backups of those photo libraries which I’m working from, and unfortunately those backups seem to have replication of the photos. No problem right? Photoprism has the ability to detect duplicates and reject them. Sweet. However, it does rely on the photos being exactly the same binary.

My problems start when I have a bunch of smaller photos, which look ok – but are clearly not the original. In this particular case the original is 2000×2000, and the alternate version is 256×256 (see top of post for an example of two images). Great – just delete the small one, but with 1000’s of photos how do I know that one is a duplicate of another but resized?

There are other flags here too, the smaller resized version is missing a proper EXIF date stamp. So sure, I can just sort out the photos based on ones with valid EXIF data and then I have a bunch of others which don’t have data. But, what if one of those photos isn’t a resized version? Maybe it’s a photo of something that I only have a small version of?

Again, with 1000’s of photos to review, I’m not going to be able to reasonably figure out which ones are keepers or not. Good thing that doing dumb stuff is what computers are good at. However, looking at two images and determining if they are the same thing is not as easy as you might think.

The folks at imagemagick have some good ideas on comparing for differences, they even tackle the same issue of identifying duplicates but still end up relying on you creating your own solution based on some advice.

Since I had this problem, I did cook up some scripting and an approach which I’ll share here. It’s messy, and I still rely on a human to decide – but for the most part I get a computer to do some brute force work to make the problem human sized.

Continue reading “Comparing images to detect duplicates”

Expanding a docker macvlan network

I’ve previously written about using macvlan networks with docker, this has proved to be a great way to make containers more like lightweight VMs as you can assign a unique IP on your network to them. Unfortunately when I did this I only allocated 4 IPs to the network, and 1 of those is used to provide a communication path from the host to the macvlan network.

Here is how I’ve used up those 4 IPs:

  1. wireguard – allows clients on wireguard to see other docker services on the host
  2. mqtt broker – used to bridge between my IoT network and the lan network without exposing all of my lan to the IoT network
  3. nginx – a local only webserver, useful for fronting Home Assistant and other web based apps I use
  4. shim – IP allocated to supporting routing from the host to the macvlan network.

If I had known how useful giving a container a unique IP on the network was, I would have allocated more up front. Unfortunately you can’t easily grow a docker network, you need to delete and recreate it.

As an overview here is what we need to do.

  • Stop any docker container that is attached to the macvlan network
  • Undo the shim routing
  • Delete the docker network
  • Recreate the docker network (expanded)
  • Redo the shim routing
  • Recreate the existing containers

This ends up not being too hard, and the only slightly non-obvious step is undoing the shim routing, which is the reverse of the setup.

The remainder of this post is a walk through of setting up a 4 IP network, then tearing it down and setting up a larger 8 IP network.

Continue reading “Expanding a docker macvlan network”

Running Selenium testing in a single Docker container

Selenium is a pretty neat bit of kit, it is a framework that makes it easy to create browser automation for testing and other web-scraping activities. Unfortunately it seems there is a dependency mess just to get going, and when I hit these types of problems I turn to Docker to contain the mess.

While there are a number of “Selenium + Docker” posts out there, many have more complex multi-container setups. I wanted a very simple single container to have Chrome + Selenium + my code to go grab something off the web. This article is close, but doesn’t work out of the box due to various software updates. This blog post will cover the changes needed.

First up is the Dockerfile.

The changes needed from the original article are minor. Since Chrome 115 the chromedriver has changed locations, and the zip file layout is slightly different. I also updated it to pull the latest version of Selenium.

ChromeDriver is a standalone server that implements the W3C WebDriver standard. This is what Selenium will use to control the Chrome browser.

The second part is the Python script tests.py

Again, only minor changes here to account for changes in Selenium APIs. This script does do some of the key ‘tricks’ to ensure that Chrome will run inside Docker (providing a few arguments to Chrome).

This is a very basic ‘hello world’ style test case, but it’s a starting point to start writing a more complicated web scraper.

Building is as simple as:

And then we run it and get output on stdout:

Armed with this simple Docker container, and using the Python Selenium documentation you can now scrape complex web pages with relative ease.