Debugging your network

The other day there was something strange happening on my home network. While my Amazon Fire HD8 tablet was happily on the WiFi, my Pixel 4a was ‘bouncing’ between WiFi and mobile data. The animated image at the top of the post is a simulation of what I saw.

The first thing I did was reboot my phone. Maybe this was some weird snafu and a reboot will fix it. This is the classic turn if off and on again.

The problem persisted. At the time I thought this may be just my device, but I found out later that Jenn’s Pixel 6a was doing the same thing, as was an older Pixel 2. The 6a was happy enough delivering internet, but based on the mobile data usage for the day it was clear most of the traffic was mobile data.

Next was the look at the OpenWRT router(s) – and I was a bit surprised that all of them had been up for 70+ days. I was headed out the door and decided to stall since it seemed that computers (and the 6a) were working ok (or so I thought). I confirmed that my phone was fine as I worked from the office, and had solid WiFi all day.

After dinner, it was clear that there was still a problem – but only the Pixel phones seemed to be having the problem. I tried removing the WiFi connection from my phone and re-adding it, maybe that would help? Nope. What if I switch to the guest network? Whaaat? Solid connectivity?!

Now the guest network doesn’t get the benefit of ad blocking from my Pi Hole. Maybe there is something going on there? I start poking around the logs on the Pi Hole, and reviewing logs on the OpenWRT router to see if there is any evidence. Nothing stands out. I try moving my phone to the IoT network – all three of the WiFi networks are hosted on the same hardware – so this helps eliminate the hardware and a good chunk of the software. The IoT network does make use of the Pi Hole, and to my surprise my phone was happy on the WiFi when using the IoT network.

At this point I’m about an hour and a half into debugging this, and I’m starting to run out of ideas. There have not been any configuration changes to my network recently. I don’t believe that any software updates have landed on my phone, and to have 3 different Pixel devices to all have the same problem all of a sudden is really weird. I’ve rebooted both my networking devices and the mobile devices – still the problem persists.

I run with 3 access points (two dumb AP and a main gateway), but each of them advertises the same SSID on two WiFi channels (a total of 6 distinct WiFi channels, all with the same SSID). This is a great setup for me as my devices just seamlessly move from connection to connection based on what is best.

My next idea is to change the SSID of one of the channels from ‘lan wifi’ to ‘hack wifi’ – allowing me to specifically connect to a given radio on a given access point. Now I can connect my phone to this new ‘hack wifi’ and know that configuration changes to this will affect the one device. Unsurprisingly the behaviour is the same, my phone just keeps connecting / reconnecting to this ‘hack wifi’.

I dive into some of the OpenWRT settings, looking for something that will make this WiFi connection more resilient. There are lots of options, but ultimately this is a dead end. I then wonder what would happen if I modify this WiFi connection to instead of routing to my ‘lan’ network, to route to the ‘iot’ network. Now my connection to ‘hack wifi’ works great.. hmm

What does this tell me? There seems to be something weird about my ‘lan’ network vs. there being something specific about the way that my ‘lan wifi’ is configured. This pivots me away from looking for differences in the WiFi configuration of my IoT, Guest, and Lan networks and looking more specifically at what’s connected to the lan network.

I grab the list of all devices on the lan network. Eliminate all of the WiFi clients in that list because it’s probably not them (but I’m guessing). Let’s take a closer look at the wired things (of which I have a good number). I start unplugging things, no joy. I turn off my main wired switch and still nothing. Finally I try unplugging my old server.

Boom. That was it. Almost immediately my phone connects to the WiFi network and stays connected. A quick check, and all of the other Pixel phones are happy now too.

This reminds me of a problem I had years ago, but my network was smaller and a little less complicated. A network cable I had built turned out to be bad, but only after months of use. Suddenly one day, my whole network was misbehaving, all devices were having problems. Powering off the machine had no effect, it was only once I removed the network cable that things came back to life. This was similar as the bad machine was connected to the main wired switch that I had powered off, but that wasn’t enough to fix the issue. Removing the cable was the fix.

Throughout this problem, computers on the ‘lan wifi’ seemed fine. Video calls worked fine, no strange drops or slow downs. Still, the impact to the Pixel phones was extreme.

The old server is very old (built in 2009), all of the drives in it are starting to fail and I should probably just power it off and wipe the drives and dispose of the hardware. I just haven’t quite gotten to it yet, this is probably a sign I should.

When Mirrors Break: RAID1 failure and recovery

A couple of years ago I added a second drive to my server in a RAID1 (mirror) configuration. Originally I was using the single drive for logs, but with a more durable mirror setup I moved more (important) data to it.

RAID is not a backup story, if you really care about the data you want to back it up. There are two hard lessons I learned with this recent failure (and my recovery). Two valuable to me bits of data I’m storing on this mirrored volume are email, and photoprism storage (but not the photos themselves). Stupidly I did not have regular backups of either of these, please learn from my mistake.

The two lessons I hope to learn from this are:

  1. Backup your data, even a bad backup is better than nothing
  2. Do not ignore any signs of problems, replace any suspicious hardware ASAP

If you read the comments on my previous post, you will see a history of minor failures that I clearly willfully ignored. I mean, hey – it’s a mirrored setup and mostly I had 2 drives working fine.. right? Stupid me.

The replacement 500GB SSD drive cost me $56.49 taxes in, it even has a 5 year manufacturer warranty in comparison to the 3 year warranty on the failed ADATA drive. Sadly checking the ADATA warranty shows me it made it just path the 3 year mark (not that a ‘free’ replacement drive would fix my problem)

While ADATA has been mostly reliable for me in the past, I’ll pick other brands for my important data. The ADATA products are often very cheap which is attractive, but at the current cost of SSDs it’s easy to pay for the premium brands.

Here is a brief replay of how the disaster rolled out. The previous day I had noticed that something was not quite right with email, but restarting things seemed to resolve the issue. The next morning email wasn’t flowing, so there was something wrong.

Looking at the logs, I was seeing a lot of messages “structure needs cleaning” – which is an indicator that there is some sort of ext4 filesytem problem and it needs to run a check to clean things up. It also appeared that the ADATA half of the mirror had failed in some way. Rebooting the system seemed like a good idea and everything seems to have come back.

Checking the logs for the mail system showed all was well, but then I checked email on my phone, and there were no messages? Stupidly I then opened up my mail client on my laptop, which then proceeded to synchronize with the mail server and delete all of the email stored on my laptop to mirror the empty mailbox on the server.

What was wrong? It took a while, but I figured out that my RAID1 array had completely failed to initialize, both volumes were marked as ‘spare’.

Ugh, well that explains what happened. When the system reboot the mount failed – and my mail server just created new data directories on the mount point (which are on my root volume).

At this point I realize I’m in a bad place, having potentially flushed decades of email. Have I mentioned that running your own email is a bad idea?

Time to start capturing things for recovery. I did a copy of the two drives using dd:

In the process of doing this, it became obvious that sdf (the ADATA drive) had hard read errors, where in contrast I was able to complete the image creation of sde (a Kingston drive).

Once I had some time to think about the situation, I was able to re-add the good drive to the array to make it become active. This let me mount the volume and make a copy of the email for backup purposes. Once this was done I unmounted and ran a fschk -y /dev/md0 to fix all of the filesystem errors.

I then stopped the currently running mail server, renamed the mount point directory to keep the email that had come into the system while I was doing repairs, and created a new (empty) mount point. Then a reboot.

Sigh of relief as all of my mail appeared back. Sure, I’m running with a degraded RAID1 array and the fschk clearly removed some corrupted files but at least the bulk of my data is back.

Fixing the broken mirror was relatively straight forward. I bought a new drive. Then I captured the output of ls dev/disk/by-id/ before powering down the system and physically swapping the bad drive for the good drive. I could then repeat the ls dev/disk/by-id/ and look at the diffs, this allowed me to see the new drive appear, and inspect which drive letter it mapped to.

Nice, it appears to have slotted in just where the previous ADATA drive was, not important but comforting. I then dumped the fdisk information of the healthy Kingston drive.

We want our new drive to be partitioned the same way, luckily the new SSD is even bigger. Mostly this is accepting defaults with the exception of typing in the last sector to match the Kingston drive.

This is similar to the original creation of the RAID1 post, but we can now skip to step 8 and add the new volume.

And that’s it, now we just wait for the mirror to re-sync. It is interesting to note that while I can talk about the device ‘by-id’, mdstat uses the legacy drive letters.

And a short while later, it’s nearly done.

At this point my email appears to be working correctly.  The ext4 filesystem corruption I blame on the failing ADATA drive in the mirror, but this is a guess. The corruption caused a few emails to be ‘lost’, but had a bigger impact on the photoprism data which in part was the mariadb storage. I also noticed that both my prometheus data and mimir data were corrupted, neither of these are critical though.

Backups are good, they don’t have to be perfect – future you will be thankful.

OpenWRT 21.02 to 22.03 upgrade

Here are my notes on upgrading OpenWRT, they are based on my previous post on upgrading.

In this case I’m upgrading specifically TP-Link Archer C7 v2 – the process will be similar for other OpenWRT devices but it’s always worth reviewing the device page. I’ve also got some v5 versions, and this means a slightly different firmware, but the same exact process.

For a major version upgrade it is worth reading the release notes First start by reading the release notes – nothing seems to be specific to my device that requires any special considerations, so I can just proceed.

An upgrade from OpenWrt 21.02 or 22.03 to OpenWrt 22.03.5 is supported in many cases with the help of the sysupgrade utility which will also attempt to preserve the configuration.

I personally prefer the cli based process, so we’ll be following that documentation.

Step 1. While I do nightly automated backups, I should also just do a web UI based backup – this is mostly for peace of mind

Step 2. Download the correct sysupgrade binary -the easy way to do this is by using the firmware selector tool. I recommend that you take the time to verify the sha256sum of your download, this is rarely an issue but I have experienced bad downloads and it’s hard to debug after the fact.

It is recommend to check you have enough RAM free – thankfully the archer has a lot of RAM (which is used for the /tmp filesystem too) so I have lots of space.

Step 3. Get ready to flash – if you review the post install steps, you’ll see that while the sysupgrade will preserve all of our configuration files – it won’t preserve any of the packages.

This script will print out all of the packages you’ve installed.

Save the list away so you can easily restore things post install. There is a flaw with this script as I’ll point out later, but in many cases it’ll work fine for you.

On my dumb access points I get this list of packages

Mostly I have the prometheus exporter (for metrics) and rsync (for backups) installed. My main gateway has a few more packages (vnstat and sqm) but it’s similar.

Step 4. Time to flash. Place the firmware you downloaded onto the openwrt router in /tmp and run sysupgrade.

This is a bit scary — because you lose your ssh connection as part of the upgrade.  It took about a minute and a half of radio silence before the device came back.  However, I was then greeted with the new web UI – and over ssh I get the 22.03.5 version splash.

Step 5. Check for any package updates – usually I leave things well enough alone, but we just did a full upgrade so it’s worth making sure we are fully current. Note, this may mess with the script in step 3 since the install dates will change for other components.

If you get any packages listed, we can easily upgrade using opkg upgrade <pkg name>

Step 6.  Install packages captured in step 3. Do this by creating a simple script to opkg install <pkg name> for each package.

Post install, take a careful look at the output of the installs, and look for any *-opkg files in /etc/config or /etc. These are config files which conflicted with local changes.

Sometimes you will want to keep your changes – others you’ll want to replace your local copy with the new -opkg file version. Take your time working through this as it will avoid tricky problems to debug later.

When I upgraded my main router, vnstat seems to have been busted in some way. The data file was no longer readable (and it’s backup) – I suspect that some code change caused the format to be incompatible. I had to remove and recreated a new one. Oh well.

Things mostly went smoothly, it took about 30mins per openwrt device and I was going slowly and taking notes. There was one tiny glitch in the upgrade. The /root/.ssh directory was wiped out – I use this to maintain a key based ssh/scp from each of my dumb AP to the main router.

Bonus. I found a new utility: Attended Sysupgrade. This is pretty slick as it makes it very easy to roll minor versions (so 22.03.02 -> 22.03.05 for example) but it will not do a major upgrade (21.03 -> 22.03). I’ve installed this on all of my openwrt devices and will use it to stay current. It takes care of all of the upgrade steps above.. but it does suffer the same ‘glitch’ in that /root/.ssh is wiped out. The other downside is that the custom firmware that is built, breaks the script in step 3 – since the flash install date is the same for all of the components. I’ll need to go refactor that script for my next upgrade.