When Mirrors Break: RAID1 failure and recovery

A couple of years ago I added a second drive to my server in a RAID1 (mirror) configuration. Originally I was using the single drive for logs, but with a more durable mirror setup I moved more (important) data to it.

RAID is not a backup story, if you really care about the data you want to back it up. There are two hard lessons I learned with this recent failure (and my recovery). Two valuable to me bits of data I’m storing on this mirrored volume are email, and photoprism storage (but not the photos themselves). Stupidly I did not have regular backups of either of these, please learn from my mistake.

The two lessons I hope to learn from this are:

  1. Backup your data, even a bad backup is better than nothing
  2. Do not ignore any signs of problems, replace any suspicious hardware ASAP

If you read the comments on my previous post, you will see a history of minor failures that I clearly willfully ignored. I mean, hey – it’s a mirrored setup and mostly I had 2 drives working fine.. right? Stupid me.

The replacement 500GB SSD drive cost me $56.49 taxes in, it even has a 5 year manufacturer warranty in comparison to the 3 year warranty on the failed ADATA drive. Sadly checking the ADATA warranty shows me it made it just path the 3 year mark (not that a ‘free’ replacement drive would fix my problem)

While ADATA has been mostly reliable for me in the past, I’ll pick other brands for my important data. The ADATA products are often very cheap which is attractive, but at the current cost of SSDs it’s easy to pay for the premium brands.

Here is a brief replay of how the disaster rolled out. The previous day I had noticed that something was not quite right with email, but restarting things seemed to resolve the issue. The next morning email wasn’t flowing, so there was something wrong.

Looking at the logs, I was seeing a lot of messages “structure needs cleaning” – which is an indicator that there is some sort of ext4 filesytem problem and it needs to run a check to clean things up. It also appeared that the ADATA half of the mirror had failed in some way. Rebooting the system seemed like a good idea and everything seems to have come back.

Checking the logs for the mail system showed all was well, but then I checked email on my phone, and there were no messages? Stupidly I then opened up my mail client on my laptop, which then proceeded to synchronize with the mail server and delete all of the email stored on my laptop to mirror the empty mailbox on the server.

What was wrong? It took a while, but I figured out that my RAID1 array had completely failed to initialize, both volumes were marked as ‘spare’.

Ugh, well that explains what happened. When the system reboot the mount failed – and my mail server just created new data directories on the mount point (which are on my root volume).

At this point I realize I’m in a bad place, having potentially flushed decades of email. Have I mentioned that running your own email is a bad idea?

Time to start capturing things for recovery. I did a copy of the two drives using dd:

In the process of doing this, it became obvious that sdf (the ADATA drive) had hard read errors, where in contrast I was able to complete the image creation of sde (a Kingston drive).

Once I had some time to think about the situation, I was able to re-add the good drive to the array to make it become active. This let me mount the volume and make a copy of the email for backup purposes. Once this was done I unmounted and ran a fschk -y /dev/md0 to fix all of the filesystem errors.

I then stopped the currently running mail server, renamed the mount point directory to keep the email that had come into the system while I was doing repairs, and created a new (empty) mount point. Then a reboot.

Sigh of relief as all of my mail appeared back. Sure, I’m running with a degraded RAID1 array and the fschk clearly removed some corrupted files but at least the bulk of my data is back.

Fixing the broken mirror was relatively straight forward. I bought a new drive. Then I captured the output of ls dev/disk/by-id/ before powering down the system and physically swapping the bad drive for the good drive. I could then repeat the ls dev/disk/by-id/ and look at the diffs, this allowed me to see the new drive appear, and inspect which drive letter it mapped to.

Nice, it appears to have slotted in just where the previous ADATA drive was, not important but comforting. I then dumped the fdisk information of the healthy Kingston drive.

We want our new drive to be partitioned the same way, luckily the new SSD is even bigger. Mostly this is accepting defaults with the exception of typing in the last sector to match the Kingston drive.

This is similar to the original creation of the RAID1 post, but we can now skip to step 8 and add the new volume.

And that’s it, now we just wait for the mirror to re-sync. It is interesting to note that while I can talk about the device ‘by-id’, mdstat uses the legacy drive letters.

And a short while later, it’s nearly done.

At this point my email appears to be working correctly.  The ext4 filesystem corruption I blame on the failing ADATA drive in the mirror, but this is a guess. The corruption caused a few emails to be ‘lost’, but had a bigger impact on the photoprism data which in part was the mariadb storage. I also noticed that both my prometheus data and mimir data were corrupted, neither of these are critical though.

Backups are good, they don’t have to be perfect – future you will be thankful.

Array Game

It’s pretty normal today to have compelling games in just a web browser. There is a certain class of game that is causal, and while it may have achievements, it doesn’t require long periods of attention from you. I find these to be just the right thing to play in the background, a minute or two here and there over the day or week.

Array Game came to via waxy. This is like an earworm, but it’ll eat up your attention. If you visit the page, you’ll get a screen that looks like:

Nothing is happening until you spend the 10 points you have on a generator. The game play is very simple. Earn points, spend them on upgrades, continue.

At one point you’ll be able to sell all of your progress to get some B points. This opens up another level, for which you can slowly earn more points and eventually start earning those points directly. In the course of the game, there are upgrades which will automate some of the button clicking.

There are levels A, B, C, D, E and F. There seem to be placeholders for G and H but as far as I can tell no way to earn those. The game does get updates from time to time so maybe one day there will be more, and if you dig around you can join a Discord server to discuss it but that’s beyond my level of interest.

It took about a month of casual play for me to ‘complete’ the current version of the game. I don’t know if it was my approach, or if the game has these built into it, but there were times when the only thing to do was just wait it out (hours) while you slowly earned enough points to move to the next level. I don’t see this as a bug, but a feature – it in a way forces you to make this a more casual game vs. a furious button clicking effort.

Now this is just a simple JS based game, and it’s running entirely in your browser. You can open up the web development tooling built into your browser and go mess with the code.

Spoiler alert – here are some simple console scripts you can run to automate the button clicking. This is incomplete, but I hope it inspires you to mess around in the console. Remember, this is a game, sure you can take some short-cuts but will that ruin the fun?

Shout out to Demonin and the other games and web things they’ve built, thanks for making a great time waster.

OpenWRT Travel Router

I recently posted about my purchase of the GL.iNet GL-AR300M16 which I of course immediately flashed with OpenWRT. As this was intended as a travel router it came along with us on a recent vacation. Above you can see the tiny little GL.iNet device plugged via the WAN port into one of the LAN ports of the internet router of the rental we had.

The GL.iNet isn’t a speedy device – with only a single 2.4GHz wireless connection it wasn’t able to saturate the internet connection (200Mbps symmetric) but I was still getting pretty reasonable speeds (~50Mbps).

I had setup the travel router to have a “travel” SSID, and could associate all of the devices we’d brought (6) to that. Sure this is a setup step, but for future trips I’ll only have to setup the travel router and all the devices will connect to the “travel” SSID.

As an aside, I’ll mention that I’ve started to bring my Roku when we travel, this way I have a familiar movie/show watching experience and I don’t have to remember to clear any passwords when I leave because I take the box with me.

Where it gets more interesting, is that I configured the travel router as a wireguard client. Pretty much following my post on OpenWRT as a wireguard client verbatim. I did set up the allowed_ips as a /24 CIDR block – effectively creating a split VPN – so that traffic targeted at my ‘home’ network would flow over wireguard, but other traffic would go directly to the internet. The benefit to this VPN setup is that if I’m streaming a movie on Netflix, that traffic will bypass the wireguard tunnel and when I want to reach a “local to my home network” service like homeassistant, it just works like at home.

Then as icing on the cake, I fiddled with the DNS options so that any address handed out by the travel router gets my pi-hole as the DNS server. If you want to do something similar check out my pi-hole setup post that talks about this configuration in OpenWRT. This gives me ad-blocking and my personal block lists. This helps keep the internet a little bit more family friendly, plus no ads!

I did experience a couple of network hangs,  4 over the week long trip, but a quick power cycle of the router and we were back in business. I suspect that this may have been either high load, or heat, that triggered the problem. The limitation of only 2.4GHz networking didn’t seem to be a big deal, and I got reasonable WiFi coverage over a 3 floors of a townhome.

This setup was pretty awesome. It gave me a ‘at home’ network experience, while I was away from home. What a great little box.

As a bonus, let’s dive into another travel configuration. I’m writing this post from a hotel room, connected to the travel router. Now the hotel doesn’t have a wired ethernet port, so I need to do something slightly different.

There is an OpenWRT package “travel-mate” that makes this more complicated setup easy. We want to operate in AP+STA (access point + station) mode, where the single wifi radio is doing both jobs. Many routers can do this, the GL.iNet is one of them.

The travel-mate documentation is a little sparse, but there is a long and fairly active forum thread that provides help. I was able to get it working with a little bit of stumbling around.

Installing two packages: travelmate and luci-app-travelmate will get you going. An OpenWRT menu “Services->Travelmate” will appear in the web UI, allowing you to access the configuration.

A newly installed travel-mate will have blank information, mine is a capture from a running version.

You’ll need to do a one time “Interface Wizard” to get the interfaces setup. This should create some trm_ network interfaces. I did this once and have forgotten about the details, you can probably safely do the same.

When you are ready to connect to the upstream WiFi (say the hotel’s internet) you will want to visit the “Wireless Station” tab and scan for, and select a SSID to connect to.

There is some magic I don’t yet fully understand about configuring a login script to bypass the captive portal that your hotel is likely to have. In my case, my laptop that connected to the travel router was presented with the captive portal webpage and I was able to log in that way (the travel router basically was a proxy for the captive portal). Once logged in, the router was granted access by the hotel WiFi and all other devices connected to the travel router just worked. (yeah, magic)

I’ll just quickly cover the  travel-mate General Settings.

The top red circle is the “Enabled” checkbox. This is handy as you don’t want travel-mate to be active if you’re using the travel router in a wired setup like I was in the top part of this post. Leaving it enabled while in a wired setup will possibly cause WiFi drop outs as it tries to scan for available networks to connect to.

The bottom red circle is checked on by default, by for my use I found that I had to disable it. Otherwise it was disabling my wireguard VPN. With the checkbox cleared, my split VPN is working fine and I’m enjoying the “at home networking, while I’m not at home” experience. It was also pretty nice that my phone and my tablet just connected to the “travel” WiFi once it was up and running.

Since we are using a single radio to both handle the clients (my devices) and talk to the host network (the hotel WiFi) you can expect that the overall speed to be much less. I know this is true as I’ve tested travel-mate in this AP+STA mode with my home network, and seen the difference (I was only able to get about 26Mbps when my home net connection is much faster). The good news here is that hotel WiFi while adequate, isn’t very good, at least not in this hotel.

Here are two speed tests, one via the travel-router, and one directly to the hotel WiFi.

They are basically the same, especially given the variations you’ll see on the hotel WiFi. The key take away here is that using the travel router isn’t imposing any real overhead or limits, and if we had much better hotel WiFi I’d still get acceptable performance.

It is interesting to note that with travel-mate and running in AP+STA mode, and only 3 devices and 1 user .. it was very stable. I didn’t have any hangs or weird problems once it was setup. I’ll certainly bring it along for future trips.