Cloudflare Managed DNS

Consider this an update to a previous article where I talk about using rebel.ca to manage my DNS records. I still like them as a company, and the support is generally good – I will no longer recommend them. Without getting into the details, I had a DNS management problem with them that was the last straw for me, this resulted in about 36hrs of downtime for this domain.

The good news is that today, there are lots of free managed DNS providers. Really this isn’t a huge technical challenge. You need two nameserver entries in your SOA that is managed by your registrar. Ideally those nameservers are hosted on machines that are on different networks and in different locations for good redundancy. As far as management of the record, having a friendly web UI isn’t a huge problem in 2024. These servers will answer DNS requests for the many DNS servers out there that ask for a name to IP mapping.  Yes, there are real costs to operating one of these – but for an individual personal domain, the number of queries and amount of data is pretty tiny.

I decided to go for Cloudflare, they offer a generous free plan and are pretty central to the operation of the internet as a whole. Hopefully I can trust them to manage my DNS record(s), but I do have some reluctance because the internet is dominated by a few huge tech companies which isn’t great. I believe the internet needs to be built on open standards and we need lots of medium sized companies providing services.

You can sign up for Cloudflare in minutes. Entirely self service, and email confirmation is used to give you full access.

Adding one of my domains to be managed by Cloudflare is easy. I want to click on the ‘Website’ entry on the left navigation bar. Then I pick +Add a site to enter the domain I want them to manage DNS for.

Now we pick the ‘Free’ plan and move to the next step.

Cloudflare does a pretty slick job of sniffing out your existing DNS records (assuming you have some) and populating it’s configuration. Review these and edit as needed. Then we can continue.

As I mentioned above, I really don’t want Cloudflare messing with things much – so I disabled the “Proxy” for all of my records and have it setup as DNS only. If I ever have a problem, I can go in and use some of their free DDoS protection stuff – but let’s start with just the basics.

To enable Cloudflare to be my DNS provider, I need to go change the record with my domain registrar so that it points at the Cloudflare name servers. Cloudflare monitors this and will make my domain ‘active’ in the dashboard, but they’ve already created the DNS records and things are good to go. It was a matter of minutes for my domain to become active once I’d modified the SOA with my registrar.

I did this back a few weeks ago. So far so good, but then again DNS is pretty boring when it’s not breaking the internet.

 

PSA: DNS servers have no priority order

It is a common misconception that DNS servers that your system uses are managed in a priority order. I had this misunderstanding for years, and I’ve seen many others with the same.

The problem comes from the router or OS setup where you can list a “Primary” and “Secondary” DNS server. This certainly gives you the impression that you have one that is ‘mostly used’ and a ‘backup one’ that is used if the first one is broken, or too slow. This is false, but confusingly also sometimes true.

Consider this stack exchange question/answer. Or this serverfault question.  If you go searching there are many more questions on this topic.

Neither DNS resolver lists nor NS record sets are intrinsically ordered, so there is no “primary”. Clients are free to query whichever one they want in whichever order they want. For resolvers specifically, clients might default to using the servers in the same order as they were given to the client, but, as you’ve discovered, they also might not.

Let me also assure you from my personal experience, there is no guarantee of order. Some systems will always try the “Primary” first, then fall back to the “Secondary”. Others will round-robin queries. Some will detect a single failure and re-order the two servers for all future queries. Some devices (Amazon Fire Tablets) will magically use a hard coded DNS server if the configured ones are not working.

Things get even more confusing to understand because there is the behaviour of the individual clients (like your laptop or phone), and then the layers of DNS servers between you and the authoritative server. DNS is a core part of how the internet works, and there is lots of information on the different parts of DNS out there.

The naming “Primary” and “Secondary” come from the server side of DNS. When you are hosting a system and configure the domain name to IP mapping, you set up your DNS records in the “Primary” system. The “Secondary” system is usually an automated replica of that “Primary”. This really has nothing to do with what the client devices are going to do with those addresses.

Another pit-fall people run into when they think there is an ordering, is when they setup a pi-hole for ad-blocking. They will use their new pi-hole installation as the “Primary” and then use a popular public DNS server (like 8.8.8.8) as the “Secondary”.  This configuration sort of works – at least some of the time, your client machine will hit your pi-hole and ad-blocking will work. Then, unpredictably it will not block an ad – because the client has used the “Secondary”.

Advice: Assume all DNS servers are the same and will return the same answer. There is no ordering.

I personally run two pi-hole installations. My “Primary” handles about 80% of the traffic, and the “Secondary” about 20%. This isn’t because 20% of the time my “Primary” is unavailable or too slow, but simply that about 20% of the client requests are deciding to use the “Secondary” for whatever reason (and that a large amount of my traffic comes from my Ubuntu server machine). Looking deeper at the two pi-hole dashboards, the mix of clients looks about the same, but the “Secondary” has fewer clients – it does seem fairly random.

If your ISP hands out IPv6 addresses, you may find that things get even more interesting as you’ll also have clients assigned an IPv6 DNS address, this adds yet another interface to the client device and another potential DNS server (or two) that may be used for name lookups.

Remember, it’s always DNS.

Replacing a ZFS degraded device

It was no surprise that a new RAIDZ array built out of decade old drives was going to have problems, I didn’t expect the problems to happen quite so quickly, but I was not surprised. This drive had 4534 days of power on time, basically 12.5 years. It was also manufactured in Oct 2009, making it 14.5 years old.

I had started to backup some data to this new ZFS volume, and upon one of the first scrub operations ZFS flagged this drive as having problems.

The degraded device, maps to /dev/sdg – I determined this by looking a the /dev/disk/by-id/wwn-0x50014ee2ae38ab42 link.

On one of my other systems I’m using snapraid.it, which I quite like. It has a SMART check that does a calculation to indicate how likely the drive is to fail. I’ve often wondered how accurate this calculation is.

The nice thing is you don’t need to be using snapraid to get the SMART check data out, it’s a read only activity based on the devices. In this case it has decided the failing drive has 100% chance of failure, so that seems to check out.

Well, as it happens I had a spare 1TB drive on my desk so it was a matter of swapping some hardware. I found a very useful blog post covering how to do it, and will replicate some of the content here.

As I mentioned above, you first need to figure out which device it is, in this case it is /dev/sdg. I also want to figure out the serial number.

Good, so we know the serial number (and the brand of drive), but when you’ve got 4 identical drives, which of the 4 is the right serial number? Of course, I ended up pulling all 4 drives before I found the matching serial number. The blog post gave some very good advice.

Before I configure an array, I like to make sure all drive bays are labelled with the corresponding drive’s serial number, that makes this process much easier!

Every install I make will now follow this advice, at least for ones with many drives. My system now looks like this thanks to my label maker

I’m certain future me will be thankful.

Because the ZFS array had marked this disk as being in a FALTED state, we do not need to mark it ‘offline’ or anything else before pulling the drive. If we were swapping an ‘online’ disk we may need to do more before pulling the drive.

Now that we’ve done the physical swap, we need to get the new disk added to the pool.

The first, very scary thing we need to do is copy the partition from an existing drive in the vdev. The new disk is the TARGET, and an existing disk is SOURCE.

Once the partition is copied over, we want to randomize the GUIDs as I believe ZFS relies on unique GUIDs for devices.

This is where my steps deviate from the referenced blog post, but the changes make complete sense. When I created this ZFS RAIDZ array I used the short sdg name for the device. However, as you can see after a reboot the zpool command is showing me the /dev/disk/by-id/ name.

This worked fine. I actually had a few miss-steps trying to do this, and zpool gave me very friendly and helpful error messages. More reason to like ZFS as a filesystem.

Cool, we can see that ZFS is repairing things with the newly added drive. Interestingly it is shown as sdg currently.

This machine is pretty loud (it has a lot of old fans), so I was pretty wild and powered it down while the ZFS was trying to resilver things. When I rebooted it after relocating it to where it normally lives and the noise won’t bug me, it seems that the device naming has sorted itself out.

The snapraid SMART report now looks a lot better too

It took about 9 hours to finish the resilvering, but then things were happy.

Some folks think that you should not use RAIDZ, but create a pool with a collection of vdevs which are mirrors.

About 2 weeks later, I had a second disk go bad on me. Again, no surprise since these are very old devices. Here is a graph of the errors.

The zfs scrub ran on April 21st, and you can see the spike in errors – but clearly this drive was failing slowly all along as I was using it in this new build. This second failing drive was /dev/sdf – which if you look back at the snapraid SMART report, was at 97% failure percentage. It is worth noting that while ZFS and the snapraid SMART have both decided these drives are bad, I was able to put both drives into a USB enclosure and access them still – I certainly don’t trust these old drives to store data on them, but ZFS stopped using the device before it became unusable.

I managed to grab a used 1TB drive for $10. It is quite old (from 2012) but only has a 1.5yrs of power on time. Hopefully it’ll last, but at the price it’s hard to argue. Swapping that drive in was a matter of following the same steps. Having the drive bay labelled with the serial numbers was very helpful.

Since then, I’ve picked up another $10 1TB drive, and this one is from 2017 with only 70 days of power on time. Given I’ve still got two decade old drives in this RAIDZ, I suspect I’ll be replacing one of them soon. The going used rate for 1TB drives is between $10 and $20 locally, amazing value if you have a redundant layout.