Computing – Page 4

Replacing a ZFS degraded device

It was no surprise that a new RAIDZ array built out of decade old drives was going to have problems, I didn’t expect the problems to happen quite so quickly, but I was not surprised. This drive had 4534 days of power on time, basically 12.5 years. It was also manufactured in Oct 2009, making it 14.5 years old.

I had started to backup some data to this new ZFS volume, and upon one of the first scrub operations ZFS flagged this drive as having problems.

$ zpool status 
  pool: backup
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 908K in 13:29:51 with 0 errors on Tue Apr  9 06:59:00 2024
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        DEGRADED     0     0     0
          raidz1-0                                    DEGRADED     0     0     0
            ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946  ONLINE       0     0     0
            wwn-0x50014ee2b0706857                    ONLINE       0     0     0
            wwn-0x50014ee2adfa14f6                    ONLINE       0     0     0
            wwn-0x50014ee2ae38ab42                    FAULTED     36     0     0  too many errors

errors: No known data errors

$ zpool status

pool: backup

state: DEGRADED

status: One or more devices are faulted in response to persistent errors.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Replace the faulted device, or use 'zpool clear' to mark the device

repaired.

scan: scrub repaired 908K in 13:29:51 with 0 errors on Tue Apr 9 06:59:00 2024

config:

NAME STATE READ WRITE CKSUM

backup DEGRADED 0 0 0

raidz1-0 DEGRADED 0 0 0

ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946 ONLINE 0 0 0

wwn-0x50014ee2b0706857 ONLINE 0 0 0

wwn-0x50014ee2adfa14f6 ONLINE 0 0 0

wwn-0x50014ee2ae38ab42 FAULTED 36 0 0 too many errors

errors: No known data errors

The degraded device, maps to /dev/sdg – I determined this by looking a the /dev/disk/by-id/wwn-0x50014ee2ae38ab42 link.

On one of my other systems I’m using snapraid.it, which I quite like. It has a SMART check that does a calculation to indicate how likely the drive is to fail. I’ve often wondered how accurate this calculation is.

SnapRAID SMART report:

   Temp  Power   Error   FP Size
      C OnDays   Count        TB  Serial                Device    Disk
 -----------------------------------------------------------------------
     28   4534       0 100%  1.0  WD-WCAV53163713       /dev/sdg  -
     24   3803       0  84%  1.0  WD-WMC0T0683946       /dev/sdd  -
     23   4156       0  84%  1.0  WD-WCAZA6813339       /dev/sde  -
     27   4740       0  97%  1.0  WD-WCAV51778566       /dev/sdf  -

The FP column is the estimated probability (in percentage) that the disk
is going to fail in the next year.

Probability that at least one disk is going to fail in the next year is 100%.

SnapRAID SMART report:

Temp Power Error FP Size

C OnDays Count TB Serial Device Disk

-----------------------------------------------------------------------

28 4534 0 100% 1.0 WD-WCAV53163713 /dev/sdg -

24 3803 0 84% 1.0 WD-WMC0T0683946 /dev/sdd -

23 4156 0 84% 1.0 WD-WCAZA6813339 /dev/sde -

27 4740 0 97% 1.0 WD-WCAV51778566 /dev/sdf -

The FP column is the estimated probability (in percentage) that the disk

is going to fail in the next year.

Probability that at least one disk is going to fail in the next year is 100%.

The nice thing is you don’t need to be using snapraid to get the SMART check data out, it’s a read only activity based on the devices. In this case it has decided the failing drive has 100% chance of failure, so that seems to check out.

Well, as it happens I had a spare 1TB drive on my desk so it was a matter of swapping some hardware. I found a very useful blog post covering how to do it, and will replicate some of the content here.

As I mentioned above, you first need to figure out which device it is, in this case it is /dev/sdg. I also want to figure out the serial number.

$ sudo smartctl -a /dev/sdg | grep Serial
Serial Number:    WD-WCAV53163713

1 2	$ sudo smartctl -a /dev/sdg \| grep Serial Serial Number: WD-WCAV53163713

Good, so we know the serial number (and the brand of drive), but when you’ve got 4 identical drives, which of the 4 is the right serial number? Of course, I ended up pulling all 4 drives before I found the matching serial number. The blog post gave some very good advice.

Before I configure an array, I like to make sure all drive bays are labelled with the corresponding drive’s serial number, that makes this process much easier!

Every install I make will now follow this advice, at least for ones with many drives. My system now looks like this thanks to my label maker

I’m certain future me will be thankful.

Because the ZFS array had marked this disk as being in a FALTED state, we do not need to mark it ‘offline’ or anything else before pulling the drive. If we were swapping an ‘online’ disk we may need to do more before pulling the drive.

Now that we’ve done the physical swap, we need to get the new disk added to the pool.

The first, very scary thing we need to do is copy the partition from an existing drive in the vdev. The new disk is the TARGET, and an existing disk is SOURCE.

# Check twice, you really don't want to mess this up
# sudo sgdisk --replicate /dev/TARGET /dev/SOURCE

$ sudo sgdisk --replicate /dev/sdg /dev/sdf

# Check twice, you really don't want to mess this up

# sudo sgdisk --replicate /dev/TARGET /dev/SOURCE

$ sudo sgdisk --replicate /dev/sdg /dev/sdf

Once the partition is copied over, we want to randomize the GUIDs as I believe ZFS relies on unique GUIDs for devices.

# Again, taking care that the device is the TARGET (aka: new drive)

$ sudo sgdisk --randomize-guids /dev/sdg

# Again, taking care that the device is the TARGET (aka: new drive)

$ sudo sgdisk --randomize-guids /dev/sdg

This is where my steps deviate from the referenced blog post, but the changes make complete sense. When I created this ZFS RAIDZ array I used the short sdg name for the device. However, as you can see after a reboot the zpool command is showing me the /dev/disk/by-id/ name.

# sudo zpool replace backup OLD NEW

$ sudo zpool replace backup /dev/disk/by-id/wwn-0x50014ee2ae38ab42 /dev/sdg

# sudo zpool replace backup OLD NEW

$ sudo zpool replace backup /dev/disk/by-id/wwn-0x50014ee2ae38ab42 /dev/sdg

This worked fine. I actually had a few miss-steps trying to do this, and zpool gave me very friendly and helpful error messages. More reason to like ZFS as a filesystem.

$ zpool status backup -v
  pool: backup
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 11 09:12:10 2024
        18.6G / 2.15T scanned at 359M/s, 0B / 2.15T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        DEGRADED     0     0     0
          raidz1-0                                    DEGRADED     0     0     0
            ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946  ONLINE       0     0     0
            wwn-0x50014ee2b0706857                    ONLINE       0     0     0
            wwn-0x50014ee2adfa14f6                    ONLINE       0     0     0
            replacing-3                               DEGRADED     0     0     0
              wwn-0x50014ee2ae38ab42                  OFFLINE      0     0     0
              sdg                                     ONLINE       0     0     0

$ zpool status backup -v

pool: backup

state: DEGRADED

status: One or more devices is currently being resilvered. The pool will

continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

scan: resilver in progress since Thu Apr 11 09:12:10 2024

18.6G / 2.15T scanned at 359M/s, 0B / 2.15T issued

0B resilvered, 0.00% done, no estimated completion time

config:

NAME STATE READ WRITE CKSUM

backup DEGRADED 0 0 0

raidz1-0 DEGRADED 0 0 0

ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946 ONLINE 0 0 0

wwn-0x50014ee2b0706857 ONLINE 0 0 0

wwn-0x50014ee2adfa14f6 ONLINE 0 0 0

replacing-3 DEGRADED 0 0 0

wwn-0x50014ee2ae38ab42 OFFLINE 0 0 0

sdg ONLINE 0 0 0

Cool, we can see that ZFS is repairing things with the newly added drive. Interestingly it is shown as sdg currently.

This machine is pretty loud (it has a lot of old fans), so I was pretty wild and powered it down while the ZFS was trying to resilver things. When I rebooted it after relocating it to where it normally lives and the noise won’t bug me, it seems that the device naming has sorted itself out.

$ zpool status backup -v
  pool: backup
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Apr 11 09:12:10 2024
        22.9G / 2.15T scanned at 244M/s, 0B / 2.15T issued
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        DEGRADED     0     0     0
          raidz1-0                                    DEGRADED     0     0     0
            ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946  ONLINE       0     0     0
            wwn-0x50014ee2b0706857                    ONLINE       0     0     0
            wwn-0x50014ee2adfa14f6                    ONLINE       0     0     0
            replacing-3                               DEGRADED     0     0    20
              wwn-0x50014ee2ae38ab42                  OFFLINE      0     0     0
              wwn-0x5000cca3a8d3fcdb                  ONLINE       0     0     0

$ zpool status backup -v

pool: backup

state: DEGRADED

status: One or more devices is currently being resilvered. The pool will

continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

scan: resilver in progress since Thu Apr 11 09:12:10 2024

22.9G / 2.15T scanned at 244M/s, 0B / 2.15T issued

0B resilvered, 0.00% done, no estimated completion time

config:

NAME STATE READ WRITE CKSUM

backup DEGRADED 0 0 0

raidz1-0 DEGRADED 0 0 0

ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946 ONLINE 0 0 0

wwn-0x50014ee2b0706857 ONLINE 0 0 0

wwn-0x50014ee2adfa14f6 ONLINE 0 0 0

replacing-3 DEGRADED 0 0 20

wwn-0x50014ee2ae38ab42 OFFLINE 0 0 0

wwn-0x5000cca3a8d3fcdb ONLINE 0 0 0

The snapraid SMART report now looks a lot better too

$ sudo snapraid smart
[sudo] password for roo: 
SnapRAID SMART report:

   Temp  Power   Error   FP Size
      C OnDays   Count        TB  Serial                Device    Disk
 -----------------------------------------------------------------------
     26   1416       0   4%  1.0  JPW9K0N21DZ2AE        /dev/sdg  -
     23   4158       0  84%  1.0  WD-WCAZA6813339       /dev/sdd  -
     23   3805       0  84%  1.0  WD-WMC0T0683946       /dev/sde  -
     25   4742       0  97%  1.0  WD-WCAV51778566       /dev/sdf  -

$ sudo snapraid smart

[sudo] password for roo:

SnapRAID SMART report:

Temp Power Error FP Size

C OnDays Count TB Serial Device Disk

-----------------------------------------------------------------------

26 1416 0 4% 1.0 JPW9K0N21DZ2AE /dev/sdg -

23 4158 0 84% 1.0 WD-WCAZA6813339 /dev/sdd -

23 3805 0 84% 1.0 WD-WMC0T0683946 /dev/sde -

25 4742 0 97% 1.0 WD-WCAV51778566 /dev/sdf -

It took about 9 hours to finish the resilvering, but then things were happy.

$ zpool status backup -v
  pool: backup
 state: ONLINE
  scan: resilvered 531G in 09:17:03 with 0 errors on Thu Apr 11 18:29:13 2024
config:

        NAME                                          STATE     READ WRITE CKSUM
        backup                                        ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946  ONLINE       0     0     0
            wwn-0x50014ee2b0706857                    ONLINE       0     0     0
            wwn-0x50014ee2adfa14f6                    ONLINE       0     0     0
            wwn-0x5000cca3a8d3fcdb                    ONLINE       0     0     0

errors: No known data errors

$ zpool status backup -v

pool: backup

state: ONLINE

scan: resilvered 531G in 09:17:03 with 0 errors on Thu Apr 11 18:29:13 2024

config:

NAME STATE READ WRITE CKSUM

backup ONLINE 0 0 0

raidz1-0 ONLINE 0 0 0

ata-WDC_WD10EARX-00N0YB0_WD-WMC0T0683946 ONLINE 0 0 0

wwn-0x50014ee2b0706857 ONLINE 0 0 0

wwn-0x50014ee2adfa14f6 ONLINE 0 0 0

wwn-0x5000cca3a8d3fcdb ONLINE 0 0 0

errors: No known data errors

Some folks think that you should not use RAIDZ, but create a pool with a collection of vdevs which are mirrors.

About 2 weeks later, I had a second disk go bad on me. Again, no surprise since these are very old devices. Here is a graph of the errors.

The zfs scrub ran on April 21st, and you can see the spike in errors – but clearly this drive was failing slowly all along as I was using it in this new build. This second failing drive was /dev/sdf – which if you look back at the snapraid SMART report, was at 97% failure percentage. It is worth noting that while ZFS and the snapraid SMART have both decided these drives are bad, I was able to put both drives into a USB enclosure and access them still – I certainly don’t trust these old drives to store data on them, but ZFS stopped using the device before it became unusable.

I managed to grab a used 1TB drive for $10. It is quite old (from 2012) but only has a 1.5yrs of power on time. Hopefully it’ll last, but at the price it’s hard to argue. Swapping that drive in was a matter of following the same steps. Having the drive bay labelled with the serial numbers was very helpful.

Since then, I’ve picked up another $10 1TB drive, and this one is from 2017 with only 70 days of power on time. Given I’ve still got two decade old drives in this RAIDZ, I suspect I’ll be replacing one of them soon. The going used rate for 1TB drives is between $10 and $20 locally, amazing value if you have a redundant layout.

Getting started with ZFS

When ZFS first came out, it was a proprietary filesystem but it had some very interesting characteristics – at the time it’s ability to scale massively and protect your data seemed very cool. My interest in filesystems goes back to my C64 days editing floppy disks to create infinite directory listings and the like. Talking about filesystems reminds me of when I was a COOP student at QNX, they had ‘QFS’ and meeting the developer helped de-mystify filesystems for me.

For some reason ZFS is also linked in my memory with the ‘shouting in the datacenter’ video. As best I can tell this is likely because both DTrace and ZFS both came out of Sun around the same time.

I finally decided to fully decommission my old server and the RAID5 array of 1TB drives. I’ve also recently been experimenting with NixOS, and I’ve really enjoyed that so far. I figured why not setup a dedicated backup server? This also presented a good chance to setup and play with ZFS which now has reliable open source versions available.

First I spent some time learning what I would consider ZFS basics. This video was useful for me. Also, these two blog posts were good starting points.

Since I’m using NixOS as my base operating system, I’ll be following the doc on setting up ZFS on NixOS. Now, while I’m not setting up my boot volume to be ZFS – it turns out you still need to do the same basic setup if you want ZFS capabilities in your NixOS.

You need to generate a unique ‘hostid’ – the doc suggests using

head -c4 /dev/urandom | od -A none -t x4

1	head -c4 /dev/urandom \| od -A none -t x4

Now we need to modify the /etc/nixos/configuration.nix to include

boot.supportedFilesystems = [ "zfs" ];
boot.zfs.forceImportRoot = false;
networking.hostId = "yourHostId";

boot.supportedFilesystems = [ "zfs" ];

boot.zfs.forceImportRoot = false;

networking.hostId = "yourHostId";

Rebuild and reboot, then you can query available zpools

$ zpool status
no pools available

1 2	$ zpool status no pools available

Now we create a pool, I think in this step we are actually adding a bunch of devices to a vdev, which is then wrapped in a pool. Using fdisk I’m able to identify the four 1TB drives which are all partitioned and ready to roll: sdd, sde, sdf, and sdg.

$ sudo zpool create backup raidz sdd sde sdf sdg

1	$ sudo zpool create backup raidz sdd sde sdf sdg

This process took a short while to complete, but after it was done running sudo fdisk -l /dev/sdd gave me this:

Disk /dev/sdd: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: WDC WD10EARX-00N
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 2F1E0C4F-A95F-E948-99AC-18E8829496CD

Device          Start        End    Sectors   Size Type
/dev/sdd1        2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/sdd9  1953507328 1953523711      16384     8M Solaris reserved 1

Disk /dev/sdd: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors

Disk model: WDC WD10EARX-00N

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: gpt

Disk identifier: 2F1E0C4F-A95F-E948-99AC-18E8829496CD

Device Start End Sectors Size Type

/dev/sdd1 2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS

/dev/sdd9 1953507328 1953523711 16384 8M Solaris reserved 1

It seems new partitions were created and I now have a zpool

New partitions were created and I assume initialized to be zfs

$ zpool status
  pool: backup
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	backup      ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sde     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	    sdg     ONLINE       0     0     0

New partitions were created and I assume initialized to be zfs

$ zpool status

pool: backup

state: ONLINE

config:

NAME STATE READ WRITE CKSUM

backup ONLINE 0 0 0

raidz1-0 ONLINE 0 0 0

sdd ONLINE 0 0 0

sde ONLINE 0 0 0

sdf ONLINE 0 0 0

sdg ONLINE 0 0 0

I don’t believe you can reasonably expand or shrink a RAIDZ vdev, this means you need to plan ahead for your storage needs. Also important to remember that the guidance is to not have ZFS volumes at more than 80% usage, beyond this level performance starts to suffer. Storage is cheap, and with pools I think you can have multiple vdev’s in a single pool, so while a single RAIDZ vdev has limitations I think ZFS offers some interesting flexibility.

Unexpectedly, it seems that the newly created ZFS is also mounted and ready to roll

$ zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
backup   523K  2.55T   140K  /backup

$ zfs list

NAME USED AVAIL REFER MOUNTPOINT

backup 523K 2.55T 140K /backup

That’s not where I want to mount the volume, so let’s go figure out how to move it.

# First let us view the mountpoint

$ zfs get mountpoint backup
NAME    PROPERTY    VALUE       SOURCE
backup  mountpoint  /backup     default

# Now we can modify that value

$ sudo zfs set mountpoint=/data/raidz backup

# And check to see it changed

$ zfs get mountpoint backup
NAME    PROPERTY    VALUE        SOURCE
backup  mountpoint  /data/raidz  local

# First let us view the mountpoint

$ zfs get mountpoint backup

NAME PROPERTY VALUE SOURCE

backup mountpoint /backup default

# Now we can modify that value

$ sudo zfs set mountpoint=/data/raidz backup

# And check to see it changed

$ zfs get mountpoint backup

NAME PROPERTY VALUE SOURCE

backup mountpoint /data/raidz local

Cool. I’ve got a ZFS filesytem. One snag, it isn’t mounted automatically after a reboot. I can manually mount it:

$ sudo zpool import -a

1	$ sudo zpool import -a

And digging into the NixOS doc, we find the configuration we need to add

 boot.zfs.extraPools = [ “backup” ];

1	boot.zfs.extraPools = [ “backup” ];

This fixed me up, and ZFS is auto mounted on reboots.

One last configuration tweak, let’s enable scrubbing of the ZFS pool in our NixOS configuration

services.zfs.autoScrub.enable = true;

1	services.zfs.autoScrub.enable = true;

Setting up ZFS on NixOS is very easy. Why would you want ZFS over another filesystem or storage management system? I’ve been using snapraid.it for a while on my main server, and I like the data integrity that it brings beyond just a RAID5 setup. The snapraid site has an interesting comparison matrix. I will say that setting up ZFS RAIDZ was a lot less scary than any of my adventures using mdadm to setup a software RAID5.

What do I see as the key strengths of ZFS?

Data integrity verification and automatic repair – all files are check-summed, and with RAIDZ redundancy we can recovery from underlying data corruption.
Pooled Storage – something I need to explore more, but I think this will give me flexibility over adding more storage over time if needed.
Copy-on-write – this is about consistency of the filesystem, especially over power failure events.

Remember I started out with some old hardware I was repurposing? Those 1TB drives were all surprisingly in ‘good’ shape, but between 10 and 13 years of power on time (some of them have manufacture data of 2009). In my next blog post we’ll cover how ZFS handles failures as we see these ancient drives start to fail.

Comparing images to detect duplicates

I’ve been using Photoprism to manage my large and growing photo library. We had simply outgrown using a single machine to manage the library, and Apple had burned us a couple of times by changing their native photo management system. I’m also not the type to trust someone else to keep and secure my photos, so I’m going to host it myself.

I have backups of those photo libraries which I’m working from, and unfortunately those backups seem to have replication of the photos. No problem right? Photoprism has the ability to detect duplicates and reject them. Sweet. However, it does rely on the photos being exactly the same binary.

My problems start when I have a bunch of smaller photos, which look ok – but are clearly not the original. In this particular case the original is 2000×2000, and the alternate version is 256×256 (see top of post for an example of two images). Great – just delete the small one, but with 1000’s of photos how do I know that one is a duplicate of another but resized?

There are other flags here too, the smaller resized version is missing a proper EXIF date stamp. So sure, I can just sort out the photos based on ones with valid EXIF data and then I have a bunch of others which don’t have data. But, what if one of those photos isn’t a resized version? Maybe it’s a photo of something that I only have a small version of?

Again, with 1000’s of photos to review, I’m not going to be able to reasonably figure out which ones are keepers or not. Good thing that doing dumb stuff is what computers are good at. However, looking at two images and determining if they are the same thing is not as easy as you might think.

The folks at imagemagick have some good ideas on comparing for differences, they even tackle the same issue of identifying duplicates but still end up relying on you creating your own solution based on some advice.

Since I had this problem, I did cook up some scripting and an approach which I’ll share here. It’s messy, and I still rely on a human to decide – but for the most part I get a computer to do some brute force work to make the problem human sized.

Continue reading “Comparing images to detect duplicates”