Over the years I’ve had the expected number of hard drive failures. Some have been more catastrophic to me as I didn’t have a good backup strategy in place, others felt avoidable if I’d paid attention the warning signs.
My current setup for data duplication is based on Snapraid, a non-traditional RAID solution. It allows mixed sizes of drives, and the replication is done via regularly running the sync operation. Mine is done daily, files are sync’d across the drives and a data validation is done from time to time as well. This means while I might lose up to 24hrs of data if the primary drive fails, I have lower usage of the main parity drive and I get the assurance that file corruption hasn’t happened.
Snapraid is very bad when you have either: many small files, frequently changing files. It is ideal for backing up media like photos or movies. To deal with the more rapidly changing data I’ve got a SSD drive for storage. I haven’t yet had a SSD fail on me, but that is assured to happen at one point. Backblaze is already seeing some failure rate information that is concerning. Couple this with the fact that my storage SSD started throwing errors the other day and only a full power cycle of the machine brought it backĀ – it’s fine now, but for how long? Time to setup a mirror.
For this storage I’m going back to traditional RAID. The SSD is a 480GB drive, and thankfully the price of them has dropped to easily under $70. This additional drive now fills all 6 of the SATA ports on my motherboard, the next upgrade will need to be an SATA port expansion card. I’ve written about RAID a few times here.
I’ve moved away from specifying drives as /dev/sdbX
because these values can change. Even this new SSD caused the drive that was at /dev/sdf
to move to /dev/sdg
allowing the new drive to use /dev/sdf
. My /etc/fstab
is now setup using /dev/disk/by-id/xxx
because these are persistent. Most of the disk utilities understand this format just fine as you an see with this example with fdisk.
1 2 3 4 5 6 7 8 9 10 |
$ sudo fdisk -l /dev/sdf Disk /dev/sdf: 447.1 GiB, 480103981056 bytes, 937703088 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes $ sudo fdisk -l /dev/disk/by-id/ata-KINGSTON_SA400S37480G_50026841D62B77E8 Disk /dev/disk/by-id/ata-KINGSTON_SA400S37480G_50026841D62B77E8: 447.1 GiB, 480103981056 bytes, 937703088 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes |
Granted, working with /dev/disk/by-id
is a lot more verbose – but that id will not change if you re-organize the SATA cables.
Let’s get going on setting up the new drive as a mirror for the existing one. Here’s the basic set of steps
- Partition the new drive so it is identical to the existing one
- Create a RAID1 array in degraded state
- Format and mount the array
- Copy the data from the existing drive to the new array
- Un-mount both the array and the original drive
- Mount the array where the original drive was mounted
- Make sure things are good – the next step is destructive
- Add the original drive to the degraded RAID1 array making it whole
It may seems like a lot of steps, and some of them are scary – but on the other side we’ll have a software RAID protecting the data. The remainder of this post will be the details of those steps above.
Step 1 – Partitioning the new drive
Any time you’re about to partition (or re-partition) a drive, it is important to be careful. We could very easily target the wrong one and then we have a big problem. Above you can see that I’ve done a sudo fdisk -l /dev/disk/by-id/ata-KINGSTON_SA400S37480G_50026B77841D62E8
and have been able to see it is not yet partitioned. This is a good way to confirm we have the right device. I also want to look at the drive we want to mirror and figure out how it is partitioned.
1 2 3 4 5 6 7 8 9 10 |
$ sudo fdisk -l /dev/disk/by-id/ata-ADATA_SU650_2K1220083359 Disk /dev/disk/by-id/ata-ADATA_SU650_2K1220083359: 447.1 GiB, 480103981056 bytes, 937703088 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 0BAA6941-2577-4204-992E-CE9310B75D0C Device Start End Sectors Size Type /dev/disk/by-id/ata-ADATA_SU650_2K1220083359-part1 2048 937701375 937699328 447.1G Linux filesystem |
We want the new drive to look like that once we are done.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ sudo fdisk /dev/disk/by-id/ata-KINGSTON_SA400S37480G_50026841D62B77E8 Welcome to fdisk (util-linux 2.31.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Device does not contain a recognized partition table. Created a new DOS disklabel with disk identifier 0x5167495b. Command (m for help): g Created a new GPT disklabel (GUID: 6C260AF2-796D-5E49-8CB0-6EA5C3995D00). Command (m for help): n Partition number (1-128, default 1): First sector (2048-937703054, default 2048): Last sector, +sectors or +size{K,M,G,T,P} (2048-937703054, default 937703054): 937701375 Created a new partition 1 of type 'Linux filesystem' and of size 447.1 GiB. Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table. Syncing disks. |
Note that I selected a non-default last sector to match the existing (smaller) drive. If the situation had been reversed, I’d be re-partitioning the existing drive to match the smaller new one. Net, to keep things sane with the mirror we want the same layout for both. It’s not a bad idea now to compare theĀ partition layouts for the two drives to make sure we got this right.
Step 2 – Create a RAID1 array
We are going to create a RAID1 array, but with a missing drive – thus it will be in a degraded state. For this we need to look in /dev/disk/by-id and select the partition name we just created in step1. This will be -part1
at the end of the device name we used.
I can use /dev/md0
because this is the first RAID array on this system.
1 2 3 4 5 6 7 8 9 10 11 |
sudo mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/disk/by-id/ata-KINGSTON_SA400S37480G_50026841D62B77E8-part1 missing mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: size set to 468717568K mdadm: automatically enabling write-intent bitmap on large array Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. |
We can look at our array via /proc/mdstat
— we should also expect to get an email from the system informing us that there is a degraded RAID array.
1 2 3 4 5 |
$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdf1[0] 468717568 blocks super 1.2 [2/1] [U_] bitmap: 0/4 pages [0KB], 65536KB chunk |
Step 3 – Format and mount
We can now treat the new /dev/md0
as a drive partition. This is standard linux formatting and mounting. I’ll be using ext4 as the filesystem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sudo mkfs -t ext4 /dev/md0 mke2fs 1.44.1 (24-Mar-2018) Discarding device blocks: done Creating filesystem with 117179392 4k blocks and 29302784 inodes Filesystem UUID: 45ac1aed-396e-4bb8-82db-abb876d3bf87 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: done |
And now we mount it on the /mnt endpoint.
1 2 3 |
$ sudo mount /dev/md0 /mnt $ ls /mnt lost+found |
Step 4 – Copy the data
For this step, I’m going to use rsync and probably run it multiple times because right now I have live workload changing some of those files. I’ll have to shut down all processes that are updating the original volume before doing the final rsync.
1 |
$ sudo rsync -avxHAX --progress /mounted/original/. /mnt/. |
This will run for some time, depending on how much you’ve got on the original disk. Once it is done, shut down anything that might be changing the original drive and run the same rsync
command again. Then you can move on to the next step.
Step 5 – Un-mount both the array and the original drive
Un-mounting /mnt
was easy, because this was a new mount and my repeated rsync runs were the only thing targeting it.
In the previous step I’d already stopped the docker containers that were using the volume as storage, so I thought it’d be similarly trivial to unmount. I was wrong.
1 2 |
$ sudo umount /mounted/original umount: /mounted/original: target is busy. |
To track down what was preventing the unmount required digging through the verbose output of sudo lsof
which will show all open files. It turned out that I had forgotten about the logging agent I have running which is reading some log files that live on this storage. Once I’d also stopped that process that was reading data I was good to go.
Step 6 – Mount the array where the original drive was mounted
This should be as easy as modifying /etc/fstab
to point to /dev/md0
where we used to point to the physical disk by-id
1 2 |
# Mirrored 480GB SSDs for storage /dev/md0 /mounted/original ext4 defaults 0 2 |
Once /etc/fstab
is fixed – we can just mount /mounted/original and restart everything we shut down.
Step 7 – Make sure things are good
At this point we have a degraded RAID1 array and a full (but aging) copy of the data on a second physical drive. This isn’t a bad place to be, but we should make sure that everything is working as expected. Check that all of the workloads you have are not generating unusual logs and whatever else you can think of to check that the new copy of the data appears to be good to go.
Step 8 – Complete the RAID1 array
We are now going to add the original drive to the RAID1 array, changing it from degraded to whole. This is a little scary because we are about to destroy the original copy of the data, but the trade-off is that we’ll end up with a resilient mirrored drive setup of the newly mounted /dev/md0
filesystem.
1 |
$ sudo mdadm /dev/md0 --add /dev/disk/by-id/ata-ADATA_SU650_2K1220083359-part1 |
Again, you will notice that I’m using the by-id
specification of the original drive partition which ends in -part1
as there is only one partition.
Once we’ve done this, we can monitor the progress of the two drives being mirrored (aka: recovering to RAID1 status)
1 2 3 4 5 6 7 8 |
$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdg1[2] sdf1[0] 468717568 blocks super 1.2 [2/1] [U_] [>....................] recovery = 0.3% (1754496/468717568) finish=35.4min speed=219312K/sec bitmap: 4/4 pages [16KB], 65536KB chunk unused devices: <none> |
Once things have completed and the system is stable, a reboot isn’t a bad idea to ensure that everything will start up fine. This is generally a good thing to do whenever you are making changes to the system.
While this isn’t perfect protection from any sort of data loss, it should allow us to gracefully recover when one of the SSD drives stops working. Having a backup plan that you test regularly is a very good thing to add as another layer of data protection.
Nice write-up!
I’ve been using e2fslabel and mounting by label, rather than mounting by drive ID. It’s a little less verbose, and lets me use automated scripts with multiple copies of a backup disk that I iterate through (mount –label weeklybackup /mnt).
Yikes – I got an email that indicated “A Fail event had been detected on md device /dev/md0.”
Inspecting the system, the device that was the failed drive simply did not appear in the /dev/sdX list anymore. A reboot did not recover the drive, however a full power off / power on did.
Once I could see the device, it was was still marked as ‘removed’ from the array. Simply re-adding it resulted in mdadm reporting that the volume was re-added – avoiding a long re-sync.
$ sudo mdadm /dev/md0 –add /dev/disk/by-id/ata-ADATA_SU650_2K1220024459-part1
mdadm: re-added /dev/disk/by-id/ata-ADATA_SU650_2K1220024459-part1
The array looks happy – but you can launch a check array manually
$ sudo /usr/share/mdadm/checkarray /dev/md0
And then monitor progress
$ cat /proc/mdstat
Still, it might be time to retire the drive – or get a spare.
Hmm. Again the same drive appears to have failed and I get emails indicating the drive array is degraded. Power off / power on of the server and the drive comes back – but the RAID array has removed the device.
Re-adding it as per above and it re-sync’d just fine.. scary. FWIW this is an ADATA drive – I probably will shy away from this brand for important storage needs in the future.
Boo.. 3rd time this has happened. Same ADATA drive getting stuck. It seems to need a hard power off / on to come back. The I need to re-add the drive to the array
$ sudo mdadm /dev/md0 –re-add /dev/disk/by-id/ata-ADATA_SU650_2K1220024459-part1
I’m not sure if –add or –re-add is the right way.. I’ll have to read up on this. Note: two dashes precede the command.
Then kick the array to check itself and we’re good to go.
$ sudo /usr/share/mdadm/checkarray /dev/md0
Oh oh – that pesky ADATA drive failed again.
This time just a full power off / power on seems to have restored the array to a healthy state. I kicked off a check to make sure all is well.
$ sudo /usr/share/mdadm/checkarray /dev/md0
Again the ADATA drive took a holiday.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sde1[0] sdf1[2](F)
468717568 blocks super 1.2 [2/1] [U_]
[===================>.] check = 99.9% (468717568/468717568) finish=0.0min speed=15548K/sec
bitmap: 3/4 pages [12KB], 65536KB chunk
Again, power off and power on.. and we’re back.. ran a checkarray just in case..
Oops.. not so fast – I just got another degraded array error. Boo.
Hmm, but the checkarray is still working away.. maybe it’ll just take a while for the mirror to repair itself?
md0 : active raid1 sdf1[2] sde1[0]
468717568 blocks super 1.2 [2/2] [UU]
[>………………..] check = 4.4% (20911552/468717568) finish=91.6min speed=81399K/sec
bitmap: 2/4 pages [8KB], 65536KB chunk
yup.. seems the check completed just fine – and my array is all good to go.
Again – one of the drives failed and I got an email “Fail event on /dev/md0 … ”
My solution this time was to just ‘sudo poweroff’ then go hit the button to start things up once it had shut down. This is probably the simplest/quickest way to recover.. I may just need to replace that drive, SSDs are certainly cheap enough now.
After the boot – I got an email “DegradedArray event on /dev/md0..” with some details about the recovery
—
A DegradedArray event had been detected on md device /dev/md0.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdf1[2] sde1[0]
468717568 blocks super 1.2 [2/1] [U_]
[==>………………] recovery = 14.3% (67200192/468717568) finish=30.4min speed=219440K/sec
bitmap: 4/4 pages [16KB], 65536KB chunk
unused devices: