{"id":2184,"date":"2023-08-10T13:33:15","date_gmt":"2023-08-10T17:33:15","guid":{"rendered":"https:\/\/lowtek.ca\/roo\/?p=2184"},"modified":"2023-08-10T13:33:15","modified_gmt":"2023-08-10T17:33:15","slug":"when-mirrors-break-raid1-failure-and-recovery","status":"publish","type":"post","link":"https:\/\/lowtek.ca\/roo\/2023\/when-mirrors-break-raid1-failure-and-recovery\/","title":{"rendered":"When Mirrors Break: RAID1 failure and recovery"},"content":{"rendered":"<p><a href=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/wd500ssd.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-2185\" src=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/wd500ssd-500x282.jpg\" alt=\"\" width=\"500\" height=\"282\" srcset=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/wd500ssd-500x282.jpg 500w, https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/wd500ssd-768x432.jpg 768w, https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/wd500ssd.jpg 1000w\" sizes=\"auto, (max-width: 500px) 85vw, 500px\" \/><\/a>A couple of years ago I <a href=\"https:\/\/lowtek.ca\/roo\/2021\/ubuntu-adding-a-2nd-data-drive-as-a-mirror-raid1\/\">added a second drive<\/a> to my server in a RAID1 (mirror) configuration. Originally I was using the single drive for logs, but with a more durable mirror setup I moved more (important) data to it.<\/p>\n<p>RAID is not a backup story, if you really care about the data you want to back it up. There are two hard lessons I learned with this recent failure (and my recovery). Two valuable to me bits of data I&#8217;m storing on this mirrored volume are <a href=\"https:\/\/lowtek.ca\/roo\/2021\/installing-docker-mailserver\/\">email<\/a>, and <a href=\"https:\/\/www.photoprism.app\/\">photoprism<\/a> storage (but not the photos themselves). Stupidly I did not have regular backups of either of these, please learn from my mistake.<\/p>\n<p>The two lessons I hope to learn from this are:<\/p>\n<ol>\n<li>Backup your data, even a bad backup is better than nothing<\/li>\n<li>Do not ignore any signs of problems, replace any suspicious hardware ASAP<\/li>\n<\/ol>\n<p>If you read the <a href=\"https:\/\/lowtek.ca\/roo\/2021\/ubuntu-adding-a-2nd-data-drive-as-a-mirror-raid1\/#comments\">comments on my previous post<\/a>, you will see a history of minor failures that I clearly willfully ignored. I mean, hey &#8211; it&#8217;s a mirrored setup and mostly I had 2 drives working fine.. right? Stupid me.<\/p>\n<p>The replacement <a href=\"https:\/\/www.canadacomputers.com\/product_info.php?cPath=179_4230&amp;item_id=221048\">500GB SSD<\/a> drive cost me $56.49 taxes in, it even has a 5 year manufacturer warranty in comparison to the 3 year warranty on the failed ADATA drive. Sadly <a href=\"https:\/\/www.adata.com\/en\/support\/warranty\/\">checking the ADATA warranty<\/a> shows me it made it just path the 3 year mark (not that a &#8216;free&#8217; replacement drive would fix my problem)<\/p>\n<p><a href=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/adata-warranty.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-2186\" src=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/adata-warranty-500x95.jpg\" alt=\"\" width=\"500\" height=\"95\" srcset=\"https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/adata-warranty-500x95.jpg 500w, https:\/\/lowtek.ca\/roo\/wp-content\/uploads\/2023\/08\/adata-warranty.jpg 588w\" sizes=\"auto, (max-width: 500px) 85vw, 500px\" \/><\/a><\/p>\n<p>While ADATA has been mostly reliable for me in the past, I&#8217;ll pick other brands for my important data. The ADATA products are often very cheap which is attractive, but at the current cost of SSDs it&#8217;s easy to pay for the premium brands.<\/p>\n<p>Here is a brief replay of how the disaster rolled out. The previous day I had noticed that something was not quite right with email, but restarting things seemed to resolve the issue. The next morning email wasn&#8217;t flowing, so there was something wrong.<\/p>\n<p>Looking at the logs, I was seeing a lot of messages &#8220;structure needs cleaning&#8221; &#8211; which is an indicator that there is some sort of ext4 filesytem problem and it needs to run a check to clean things up. It also appeared that the ADATA half of the mirror had failed in some way. Rebooting the system seemed like a good idea and everything seems to have come back.<\/p>\n<p>Checking the logs for the mail system showed all was well, but then I checked email on my phone, and there were no messages? Stupidly I then opened up my mail client on my laptop, which then proceeded to synchronize with the mail server and delete all of the email stored on my laptop to mirror the empty mailbox on the server.<\/p>\n<p>What was wrong? It took a while, but I figured out that my RAID1 array had completely failed to initialize, both volumes were marked as &#8216;spare&#8217;.<\/p>\n<pre class=\"lang:default decode:true \">$ cat \/proc\/mdstat \r\nPersonalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] \r\nmd0 : inactive sdf1[2](S) sde1[0](S)\r\n      937435136 blocks super 1.2\r\n       \r\nunused devices: &lt;none&gt;<\/pre>\n<p>Ugh, well that explains what happened. When the system reboot the mount failed &#8211; and my mail server just created new data directories on the mount point (which are on my root volume).<\/p>\n<p>At this point I realize I&#8217;m in a bad place, having potentially flushed decades of email. Have I mentioned that running your own email is a bad idea?<\/p>\n<p>Time to start capturing things for recovery. I did a copy of the two drives using dd:<\/p>\n<pre class=\"lang:default decode:true \">$ sudo dd if=\/dev\/sde1 of=\/other\/volume\/sde1-dd.img\r\n$ sudo dd if=\/dev\/sdf1 of=\/other\/volume\/sdf1-dd.img<\/pre>\n<p>In the process of doing this, it became obvious that <code>sdf<\/code> (the ADATA drive) had hard read errors, where in contrast I was able to complete the image creation of <code>sde<\/code> (a Kingston drive).<\/p>\n<p>Once I had some time to think about the situation, I was able to re-add the good drive to the array to make it become active. This let me mount the volume and make <a href=\"https:\/\/docker-mailserver.github.io\/docker-mailserver\/edge\/faq\/#bind-mounts-default\">a copy of the email for backup purposes<\/a>. Once this was done I unmounted and ran a <code>fschk -y \/dev\/md0<\/code> to fix all of the filesystem errors.<\/p>\n<p>I then stopped the currently running mail server, renamed the mount point directory to keep the email that had come into the system while I was doing repairs, and created a new (empty) mount point. Then a reboot.<\/p>\n<p>Sigh of relief as all of my mail appeared back. Sure, I&#8217;m running with a degraded RAID1 array and the <code>fschk<\/code> clearly removed some corrupted files but at least the bulk of my data is back.<\/p>\n<p>Fixing the broken mirror was relatively straight forward. I bought a new drive. Then I captured the output of <code>ls dev\/disk\/by-id\/<\/code> before powering down the system and physically swapping the bad drive for the good drive. I could then repeat the <code>ls dev\/disk\/by-id\/<\/code> and look at the diffs, this allowed me to see the new drive appear, and inspect which drive letter it mapped to.<\/p>\n<pre class=\"lang:default decode:true\">ls -l \/dev\/disk\/by-id\/ata-WD_Blue_SA510_2.WD_Blue_SA510_2.5_500GB_224753806202 \r\nlrwxrwxrwx 1 root root 9 Aug  9 19:12 \/dev\/disk\/by-id\/ata-WD_Blue_SA510_2.5_500GB_224753806202 -&gt; ..\/..\/sdf<\/pre>\n<p>Nice, it appears to have slotted in just where the previous ADATA drive was, not important but comforting. I then dumped the fdisk information of the healthy Kingston drive.<\/p>\n<pre class=\"lang:default decode:true \">$ sudo fdisk -l \/dev\/disk\/by-id\/ata-KINGSTON_SA400S37480G_50026841D62B77E8\r\nDisk \/dev\/disk\/by-id\/ata-KINGSTON_SA400S37480G_50026841D62B77E8: 447.13 GiB, 480103981056 bytes, 937703088 sectors\r\nDisk model: KINGSTON SA400S3\r\nUnits: sectors of 1 * 512 = 512 bytes\r\nSector size (logical\/physical): 512 bytes \/ 512 bytes\r\nI\/O size (minimum\/optimal): 512 bytes \/ 512 bytes\r\nDisklabel type: gpt\r\nDisk identifier: 6C260AF2-796D-5E49-8CB0-6E95DA5C3900\r\n\r\nDevice                                                           Start       End   Sectors   Size Type\r\n\/dev\/disk\/by-id\/ata-KINGSTON_SA400S37480G_50026841D62B77E8-part1  2048 937701375 937699328 447.1G Linux filesystem<\/pre>\n<p>We want our new drive to be partitioned the same way, luckily the new SSD is even bigger. Mostly this is accepting defaults with the exception of typing in the last sector to match the Kingston drive.<\/p>\n<pre class=\"lang:default decode:true \">$ sudo fdisk \/dev\/disk\/by-id\/ata-WD_Blue_SA510_2.5_500GB_224753806202\r\n\r\nWelcome to fdisk (util-linux 2.34).\r\nChanges will remain in memory only, until you decide to write them.\r\nBe careful before using the write command.\r\n\r\nDevice does not contain a recognized partition table.\r\nCreated a new DOS disklabel with disk identifier 0xad299882.\r\n\r\nCommand (m for help): p\r\nDisk \/dev\/disk\/by-id\/ata-WD_Blue_SA510_2.5_500GB_224753806202: 465.78 GiB, 500107862016 bytes, 976773168 sectors\r\nDisk model: WD Blue SA510 2.\r\nUnits: sectors of 1 * 512 = 512 bytes\r\nSector size (logical\/physical): 512 bytes \/ 512 bytes\r\nI\/O size (minimum\/optimal): 512 bytes \/ 512 bytes\r\nDisklabel type: dos\r\nDisk identifier: 0xad299882\r\n\r\nCommand (m for help): g\r\nCreated a new GPT disklabel (GUID: 300BCC0D-C0F3-A640-B717-DFBB3311378F).\r\n\r\nCommand (m for help): n\r\nPartition number (1-128, default 1): \r\nFirst sector (2048-976773134, default 2048): \r\nLast sector, +\/-sectors or +\/-size{K,M,G,T,P} (2048-976773134, default 976773134): 937701375\r\n\r\nCreated a new partition 1 of type 'Linux filesystem' and of size 447.1 GiB.\r\n\r\nCommand (m for help): w\r\nThe partition table has been altered.\r\nCalling ioctl() to re-read partition table.\r\nSyncing disks.<\/pre>\n<p>This is similar to the original <a href=\"https:\/\/lowtek.ca\/roo\/2021\/ubuntu-adding-a-2nd-data-drive-as-a-mirror-raid1\">creation of the RAID1 post<\/a>, but we can now skip to step 8 and add the new volume.<\/p>\n<pre class=\"lang:default decode:true \">sudo mdadm \/dev\/md0 --add \/dev\/disk\/by-id\/ata-WD_Blue_SA510_2.5_500GB_224753806202-part1<\/pre>\n<p>And that&#8217;s it, now we just wait for the mirror to re-sync. It is interesting to note that while I can talk about the device &#8216;by-id&#8217;, mdstat uses the legacy drive letters.<\/p>\n<pre class=\"lang:default decode:true \">$ cat \/proc\/mdstat \r\nPersonalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] \r\nmd0 : active raid1 sdf1[2] sde1[0]\r\n      468717568 blocks super 1.2 [2\/1] [U_]\r\n      [&gt;....................]  recovery =  0.1% (862656\/468717568) finish=36.1min speed=215664K\/sec\r\n      bitmap: 4\/4 pages [16KB], 65536KB chunk\r\n\r\nunused devices: &lt;none&gt;\r\n<\/pre>\n<p>And a short while later, it&#8217;s nearly done.<\/p>\n<pre class=\"lang:default decode:true \">$ cat \/proc\/mdstat \r\nPersonalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] \r\nmd0 : active raid1 sdf1[2] sde1[0]\r\n      468717568 blocks super 1.2 [2\/1] [U_]\r\n      [===================&gt;.]  recovery = 97.5% (457392384\/468717568) finish=3.7min speed=50854K\/sec\r\n      bitmap: 4\/4 pages [16KB], 65536KB chunk\r\n\r\nunused devices: &lt;none&gt;\r\n<\/pre>\n<p>At this point my email appears to be working correctly.\u00a0 The ext4 filesystem corruption I blame on the failing ADATA drive in the mirror, but this is a guess. The corruption caused a few emails to be &#8216;lost&#8217;, but had a bigger impact on the photoprism data which in part was the mariadb storage. I also noticed that both my prometheus data and mimir data were corrupted, neither of these are critical though.<\/p>\n<p>Backups are good, they don&#8217;t have to be perfect &#8211; future you will be thankful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A couple of years ago I added a second drive to my server in a RAID1 (mirror) configuration. Originally I was using the single drive for logs, but with a more durable mirror setup I moved more (important) data to it. RAID is not a backup story, if you really care about the data you &hellip; <a href=\"https:\/\/lowtek.ca\/roo\/2023\/when-mirrors-break-raid1-failure-and-recovery\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;When Mirrors Break: RAID1 failure and recovery&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,12],"tags":[],"class_list":["post-2184","post","type-post","status-publish","format-standard","hentry","category-computing","category-how-to"],"_links":{"self":[{"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/posts\/2184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/comments?post=2184"}],"version-history":[{"count":2,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/posts\/2184\/revisions"}],"predecessor-version":[{"id":2188,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/posts\/2184\/revisions\/2188"}],"wp:attachment":[{"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/media?parent=2184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/categories?post=2184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lowtek.ca\/roo\/wp-json\/wp\/v2\/tags?post=2184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}