== Verify the SMARTCTL Errors == Run the following command where /dev/sdX is the suspected failed disk: {{{ smartctl -a /dev/sdX | less }}} An example output where /dev/sda failed: {{{ 0 ottorene:~# smartctl -a /dev/sda | less Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VAL UE 1 Raw_Read_Error_Rate 0x000f 104 095 006 Pre-fail Always - 1636341 38 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 040 016 030 Pre-fail Always In_the_past 1109779 13882577 9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 32403 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 076 076 000 Old_age Always - 24 188 Command_Timeout 0x0032 100 090 000 Old_age Always - 87 189 High_Fly_Writes 0x003a 089 089 000 Old_age Always - 11 }}} Note that the WHEN_FAILED column indicates a disk failure. == Output the Hardware to a TXT File and Save a Local Copy == {{{ 0 ottorene:~# lshw > ~/ottorene.lshw.txt }}} == Check the Encrypted Disk Passphrase == To verify that we have the correct encrypted disk passphrase for the reboot, we need to find the passphrase in keyringer and verify that it works with the following command. {{{ mf-audit-crypt-disks }}} It will first ask what the existing passphrase is to unlock and remove the key slot, and then replace the removed key slot with a new one. Using the same passphrase for the replacement key slot is fine. The output should look something like the following. {{{ 0 ottorene:~# mf-audit-crypt-disks md1_crypt /dev/md1 none luks crypt-audit: Testing key slot availability for '/dev/md1' crypt-audit: Testing passphrase for '/dev/md1' Enter any passphrase: Key slot 0 unlocked. Enter new passphrase for key slot: Verify passphrase: Command successful. crypt-audit: Passphrase correct, removing slot Enter any remaining LUKS passphrase: Key slot 0 unlocked. Key slot 0 verified. Command successful. crypt-audit: Success }}} == Install GRUB on the Disk that Stays in the Machine == Since we're removing one of the disks, we want to make sure that the remaining disk has grub installed when we reboot the machine after physically replacing the failed disk with the new one. In the following command /dev/sdY should be the other disk. {{{ grub-install /dev/sdY }}} An example output using /dev/sdb since in the above, /dev/sda failed.: {{{ 0 ottorene:~# grub-install /dev/sdb Installation finished. No error reported. }}} == Fail and Remove Partitions in the RAID that are to be Replaced == Since the disks are in a RAID, we need to check where the failed disk's partitions are and fail and remove them. To output the RAID, run {{{ cat /proc/mdstat }}} In this failed disk example, /dev/sda1 is on /dev/md0 and /dev/sda2 is on /dev/md1: {{{ 0 ottorene:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 976510912 blocks [2/2] [UU] [==============>......] check = 70.6% (689525376/976510912) finish=764.3min speed=6256K /sec md0 : active raid1 sda1[0] sdb1[1] 248896 blocks [2/2] [UU] unused devices: }}} The command to fail a partition running in the RAID is, where /dev/mdZ is the RAID block and /dev/sdX# is the failed disk's partition {{{ mdadm /dev/mdZ --fail /dev/sdX# }}} E.g. Note the (F) in the cat /proc/mdstat output next to the /dev/sda1 partition. {{{ 0 ottorene:~# mdadm /dev/md0 --fail /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md0 0 ottorene:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 976510912 blocks [2/2] [UU] [==============>......] check = 70.6% (690020224/976510912) finish=745.1min speed=6407K /sec md0 : active raid1 sda1[2](F) sdb1[1] 248896 blocks [2/1] [_U] unused devices: }}} The command to remove the failed partition, where /dev/mdZ is the RAID block and /dev/sdX# is the failed partition {{{ mdadm /dev/mdZ --remove /dev/sdX# }}} E.g. Note that sda1 is no longer listed in md0. {{{ 0 ottorene:~# mdadm /dev/md0 --remove /dev/sda1 mdadm: hot removed /dev/sda1 from /dev/md0 0 ottorene:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 976510912 blocks [2/2] [UU] [==============>......] check = 70.6% (690087488/976510912) finish=699.1min speed=6827K /sec md0 : active raid1 sdb1[1] 248896 blocks [2/1] [_U] unused devices: }}} Fail and remove the second partition in the other RAID block following the same steps. {{{ 0 ottorene:~# mdadm /dev/md1 --fail /dev/sda2 mdadm: set /dev/sda2 faulty in /dev/md1 0 ottorene:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[2](F) sdb2[1] 976510912 blocks [2/1] [_U] md0 : active raid1 sdb1[1] 248896 blocks [2/1] [_U] unused devices: 0 ottorene:~# mdadm /dev/md1 --remove /dev/sda2 mdadm: hot removed /dev/sda2 from /dev/md1 0 ottorene:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[1] 976510912 blocks [2/1] [_U] md0 : active raid1 sdb1[1] 248896 blocks [2/1] [_U] unused devices: }}} == Shutdown All the Guests and then Powerdown the Server == Once the server is powered off, remove it from the rack and replace the failed disk with the replacement.