Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of replacing_failed_disk

Timestamp:: Oct 3, 2012, 2:03:52 PM (13 years ago)
Author:: Joseph
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

replacing_failed_disk

               v1
+== Verify the SMARTCTL Errors ==
+Run the following command where /dev/sdX is the suspected failed disk:
+{{{
+smartctl -a /dev/sdX | less
+}}}
+An example output where /dev/sda failed:
+{{{
+ottorene:~# smartctl -a /dev/sda | less
+Vendor Specific SMART Attributes with Thresholds:
+ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VAL
+UE
+Raw_Read_Error_Rate     0x000f   104   095   006    Pre-fail  Always       -       1636341
+Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
+Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11
+Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
+Seek_Error_Rate         0x000f   040   016   030    Pre-fail  Always   In_the_past 1109779
+13882577
+Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       32403
+Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
+Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11
+End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
+Reported_Uncorrect      0x0032   076   076   000    Old_age   Always       -       24
+Command_Timeout         0x0032   100   090   000    Old_age   Always       -       87
+High_Fly_Writes         0x003a   089   089   000    Old_age   Always       -       11
+}}}
+Note that the WHEN_FAILED column indicates a disk failure.
+== Output the Hardware to a TXT File and Save a Local Copy ==
+{{{
+ottorene:~# lshw > ~/ottorene.lshw.txt
+}}}
+== Check the Encrypted Disk Passphrase ==
+To verify that we have the correct encrypted disk passphrase for the reboot, we need to find the passphrase in keyringer and verify that it works with the following command.
+{{{
+mf-audit-crypt-disks
+}}}
+It will first ask what the existing passphrase is to unlock and remove the key slot, and then replace the removed key slot with a new one.  Using the same passphrase for the replacement key slot is fine.  The output should look something like the following.
+{{{
+ottorene:~# mf-audit-crypt-disks
+md1_crypt /dev/md1 none luks
+crypt-audit: Testing key slot availability for '/dev/md1'
+crypt-audit: Testing passphrase for '/dev/md1'
+Enter any passphrase:
+Key slot 0 unlocked.
+Enter new passphrase for key slot:
+Verify passphrase:
+Command successful.
+crypt-audit: Passphrase correct, removing slot
+Enter any remaining LUKS passphrase:
+Key slot 0 unlocked.
+Key slot 0 verified.
+Command successful.
+crypt-audit: Success
+}}}
+== Install GRUB on the Disk that Stays in the Machine ==
+Since we're removing one of the disks, we want to make sure that the remaining disk has grub installed when we reboot the machine after physically replacing the failed disk with the new one.  In the following command /dev/sdY should be the other disk.
+{{{
+grub-install /dev/sdY
+}}}
+An example output using /dev/sdb since in the above, /dev/sda failed.:
+{{{
+ottorene:~# grub-install /dev/sdb
+Installation finished. No error reported.
+}}}
+== Fail and Remove Partitions in the RAID that are to be Replaced ==
+Since the disks are in a RAID, we need to check where the failed disk's partitions are and fail and remove them.
+To output the RAID, run
+{{{
+cat /proc/mdstat
+}}}
+In this failed disk example, /dev/sda1 is on /dev/md0 and /dev/sda2 is on /dev/md1:
+{{{
+ottorene:~# cat /proc/mdstat
+Personalities : [raid1]
+md1 : active raid1 sda2[0] sdb2[1]
+      976510912 blocks [2/2] [UU]
+      [==============>......]  check = 70.6% (689525376/976510912) finish=764.3min speed=6256K
+/sec
+md0 : active raid1 sda1[0] sdb1[1]
+blocks [2/2] [UU]
+unused devices: <none>
+}}}
+The command to fail a partition running in the RAID is, where /dev/mdZ is the RAID block and /dev/sdX# is the failed disk's partition
+{{{
+mdadm /dev/mdZ --fail /dev/sdX#
+}}}
+E.g.  Note the (F) in the cat /proc/mdstat output next to the /dev/sda1 partition.
+{{{
+ottorene:~# mdadm /dev/md0 --fail /dev/sda1
+mdadm: set /dev/sda1 faulty in /dev/md0
+ottorene:~# cat /proc/mdstat
+Personalities : [raid1]
+md1 : active raid1 sda2[0] sdb2[1]
+      976510912 blocks [2/2] [UU]
+      [==============>......]  check = 70.6% (690020224/976510912) finish=745.1min speed=6407K
+/sec
+md0 : active raid1 sda1[2](F) sdb1[1]
+blocks [2/1] [_U]
+unused devices: <none>
+}}}
+The command to remove the failed partition, where /dev/mdZ is the RAID block and /dev/sdX# is the failed partition
+{{{
+mdadm /dev/mdZ --remove /dev/sdX#
+}}}
+E.g.  Note that sda1 is no longer listed in md0.
+{{{
+ottorene:~# mdadm /dev/md0 --remove /dev/sda1
+mdadm: hot removed /dev/sda1 from /dev/md0
+ottorene:~# cat /proc/mdstat
+Personalities : [raid1]
+md1 : active raid1 sda2[0] sdb2[1]
+      976510912 blocks [2/2] [UU]
+      [==============>......]  check = 70.6% (690087488/976510912) finish=699.1min speed=6827K
+/sec
+md0 : active raid1 sdb1[1]
+blocks [2/1] [_U]
+unused devices: <none>
+}}}
+Fail and remove the second partition in the other RAID block following the same steps.
+{{{
+ottorene:~# mdadm /dev/md1 --fail /dev/sda2
+mdadm: set /dev/sda2 faulty in /dev/md1
+ottorene:~# cat /proc/mdstat
+Personalities : [raid1]
+md1 : active raid1 sda2[2](F) sdb2[1]
+      976510912 blocks [2/1] [_U]
+md0 : active raid1 sdb1[1]
+blocks [2/1] [_U]
+unused devices: <none>
+ottorene:~# mdadm /dev/md1 --remove /dev/sda2
+mdadm: hot removed /dev/sda2 from /dev/md1
+ottorene:~# cat /proc/mdstat
+Personalities : [raid1]
+md1 : active raid1 sdb2[1]
+      976510912 blocks [2/1] [_U]
+md0 : active raid1 sdb1[1]
+blocks [2/1] [_U]
+unused devices: <none>
+}}}
+== Shutdown All the Guests and then Powerdown the Server ==
+Once the server is powered off, remove it from the rack and replace the failed disk with the replacement.