Changes between Initial Version and Version 1 of replacing_failed_disk


Ignore:
Timestamp:
Oct 3, 2012, 10:03:52 AM (8 years ago)
Author:
Joseph
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • replacing_failed_disk

    v1 v1  
     1== Verify the SMARTCTL Errors ==
     2Run the following command where /dev/sdX is the suspected failed disk:
     3{{{
     4smartctl -a /dev/sdX | less
     5}}}
     6
     7An example output where /dev/sda failed:
     8{{{
     90 ottorene:~# smartctl -a /dev/sda | less
     10
     11Vendor Specific SMART Attributes with Thresholds:
     12ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VAL
     13UE
     14  1 Raw_Read_Error_Rate     0x000f   104   095   006    Pre-fail  Always       -       1636341
     1538
     16  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
     17  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11
     18  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
     19  7 Seek_Error_Rate         0x000f   040   016   030    Pre-fail  Always   In_the_past 1109779
     2013882577
     21  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       32403
     22 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     23 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11
     24184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
     25187 Reported_Uncorrect      0x0032   076   076   000    Old_age   Always       -       24
     26188 Command_Timeout         0x0032   100   090   000    Old_age   Always       -       87
     27189 High_Fly_Writes         0x003a   089   089   000    Old_age   Always       -       11
     28}}}
     29
     30Note that the WHEN_FAILED column indicates a disk failure.
     31
     32== Output the Hardware to a TXT File and Save a Local Copy ==
     33{{{
     340 ottorene:~# lshw > ~/ottorene.lshw.txt
     35}}}
     36
     37== Check the Encrypted Disk Passphrase ==
     38
     39To verify that we have the correct encrypted disk passphrase for the reboot, we need to find the passphrase in keyringer and verify that it works with the following command.
     40{{{
     41mf-audit-crypt-disks
     42}}}
     43
     44It will first ask what the existing passphrase is to unlock and remove the key slot, and then replace the removed key slot with a new one.  Using the same passphrase for the replacement key slot is fine.  The output should look something like the following. 
     45{{{
     460 ottorene:~# mf-audit-crypt-disks
     47md1_crypt /dev/md1 none luks
     48crypt-audit: Testing key slot availability for '/dev/md1'
     49crypt-audit: Testing passphrase for '/dev/md1'
     50Enter any passphrase:
     51Key slot 0 unlocked.
     52Enter new passphrase for key slot:
     53Verify passphrase:
     54Command successful.
     55crypt-audit: Passphrase correct, removing slot
     56Enter any remaining LUKS passphrase:
     57Key slot 0 unlocked.
     58Key slot 0 verified.
     59Command successful.
     60crypt-audit: Success
     61}}}
     62
     63== Install GRUB on the Disk that Stays in the Machine ==
     64
     65Since we're removing one of the disks, we want to make sure that the remaining disk has grub installed when we reboot the machine after physically replacing the failed disk with the new one.  In the following command /dev/sdY should be the other disk.
     66{{{
     67grub-install /dev/sdY
     68}}}
     69
     70An example output using /dev/sdb since in the above, /dev/sda failed.:
     71{{{
     720 ottorene:~# grub-install /dev/sdb
     73Installation finished. No error reported.
     74}}}
     75
     76== Fail and Remove Partitions in the RAID that are to be Replaced ==
     77Since the disks are in a RAID, we need to check where the failed disk's partitions are and fail and remove them.
     78
     79To output the RAID, run
     80{{{
     81cat /proc/mdstat
     82}}}
     83
     84In this failed disk example, /dev/sda1 is on /dev/md0 and /dev/sda2 is on /dev/md1:
     85{{{
     860 ottorene:~# cat /proc/mdstat
     87Personalities : [raid1]
     88md1 : active raid1 sda2[0] sdb2[1]
     89      976510912 blocks [2/2] [UU]
     90      [==============>......]  check = 70.6% (689525376/976510912) finish=764.3min speed=6256K
     91/sec
     92
     93md0 : active raid1 sda1[0] sdb1[1]
     94      248896 blocks [2/2] [UU]
     95
     96unused devices: <none>
     97}}}
     98
     99The command to fail a partition running in the RAID is, where /dev/mdZ is the RAID block and /dev/sdX# is the failed disk's partition
     100{{{
     101mdadm /dev/mdZ --fail /dev/sdX#
     102}}}
     103
     104E.g.  Note the (F) in the cat /proc/mdstat output next to the /dev/sda1 partition.
     105{{{
     1060 ottorene:~# mdadm /dev/md0 --fail /dev/sda1
     107mdadm: set /dev/sda1 faulty in /dev/md0
     108
     1090 ottorene:~# cat /proc/mdstat
     110Personalities : [raid1]
     111md1 : active raid1 sda2[0] sdb2[1]
     112      976510912 blocks [2/2] [UU]
     113      [==============>......]  check = 70.6% (690020224/976510912) finish=745.1min speed=6407K
     114/sec
     115
     116md0 : active raid1 sda1[2](F) sdb1[1]
     117      248896 blocks [2/1] [_U]
     118
     119unused devices: <none>
     120}}}
     121
     122The command to remove the failed partition, where /dev/mdZ is the RAID block and /dev/sdX# is the failed partition
     123{{{
     124mdadm /dev/mdZ --remove /dev/sdX#
     125}}}
     126
     127E.g.  Note that sda1 is no longer listed in md0.
     128{{{
     1290 ottorene:~# mdadm /dev/md0 --remove /dev/sda1
     130mdadm: hot removed /dev/sda1 from /dev/md0
     131
     1320 ottorene:~# cat /proc/mdstat
     133Personalities : [raid1]
     134md1 : active raid1 sda2[0] sdb2[1]
     135      976510912 blocks [2/2] [UU]
     136      [==============>......]  check = 70.6% (690087488/976510912) finish=699.1min speed=6827K
     137/sec
     138
     139md0 : active raid1 sdb1[1]
     140      248896 blocks [2/1] [_U]
     141unused devices: <none>
     142}}}
     143
     144Fail and remove the second partition in the other RAID block following the same steps.
     145{{{
     1460 ottorene:~# mdadm /dev/md1 --fail /dev/sda2
     147mdadm: set /dev/sda2 faulty in /dev/md1
     148
     1490 ottorene:~# cat /proc/mdstat
     150Personalities : [raid1]
     151md1 : active raid1 sda2[2](F) sdb2[1]
     152      976510912 blocks [2/1] [_U]
     153
     154md0 : active raid1 sdb1[1]
     155      248896 blocks [2/1] [_U]
     156
     157unused devices: <none>
     158
     1590 ottorene:~# mdadm /dev/md1 --remove /dev/sda2
     160mdadm: hot removed /dev/sda2 from /dev/md1
     161
     1620 ottorene:~# cat /proc/mdstat
     163Personalities : [raid1]
     164md1 : active raid1 sdb2[1]
     165      976510912 blocks [2/1] [_U]
     166
     167md0 : active raid1 sdb1[1]
     168      248896 blocks [2/1] [_U]
     169unused devices: <none>
     170}}}
     171
     172== Shutdown All the Guests and then Powerdown the Server ==
     173
     174Once the server is powered off, remove it from the rack and replace the failed disk with the replacement.