wiki:replacing_failed_disk

Version 1 (modified by Joseph, 7 years ago) (diff)

--

Verify the SMARTCTL Errors

Run the following command where /dev/sdX is the suspected failed disk:

smartctl -a /dev/sdX | less

An example output where /dev/sda failed:

0 ottorene:~# smartctl -a /dev/sda | less

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VAL
UE
  1 Raw_Read_Error_Rate     0x000f   104   095   006    Pre-fail  Always       -       1636341
38
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   040   016   030    Pre-fail  Always   In_the_past 1109779
13882577
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       32403
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   076   076   000    Old_age   Always       -       24
188 Command_Timeout         0x0032   100   090   000    Old_age   Always       -       87
189 High_Fly_Writes         0x003a   089   089   000    Old_age   Always       -       11

Note that the WHEN_FAILED column indicates a disk failure.

Output the Hardware to a TXT File and Save a Local Copy

0 ottorene:~# lshw > ~/ottorene.lshw.txt

Check the Encrypted Disk Passphrase

To verify that we have the correct encrypted disk passphrase for the reboot, we need to find the passphrase in keyringer and verify that it works with the following command.

mf-audit-crypt-disks

It will first ask what the existing passphrase is to unlock and remove the key slot, and then replace the removed key slot with a new one. Using the same passphrase for the replacement key slot is fine. The output should look something like the following.

0 ottorene:~# mf-audit-crypt-disks
md1_crypt /dev/md1 none luks
crypt-audit: Testing key slot availability for '/dev/md1'
crypt-audit: Testing passphrase for '/dev/md1'
Enter any passphrase:
Key slot 0 unlocked.
Enter new passphrase for key slot:
Verify passphrase:
Command successful.
crypt-audit: Passphrase correct, removing slot
Enter any remaining LUKS passphrase:
Key slot 0 unlocked.
Key slot 0 verified.
Command successful.
crypt-audit: Success

Install GRUB on the Disk that Stays in the Machine

Since we're removing one of the disks, we want to make sure that the remaining disk has grub installed when we reboot the machine after physically replacing the failed disk with the new one. In the following command /dev/sdY should be the other disk.

grub-install /dev/sdY

An example output using /dev/sdb since in the above, /dev/sda failed.:

0 ottorene:~# grub-install /dev/sdb
Installation finished. No error reported.

Fail and Remove Partitions in the RAID that are to be Replaced

Since the disks are in a RAID, we need to check where the failed disk's partitions are and fail and remove them.

To output the RAID, run

cat /proc/mdstat

In this failed disk example, /dev/sda1 is on /dev/md0 and /dev/sda2 is on /dev/md1:

0 ottorene:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      976510912 blocks [2/2] [UU]
      [==============>......]  check = 70.6% (689525376/976510912) finish=764.3min speed=6256K
/sec

md0 : active raid1 sda1[0] sdb1[1]
      248896 blocks [2/2] [UU]

unused devices: <none>

The command to fail a partition running in the RAID is, where /dev/mdZ is the RAID block and /dev/sdX# is the failed disk's partition

mdadm /dev/mdZ --fail /dev/sdX#

E.g. Note the (F) in the cat /proc/mdstat output next to the /dev/sda1 partition.

0 ottorene:~# mdadm /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0

0 ottorene:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      976510912 blocks [2/2] [UU]
      [==============>......]  check = 70.6% (690020224/976510912) finish=745.1min speed=6407K
/sec

md0 : active raid1 sda1[2](F) sdb1[1]
      248896 blocks [2/1] [_U]

unused devices: <none>

The command to remove the failed partition, where /dev/mdZ is the RAID block and /dev/sdX# is the failed partition

mdadm /dev/mdZ --remove /dev/sdX#

E.g. Note that sda1 is no longer listed in md0.

0 ottorene:~# mdadm /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0

0 ottorene:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      976510912 blocks [2/2] [UU]
      [==============>......]  check = 70.6% (690087488/976510912) finish=699.1min speed=6827K
/sec

md0 : active raid1 sdb1[1]
      248896 blocks [2/1] [_U]
unused devices: <none>

Fail and remove the second partition in the other RAID block following the same steps.

0 ottorene:~# mdadm /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1

0 ottorene:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[2](F) sdb2[1]
      976510912 blocks [2/1] [_U]

md0 : active raid1 sdb1[1]
      248896 blocks [2/1] [_U]

unused devices: <none>

0 ottorene:~# mdadm /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2 from /dev/md1

0 ottorene:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1]
      976510912 blocks [2/1] [_U]

md0 : active raid1 sdb1[1]
      248896 blocks [2/1] [_U]
unused devices: <none>

Shutdown All the Guests and then Powerdown the Server

Once the server is powered off, remove it from the rack and replace the failed disk with the replacement.