Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#11367 closed Bug/Something is broken (fixed)

malaka disk replacement

Reported by: Owned by:
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no


I was notified by email about a failed raid event with malaka.

Change History (6)

comment:1 Changed 3 years ago by

The disk has been replaced and the small /dev/md0 RAID array has been sync'ed. I haven't started /dev/md1 yet because that typically consumes a lot of disk i/o so we typically sync these in the middle of the night after shutting down all guests.

The failed disk is a Hitchi dated Apr 2010.

Identified by lshw: product: Hitachi HUA72202

comment:2 Changed 3 years ago by

  • Owner set to
  • Status changed from new to assigned

We plan to sync the disk monday night between 11:00 pm and 5:00 am america/chicago (requiring all guests to be off).

comment:3 Changed 3 years ago by

Strangely... I remember failing /dev/sdd2 and /dev/sdd3 from /dev/md0 and /dev/md1, partitioning the newly created /dev/sdd (which appeared when I inserted the new disk) and then adding the new /dev/sdd2 to /dev/md0 last friday when I added the disk.

However, now I see the /dev/sdd2 is failed on /dev/md0, /dev/sdd doesn't appear in the partition table, and /dev/sde is available and properly partitioned.

I think the new disk must have been recognized as /dev/sdd and somehow it got switched to sde??

In any event, I just ran a short test on /dev/sde (I wish I had started a long test last friday) and it doesn't report any errors:

0 malaka:~# smartctl -a /dev/sde | grep Short
Short self-test routine 
# 1  Short offline       Completed without error       00%        75         -
0 malaka:~# 

And I added it to /dev/md0 with:

mdadm /dev/md0 --add /dev/sde2

And lastly I removed the failed /dev/sdd2 (which was still listed in the output of /proc/mdstat but could not be removed using --remove because it is not in the partition table):

mdadm /dev/md0 -r detached

Now, Jaime, you should be ready to go:

  • Shutdown all guests. From
    for guest in $(ls /etc/sv/kvm/); do sv stop $guest; done
  • Make sure they are all down:
    for guest in $(ls /etc/sv/kvm/); do sv status $guest; done
  • Add the new partition to the raid array:
    mdadm /dev/md1 --add /dev/sde3

Then go to bed. When I wake up at 6:00 am I will ensure the sync is done (I hope it is done...) and then I will start all guests.

comment:4 Changed 3 years ago by

Ok, just shut down all guests on malaka down and initiated sync.

md1 : active raid10 sde3[5] sda3[4] sdc3[2] sdb3[1]
      3906010112 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]
      [>....................]  recovery =  1.9% (39055232/1953005056) finish=251.5min speed=126788K/sec
md0 : active raid1 sde2[6] sda2[4] sdc2[2] sdb2[1]
      499700 blocks super 1.2 [4/4] [UUUU]

Although iI could see through the serial console that chelsea, hammer, and kinoy had reached the shutdown target their processes wouldn't stop on their own so I had to kill those processes. They might give you network issues coming back up.

comment:5 Changed 3 years ago by

  • Resolution set to fixed
  • Status changed from assigned to closed

At 5:50 am America/New_York the sync was complete and I started all guests. There was a brief period when I lost network connectivity, which I think is an old bug that causes the host network to briefly go down when network guests start.

All guests, including chelsea, hammer and kinoy, started properly. Lots of disk i/o now, but it looks like we are back in shape.

comment:6 Changed 3 years ago by

Ack. I didn't realize until now (after seeing the support tickets) that paulo is on malaka. The server paulo is one of the two servers. We should have updated the DNS settings on to disable paulo during this period.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.