Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#11796 closed Bug/Something is broken (fixed)

disk failure on barron

Reported by: https://id.mayfirst.org/jamie Owned by: https://id.mayfirst.org/jamie
Priority: Medium Component: Tech
Keywords: barron.mayfirst.org disk Cc:
Sensitive: no

Description

sdb has failed:

1 barron:~# cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md1 : active raid10 sda3[0] sdf3[12] sdl3[15] sdk3[9] sdj3[8] sdi3[7] sdh3[14] sdg3[5] sde3[16] sdd3[3] sdc3[13] sdb3[17](F)
      17577759744 blocks super 1.2 512K chunks 2 near-copies [12/11] [U_UUUUUUUUUU]
      
md0 : active raid1 sda2[0] sdl2[15] sdk2[10] sdj2[9] sdi2[8] sdh2[14] sdg2[6] sdf2[12] sde2[16] sdd2[3] sdc2[13] sdb2[17]
      499392 blocks super 1.2 [12/12] [UUUUUUUUUUUU]
      
unused devices: <none>
0 barron:~#

In addition, sdg is having some problems but has not yet failed:

0 barron:~# egrep "sdg" /var/log/syslog
May 25 06:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 06:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 06:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 76
May 25 07:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 07:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 07:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 76 to 78
May 25 07:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 07:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 07:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 78 to 80
May 25 08:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 08:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 08:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 81
May 25 08:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 08:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 08:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 82
May 25 09:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 09:03:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 09:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Currently unreadable (pending) sectors
May 25 09:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
May 25 09:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
May 25 09:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 69
May 25 09:33:09 barron smartd[1454]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 31
0 barron:~# 

Change History (8)

comment:1 Changed 3 years ago by https://id.mayfirst.org/jamie

  • Owner set to https://id.mayfirst.org/jamie
  • Status changed from new to assigned

sdb is no longer recognized by barron. sdg is:

According to lshw:

              *-disk:7
                   description: ATA Disk
                   product: ST3000NC002-1DY1
                   vendor: Seagate
                   physical id: 0.6.0
                   bus info: scsi@0:0.6.0
                   logical name: /dev/sdg
                   version: CN02
                   serial: Z1F2PCJZ
                   size: 2794GiB (3TB)
                   capacity: 2794GiB (3TB)
                   capabilities: 15000rpm gpt-1.00 partitioned partitioned:gpt
                   configuration: ansiversion=6 guid=11bf3084-3e6d-4218-8b66-127448c96ebf logicalsectorsize=512 sectorsize=4096
Last edited 3 years ago by https://id.mayfirst.org/jamie (previous) (diff)

comment:2 Changed 3 years ago by https://id.mayfirst.org/jamie

This sucks - sdb was just replaced six weeks ago (#11667). I see no signs in our ticket system that sdg has ever been replaced.

I'm not sure if I have 2 3TB disks in the colo - but we should replace both. The problem is that I'm not sure how to know if sdb and sdg were mirrors (in which case, if I pull both of them, the system will fail).

I tried --query --examine:

1 barron:~# mdadm --query --examine /dev/sdg3
/dev/sdg3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 0bdb6852:016b0484:1d8de5dd:7312bea4
           Name : baron:1
  Creation Time : Mon Jul 29 23:39:31 2013
     Raid Level : raid10
   Raid Devices : 12

 Avail Dev Size : 5859254272 (2793.91 GiB 2999.94 GB)
     Array Size : 17577759744 (16763.46 GiB 17999.63 GB)
  Used Dev Size : 5859253248 (2793.91 GiB 2999.94 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=1024 sectors
          State : active
    Device UUID : 4446daa6:040e8e1d:424961d2:4f83c91b

    Update Time : Sun May 22 02:58:46 2016
       Checksum : 8835ece6 - correct
         Events : 2173949

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : A.AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
0 barron:~#

But don't see anything that indicates which disk it is mirroring.

comment:3 Changed 3 years ago by https://id.mayfirst.org/jaimev

Bad disk or bad hardware controller? Maybe we should devise a physical labeling system that will reduce guesswork when changing disks.

comment:4 Changed 3 years ago by https://id.mayfirst.org/dkg

I think the question is less about labeling the disks, and more about figuring out the internal device layout.

the debian RAID FAQ notes:

  RAID10 gives you the choice between three ways of laying out the blocks on
  the disk. Assuming a simple 4 drive setup with 2 copies of each block, then
  if A,B,C are data blocks, a,b their parts, and 1,2 denote their copies, the
  following would be a classic RAID1+0 where 1,2 and 3,4 are RAID0 pairs
  combined into a RAID1:

  near=2 would be (this is the classic RAID1+0)

    hdd1  Aa1 Ba1 Ca1
    hdd2  Aa2 Ba2 Ca2
    hdd3  Ab1 Bb1 Cb1
    hdd4  Ab2 Bb2 Cb2

  offset=2 would be

    hdd1  Aa1 Bb2 Ca1 Db2
    hdd2  Ab1 Aa2 Cb1 Ca2
    hdd3  Ba1 Ab2 Da1 Cb2
    hdd4  Bb1 Ba2 Db1 Da2

  far=2 would be

    hdd1  Aa1 Ca1  .... Bb2 Db2
    hdd2  Ab1 Cb1  .... Aa2 Ca2
    hdd3  Ba1 Da1  .... Ab2 Cb2
    hdd4  Bb1 Db1  .... Ba2 Da2

  Where the second set start half-way through the drives.

  The advantage of far= is that you can easily spread a long sequential read
  across the drives.  The cost is more seeking for writes. offset= can
  possibly get similar benefits with large enough chunk size. Neither upstream
  nor the package maintainer have tried to understand all the implications of
  that layout. It was added simply because it is a supported layout in DDF and
  DDF support is a goal.

And we can see from above that Layout : near=2 .

That suggests that the copies are pairwise adjacent.

0 barron:~# mdadm --query --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Mon Jul 29 23:39:31 2013
     Raid Level : raid10
     Array Size : 17577759744 (16763.46 GiB 17999.63 GB)
  Used Dev Size : 2929626624 (2793.91 GiB 2999.94 GB)
   Raid Devices : 12
  Total Devices : 12
    Persistence : Superblock is persistent

    Update Time : Sun May 22 02:58:46 2016
          State : active, degraded 
 Active Devices : 11
Working Devices : 11
 Failed Devices : 1
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : baron:1
           UUID : 0bdb6852:016b0484:1d8de5dd:7312bea4
         Events : 2173949

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync set-A   /dev/sda3
       2       0        0        2      removed
      13       8       35        2      active sync set-A   /dev/sdc3
       3       8       51        3      active sync set-B   /dev/sdd3
      16       8       67        4      active sync set-A   /dev/sde3
       5       8       99        5      active sync set-B   /dev/sdg3
      14       8      115        6      active sync set-A   /dev/sdh3
       7       8      131        7      active sync set-B   /dev/sdi3
       8       8      147        8      active sync set-A   /dev/sdj3
       9       8      163        9      active sync set-B   /dev/sdk3
      15       8      179       10      active sync set-A   /dev/sdl3
      12       8       83       11      active sync set-B   /dev/sdf3

      17       8       19        -      faulty
0 barron:~# 

looks like both sdg3 and the removed disk were part of set-B, which i think implies that they were *not* mirrors of each other. I welcome someone else's interpretation of this, though.

comment:5 Changed 3 years ago by https://id.mayfirst.org/jamie

Thank you! This is really useful. Sadly I only seem to have 1 3TB disk so I can only replace one disk on this trip. But I will be using this reference for the future.

comment:6 Changed 3 years ago by https://id.mayfirst.org/jamie

New disk is install... heading to appt and then back to office and then I will set it up

comment:7 Changed 3 years ago by https://id.mayfirst.org/jamie

  • Resolution set to fixed
  • Status changed from assigned to closed

I've just partitioned the new disk and started sync'ing it.

comment:8 Changed 3 years ago by https://id.mayfirst.org/jamie

I've also just ordered three new 3TB disks (we have 2 2TB disks in the cabinet now).

Since the last one (which lasted all of 6 weeks) was a Seagate Constellation, this time I split the order between:

  • Toshiba MG03ACA300 3TB 7200RPM SATA3/SATA 6.0 GB/s 64MB Enterprise Hard
  • IBM RE 3TB Enterprise SAS HDD,WD3001FYY G, 7200 RPM,SATAIII,32
  • WD RE SAS 3 TB Enterprise Hard Drive: 3.5 Inch, 7200 RPM, SAS, 32 MB Cache

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.