Opened 7 years ago

Closed 7 years ago

#5525 closed Bug/Something is broken (fixed)

mdadm failure on malaka

Reported by: Jamie McClelland Owned by: Jamie McClelland
Priority: Urgent Component: Tech
Keywords: malaka mdadm Cc:
Sensitive: no

Description

sda3 seems to have been booted:

0 malaka:~# cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md1 : active raid10 sda3[0](F) sdd3[3] sdc3[2] sdb3[1]
      3906010112 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      
md0 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      499700 blocks super 1.2 [4/4] [UUUU]
      
unused devices: <none>
0 malaka:~#

Change History (7)

comment:1 Changed 7 years ago by Jamie McClelland

syslog reports:

0  malaka:~# grep sda /var/log/syslog
Apr  3 06:30:07 malaka kernel: [2089494.832606] sd 6:0:0:0: [sda] Unhandled sense code
Apr  3 06:30:07 malaka kernel: [2089494.832612] sd 6:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  3 06:30:07 malaka kernel: [2089494.832616] sd 6:0:0:0: [sda] Sense Key : Medium Error [current] 
Apr  3 06:30:07 malaka kernel: [2089494.832620] sd 6:0:0:0: [sda] Add. Sense: Record not found
Apr  3 06:30:07 malaka kernel: [2089494.832624] sd 6:0:0:0: [sda] CDB: Write(10): 2a 00 1f a5 09 c8 00 00 08 00
Apr  3 06:30:07 malaka kernel: [2089494.832631] end_request: I/O error, dev sda, sector 530909640
Apr  3 06:30:07 malaka kernel: [2089494.836597] raid10: Disk failure on sda3, disabling device.
Apr  3 06:30:07 malaka kernel: [2089494.948193]  disk 0, wo:1, o:0, dev:sda3
Apr  3 06:30:08 malaka mdadm[1711]: Fail event detected on md device /dev/md/1, component device /dev/sda3
0 malaka:~# 

As enrique suggested by email, we should run smartctl. Running it reveals the following:

0 malaka:~# smartctl -a /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Ultrastar A7K2000
Device Model:     Hitachi HUA722020ALA330
Serial Number:    JK11A1YAJE0GHV
Firmware Version: JKAOA3EA
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Apr  3 17:27:46 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (22624) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       101
  3 Spin_Up_Time            0x0007   133   133   024    Pre-fail  Always       -       623 (Average 467)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 2011
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   121   121   020    Pre-fail  Offline      -       35
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       3444
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Lifetime Min/Max 19/38)
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2045
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3327 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3327 occurred at disk power-on lifetime: 3433 hours (143 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 c8 09 a5 0f

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 38 c8 5b 30 40 00  28d+00:27:59.773  WRITE FPDMA QUEUED
  61 08 30 c0 5b 30 40 00  28d+00:27:59.773  WRITE FPDMA QUEUED
  61 08 28 b8 5b 30 40 00  28d+00:27:59.772  WRITE FPDMA QUEUED
  61 08 20 b0 5b 30 40 00  28d+00:27:59.772  WRITE FPDMA QUEUED
  61 08 18 a8 5b 30 40 00  28d+00:27:59.772  WRITE FPDMA QUEUED

Error 3326 occurred at disk power-on lifetime: 3386 hours (141 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 21 df eb 02 0d

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 e9 02 40 00  16d+02:39:20.632  READ FPDMA QUEUED
  60 00 00 00 e8 02 40 00  16d+02:39:20.489  READ FPDMA QUEUED
  61 08 08 b8 a0 a8 40 00  16d+02:39:20.411  WRITE FPDMA QUEUED
  61 08 00 b0 a0 a8 40 00  16d+02:39:20.411  WRITE FPDMA QUEUED
  61 02 00 08 80 0f 40 00  16d+02:39:20.355  WRITE FPDMA QUEUED

Error 3325 occurred at disk power-on lifetime: 3258 hours (135 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 0b f5 8f c3 01

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 08 28 95 c3 40 00  33d+08:09:45.072  READ FPDMA QUEUED
  60 10 08 f0 94 c3 40 00  33d+08:09:45.069  READ FPDMA QUEUED
  60 08 08 78 94 c3 40 00  33d+08:09:45.068  READ FPDMA QUEUED
  60 08 08 40 94 c3 40 00  33d+08:09:45.065  READ FPDMA QUEUED
  60 08 08 c0 93 c3 40 00  33d+08:09:45.060  READ FPDMA QUEUED

Error 3324 occurred at disk power-on lifetime: 2710 hours (112 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 16 6a a7 69 0b

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 80 a4 69 40 00  43d+00:47:16.694  READ FPDMA QUEUED
  60 00 00 80 a0 69 40 00  43d+00:47:16.621  READ FPDMA QUEUED
  60 80 00 00 a0 69 40 00  43d+00:47:16.587  READ FPDMA QUEUED
  61 02 00 08 80 0f 40 00  43d+00:47:16.073  WRITE FPDMA QUEUED
  61 08 00 e8 98 33 40 00  43d+00:47:14.672  WRITE FPDMA QUEUED

Error 3323 occurred at disk power-on lifetime: 2709 hours (112 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 2d d3 9e 45 02

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 9b 45 40 00  42d+17:55:17.534  READ FPDMA QUEUED
  60 00 00 00 97 45 40 00  42d+17:55:17.462  READ FPDMA QUEUED
  60 00 00 00 93 45 40 00  42d+17:55:17.383  READ FPDMA QUEUED
  60 00 00 00 92 45 40 00  42d+17:55:17.311  READ FPDMA QUEUED
  61 02 00 08 80 0f 40 00  42d+17:55:16.210  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

88 malaka:~#

comment:2 Changed 7 years ago by Jamie McClelland

More context... here's the part of syslog including these lines preceded the lines in /var/log/syslog:

0 malaka:~# head -n 30  /var/log/syslog
Apr  3 06:25:01 malaka rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1668" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Apr  3 06:30:07 malaka kernel: [2089494.831892] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.831980] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832073] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832166] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832257] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832350] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832443] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
Apr  3 06:30:07 malaka kernel: [2089494.832606] sd 6:0:0:0: [sda] Unhandled sense code
Apr  3 06:30:07 malaka kernel: [2089494.832612] sd 6:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr  3 06:30:07 malaka kernel: [2089494.832616] sd 6:0:0:0: [sda] Sense Key : Medium Error [current] 
Apr  3 06:30:07 malaka kernel: [2089494.832620] sd 6:0:0:0: [sda] Add. Sense: Record not found
Apr  3 06:30:07 malaka kernel: [2089494.832624] sd 6:0:0:0: [sda] CDB: Write(10): 2a 00 1f a5 09 c8 00 00 08 00
Apr  3 06:30:07 malaka kernel: [2089494.832631] end_request: I/O error, dev sda, sector 530909640
Apr  3 06:30:07 malaka kernel: [2089494.836597] raid10: Disk failure on sda3, disabling device.
Apr  3 06:30:07 malaka kernel: [2089494.836597] raid10: Operation continuing on 3 devices.
Apr  3 06:30:07 malaka kernel: [2089494.948186] RAID10 conf printout:
Apr  3 06:30:07 malaka kernel: [2089494.948190]  --- wd:3 rd:4
Apr  3 06:30:07 malaka kernel: [2089494.948193]  disk 0, wo:1, o:0, dev:sda3
Apr  3 06:30:07 malaka kernel: [2089494.948195]  disk 1, wo:0, o:1, dev:sdb3
Apr  3 06:30:07 malaka kernel: [2089494.948197]  disk 2, wo:0, o:1, dev:sdc3
Apr  3 06:30:07 malaka kernel: [2089494.948198]  disk 3, wo:0, o:1, dev:sdd3
Apr  3 06:30:08 malaka mdadm[1711]: Fail event detected on md device /dev/md/1, component device /dev/sda3
Apr  3 06:30:08 malaka kernel: [2089495.192144] RAID10 conf printout:
Apr  3 06:30:08 malaka kernel: [2089495.192148]  --- wd:3 rd:4
Apr  3 06:30:08 malaka kernel: [2089495.192152]  disk 1, wo:0, o:1, dev:sdb3
Apr  3 06:30:08 malaka kernel: [2089495.192154]  disk 2, wo:0, o:1, dev:sdc3
Apr  3 06:30:08 malaka kernel: [2089495.192156]  disk 3, wo:0, o:1, dev:sdd3
Apr  3 07:00:01 malaka /USR/SBIN/CRON[31666]: (root) CMD (  [ -x /usr/local/sbin/mf-monitor-df ] && /usr/local/sbin/mf-monitor-df)
Apr  3 07:00:01 malaka /USR/SBIN/CRON[31667]: (root) CMD (if [ -x /usr/local/sbin/freepuppet-origin-pull-and-run ]; then /usr/local/sbin/freepuppet-origin-pull-and-run > /dev/null; fi)
0 malaka:~#

comment:3 Changed 7 years ago by Jamie McClelland

In case we don't have a spare in telehouse, I've placed an order for one drive (next day delivery) and another order for two more drives (3 - 5 day delivery).

jamie

comment:4 Changed 7 years ago by Ross

Owner: set to Jamie McClelland
Status: newassigned

comment:5 Changed 7 years ago by Jamie McClelland

We do have a spare in Telehouse. I transported it to XO and installed it without incident (nice to have actually working hot swappable hard disks).

Joseph ran:

smartctl -t short /dev/sda

And no errors reported.

Then, he ran:

smartctl -t long /dev/sda

We'll check in tomorrow for errors.

Assuming there are no errors, I think we should schedule a re-sync for tomorrow night. I am not going to be available - so I'll need volunteers to baby site it (or we'll have to postpone til thursday night).

comment:6 Changed 7 years ago by Jamie McClelland

I've started the sync. It claims it will finish at 4:00 pm tomorrow. However, not sure how tonight's backup process or tomorrow traffic will affect the results.

If we start having performance problems on guests on malaka, we can reduce the rate of the RAID sync with:

echo 10 > /proc/sys/dev/raid/speed_limit_min

That will allow the sync to go down to a minimum of 10K/second (default is 1000K/second) which should hopefully have a negligible affect on disk i/o.

comment:7 Changed 7 years ago by Jamie McClelland

Resolution: fixed
Status: assignedclosed

Done... sync'ing is complete.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.