Opened 11 years ago

Closed 7 years ago

#129 closed Bug/Something is broken (fixed)

potential disk failures on viewsic

Reported by: https://id.mayfirst.org/dkg Owned by: https://id.mayfirst.org/jamie
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description

looking through the console logs for viewsic, i found this from 10 days ago:

0 sylvia:~# cereal follow --cat viewsic | grep 3w-xxx
2007-10-05_08:53:09.53574 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
0 sylvia:~# 

a little worrisome, no? is viewsic running hardware raid? is it being monitored? Was there some other report about this error? What's the MFPL policy for responding to events like this?

Change History (8)

comment:1 Changed 11 years ago by https://id.mayfirst.org/dkg

If this is a 3ware hardware RAID controller, you might be interested in using debian-unofficial's 3ware packages for monitoring, or in downloading the monitoring tools directly from 3ware.

comment:2 Changed 11 years ago by https://id.mayfirst.org/jamie

Thanks Daniel - this is definitely worrisome. I think I might go to Telehouse tonight instead of coming to the lair for the upgrades. I will try to get those monitoring tools installed this afternoon to have a look see. I'll be tied up at least until 4:30 pm today.

comment:3 Changed 11 years ago by https://id.mayfirst.org/dkg

I'm not sure what advantage a telehouse visit will be if we don't have diagnostics on what device needs replacement.

I've never used the 3ware tools myself. i'd be curious to see a writeup about them (though of course i'd prefer to learn more about free SW raid tools that are more widely useful).

comment:4 Changed 11 years ago by https://id.mayfirst.org/jamie

I'm going to try an install of the tools you suggested now.

However, even if I'm not successful, I think it would be wise to go to telehouse, reboot the machine, enter the raid bios, and run their diagnostic tools (or maybe the RAID bios would report a hard drive failure).

We have several spare drives that could be put in.

If it is a failed, drive, maybe this is an opportunity to switch to software raid.

comment:5 Changed 11 years ago by https://id.mayfirst.org/dkg

I understand your wanting to get this cleared up, but running HW RAID BIOS-level tools is going to require serious downtime on the machine, if it needs to scan two 120GB disks.

I really think that we'd do better to try to diagnose without incurring downtime; if we find that downtime is warranted, we should schedule it, since this is not an immediate failure.

It's important that we stay on top of this, but i don't think it's reached a critical level yet. The goal in fixing it is to avoid downtime for MFPL members, right?

comment:6 Changed 11 years ago by https://id.mayfirst.org/jamie

Ok, seems as though there is no immediate emergency.

Following the directions here:

http://www.debian-unofficial.org/installation.html

I added the following to /etc/apt/sources.list:

deb http://ftp.debian-unofficial.org/debian sarge restricted

Then I ran:

apt-get update
apt-get install 3ware-cli-binary

Then, I did some reading of:

man tw_cli

Then, as root, I ran:

tw_cli

Which put me in inter-active mode.

I typed:

//viewsic> show

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c0    8006-2LP     2       2        1       0        2       -       -      

//viewsic> focus c0

//viewsic/c0> show

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    OK             -      -       111.814   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     111.81 GB   234493056     S02AJ10Y309171      
p1     OK               u0     111.81 GB   234493056     S02AJ10Y309172      

//viewsic/c0>

This seems to indicate that both drives are operating fine, without any disk failures.

The tw_cli man page is really really long with tons of features. I just read it briefly enough to write the following cron job, which will run hourly to alert us if a drive fails:

#!/bin/bash

# make sure drives are in ok status
states=$(tw_cli //viewsic/c0 show drivestatus | grep "^p" | awk '{print $2}')
for state in $states; do
  if [ "$state" != "OK" ]; then
    echo "Viewsic raid failure" | mail -s "Viewsic raid failure" root@mayfirst.org
  fi
done

What other steps do you think we should take to test those drives and the raid?

comment:7 Changed 11 years ago by https://id.mayfirst.org/jamie

Following up on your last message...

I don't fully understand all the options for tw_cli (and tw_schedule) - but it seems like it allows us to schedule tests for the RAID. Can you make heads or tails of the tw_cli man page on those topics? In any event, I agree, a trip to Telehouse tonight is not needed (so I'm still planning on seeing you at the Lair).

comment:8 Changed 7 years ago by https://id.mayfirst.org/jamie

  • Resolution set to fixed
  • Status changed from new to closed

This hardware has been retired.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.