Opened 12 years ago

Closed 7 years ago

#129 closed Bug/Something is broken (fixed)

potential disk failures on viewsic

Reported by: Daniel Kahn Gillmor Owned by: Jamie McClelland
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no


looking through the console logs for viewsic, i found this from 10 days ago:

0 sylvia:~# cereal follow --cat viewsic | grep 3w-xxx
2007-10-05_08:53:09.53574 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
0 sylvia:~# 

a little worrisome, no? is viewsic running hardware raid? is it being monitored? Was there some other report about this error? What's the MFPL policy for responding to events like this?

Change History (8)

comment:1 Changed 12 years ago by Daniel Kahn Gillmor

If this is a 3ware hardware RAID controller, you might be interested in using debian-unofficial's 3ware packages for monitoring, or in downloading the monitoring tools directly from 3ware.

comment:2 Changed 12 years ago by Jamie McClelland

Thanks Daniel - this is definitely worrisome. I think I might go to Telehouse tonight instead of coming to the lair for the upgrades. I will try to get those monitoring tools installed this afternoon to have a look see. I'll be tied up at least until 4:30 pm today.

comment:3 Changed 12 years ago by Daniel Kahn Gillmor

I'm not sure what advantage a telehouse visit will be if we don't have diagnostics on what device needs replacement.

I've never used the 3ware tools myself. i'd be curious to see a writeup about them (though of course i'd prefer to learn more about free SW raid tools that are more widely useful).

comment:4 Changed 12 years ago by Jamie McClelland

I'm going to try an install of the tools you suggested now.

However, even if I'm not successful, I think it would be wise to go to telehouse, reboot the machine, enter the raid bios, and run their diagnostic tools (or maybe the RAID bios would report a hard drive failure).

We have several spare drives that could be put in.

If it is a failed, drive, maybe this is an opportunity to switch to software raid.

comment:5 Changed 12 years ago by Daniel Kahn Gillmor

I understand your wanting to get this cleared up, but running HW RAID BIOS-level tools is going to require serious downtime on the machine, if it needs to scan two 120GB disks.

I really think that we'd do better to try to diagnose without incurring downtime; if we find that downtime is warranted, we should schedule it, since this is not an immediate failure.

It's important that we stay on top of this, but i don't think it's reached a critical level yet. The goal in fixing it is to avoid downtime for MFPL members, right?

comment:6 Changed 12 years ago by Jamie McClelland

Ok, seems as though there is no immediate emergency.

Following the directions here:

I added the following to /etc/apt/sources.list:

deb sarge restricted

Then I ran:

apt-get update
apt-get install 3ware-cli-binary

Then, I did some reading of:

man tw_cli

Then, as root, I ran:


Which put me in inter-active mode.

I typed:

//viewsic> show

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
c0    8006-2LP     2       2        1       0        2       -       -      

//viewsic> focus c0

//viewsic/c0> show

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
u0    RAID-1    OK             -      -       111.814   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
p0     OK               u0     111.81 GB   234493056     S02AJ10Y309171      
p1     OK               u0     111.81 GB   234493056     S02AJ10Y309172      


This seems to indicate that both drives are operating fine, without any disk failures.

The tw_cli man page is really really long with tons of features. I just read it briefly enough to write the following cron job, which will run hourly to alert us if a drive fails:


# make sure drives are in ok status
states=$(tw_cli //viewsic/c0 show drivestatus | grep "^p" | awk '{print $2}')
for state in $states; do
  if [ "$state" != "OK" ]; then
    echo "Viewsic raid failure" | mail -s "Viewsic raid failure"

What other steps do you think we should take to test those drives and the raid?

comment:7 Changed 12 years ago by Jamie McClelland

Following up on your last message...

I don't fully understand all the options for tw_cli (and tw_schedule) - but it seems like it allows us to schedule tests for the RAID. Can you make heads or tails of the tw_cli man page on those topics? In any event, I agree, a trip to Telehouse tonight is not needed (so I'm still planning on seeing you at the Lair).

comment:8 Changed 7 years ago by Jamie McClelland

Resolution: fixed
Status: newclosed

This hardware has been retired.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.