Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#1038 closed Bug/Something is broken (fixed)

fred is unresponsive

Reported by: Daniel Kahn Gillmor Owned by: Jamie McClelland
Priority: Urgent Component: Tech
Keywords: fred.mayfirst.org xen Cc:
Sensitive: no

Description

I was working on #919, trying to get pontiac and geronimo working better, and it looks like fred has crashed.

On the console, i'm seeing:

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/xen/core/evtchn.c:481
invalid opcode: 0000 [1] SMP
CPU 1
Modules linked in: sg xt_tcpudp xt_physdev iptable_filter ip_tables x_tables bridge netloop ipv6 button ac battery loop evdev i2c_i801 psmouse i2c_core pcspkr serio_raw serial_core floppy ext3 jbd mbcache sha256 aes dm_crypt dm_mirror dm_snapshot dm_mod raid1 md_mod ide_generic sd_mod generic ata_piix libata scsi_mod piix ide_core ehci_hcd e1000 uhci_hcd fan
Pid: 21, comm: xenwatch Not tainted 2.6.18-6-xen-amd64 #1
RIP: e030:[<ffffffff8036106b>]  [<ffffffff8036106b>] retrigger+0x26/0x3e
RSP: e02b:ffff88000f0e9d88  EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000009700 RCX: ffffffffff578000
RDX: 0000000000000047 RSI: ffff88000f0e9d30 RDI: 000000000000012e
RBP: ffffffff804cdb80 R08: ffff88000f01eb70 R09: ffff88000cb99d00
R10: ffff88000cb99800 R11: ffffffff80361045 R12: 000000000000012e
R13: ffffffff804cdbbc R14: 0000000000000000 R15: 0000000000000008
FS:  00002b946ba416d0(0000) GS:ffffffff804c3080(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process xenwatch (pid: 21, threadinfo ffff88000f0e8000, task ffff88000f0d8080)
Stack:  ffffffff802a06c1  ffff88000cb99d00  ffff88000cb99d00  0000000000000000
 ffff88000f0e9de0  000000000000020b  ffffffff8036dbd6  0000000000000000
 ffffffff8036e04e  ffff88000f0e9ea4
Call Trace:
 [<ffffffff802a06c1>] enable_irq+0x9d/0xbc
 [<ffffffff8036dbd6>] __netif_up+0xc/0x15
 [<ffffffff8036e04e>] netif_map+0x2a6/0x2d8
 [<ffffffff8035c3af>] bus_for_each_dev+0x61/0x6e
 [<ffffffff80366858>] xenwatch_thread+0x0/0x145
 [<ffffffff80366858>] xenwatch_thread+0x0/0x145
 [<ffffffff80368398>] frontend_changed+0x2ba/0x4f9
 [<ffffffff80366858>] xenwatch_thread+0x0/0x145
 [<ffffffff8028f8ad>] keventd_create_kthread+0x0/0x61
 [<ffffffff80365c66>] xenwatch_handle_callback+0x15/0x48
 [<ffffffff80366985>] xenwatch_thread+0x12d/0x145
 [<ffffffff8028fa70>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8028f8ad>] keventd_create_kthread+0x0/0x61
 [<ffffffff80366858>] xenwatch_thread+0x0/0x145
 [<ffffffff8023352b>] kthread+0xd4/0x107
 [<ffffffff8025c86c>] child_rip+0xa/0x12
 [<ffffffff8028f8ad>] keventd_create_kthread+0x0/0x61
 [<ffffffff80233457>] kthread+0x0/0x107
 [<ffffffff8025c862>] child_rip+0x0/0x12
 
 
Code: 0f 0b 68 74 db 41 80 c2 e1 01 f0 0f ab 91 00 08 00 00 b8 01
RIP  [<ffffffff8036106b>] retrigger+0x26/0x3e
 RSP <ffff88000f0e9d88>

ugh. i'm not seeing any responsiveness on the serial console, even with the skinny_elephants_recovery.

I think i'm going to reset fred.

Change History (2)

comment:1 Changed 11 years ago by Daniel Kahn Gillmor

Keywords: xen added
Resolution: fixed
Status: newclosed

I've just hard-reset fred. ugh.

After the first hard reset, i brought up marcos, and then robeson, and immediately after bringing up robeson (with xm create robeson), i got the same error message on the console, and the following message on my ssh session:

Message from syslogd@fred at Tue May 27 23:49:45 2008 ...
fred kernel: invalid opcode: 0000 [1] SMP 

0 fred:~#

after this, the ssh session (and the rest of the machine) became unresponsive.

So i hard reset fred again, and restarted the domUs. I started them in this order:

  • marcos
  • viewsic
  • albizu
  • angela
  • robeson

And it seemed to be created without a problem.

This stinks of an intermittent fault (perhaps a race condition somewhere?), since the failure isn't happening repeatedly.

I'm not re-enabling pontiac or geronimo at the moment, because the failures have been occurring when new xen instances are brought up, and i don't want to push it.

At least this immediate ticket is resolved now, though.

comment:2 Changed 11 years ago by Jamie McClelland

Thanks Daniel for the rescue!

Robeson is running asterisk:

0 robeson:~# ps -eFH | grep asterisk | grep -v grep
asterisk  1451     1  0 54128 10112   0 00:00 ?        00:00:01   /usr/sbin/asterisk -p -U asterisk
0 robeson:~#

The -p means it is running in realtime priority. From the man page:

--- -p If supported by the operating system (and executing as root), attempt to run with realtime priority for increased performance and responsiveness within the Asterisk process, at the expense of other programs running on the same machine. ---

I wonder if there's a problem between whatever -p is doing and xen?

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.