Opened 3 years ago

Closed 18 months ago

Last modified 14 months ago

#3772 closed Bug/Something is broken (fixed)

kvm host crash brings down the network

Reported by: https://id.mayfirst.org/jamie Owned by: https://id.mayfirst.org/jamie
Priority: Urgent Component: Tech
Keywords: kvm crash networking outage Cc:
Sensitive: no

Description

We now have two instances in which a kvm host crashing has brought down the entire Telehouse network:

There are two questions: why did the server crash? And why did it bring down the network when it did crash? The second is the most pressing (although it might be related to the first).

Given the impact this has on our organization, I think we need to prioritize this ticket. So far this has happened twice during off hours at a time when I've been available to deal with it. It could happen at a far worse time.

Thoughts on how to proceed?

jamie

Change History (30)

comment:1 Changed 3 years ago by https://id.mayfirst.org/jamie

I suspect that the network problem causing all the servers to be unreachable is happening at the ethernet level (because it seems to affect all the servers below our router, which cover many different IP subnets).

Our KVM scripts that control our virtual servers are handled by runit, which will constantly try to bring a virtual server back up if it detects that it has crashed. I wonder if something is triggering that process in a way that floods the network with MAC address broadcasts? I'm not sure how that would continue if the kernel has crashed though??

jamie

comment:2 Changed 3 years ago by https://id.mayfirst.org/jamie

I did a directory listing of the console logs on ken to see if any of the virtual servers had console messages written to them just before the crash (which might indicate that a virtual server experiencing problems caused the host server to crash). This does not seem to be the case. I'm excluding logs that started after 2010-01-04.09 (9:00 am today) because those logs were generated after the reboot:

0 ken:~# for foo in $(ls /home); do log=$(ls /home/$foo/vms/$foo/console.* | grep -v "console.2011-01-04_09"|tail -n 1); ls -l "$log"; done 
-rw-r--r-- 1 bataille bataille 2362 Dec  7 14:29 /home/bataille/vms/bataille/console.2010-12-07_14.30.07-0500
-rw-r--r-- 1 brown brown 32284 Sep 16 20:37 /home/brown/vms/brown/console.2010-09-16_20.37.59-0400
-rw-r--r-- 1 debs debs 741582 Nov  9 17:03 /home/debs/vms/debs/console.2010-11-09_17.03.06-0500
-rw-r--r-- 1 dorothy dorothy 86 Jun  4  2010 /home/dorothy/vms/dorothy/console.2010-06-04_13.05.03-0400
-rw-r--r-- 1 douglass douglass 19951 Jul 30 14:07 /home/douglass/vms/douglass/console.2010-07-30_17.11.19-0400
-rw-r--r-- 1 fuller fuller 140608 Dec  7 12:56 /home/fuller/vms/fuller/console.2010-12-07_12.56.43-0500
-rw-r--r-- 1 ignatz ignatz 71115 Jul 30 13:28 /home/ignatz/vms/ignatz/console.2010-07-30_13.28.01-0400
-rw-r--r-- 1 lucius lucius 47082 Dec 26 12:11 /home/lucius/vms/lucius/console.2010-12-26_12.11.18-0500
-rw-r--r-- 1 makhno makhno 19988 Dec 16 15:27 /home/makhno/vms/makhno/console.2010-12-16_15.27.12-0500
-rw-r--r-- 1 mandela mandela 26531 Nov 30 13:25 /home/mandela/vms/mandela/console.2010-11-30_13.25.36-0500
-rw-r--r-- 1 menchu menchu 33608 Aug 22 00:19 /home/menchu/vms/menchu/console.2010-08-22_00.19.21-0400
-rw-r--r-- 1 mirabal mirabal 5004 Sep  7 02:59 /home/mirabal/vms/mirabal/console.2010-09-07_02.59.47-0400
-rw-r--r-- 1 peltier peltier 345398 Nov 19 08:23 /home/peltier/vms/peltier/console.2010-11-19_08.33.10-0500
0 ken:~#

comment:3 Changed 3 years ago by https://id.mayfirst.org/jamie

Investigating ken's syslog - shows nothing unusual:

Jan  4 04:05:01 ken /USR/SBIN/CRON[18636]: (root) CMD (if [ -x /etc/munin/plugins/apt_all]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jan  4 04:05:01 ken /USR/SBIN/CRON[18637]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  4 04:09:01 ken /USR/SBIN/CRON[18889]: (root) CMD (if [ -x /usr/sbin/backupninja ]; then /usr/sbin/backupninja; fi)
Jan  4 04:10:01 ken /USR/SBIN/CRON[18985]: (root) CMD (if [ -x /etc/munin/plugins/apt_all]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jan  4 04:15:01 ken /USR/SBIN/CRON[19243]: (root) CMD (if [ -x /etc/munin/plugins/apt_all]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jan  4 04:15:01 ken /USR/SBIN/CRON[19244]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  4 04:17:01 ken /USR/SBIN/CRON[19466]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan  4 04:20:01 ken /USR/SBIN/CRON[19511]: (root) CMD (if [ -x /etc/munin/plugins/apt_all]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jan  4 09:02:33 ken kernel: imklog 4.6.4, log source = /proc/kmsg started.

comment:4 Changed 3 years ago by https://id.mayfirst.org/jamie

Having trouble finding the console log for ken.

I'm looking in: robideau:/var/lib/cereal/sessions/ken/log/main. There's one file called current. It has the following (toward the end):

2010-12-07_17:54:48.98052 
2010-12-07_17:54:48.98054 cereal: user 'ken-console' on /dev/pts/0 detached from session.
2011-01-04_13:55:17.61105 
2011-01-04_13:55:17.63366 cereal: user 'ken-console' attaching to session from /dev/pts/2...
2011-01-04_15:00:54.66907 
2011-01-04_15:00:54.66909 cereal: user 'ken-console' on /dev/pts/2 detached from session.

I'm not sure where the log of what was happening on the console is located. When robideau finally came up, I logged in an saw a screen full of messages, including the only one I copied into our IRC channel:

Kernel panic - not syncing: Attempted to kill the idle
               task!

comment:5 Changed 3 years ago by https://id.mayfirst.org/dkg

  • Keywords kvm crash networking outage added

jamie and i will be discussing this in detail at 10am on Tuesday 2010-01-11 on the #mayfirst IRC channel

comment:6 Changed 3 years ago by https://id.mayfirst.org/dkg

We currently suspect the bolivar crash to be from this upstream kernel bug. during the crash, the console log shows:

[ 5033.943024] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/stat
2010-11-07_22:19:26.69561 [ 5033.952652] CPU 2 
2010-11-07_22:19:26.69562 [ 5033.954669] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 pl2303 usbserial tun bridge stp kvm_intel kvm loop snd_pcsp snd_pcm snd_timer snd soundcore snd_page_alloc evdev dcdbas power_meter serio_raw psmouse processor ext3 jbd mbcache sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod sd_mod crc_t10dif ide_pci_generic ide_core ata_generic uhci_hcd ata_piix libata scsi_mod ehci_hcd usbcore nls_base button bnx2 thermal fan thermal_sys [last unloaded: tun]
2010-11-07_22:19:26.69563 [ 5034.012197] Pid: 0, comm: swapper Not tainted 2.6.32-3-amd64 #1 PowerEdge R410
2010-11-07_22:19:26.69564 [ 5034.019400] RIP: 0010:[<ffffffff81044e8e>]  [<ffffffff81044e8e>] find_busiest_group+0x412/0x875
2010-11-07_22:19:26.69565 [ 5034.028096] RSP: 0018:ffff88083e487cb8  EFLAGS: 00010056
2010-11-07_22:19:26.69567 [ 5034.033394] RAX: 0000000000000000 RBX: ffffffffffffffff RCX: ffffffff8103a701
2010-11-07_22:19:26.69568 [ 5034.040511] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000200
2010-11-07_22:19:26.69569 [ 5034.047628] RBP: ffff88044e42f9f0 R08: 0000000000000000 R09: ffff88044e42fb00
2010-11-07_22:19:26.69570 [ 5034.054744] R10: 0000000000000000 R11: ffffffff813b871e R12: 00000000000155c0
2010-11-07_22:19:26.69571 [ 5034.061860] R13: 0000000000000000 R14: 0000000000000001 R15: ffff88044e42fab0
2010-11-07_22:19:26.69572 [ 5034.068978] FS:  0000000000000000(0000) GS:ffff88044e420000(0000) knlGS:0000000000000000
2010-11-07_22:19:26.69572 [ 5034.077047] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
2010-11-07_22:19:26.69573 [ 5034.082779] CR2: 00007f28555e1000 CR3: 0000000001001000 CR4: 00000000000026e0
2010-11-07_22:19:26.69575 [ 5034.089894] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2010-11-07_22:19:26.69575 [ 5034.097010] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2010-11-07_22:19:26.69576 [ 5034.104126] Process swapper (pid: 0, threadinfo ffff88083e486000, task ffff88083e4654c0)
2010-11-07_22:19:26.69577 [ 5034.112195] Stack:
2010-11-07_22:19:26.69578 [ 5034.114199]  00000000000155c8 00000000000155c0 0000000000000008 00000000000155c0
2010-11-07_22:19:26.69579 [ 5034.121435] <0> 00000000000155c0 00000000000155c0 000000000003c047 ffffffff81394725
2010-11-07_22:19:26.69580 [ 5034.129166] <0> ffff88083ca9b1a8 ffffffff8103fea0 ffff88044e42f8a0 ffff88083e487eec
2010-11-07_22:19:26.69584 [ 5034.137044] Call Trace:
2010-11-07_22:19:26.69584 [ 5034.139486]  [<ffffffff8103fea0>] ? update_curr+0xa6/0x147
2010-11-07_22:19:26.69585 [ 5034.144962]  [<ffffffff812ed8d5>] ? schedule+0x2bd/0x7cb
2010-11-07_22:19:26.69586 [ 5034.150264]  [<ffffffff8106e8fb>] ? clockevents_notify+0x31/0x115
2010-11-07_22:19:26.69587 [ 5034.156348]  [<ffffffff8100fec6>] ? cpu_idle+0xd8/0xda
2010-11-07_22:19:26.69587 [ 5034.161471] Code: 74 10 48 8b 84 24 a0 01 00 00 c7 00 00 00 00 00 eb 5a 41 8b 77 08 48 8b 84 24 38 01 00 00 31 d2 49 c1 e5 0a 49 29 de 48 c1 e0 0a <48> f7 f6 31 d2 48 89 84 24 30 01 00 00 41 8b 77 08 4c 89 e8 48 
2010-11-07_22:19:26.69588 [ 5034.180997] RIP  [<ffffffff81044e8e>] find_busiest_group+0x412/0x875
2010-11-07_22:19:26.69589 [ 5034.187347]  RSP <ffff88083e487cb8>
2010-11-07_22:19:26.69590 [ 5034.191216] ---[ end trace 514a38ba9201cbba ]---
2010-11-07_22:19:26.69590 [ 5034.191221] divide error: 0000 [#2] SMP 
2010-11-07_22:19:26.69591 [ 5034.191225] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/stat
2010-11-07_22:19:26.69593 [ 5034.191227] CPU 0 
2010-11-07_22:19:26.69593 [ 5034.191229] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hf
splus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 pl2303 usbserial tun bridge stp kvm_intel kvm loop snd_pcsp snd_pcm snd_timer snd sou
ndcore snd_page_alloc evdev dcdbas power_meter serio_raw psmouse processor ext3 jbd mbcache sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid1 md_mod sd_mod crc_t10dif ide_pci_generic ide_core ata_generic uhci_hcd ata_piix libata scsi_mod ehci_hcd usbcore nls_base button bnx2 thermal fan thermal_sys [last unloaded: tun]
2010-11-07_22:19:26.69595 [ 5034.191270] Pid: 21200, comm: kvm Tainted: G      D    2.6.32-3-amd64 #1 PowerEdge R410
2010-11-07_22:19:26.69595 [ 5034.191273] RIP: 0010:[<ffffffff81044e8e>]  [<ffffffff81044e8e>] find_busiest_group+0x412/0x875
2010-11-07_22:19:26.69596 [ 5034.191280] RSP: 0018:ffff8802abe4fa68  EFLAGS: 00010046
2010-11-07_22:19:26.69597 [ 5034.191283] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8103a700
2010-11-07_22:19:26.69598 [ 5034.191285] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000200
2010-11-07_22:19:26.69599 [ 5034.191288] RBP: ffff88044e40f9f0 R08: 0000000000000000 R09: ffff88044e42fb00
2010-11-07_22:19:26.69600 [ 5034.191290] R10: 0000000000000000 R11: ffffffffa00dbf1d R12: 00000000000155c0
2010-11-07_22:19:26.69600 [ 5034.191293] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88044e42fab0
2010-11-07_22:19:26.69601 [ 5034.191296] FS:  00007ff688457910(0000) GS:ffff88044e400000(0000) knlGS:0000000000000000
2010-11-07_22:19:26.69602 [ 5034.191299] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
2010-11-07_22:19:26.69603 [ 5034.191301] CR2: 00000000b7743800 CR3: 00000003494af000 CR4: 00000000000026e0
2010-11-07_22:19:26.69603 [ 5034.191304] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2010-11-07_22:19:26.69604 [ 5034.191306] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2010-11-07_22:19:26.69605 [ 5034.191309] Process kvm (pid: 21200, threadinfo ffff8802abe4e000, task ffff880421b83170)
2010-11-07_22:19:26.69606 [ 5034.191311] Stack:
2010-11-07_22:19:26.69606 [ 5034.191312]  00000000000155c8 00000000000155c0 0000000000000008 00000000000155c0
2010-11-07_22:19:26.69607 [ 5034.191315] <0> 00000000000155c0 00000000000155c0 00400d4d86bd6c3c ffffffff81024d72
2010-11-07_22:19:26.69608 [ 5034.191319] <0> ffffffff813b871e ffffffff8106fa87 ffff88044e40f8a0 ffff8802abe4fc9c
2010-11-07_22:19:26.69609 [ 5034.191322] Call Trace:
2010-11-07_22:19:26.69610 [ 5034.191329]  [<ffffffff81024d72>] ? lapic_next_event+0x18/0x1d
2010-11-07_22:19:26.69611 [ 5034.191335]  [<ffffffff8106fa87>] ? tick_dev_program_event+0x2d/0x95
2010-11-07_22:19:26.69611 [ 5034.191340]  [<ffffffff812ed8d5>] ? schedule+0x2bd/0x7cb
2010-11-07_22:19:26.69612 [ 5034.191347]  [<ffffffffa00da103>] ? __vmx_load_host_state+0xb0/0x166 [kvm_intel]
2010-11-07_22:19:26.69614 [ 5034.191362]  [<ffffffffa0267821>] ? kvm_vcpu_block+0x94/0xb7 [kvm]
2010-11-07_22:19:26.69615 [ 5034.191366]  [<ffffffff81064a56>] ? autoremove_wake_function+0x0/0x2e
2010-11-07_22:19:26.69616 [ 5034.191380]  [<ffffffffa0271e8e>] ? kvm_arch_vcpu_ioctl_run+0x80b/0xa44 [kvm]
2010-11-07_22:19:26.69616 [ 5034.191386]  [<ffffffff8103a557>] ? activate_task+0x20/0x26
2010-11-07_22:19:26.69617 [ 5034.191390]  [<ffffffff81071520>] ? wake_futex+0x31/0x4e
2010-11-07_22:19:26.69618 [ 5034.191394]  [<ffffffff8103a17f>] ? sched_slice+0x74/0x92
2010-11-07_22:19:26.69619 [ 5034.191402]  [<ffffffffa02649d1>] ? kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
2010-11-07_22:19:26.69619 [ 5034.191406]  [<ffffffff8104b422>] ? wake_up_new_task+0xda/0xe4
2010-11-07_22:19:26.69620 [ 5034.191412]  [<ffffffff810f8fa2>] ? vfs_ioctl+0x21/0x6c
2010-11-07_22:19:26.69621 [ 5034.191415]  [<ffffffff810f94f0>] ? do_vfs_ioctl+0x48d/0x4cb
2010-11-07_22:19:26.69622 [ 5034.191419]  [<ffffffff810738e9>] ? sys_futex+0x113/0x131
2010-11-07_22:19:26.69623 [ 5034.191425]  [<ffffffff8110fe61>] ? block_llseek+0x75/0x81
2010-11-07_22:19:27.46695 [ 5034.191428]  [<ffffffff810f957f>] ? sys_ioctl+0x51/0x70
2010-11-07_22:19:27.46696 [ 5034.191432]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
2010-11-07_22:19:27.46697 [ 5034.191434] Code: 74 10 48 8b 84 24 a0 01 00 00 c7 00 00 00 00 00 eb 5a 41 8b 77 08 48 8b 84 24 38 01 00 00 31 d2 49 c1 e5 0a 49 29 de 48 c1 e0 0a <48> f7 f6 31 d2 48 89 84 24 30 01 00 00 41 8b 77 08 4c 89 e8 48 
2010-11-07_22:19:27.46698 [ 5034.191455] RIP  [<ffffffff81044e8e>] find_busiest_group+0x412/0x875
2010-11-07_22:19:27.46699 [ 5034.191459]  RSP <ffff8802abe4fa68>
2010-11-07_22:19:27.46700 [ 5034.191461] ---[ end trace 514a38ba9201cbbb ]---
2010-11-07_22:19:27.46700 [ 5034.534867] Kernel panic - not syncing: Attempted to kill the idle task!
2010-11-07_22:19:27.46701 [ 5034.541551] Pid: 0, comm: swapper Tainted: G      D    2.6.32-3-amd64 #1
2010-11-07_22:19:27.46702 [ 5034.548235] Call Trace:
2010-11-07_22:19:27.46703 [ 5034.550674]  [<ffffffff812ed349>] ? panic+0x86/0x141
2010-11-07_22:19:27.46704 [ 5034.555625]  [<ffffffff812ed452>] ? printk+0x4e/0x5c
2010-11-07_22:19:27.46704 [ 5034.560580]  [<ffffffff81050dff>] ? do_exit+0x72/0x6b5
2010-11-07_22:19:27.46705 [ 5034.565706]  [<ffffffff8104e219>] ? release_console_sem+0x17e/0x1af
2010-11-07_22:19:27.46706 [ 5034.571959]  [<ffffffff81014a82>] ? oops_end+0xaf/0xb4
2010-11-07_22:19:27.46708 [ 5034.577086]  [<ffffffff81012ab5>] ? do_divide_error+0x85/0x8f
2010-11-07_22:19:27.46709 [ 5034.582818]  [<ffffffff81044e8e>] ? find_busiest_group+0x412/0x875
2010-11-07_22:19:27.46710 [ 5034.588984]  [<ffffffff8106807c>] ? up+0xe/0x36
2010-11-07_22:19:27.46710 [ 5034.593504]  [<ffffffff8104e219>] ? release_console_sem+0x17e/0x1af
2010-11-07_22:19:27.46711 [ 5034.599756]  [<ffffffff810170b3>] ? sched_clock+0x5/0x8
2010-11-07_22:19:27.46712 [ 5034.604967]  [<ffffffff810118db>] ? divide_error+0x1b/0x20
2010-11-07_22:19:27.46713 [ 5034.610440]  [<ffffffff8103a701>] ? calc_global_load+0x14/0x95
2010-11-07_22:19:27.46713 [ 5034.616258]  [<ffffffff81044e8e>] ? find_busiest_group+0x412/0x875
2010-11-07_22:19:27.46714 [ 5034.622424]  [<ffffffff81044e2c>] ? find_busiest_group+0x3b0/0x875
2010-11-07_22:19:27.46715 [ 5034.628589]  [<ffffffff8103fea0>] ? update_curr+0xa6/0x147
2010-11-07_22:19:27.46716 [ 5034.634061]  [<ffffffff812ed8d5>] ? schedule+0x2bd/0x7cb
2010-11-07_22:19:27.46716 [ 5034.639360]  [<ffffffff8106e8fb>] ? clockevents_notify+0x31/0x115
2010-11-07_22:19:27.46717 [ 5034.645438]  [<ffffffff8100fec6>] ? cpu_idle+0xd8/0xda

comment:7 Changed 3 years ago by https://id.mayfirst.org/dkg

592497 describes other cases where crashing systems with Broadcom NetXtreme II NICs have taken out the switch they are connected to.

comment:8 Changed 3 years ago by https://id.mayfirst.org/jamie

We have sittingbull (our backup server) in our sunset park office.

The servers that crashed are running Broadcom Corporation NetXtreme II BCM5716. sittingbull is running Broadcom Corporation NetXtreme II BCM5708. Slightly different, but close enough that we might be able to run tests on sittingbull.

We're considering forcing sittingbull to crash while we monitor it's NIC output to see if any kernel crash will cause a Broadcom NIC to spew ethernet garbage that could shutdown a switch.

jamie

comment:9 Changed 3 years ago by https://id.mayfirst.org/jamie

FYI... running updates on bolivar, ken, negri, and clr which is pulling in a new version of firmware-bnx2 (0.28).

jamie

comment:10 Changed 3 years ago by https://id.mayfirst.org/jamie

As evidenced today with the bolivar crash, this is still happening.

Last April clr went down, but did not take down the entire bandcon cabinet with it.

I'm not sure if this is a silver bullet, but I think we should replace the switch in place at Telehouse with the same model we just bought at bandcon.

jamie

comment:11 Changed 3 years ago by https://id.mayfirst.org/jamie

See #4343.

comment:12 Changed 3 years ago by https://id.mayfirst.org/jamie

I just made a purchase for another HP PROCURVE SWITCH 2824 ETHERNET 1000MBPS J4903A.

In #4342, dkg suggests replacing the broadcom nics - which is another angle to try to solve this problem.

jamie

comment:13 Changed 3 years ago by https://id.mayfirst.org/jamie

See #4423.

comment:14 Changed 3 years ago by https://id.mayfirst.org/jamie

During the crash in #4343, I had the telehouse technician move to our backup, un-managed switch.

During the #4423 crash, the entire network was not taken down (I couldn't discern a pattern, but console.mayfirst.org was still accessible so I was able to recover without requiring a reboot from a telehouse technician). So, I think the theory that a better switch will fix this particular issue is valid.

Fortunately, our new switch has arrived - so I plan to go in early this afternoon to configure it.

jamie

comment:15 Changed 3 years ago by https://id.mayfirst.org/jamie

  • Resolution set to fixed
  • Status changed from new to closed

We've setup the new switch (#4427). I'm optimistically closing this ticket. I think this should be the end of one server bringing down the entire network (or part of the network).

jamie

comment:16 Changed 3 years ago by https://id.mayfirst.org/dkg

Just to be clear here, the theory is that:

  • during some flavor of host crash (apparently due to a division-by-zero while under heavy load), certain network interfaces send garbage.
  • some switches, upon receiving that garbage, either fail completely, or forward that garbage to the other connected machinery.
  • replacing the switches in question with (hopefully better) devices will make us less likely to suffer cabinet-wide outages during these failures.

We have not been able to reproduce the specific failures intentionally, or to capture a copy of the emitted garbage.

Is this right?

comment:17 Changed 2 years ago by https://id.mayfirst.org/jamie

Yes dkg - that's correct. Since we haven't had a similar failure since replacing the switch, we can't be sure this theory is correct.

However, the basis for the theory is that:

  • clr crash at xo was from a similar cause (this is purely theoretical since we have no console log of that crash)
  • the clr crash did not take down the entire network
  • the xo network had a higher quality switch installed

Very circumstantial... but until we have evidence to the contrary I think it's the best we have.

jamie

comment:18 Changed 2 years ago by https://id.mayfirst.org/jamie

comment:19 Changed 2 years ago by https://id.mayfirst.org/dkg

  • Resolution fixed deleted
  • Status changed from closed to assigned

we're using DebianPackage:firmware-bnx2 version 0.28, which contains bnx2 proprietary firmware blob versions up to 6.0.17, but we're using stock debian drivers, not the broadcom-issued drivers noted in the link above.

The link suggests that the problem is fixed with proprietary firmware 6.2.1 or 6.4.4; 6.2.1 (along with all earlier firmwares) appears to be available in DebianPackage:firmware-bnx2 version 0.33; it's not clear to me whether upgrading to that package but keeping the squeeze kernel will end up using the newer firmware -- i suspect it will not, since i don't think that firmware existed at the time of the kernel's release.

Sven Ulland's post also suggests that we could avoid this sort of flood by disabling transmission of ethernet PAUSE frames, like so:

ethtool --pause eth0 tx off

I don't know what the specific consequences of that would be in daily operation, but it doesn't seem like a bad option, after reading up on Ethernet flow control.

I propose we try adding this command to the network initialization scripts for one machine that has the bnx2 devices (maybe ken?), and monitor it closely for network weirdnesses. if it doesn't seem problematic after a few weeks, we could add it to all machines that are saddled with the bnx2.

If we decide to make it permanent for all bnx2 devices, we could do that with a script in /etc/network/if-up.d/no-pause-bnx2 like this:

#!/bin/sh
if [ "bnx2" = "$(basename "$(readlink -f /sys/class/net/$IFACE/device/driver)")" ]; then
  ethtool --pause "$IFACE" tx off
fi

comment:20 Changed 2 years ago by https://id.mayfirst.org/jamie

Thanks dkg for the research.

I think running the command to disable pause frames sounds like a good idea.

Just to be clear on the testing order of operations:

  • execute the command as root on ken:
    ethtool --pause eth0 tx off
    
  • Wait a few weeks to see if we have any problems
  • Manually execute on all servers with broadcom nics
  • Add your proposed script to /etc/network/if-up.d/no-pause-bnx2 on all servers with broadcom nics

I'm proposing that we test before adding it to the network initialization script so if it causes a server crash it won't re-run the command when we bring the server back up.

Sound reasonable?

jamie

comment:21 Changed 2 years ago by https://id.mayfirst.org/dkg

yep, that sounds reasonable to me. I appreciate your caution, jamie :)

comment:22 Changed 2 years ago by https://id.mayfirst.org/dkg

fwiw, we could add that script to *all* servers, not just those with broadcom NICs. it is designed to only have an effect on a device that uses bnx2 drivers, so it will be a no-op for a broadcom-free server.

comment:23 Changed 2 years ago by https://id.mayfirst.org/jamie

Ok. Adding to all servers will at least simplify our puppet configuration. Perhaps all physical servers would be a good compromise between simple puppet scripts and avoiding having scripts lying around that don't seem to make any sense (e.g. on a virtual server).

I'm going to be on vacation the 2nd week of November, so...

I'll plan to run that command this saturday morning, so we have two days during the weekend for something to go wrong on ken + 7 full days before I go on vacation.

Then, I'd suggest we wait til mid-november to run it on all the machines and add it to the network scripts.

jamie

comment:24 Changed 2 years ago by https://id.mayfirst.org/dkg

Just wanted to note here that Leszek Urbanski's blog post (linked from Sven Ulland's mail to the linux-poweredge list) contains some thoughtful and detailed explanations for anyone interested in learning more about what appears to be going on.

comment:25 Changed 2 years ago by https://id.mayfirst.org/dkg

What's the status on this rollout?

comment:26 Changed 18 months ago by https://id.mayfirst.org/dkg

What's going on with this?

comment:27 Changed 18 months ago by https://id.mayfirst.org/jamie

  • Keywords f2f added

I don't remember if I actually ran that command on ken or not. I suspect I did not. I'm adding the f2f tag. I'd like to run it manually on ken tomorrow morning and if there are no immediately noticeable network problems, add it to puppet (for physical servers) so that, when puppet gets pushed out, it will get pushed to all servers and then next time the network restarts, it will take effect. That way we should have plenty of time to find problems (and if we do we can rollback the puppet changes).

comment:28 Changed 18 months ago by https://id.mayfirst.org/jamie

I'm fairly sure I never ran that command on ken because ethtool wasn't even installed on ken.

Before making the change, I used the -a switch of ethtool to query what was currently set:

0 ken:~# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX:             off
TX:             off

0 ken:~# 

Hm. Seems to be off already. Just to see what would happen, I tried setting it:

0 ken:~# ethtool --pause eth0 tx off
tx unmodified, ignoring
no pause parameters changed, aborting
78 ken:~#

I installed and ran ethtool -a eth0 on two other hosts that use bnx2 (malaka and bolivar) and got the same results.

jamie

comment:29 Changed 18 months ago by https://id.mayfirst.org/jamie

  • Resolution set to fixed
  • Status changed from assigned to closed

dkg and I decided to close this ticket. The switch that crashed has been replaced. We can re-open if it happens again.

comment:30 Changed 14 months ago by https://id.mayfirst.org/ross

  • Keywords f2f removed

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.