Opened 6 years ago

Closed 5 years ago

#7058 closed Bug/Something is broken (fixed)

monitor to ensure that zimmermann (keys.mayfirst.org) is in the main sks pool

Reported by: https://id.mayfirst.org/dkg Owned by: https://id.mayfirst.org/dkg
Priority: Medium Component: Tech
Keywords: sks zimmermann.mayfirst.org nagios systemd Cc:
Sensitive: no

Description

looking at zimmermann.mayfirst.org:/var/log/sks/recon.log, there are a lot of lines like:

09:04:51 Reconciliation attempt from <ADDR_INET [131.155.141.70]:42017> while gossip disabled. Ignoring.

and they go back for at least 5 days, so i don't know what caused the gossip to be disabled :(

https://sks-keyservers.net/status/ suggests that zimmermann is 4K keys behind everyone else.

i've just restarted sks on zimmermann to see if that can improve matters.

we should really monitor this better somehow, though.

Change History (17)

comment:1 Changed 6 years ago by https://id.mayfirst.org/dkg

  • Owner set to https://id.mayfirst.org/dkg
  • Status changed from new to assigned

comment:2 Changed 6 years ago by https://id.mayfirst.org/dkg

After the restart, zimmermann is now pulling in a bunch of keys from one of its peers. I think this might expose a bug in the recon logic, but i'm not sure. i'm writing it up for sks-devel now.

comment:3 Changed 6 years ago by https://id.mayfirst.org/dkg

  • Keywords nagios added
  • Summary changed from zimmermann (keys.mayfirst.org) is not synchronizing with the rest of the SKS pool to monitor to ensure that zimmermann (keys.mayfirst.org) is in the main sks pool

OK, i've written up my analysis and sent it to sks-devel.

I'd like to keep this ticket open for monitoring when zimmermann drops out of the pool. Probably a nagios alert is the way to go.

comment:4 Changed 5 years ago by https://id.mayfirst.org/ross

It looks like zimmerman has dropped out of the pool again. Alex stated this on irc:

17:42 < alex> hey. i'm hitting more and more cases where keys are on pgp.mit.edu and others, 
but not on keys.mayfirst.org. did the synchronisation stop or something?
17:49 < alex> it's been 3 keys in about a week
17:50 < alex> twice it worked from pgp.mit.edu, once on keys.indymedia.org. both the first i tried after keys.mayfirst
Last edited 5 years ago by https://id.mayfirst.org/ross (previous) (diff)

comment:5 Changed 5 years ago by https://id.mayfirst.org/dkg

Thanks for the heads-up. it looks like recon died and failed during a logrotate and failed to restart back on the 6th. I've restarted it and it's re-syncing now.

we really should not be using crappy old sysvinit and logrotate any more. too many problems are directly attributable to this janky infrastructure :(

maybe once i get zimmermann transitioned to wheezy i can look into using alternate startup mechanisms.

comment:6 Changed 5 years ago by https://id.mayfirst.org/dkg

We got the same problem again this morning, with the recon process dying during log rotation:

2013-07-08 06:25:18 Added 2 hash-updates. Caught up to 1373279103.002051
2013-07-08 06:25:18 Raising Sys.Break -- PTree may be corrupted: Sys_error("Bad file descriptor")
2013-07-08 06:25:43 DB closed

I restarted sks, again.

comment:7 Changed 5 years ago by https://id.mayfirst.org/dkg

Sigh. zimmermann was out of the pool when i checked in again today, but this time due to something different. the recon process was hung but still running -- outputting nothing to the logs.

And there were 8 db_archive processes from /etc/cron.daily/sks (from the last 8 days) which were all hung in a futex() syscall.

I had to terminate all 9 of these processes with SIGKILL (sks recon first, then the db_archive processes), and then shut down the sks db process with SIGINT. once all processes were down, i restarted the daemon, and it started syncing again. We'll see how long it takes to come back; its initial report upon recon restarting was:

2013-08-01 01:57:13 12706 hashes recovered from <ADDR_INET [204.13.164.120]:11371>

comment:8 Changed 5 years ago by https://id.mayfirst.org/dkg

I've updated zimmermann to sks version 1.1.4 today, and included my OpenPGP key fingerprint as the contact (this is a new feature in 1.1.4) -- i'm discussing with the sks-keyservers.net pool administrator whether we can get an automatic contact e-mail sent when it drops out of one of the pools.

comment:9 Changed 5 years ago by https://id.mayfirst.org/dkg

Argh. After finally getting the nagios updates working on jojobe, it looks like we need SNI for the check to complete properly.

So closing this ticket is blocked (again) by another ticket: #7770.

comment:10 Changed 5 years ago by https://id.mayfirst.org/dkg

  • Keywords systemd added

I've now modified zimmermann to run with systemd as pid 1. This will give us a chance to experiment with systemd as well as have a service monitored cleanly to be restarted if it dies.

At the moment, though, it is still running under the sysvinit compatibility layer. once we move it to systemd, we can use this setup that we can use to try to resolve DebianBug:715360.

comment:11 Changed 5 years ago by https://id.mayfirst.org/dkg

I switched to systemd by installing systemd, dbus, libpam-systemd, and their dependencies. Then i modified /etc/default/grub so that:

GRUB_CMDLINE_LINUX="init=/bin/systemd console=ttyS0,115200n8"

and then rebooted.

I noticed that when booting to systemd like this, runit was not operating.

I set up runit by adding the following file to /etc/systemd/system/runit.service:

[Unit]
Description=A process supervising daemon

[Service]
Type=simple
ExecStart=/usr/sbin/runsvdir-start

[Install]
WantedBy=multi-user.target

once that was in place, i ran:

systemctl enable runit.service
systemctl start runit.service

and it came back up as previously configured.

I've reported this to gerrit as DebianBug:722116

Last edited 5 years ago by https://id.mayfirst.org/dkg (previous) (diff)

comment:12 Changed 5 years ago by https://id.mayfirst.org/dkg

comment:13 Changed 5 years ago by https://id.mayfirst.org/dkg

  • Resolution set to fixed
  • Status changed from assigned to closed

This monitoring is now working.

comment:14 Changed 5 years ago by https://id.mayfirst.org/dkg

This is now working under systemd with:

==> /etc/systemd/system/sks-db.service <==
[Unit]
Description=SKS database service

[Service]
Type=simple
ExecStart=/usr/sbin/sks -stdoutlog db
User=debian-sks

[Install]
WantedBy=multi-user.target

==> /etc/systemd/system/sks.service <==
[Unit]
Description=SKS reconciliation service

[Service]
Type=simple
ExecStart=/usr/sbin/sks -stdoutlog recon
User=debian-sks
BindTo=sks-db.service
After=sks-db.service

[Install]
WantedBy=multi-user.target
0 zimmermann:~# 

we probably want to add in some sort of Restart= parameter to these configurations as well (see systemd.service) but i'm not sure what to choose.

I also commented out all of /etc/logrotate.d/sks, since the logs are now being dumped to systemctl's stdout (and presumably the journal, though i'm not sure where if that goes to disk or not).

comment:15 Changed 5 years ago by https://id.mayfirst.org/dkg

jojobe runs wheezy now so we should consider reverting e243d584b5eb83d047b06bd7a1994316f38bc3b3/puppet to make our check use https.

comment:16 Changed 5 years ago by https://id.mayfirst.org/dkg

  • Resolution fixed deleted
  • Status changed from closed to assigned

comment:17 Changed 5 years ago by https://id.mayfirst.org/dkg

  • Resolution set to fixed
  • Status changed from assigned to closed

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.