Opened 6 years ago

Closed 5 years ago

#7058 closed Bug/Something is broken (fixed)

monitor to ensure that zimmermann ( is in the main sks pool

Reported by: Owned by:
Priority: Medium Component: Tech
Keywords: sks nagios systemd Cc:
Sensitive: no


looking at, there are a lot of lines like:

09:04:51 Reconciliation attempt from <ADDR_INET []:42017> while gossip disabled. Ignoring.

and they go back for at least 5 days, so i don't know what caused the gossip to be disabled :( suggests that zimmermann is 4K keys behind everyone else.

i've just restarted sks on zimmermann to see if that can improve matters.

we should really monitor this better somehow, though.

Change History (17)

comment:1 Changed 6 years ago by

  • Owner set to
  • Status changed from new to assigned

comment:2 Changed 6 years ago by

After the restart, zimmermann is now pulling in a bunch of keys from one of its peers. I think this might expose a bug in the recon logic, but i'm not sure. i'm writing it up for sks-devel now.

comment:3 Changed 6 years ago by

  • Keywords nagios added
  • Summary changed from zimmermann ( is not synchronizing with the rest of the SKS pool to monitor to ensure that zimmermann ( is in the main sks pool

OK, i've written up my analysis and sent it to sks-devel.

I'd like to keep this ticket open for monitoring when zimmermann drops out of the pool. Probably a nagios alert is the way to go.

comment:4 Changed 6 years ago by

It looks like zimmerman has dropped out of the pool again. Alex stated this on irc:

17:42 < alex> hey. i'm hitting more and more cases where keys are on and others, 
but not on did the synchronisation stop or something?
17:49 < alex> it's been 3 keys in about a week
17:50 < alex> twice it worked from, once on both the first i tried after keys.mayfirst
Last edited 6 years ago by (previous) (diff)

comment:5 Changed 6 years ago by

Thanks for the heads-up. it looks like recon died and failed during a logrotate and failed to restart back on the 6th. I've restarted it and it's re-syncing now.

we really should not be using crappy old sysvinit and logrotate any more. too many problems are directly attributable to this janky infrastructure :(

maybe once i get zimmermann transitioned to wheezy i can look into using alternate startup mechanisms.

comment:6 Changed 6 years ago by

We got the same problem again this morning, with the recon process dying during log rotation:

2013-07-08 06:25:18 Added 2 hash-updates. Caught up to 1373279103.002051
2013-07-08 06:25:18 Raising Sys.Break -- PTree may be corrupted: Sys_error("Bad file descriptor")
2013-07-08 06:25:43 DB closed

I restarted sks, again.

comment:7 Changed 6 years ago by

Sigh. zimmermann was out of the pool when i checked in again today, but this time due to something different. the recon process was hung but still running -- outputting nothing to the logs.

And there were 8 db_archive processes from /etc/cron.daily/sks (from the last 8 days) which were all hung in a futex() syscall.

I had to terminate all 9 of these processes with SIGKILL (sks recon first, then the db_archive processes), and then shut down the sks db process with SIGINT. once all processes were down, i restarted the daemon, and it started syncing again. We'll see how long it takes to come back; its initial report upon recon restarting was:

2013-08-01 01:57:13 12706 hashes recovered from <ADDR_INET []:11371>

comment:8 Changed 6 years ago by

I've updated zimmermann to sks version 1.1.4 today, and included my OpenPGP key fingerprint as the contact (this is a new feature in 1.1.4) -- i'm discussing with the pool administrator whether we can get an automatic contact e-mail sent when it drops out of one of the pools.

comment:9 Changed 5 years ago by

Argh. After finally getting the nagios updates working on jojobe, it looks like we need SNI for the check to complete properly.

So closing this ticket is blocked (again) by another ticket: #7770.

comment:10 Changed 5 years ago by

  • Keywords systemd added

I've now modified zimmermann to run with systemd as pid 1. This will give us a chance to experiment with systemd as well as have a service monitored cleanly to be restarted if it dies.

At the moment, though, it is still running under the sysvinit compatibility layer. once we move it to systemd, we can use this setup that we can use to try to resolve DebianBug:715360.

comment:11 Changed 5 years ago by

I switched to systemd by installing systemd, dbus, libpam-systemd, and their dependencies. Then i modified /etc/default/grub so that:

GRUB_CMDLINE_LINUX="init=/bin/systemd console=ttyS0,115200n8"

and then rebooted.

I noticed that when booting to systemd like this, runit was not operating.

I set up runit by adding the following file to /etc/systemd/system/runit.service:

Description=A process supervising daemon



once that was in place, i ran:

systemctl enable runit.service
systemctl start runit.service

and it came back up as previously configured.

I've reported this to gerrit as DebianBug:722116

Last edited 5 years ago by (previous) (diff)

comment:12 Changed 5 years ago by

comment:13 Changed 5 years ago by

  • Resolution set to fixed
  • Status changed from assigned to closed

This monitoring is now working.

comment:14 Changed 5 years ago by

This is now working under systemd with:

==> /etc/systemd/system/sks-db.service <==
Description=SKS database service

ExecStart=/usr/sbin/sks -stdoutlog db


==> /etc/systemd/system/sks.service <==
Description=SKS reconciliation service

ExecStart=/usr/sbin/sks -stdoutlog recon

0 zimmermann:~# 

we probably want to add in some sort of Restart= parameter to these configurations as well (see systemd.service) but i'm not sure what to choose.

I also commented out all of /etc/logrotate.d/sks, since the logs are now being dumped to systemctl's stdout (and presumably the journal, though i'm not sure where if that goes to disk or not).

comment:15 Changed 5 years ago by

jojobe runs wheezy now so we should consider reverting e243d584b5eb83d047b06bd7a1994316f38bc3b3/puppet to make our check use https.

comment:16 Changed 5 years ago by

  • Resolution fixed deleted
  • Status changed from closed to assigned

comment:17 Changed 5 years ago by

  • Resolution set to fixed
  • Status changed from assigned to closed

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.