Opened 2 years ago

Closed 21 months ago

#12677 closed Bug/Something is broken (worksforme)

Backups to Iz are having issues

Reported by: https://id.mayfirst.org/jamila Owned by: https://id.mayfirst.org/jaimev
Priority: Medium Component: Tech
Keywords: iz, backups Cc: https://id.mayfirst.org/jamie
Sensitive: no

Description

Palante and our clients have several backups to Iz, each server runs one per night. I monitor them pretty closely, and I've seen a lot of failures lately. The accounts on iz that have definitely had issues are palantetech-sync, ajmuste-sync, and srlp-sync.

The errors I'm getting are usually of the sort:

Fatal: Can't connect to iz.mayfirst.org as palantetech-sync.

But that error only comes up sometimes. The ssh keys are definitely set up correctly, I can run the jobs at other times. Is this a problem of too many connections at the same time? Should I stagger when the backup jobs run more?

Change History (26)

comment:1 Changed 2 years ago by https://id.mayfirst.org/jaimev

  • Cc https://id.mayfirst.org/jamie added
  • Keywords iz backups added
  • Owner set to https://id.mayfirst.org/jaimev
  • Status changed from new to assigned

From the logs I can see the successful connections but am not able to get much information about the unsuccessful connections. I tried searching for any denied login attempts from the same ip numbers that established successful connections but didn't see any evidence of that. However, if it is a networking issue or if the server is saturated in some other way it is possible that the connection requests are never even reaching the ssh server. If you can give us exact date and timestamps that will help us track down what is going on at the sever at those times.

Also copying jamie here for more input.

comment:2 Changed 2 years ago by https://id.mayfirst.org/jamila

Here's one instance, trying to connect as palantetech-sync from bagel.palantetech.coop, IP 50.116.4.203

Apr 21 01:01:31 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)

Apr 21 01:04:18 Fatal: Can't connect to iz.mayfirst.org as palantetech-sync.

Apr 21 01:04:18 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

When it tried again the next day, it worked.

Apr 22 01:01:55 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)

Apr 22 01:05:23 Info: Removing backups older than 14D days succeeded.

Apr 22 01:13:03 Info: Successfully finished backing up source bagel

Apr 22 01:13:03 Info: <<<< finished action /etc/backup.d/91-mfpl.rdiff: SUCCESS

Last edited 2 years ago by https://id.mayfirst.org/jamila (previous) (diff)

comment:3 Changed 2 years ago by https://id.mayfirst.org/jaimev

Interesting, I don't see anything from that ip with that timestamp but I do see a series of short connections from palantetech-sync around that time from another ip.

Apr 21 01:03:01 iz sshd[30498]: Accepted publickey for palantetech-sync from 23.92.21.90 port 40015 ssh2: RSA de:52:8a:7a:66:14:69:19:60:f6:94:3b:8c:6f:2a:9e
Apr 21 01:03:01 iz sshd[30502]: Accepted publickey for palantetech-sync from 23.92.21.90 port 40016 ssh2: RSA de:52:8a:7a:66:14:69:19:60:f6:94:3b:8c:6f:2a:9e
Apr 21 01:03:02 iz sshd[30506]: Accepted publickey for palantetech-sync from 23.92.21.90 port 40017 ssh2: RSA de:52:8a:7a:66:14:69:19:60:f6:94:3b:8c:6f:2a:9e
Apr 21 01:04:01 iz sshd[30510]: Accepted publickey for palantetech-sync from 23.92.21.90 port 40038 ssh2: RSA de:52:8a:7a:66:14:69:19:60:f6:94:3b:8c:6f:2a:9e

comment:4 Changed 2 years ago by https://id.mayfirst.org/jamila

Yes, 23.92.21.90 is another one of our servers that runs backups to Iz.

comment:5 Changed 2 years ago by https://id.mayfirst.org/jamila

Here's one that failed last night. User palantetech-sync, from server curds.palantetech.coop.

Apr 27 02:38:05 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
Apr 27 02:38:16 Fatal: Can't connect to iz.mayfirst.org as palantetech-sync.
Apr 27 02:38:16 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

comment:6 Changed 2 years ago by https://id.mayfirst.org/jamie

Thanks Jamila, we also are starting to get nagios alerts.

I suspect it's the MaxStartup setting (which will start denying connections when there is more than 10 un-authenticated connections). I suspect it's some kind of dictionary attack robot that doesn't realize we don't allow password based auth.

We just increased the log level so we can confirm this is the problem and if so, we'll probably need to either change the setting or keep the log level high and install fail2ban.

Hopefully we will get this figured out in the next few days. Thanks for ticket!

comment:7 Changed 2 years ago by https://id.mayfirst.org/jamila

Thanks Jamie! I'll let you know if I see other similar failures on our end.

comment:8 Changed 23 months ago by https://id.mayfirst.org/jamila

Curds.palantetech.coop failed again.

May 01 02:30:38 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
May 01 02:30:55 Fatal: Can't connect to iz.mayfirst.org as palantetech-sync.
May 01 02:30:55 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

comment:9 Changed 23 months ago by https://id.mayfirst.org/jamila

Let me know if there is more information I can get you from the failures. They are happening in the middle of the night, and so far when I try to run it during the day it runs fine.

comment:10 Changed 23 months ago by https://id.mayfirst.org/jamie

Hi Jamila - thanks again and sorry for the delay. Frustratingly, our debug log doesn't shed any more light into what is going on.

We don't see a huge number of connections at that time and we see two successful connections from your IP that are immediately ended:

May  1 02:30:38 iz sshd[30489]: Connection from 199.58.81.151 port 37311 on 216.66.15.22 port 22
May  1 02:30:55 iz sshd[30491]: Connection from 199.58.81.151 port 37316 on 216.66.15.22 port 22
May  1 02:30:55 iz sshd[30491]: Postponed publickey for palantetech-sync from 199.58.81.151 port 37316 ssh2 [preauth]
May  1 02:30:55 iz sshd[30491]: Accepted publickey for palantetech-sync from 199.58.81.151 port 37316 ssh2: RSA a5:cc:d5:fa:a1:54:b3:be:25:2e:49:3c:c9:c8:4e:a0
May  1 02:30:55 iz sshd[30493]: Starting session: command for palantetech-sync from 199.58.81.151 port 37316
May  1 02:31:43 iz sshd[30493]: Received disconnect from 199.58.81.151: 11: disconnected by user
May  1 02:31:43 iz sshd[30497]: Connection from 199.58.81.151 port 37322 on 216.66.15.22 port 22
May  1 02:31:43 iz sshd[30497]: Postponed publickey for palantetech-sync from 199.58.81.151 port 37322 ssh2 [preauth]
May  1 02:31:43 iz sshd[30497]: Accepted publickey for palantetech-sync from 199.58.81.151 port 37322 ssh2: RSA a5:cc:d5:fa:a1:54:b3:be:25:2e:49:3c:c9:c8:4e:a0
May  1 02:31:43 iz sshd[30499]: Starting session: command for palantetech-sync from 199.58.81.151 port 37322
May  1 02:31:43 iz sshd[30499]: Received disconnect from 199.58.81.151: 11: disconnected by user

Clearly something is going wrong though, so I've simply increased the max startups from 10 to 30. I'm not convinced that will fix it, but I'm at a loss so figured we should try it and see what happens.

comment:11 Changed 23 months ago by https://id.mayfirst.org/jamila

Thanks! This happened early this morning from grits.palantetech.coop (199.58.81.200). I'm checking now and it's not true. rdiff-backup is the same version. Not sure if this is related.

May 05 01:19:59 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
May 05 01:20:13 Fatal: rdiff-backup does not have the same version at the source and at the destination.
May 05 01:20:13 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

comment:12 Changed 23 months ago by https://id.mayfirst.org/jamila

It happened again over the weekend. Let me know if you want any more information than this.

One happened from office.srlp.org (68.173.15.176) connecting as srlp-sync.

May 08 01:14:29 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
May 08 01:15:57 Fatal: Can't connect to iz.mayfirst.org as srlp-sync.
May 08 01:15:57 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

This also happened again on grits.palantetech.coop (199.58.81.200).

May 07 01:10:18 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
May 07 01:10:29 Fatal: Can't connect to iz.mayfirst.org as palantetech-sync.
May 07 01:10:29 Fatal: <<<< finished action /etc/backup.d/91-mfpl.rdiff: FAILED

comment:13 Changed 23 months ago by https://id.mayfirst.org/jaimev

This may mean that the problem we suspected may not the root cause of the disconnections. One thing I did notice when logging into iz during the evening is that it was very high disk io and wait times due to so many machines making overnight backups.

comment:14 Changed 23 months ago by https://id.mayfirst.org/jamie

Frustrating. I just reset our ssh configuration since that didn't work (and the debug level of logging doesn't help).

Jaime - can you setup fail2ban on iz to block on "Invalid user"?

According to sar our wait times never go over 50% so it seems strange that iz would reject ssh connections.

comment:15 Changed 23 months ago by https://id.mayfirst.org/jamila

Thank you! I'll let you know if that helps. Fwiw I'm also looking into switching from using rdiffbackup to using borg, which runs a *lot* quicker. Just on servers we control, there are over a dozen nightly backup jobs to iz, so I am hoping to eventually lower the effective load by using a more efficient program.

comment:16 Changed 23 months ago by https://id.mayfirst.org/jamila

I'm adding borg to our servers. gravy.palantetech.coop ran backups and had a connection issue when trying to run the borg backup at 1:19, but the rdiff worked fine between 1:37-2:22.

May 15 01:15:18 Info: >>>> starting action /etc/backup.d/71-mfpl.borg (because current time matches everyday at 01:00)
May 15 01:15:40 Info: Repository was already initialized
May 15 01:19:40 Error: Connection closed by remote host. Is borg working on the server?
May 15 01:19:40 Fatal: Failed backuping up source 
May 15 01:19:40 Fatal: <<<< finished action /etc/backup.d/71-mfpl.borg: FAILED
May 15 01:19:40 Info: >>>> starting action /etc/backup.d/76-borgbackupcheck.sh (because current time matches everyday at 01:00)
May 15 01:37:45 Info: <<<< finished action /etc/backup.d/76-borgbackupcheck.sh: SUCCESS
May 15 01:37:45 Info: >>>> starting action /etc/backup.d/91-mfpl.rdiff (because current time matches everyday at 01:00)
May 15 01:43:16 Info: Removing backups older than 14D days succeeded.
May 15 02:22:00 Info: Successfully finished backing up source hq
May 15 02:22:00 Info: <<<< finished action /etc/backup.d/91-mfpl.rdiff: SUCCESS

This says to me there might be a load issue just after 1am. If your tests confirm this, I will go through and stagger out the times of our backups.

comment:17 Changed 22 months ago by https://id.mayfirst.org/jamila

I had issues again last night. I have now staggered 7 of the palantetech-sync backups to start running 20 minutes apart, none of them starting at 1am, where previously they all started at 1am. Here's hoping that helps.

comment:18 Changed 22 months ago by https://id.mayfirst.org/jamie

Thanks jamila - it really does seem to be helping a lot.

Also - are many of these backups new backups? I'm wondering if this problem happened because you started several backups from scratch that were copying large volumes of data - and once the initial sync is complete, it will use a lot less disk i/o?

comment:19 Changed 22 months ago by https://id.mayfirst.org/jamila

Hi Jamie,

No, I'm afraid none of those 7 have been new backups for a while. I am running other new backups now, but not on those hosts. These 7 have also been switched completely to using borgbackup instead of rdiff-backup, which should also lighten the load a lot. Eventually I'm hoping to switch all of our client backups to Iz to use borgbackup, but we have to get our deployment system set up for that first. I am still seeing occasional backup issues to iz on the clients that are running rdiff-backups, only one of which was starting at 1am.

comment:20 Changed 22 months ago by https://id.mayfirst.org/jamila

I've switched several hosts to borgbackup, but there are still issues.

Iz dropped the connection from bagel.palantetech.coop (50.116.4.203) last night.

May 31 00:03:57 Info: >>>> starting action /etc/backup.d/76-borgbackupcheck.sh (because current time matches everyday at 00:10)
May 31 00:04:36 Warning: Connection closed by 216.66.15.22
May 31 00:04:36 Warning: <<<< finished action /etc/backup.d/76-borgbackupcheck.sh: WARNING

comment:21 Changed 22 months ago by https://id.mayfirst.org/jamie

Thanks Jamila.

This is an untenable situation - I think the only long term solution will be to setup more backup servers. Probably we will be setting them up out of the main colo center - so the main colo center will be dedicated to onsite backups for MF/PL servers. This change will probably take at least a month to get off the ground at which point we can begin moving the palante backups one by one to a server better able to handle the load.

In the meantime we can try:

  • I noticed that one disk in the 12 disk array is repeatedly reporting errors. I can replace this disk tomorrow night (that will result in a lot more disk i/o over the weekend while it syncs, but maybe it will improve performance of the RAID once it completes).
  • Start switching MF/PL backups to use borgbackup (https://github.com/mxroo/borg_backupninja)

I think we should do both

comment:22 Changed 22 months ago by https://id.mayfirst.org/jamila

Thank you Jamie,

Here's hoping replacing that disk helps! Let me know when you're about to start migrating to borgbackup, I am doing a lot of QA on my borgbackup config right now and I will make sure my github is up to date for you. My coworker Jessie has also created some ansible scripts for rolling out the borg backups, if that would be helpful.

comment:23 Changed 22 months ago by https://id.mayfirst.org/jaimev

"Start switching MF/PL backups to use borgbackup" \o/

comment:24 Changed 22 months ago by https://id.mayfirst.org/jamila

Issues seem to be strongest at 1am. I'm able to run this job without issue right now, but at 1am this problem happened again for user ajmuste-sync from 67.245.112.172.

Jun 07 01:01:03 Info: >>>> starting action /etc/backup.d/71-mfpl.borg (because current time matches everyday at 01:00)
Jun 07 01:03:11 Fatal: Can't connect to iz.mayfirst.org as ajmuste-sync.
Jun 07 01:03:11 Fatal: <<<< finished action /etc/backup.d/71-mfpl.borg: FAILED

comment:25 Changed 22 months ago by https://id.mayfirst.org/jamie

I saw the error last night also via nagios. It was the first error since I removed the bad drive on friday - I was thinking that drive was the culprit.

Prepare for possibly a spike in errors as we sync the new drive tomorrow.

I'll be working in earnest next week on converting our backups to borg to try to continue reducing the load.

comment:26 Changed 21 months ago by https://id.mayfirst.org/jamila

  • Resolution set to worksforme
  • Status changed from assigned to closed

I haven't seen any of these sorts of issues in a while, I think switching the drives and reducing the load has helped.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.