Opened 3 months ago

Last modified 3 months ago

#15606 assigned Bug/Something is broken

change backup strategy to offsite only

Reported by: Jamie McClelland Owned by: Jamie McClelland
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description

Per the coordination team meeting on 2020-04-20, we are transitioning to a backup strategy that only backs up to an offsite backup location - we are removing the redundant onsite backup.

We are implementing the offsite backup directly from each server to the remote backup server via borgbackup.

Change History (8)

comment:1 Changed 3 months ago by Jamie McClelland

Sensitive: unset

comment:2 Changed 3 months ago by Jamie McClelland

I've been testing the roll out of borgbackup directly to our offsite backup servers using malcolm - our biggest mosh.

malcolm had previously been backing up to iz (onesite) via rdiff backup and then from iz to minnie (offsite) via borg backup.

The last offsite backup was in late December.

First, I created a new user account on minnie (borg-malcolm) and the home directory /home/borg-malcolm. Then, I created the directory /home/borg-malcolm/servers and finally I moved /home/borg-iz/servers/malcolm to /home/bog-malcolm/servers/mailcolm and fixed the permissions.

Then, on malcolm, I generated a list of the available archives (borgbackup list borg-malcolm@minnie.mayfirst.org:servers/malcolm and tried listing the contents of an archive.

Then, I ran cd / && borgbackup create borg-malcolm@minnie.mayfirst.org:servers/malcolm::20200417 etc

This process triggered borgbackup on malcolm to create (and download) and index of all the the available files in the backup on minnie. This took several hours and about 7GB of disk space that was stored in /root/.cache/borg (I had to create a dedicated partition for it).

However, the actual backup of /etc did not take so long.

Then, I ran the same command, but backed up the far bigger /home/ directory.

This took 54 hours. I'm still not sure if borgbackup recognized existing files that have not changed since late December and did not re-back those up, or whether it backed everything up. In any event 54 hours is far less time then it takes to recover an rdiff-backup in which the archive has been corrupted.

When I re-ran the command the next day it took less then 2 hours to complete - which is quite good.

Lastly, I moved things around on minnie to better match the current directory setup (on minnie we back up to /home/members/mayfirst/backup/malcolm/borg-backup and we back up to the user malcolm-sync).

After moving everything around on minnie I re-ran the backup on malcolm and received the helpful message:

Warning: The repository at location ssh://malcolm-sync@minnie.mayfirst.org/home/members/mayfirst/backups/malcolm/borg-backup was previously located at ssh://borg-malcolm@minnie.mayfirst.org/home/borg-malcolm/servers/malcolm

I was given the option to approve this change - and responded yes. This is a great step - it means we didn't have to re-download and re-generate all the indexes.

I've now configured malcolm to use borgbackup via backupninja (and made the necessary puppet adjustments). Tonight it will run in an automated fashion.

Her

comment:3 Changed 3 months ago by Jamie McClelland

Some of my take-aways from this:

  1. borgbackup stores all metadata on the source server - which should make it extremely light on the target server. I suspect all the comparisons of what needs to be transferred happens entirely on the source server. This means we should be able to really load up backups on a single target server without worry too much about resource contention or disk i/o problems on the target.
  1. It can handle moving things around on the target server. I'm still not 100% if borgbackup is purely using file hashes to determine if a file has been backed up or not. In other words, does it take 54 hours to make an complete backup of data from malcolm to minnie? Or does it take 54 hours to back up all the files that have changed on malcolm in the last 4 months?
  1. It does seem to monopolize disk i/o for a single cpu (based on munin charts). I'm not sure if this would have sunk the server if we weren't employing lvm caching on the ssd but I suspect it does. For large servers like malcolm, we probably will want to switch them to using ssd backed lvm caches before the initial sync.

comment:4 Changed 3 months ago by JaimeV

I can work on identifying the moshes with large disks and making sure they are backed by lvmcache. I don't think the cache would do anything to speed up borgbackup but might help keep apache and dovecot stable by allowing them to read frequently accessed files from the ssd.

It does sound like having at least 2 cpu's on client would be ideal. It also mentions that it can take up a considerable amount of tmp space on the destination server but there is an environment variable you can set to change where temporary files will be created for each backup. https://borgbackup.readthedocs.io/en/stable/usage/general.html#resource-usage

comment:5 Changed 3 months ago by Jamie McClelland

Success on malcolm:

Apr 23 00:00:01 Info: >>>> starting action /etc/backup.d/10_mysql.sh (because current time matches everyday at 00:00)
Apr 23 00:14:27 Info: <<<< finished action /etc/backup.d/10_mysql.sh: SUCCESS
Apr 23 00:14:27 Info: >>>> starting action /etc/backup.d/71_backup.borg (because current time matches everyday at 00:00)
Apr 23 00:14:34 Info: Repository was already initialized
Apr 23 02:09:03 Info: Successfully finished backing up source 
Apr 23 02:09:41 Info: Removing old backups succeeded.
Apr 23 02:09:41 Info: <<<< finished action /etc/backup.d/71_backup.borg: SUCCESS
Apr 23 02:09:41 Info: FINISHED: 2 actions run. 0 fatal. 0 error. 0 warning.

The last successful onsite rdiff backup took about 6 hours.

comment:6 Changed 3 months ago by Jamie McClelland

We now have amilcar, daza, claudette, malcolm done and ossie in progress.

comment:7 Changed 3 months ago by Jamie McClelland

In terms of working on the biggest partitions first, the logical next candidates would include:

  • chavez
  • viewsic
  • lucius (yipes)

I'm not sure if we should get these partitions cached on ssd before we attempt it. I'm inclined to say: let's pick one and test it without the ssd cache. With borgbackup, we can always kill it mid-way through and still be able to pick it up later.

comment:8 Changed 3 months ago by JaimeV

chavez seems to be doing ok. looking at viewsic's graphs I think we should actually move that one to another host where we can set it up with an lvmcache.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.