Opened 5 weeks ago

Last modified 7 days ago

#13884 new Task/To do item

alter backup strategy to reduce disk i/o

Reported by: https://id.mayfirst.org/jamie Owned by:
Priority: Medium Component: Tech
Keywords: backup borgbackup rdiff-backup rsync Cc:
Sensitive: no

Description

Currently, for all production machines, we run exactly two backups every night. One backup uses rdiff-backup to make a 10 day incremental backup to our onsite backup server. After that completes, we run a second rsync backup to our offsite backup server.

Often, these backup jobs are still running on our production servers as late as 9, 10 or even 11 am America/New_York time, which causes cascading disk i/o slow downs on every server sharing the disks being backed up.

The new plan is:

  • Run a single rsync backup to our onsite backup server each night.
  • Every day, run an incremental 10 day borg backup from our onsite backup server to our offsite backup servers

Change History (21)

comment:1 Changed 5 weeks ago by https://id.mayfirst.org/jamie

I'm experimenting first with octavia by taking these steps:

  • Edit .pp file. Change the rdiff-backup to iz line to be an rsync backup to iz and remove the existing rsync backup to the offsite backup server.
  • On octavia, delete /etc/backup.d/50_backup.rdiff (puppet won't delete it for you).
  • On iz, as the octavia-sync user, move the rdiff-backup-data directory to the home directory (mv rdiff-backup/rdiff-backup-data ~/) then rename rdiff-backup rsync. This change should allow the next rsync run from octavia to iz to happen without re-copying all the files.
  • Prepare borg backup on minnie (which is where octavia is currently backing up to)
    • Create borg-iz username on minnie
    • Grant access to root@iz to borg-iz on minnie
    • As borg-iz user on minnie, initiate the backup:
      mkdir -p /home/borg-iz/servers
      borg init --encryption none /home/borg-iz/servers/octavia
      
  • As root, chown the octavia backup so the borg-iz user can access it:
    chown -R borg-iz /home/members/mayfirst/backups/octavia/rsync/
    
  • Import backup data. These steps are not yet started - I'm waiting for minnie disks to finishing re-sync'ing the RAID
    • As the borg-iz user on minnie, run the initial backup locally (so we don't have to move all this data from our colo to our offsite backup site):
      cd /home/members/mayfirst/backups/octavia/rsync/
      borg create /home/borg-iz/servers/octavia::$(date +%Y%m%d) .
      
    • When complete, rm -rf /home/members/mayfirst/backups/octavia && deluser octavia-sync
  • Schedule regular borg backups from iz. Once the data is imported, I will write a script that runs on a cron job that will cd into /home/members/mayfirst/backups/octavia/rsync and run borg create borg-iz@minnie.mayfirst.org:/home/borg-iz/servers/octavia::$(date "%Y%m%d) . and borg prune --keep-daily 10 borg-iz@minnie.mayfirst.org:/home/borg-iz/servers/octavia
Last edited 5 weeks ago by https://id.mayfirst.org/jamie (previous) (diff)

comment:2 Changed 5 weeks ago by https://id.mayfirst.org/jamie

Both octavia (on minnie) and juana (on banks) are now importing their backups into borg.

comment:3 Changed 5 weeks ago by https://id.mayfirst.org/jamie

Octavia completed. It took minnie about 3 hours or so to import 37GB of rsync data to 31GB of borg backup data.

Now, running borg backup from iz to minnie.

comment:4 Changed 4 weeks ago by https://id.mayfirst.org/jamie

Both octavia and juana are now fully configured to use borg backup and our borg backup scripts are handled by puppet.

On iz, our borgbackup scripts are triggered by the presence of a file called borg-${target} (e.g. borg-minnie or borg-banks). Every morning at 8:00 am eastern, the script looks for home directories with those files and if found, it initiates a borg backup to the specified target.

You can specify which target to backup to in each individual server's .pp file (in the iz stanza at the bottom).

comment:5 Changed 4 weeks ago by https://id.mayfirst.org/jamie

Now that things are setup, here's a revised version of the steps to convert a server to use borg backup:

  • Prepare the server:
    • Make the following changes to the server's pp file:
      • Change the rdiff line backing up to iz to be an rsync line
      • In the server's .pp file, delete the backup_rsync_target line that specifies either minnie or banks
      • Delete the backup server from the if clause at the bottom of the .pp file
    • Here's a sample summary:
         class { "mayfirst::m_minimal":
           location => "telehouse",
      -    backup_rdiff_target => "iz.mayfirst.org",
      +    backup_rsync_target => "iz.mayfirst.org",
      -    backup_rsync_target => "minnie.mayfirst.org",
           caching_dns_ips => [ "216.66.22.34", "209.51.171.179" ],
           backup_start_hour => 0
         }
      @@ -43,7 +42,7 @@ if ( $::fqdn in $::mfpl_nagios_servers ) {
         }
       }
       
      -if ( $::fqdn == "iz.mayfirst.org" ) or ( $::fqdn == "minnie.mayfirst.org" ) {
      +if ( $::fqdn == "iz.mayfirst.org" ) {
         mayfirst::m_backupninja::server::configure_node { "proudhon": }
       }
      
    • Commit the changes and push to origin and push to the server you are changing
    • Prepare the server. ssh to the server and run mf-borg-init - this script will warn you of any errors. Keep running until it tells you that you are ready.
  • Initialize the borg backup on the backup server (either banks or minnie depending on where this server's offsite backups were going). This is a two stage process (each stage takes a while) so be sure to run this in a screen session: mf-borg-init-offsite $server. Run this command twice (it will tell you what it is doing and when you are done)
  • Edit the .pp file again, adding a line that will trigger the borg backup from iz:
     
     if ( $::fqdn == "iz.mayfirst.org" ) {
       mayfirst::m_backupninja::server::configure_node { "proudhon": }
    +  mayfirst::m_borgbackup::source { "proudhon": target => "minnie" }
     }
    
    • Commit and push to iz
Last edited 4 weeks ago by https://id.mayfirst.org/jamie (previous) (diff)

comment:6 Changed 4 weeks ago by https://id.mayfirst.org/jaimev

wow. Great work jamie.

comment:7 Changed 4 weeks ago by https://id.mayfirst.org/jamie

Thnx :) - now, both gaspar and proudhon are in process. Given all the disk i/o I'm only adding two servers per day. After we have a few more under our belt we may try to do more.

comment:8 Changed 3 weeks ago by https://id.mayfirst.org/jamie

I'm continuing to make progress. I'm focusing on moving one large backup per day per offsite backup server (because switching to the new system is disk i/o intensive on the offsite backup servers).

It will take a few weeks to complete.

Also, I'm seeing rsync to iz takes a long time (for example, proudhon takes 5 hours to rsync 100GB of data). That might simply be the weakness of rsync (it has to read every file to determine if it has been modified). However, based on this set of proposed optimizations I think we should add --inplace and --whole-file to our rsync script. The --inplace could be added now (it will help with the onsite and offsite backups) but --whole-file should only be added for backups happening to iz.

comment:9 Changed 3 weeks ago by https://id.mayfirst.org/jamie

I've just made the change in puppet and pushed to proudhon (all rsync servers now use --inplace and if you are backing up to iz, you also get --whole-file).

I'll see tomorrow if the rsync backup goes faster than 5 hours.

comment:10 Changed 3 weeks ago by https://id.mayfirst.org/jamie

Unfortunately, the rsync options did not make any discernable difference in the time it took proudhon to complete the rsync. I suspect it is because rsync has to compare every single file on proudhon to every single file on iz, which is a huge amount of reading.

I think our options are:

  • Live with it
  • Consider a different tool to backup onsite files (I suspect borg backup and others are more efficient because they record their state after each run locally so the next run goes faster).
  • Consider a modified rsync approach, for example:
    • record the date/time rsync completes
    • on the next run, use find to find files that have been changed since that date and rsync only those files
    • Once a week do a traditional recursive rsync to sweep up any files we may have missed

This approach would reduce the number of file reads on the remote side, and also it would probably be a more efficient search on the local side.

comment:11 Changed 3 weeks ago by https://id.mayfirst.org/jamie

Other options based on more research that would require more significant modifications to our setup:

  • Using LVM to create a snapshot of changed files and only backup those files
  • Use a block-level backup like DRBD

At this point, I'm leaning toward "Live with it".

comment:12 Changed 8 days ago by https://id.mayfirst.org/jamie

I just discovered the reason rsync takes so much longer than rdiff-backup... it's because our rsync script runs via ionice.

I'm going to remove ionice from our rsync scripts to see if that changes things (it might cause a significant disk i/o increase while the backups are running in the middle of the night but I suspect this won't be a big deal - the difference is between running rdiff-backup without ionice followed by rysync with ionice vs just running rsync without ionice).

comment:13 Changed 8 days ago by https://id.mayfirst.org/jaimev

I am curious why we aren't using borg backup in the first phase to take advantage of that speed increase where the extra time reading backups impacts us most.

If we went the direction of block level backups, my understanding of how the lvm snapshots work is that they actually only accumulate copies of the old files that have now changed on the active lv. That is how they are space efficient. Reading from the snapshot, reads both these old files and files that have not changed on the active lv, and so is still like reading from the complete lv from the point in time the snapshot was created.

Still, snapshots might make a useful part of a backup strategy using a simple block level backup like bdsync or (i don't completely understand DRBD yet) , as you create the snapshot and then make the block level backup from the snapshot knowing that you have a faithful remote reproduction of the disk at snapshot time. However there are a lot of warnings about how keeping too many lvm snapshots active at once can create more disk io performance issues. So I think doing the above has to be better coordinated to ensure the creation of lvm snapshots and remote block level backups of the guest vm's happen sequentially to tightly control how many lvm snapshots are active instead of initiating at random times like we currently do.

Something I've experimented with elsewhere is coordinating this from the physical host itself and making the snapshots and backups of the logical volumes created on the physical host. However in that case several logical volumes were created on the physical host for each guest and each used as single partitions within the guest for root home var etc. But as we currently create at most two physical host lvm's for each guest, one that gets further subdivided into logical volumes within the guest and maybe another for the database partition, the above strategy wouldn't allow us to selectively backup only parts of each guest.

comment:14 follow-up: Changed 8 days ago by https://id.mayfirst.org/jamie

I agree about moving to block level backups. I think that should be our ultimate goal.

As for the short term, I expected rsync to be the overall fastest backup method, which is why I chose it as the initial backup to iz. It also has the benefit of creating a complete copy of the backup via the filesystem which can easily be searched to find just the file or directory you want (without having to run any special borg backup restore commands).

Now that I have removed the ionice restriction, I'll be curious to see which is faster: an unhindered rsync backup to iz or the borgbackup to minnie/banks. If the minnie/banks backup is network constrained, it's not really an even comparison, but if it's disk i/o constrained it should be useful.

If borgbackup is significantly faster then a straight rsync, then we may consider switching to a borgbackup to iz and then an rsync to minnie/banks (although we may have problems - what if rsync runs while borgbackup is mid-backup? We might end up with corrupt borgbackups on minnie/banks since borgbackup doesn't do invidivudal files but instead creates its own archives).

As for longer term block level backup... I think switching to drbd is the way to go.

I ideally would envision something like this:

After we complete the step of changing the file system on moshes (red-mosh-reorganization), I would like to see us mount all data partitions to /media/ (e.g. /media/albizu, /media/chavez and then symlink data on those partitions to the permanent location in /home/users and /home/sites. The control panel would have to keep track of which data partition each home directory and web site belongs to.

That way, everyone would always know their permanent home directory is /home/users/jamie and their permanent web site directory is /home/sites/1234 but in reality those are symlinks to the actual location of the data, which could be on one of many partitions mounted in /media/.

Then, we can limit partition sizes to 500GB (so fsck's run faster) and also we could allocated data-only logical volumes from the host. If a mosh runs out of disk space, instead of extending an already huge partition, we simply create a new partition.

This also gives us clean data only partitions that are block devices allocated from the host. So, we can do a complete data backup by sync'ing at the block level each data level block device from the host.

And lastly... this setup opens the way to mounting remote volumes into each guest as well - the system is all based on symlinks to data in /media/ -it doesn't know or care where the data comes from.

comment:15 in reply to: ↑ 14 Changed 8 days ago by https://id.mayfirst.org/jaimev

I also like the simple accesibility of a first phase rsync copy from guest to iz.

I think in either case a second stage backup to minnie/banks would benefit by copying from an lvm snapshot on iz to ensure there is no corruption for incoming data from rsync or borg.

comment:16 Changed 8 days ago by https://id.mayfirst.org/jaimev

Also I like the proposal for dividing up guest data among logical volumes or DRBD block devices. One issue that comes up for me is that member data is not static. We can limit a logical volume size, but the web folders and mailboxes within it will continue to grow. Do we then have to move those that no longer fit into new logical volumes? We can implement disk quotas per user at a filesystem level as well but inevitably we will have to expand these under certain conditions. Under our current setup we allow unrestricted member data to fill lvs, and keep expanding logical volumes until they don't fit in the physical host and then move the logical volume to a new physical host. So data has to be moved at some point I guess.

I would love a way of thinking about and provisioning space for member data at a more atomic level. Thinking about member data stores as units of data that have to be moved together makes more sense to me than the larger blocks of member data grouped by by virtue of order into logical volumes assigned to guests. If a particular site or mailbox needs more space only that data should have to be expanded/moved. That seems more fair than creating downtime for all members by moving large logical volumes between physical hosts. Of course there should be little downtime if member data stores that are reaching their assigned space constraint are pre-synced to their new destination. What happens if block level devices, lvm or DRBD , are created for every member data store? Is there a limit? Performance impacts? Storage inefficiency? Is it any easier to move things around this way? Is there another way to think about this?

I understand the proposal of mounting data from /media/guest1/{/home/member/user} /media/guest2/{/home/member/site/} and to /home/users/user1 or /home/sites/site1 within the physical host but why is it necessary for member data stores to be grouped into larger logical volumes at all? Or is this solely because this is where the data resides actually and reorganizing this would be too disk intensive ?

I might be thinking about this the wrong way but I would like any plans we make should include a strategy for that event where extra resources must be added, both technically and at an organizational level. It would also be nice to work out a way for this to happen automatically.

comment:17 Changed 8 days ago by https://id.mayfirst.org/jamie

I think we should add user quotas (along the lines of Nextcloud, with a low default setting and a willingness to increase on request). It's hard to figure out the best way to do that until we complete some of these phases - but I suspect once we move to ldap for email accounts it will be easy to set quotas via IMAP. In any event... the lack of quotas has got to go!

And even with quotas, disk partitions will fill up - but resolving that issue simply means picking the biggest user on a partition that is filling up, rsync'ing their data to a different partition, re-creating the symlinks and removing the old data.

I also share you goals about more atomic moves.

I envision us ending up with a system in which any single user account can independently be moved either to another partition or to an entirely different server (although not without rsync'ing the data - at least until we get a network file system).

I've been doing a lot of ldap research - particularly how to configure with both postfix and dovecot. The standard model seems to be to define an entry with a series of full email addresses (joe@joe.org, joe@bob.org) followed by a single mail drop (joe@chavez.mayfirst.org) and a single IMAP host (chavez.mayfirst.org). If all our postfix and dovecot servers use the same ldap server, then joe@joe.org can be delivered to chavez and bob@joe.org can be delivered to another server. And changing servers simply means: rsync the data and update ldap. In other words, we don't have to keep all email addresses from the same domain together.

With this setup, we can move any single email account (or web site) independently of any criteria except how to best distribute the resources.

I think we could create logical volumes or partitions for each member or user, but I suspect that would end up being more work to administer than it's worth, especially if we can get quotas working properly.

comment:18 Changed 7 days ago by https://id.mayfirst.org/jamie

Sadly, the removal of ionice for the rsync backup has had no visible impact. It's 9:27 am and malcolm, leslie, chavez, june and rodolpho rsync's are still running on iz.

I suspect it is because disk i/o on iz is the culprit. Since rsync is waiting for iz, slowing down the disk i/o on the source doesn't really matter. So, I'm putting ionice back in.

However, checking the borg backup logs... it appears that the borg backup of our big servers (below shows chavez) is a little faster, but not by a lot:

  • mf-borg-backup completion time ranges between 5 hr 22 minutes and 7 hours and 26 minutes
  • rsync completion time ranges between 6 hr 33 minutes and 8 hr 48 minutes
Last edited 7 days ago by https://id.mayfirst.org/jamie (previous) (diff)

comment:19 Changed 7 days ago by https://id.mayfirst.org/jaimev

Were those times for both rsync and borg backups that had already run once before?

comment:20 Changed 7 days ago by https://id.mayfirst.org/jamie

Yes - all times are for second or later runs.

comment:21 Changed 7 days ago by https://id.mayfirst.org/jaimev

Not sure if we have the luxury of the extra disk io but it might be nice to try copying to both regular and ssd based logical volumes on medgar in place of iz just to get some reference points for best case sync times with our current hardware and setup.

Last edited 7 days ago by https://id.mayfirst.org/jaimev (previous) (diff)

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.