Opened 3 months ago

Last modified 2 months ago

#14261 assigned Bug/Something is broken

manage disk i/o problems

Reported by: https://id.mayfirst.org/jamie Owned by: https://id.mayfirst.org/jamie
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description (last modified by https://id.mayfirst.org/jamie)

Over the last few weeks we have been plagued with disk i/o problems - specifically on peery (which now seems ok) and more recently on chavez (which as pretty bad as recently as yesterday).

I'm working on a few ideas.

First - chavez. Last night I made three changes:

  1. Doubled the CPUs from 2 to 4
  2. Increased RAM from 8GB to 12 GB
  3. Changed dovecot to fsync less often by adding /etc/dovecot/conf.d/99-mfpl-fsync.conf with:
mail_fsync = never

protocol lda {
  # Enable fsyncing for LDA
  mail_fsync = optimized
}
protocol lmtp {
  # Enable fsyncing for LMTP
  mail_fsync = optimized
}

Attachments (4)

cpu-day-chavez.png (37.0 KB) - added by https://id.mayfirst.org/jamie 3 months ago.
gaspar-cpu.png (55.3 KB) - added by https://id.mayfirst.org/jamie 3 months ago.
cpu-day-chavez.2.png (37.0 KB) - added by https://id.mayfirst.org/jamie 3 months ago.
viewsic-cpu.png (40.1 KB) - added by https://id.mayfirst.org/jamie 3 months ago.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 3 months ago by https://id.mayfirst.org/jamie

  • Owner set to https://id.mayfirst.org/jamie
  • Status changed from new to assigned

Changed 3 months ago by https://id.mayfirst.org/jamie

comment:2 Changed 3 months ago by https://id.mayfirst.org/jamie

I'm not sure which change made the biggest difference, but so far so good:

comment:3 Changed 3 months ago by https://id.mayfirst.org/jamie

I'm also experimenting on octavia with putting the index files in a tempfs. However, it doesn't seem to be creating index files there.

comment:4 Changed 3 months ago by https://id.mayfirst.org/jamie

Now, indices on the tempfs seem to be working.

I calculated total size of all existing indices with:

find /home/members/ -path '/home/members/*/sites/*/users/*/Maildir/*' -name 'dovecot.index*' -print0 | du --files0-from=- -ch

On octavia, it was: 299MB, so I made a tempfs that is 600MB in size by adding this to fstab:

tmpfs /var/lib/dovecot-indices tmpfs size=600M,mode=1777 0 0

Then,

mount /var/lib/dovecot-indices

Then, I added: /etc/dovecot/conf.d/99z-mfpl-indices.conf with the content:

mail_location = maildir:~/Maildir:INDEX=/var/lib/dovecot-indices/%u

And I restarted dovecot.

comment:5 Changed 3 months ago by https://id.mayfirst.org/jamie

  • Description modified (diff)

comment:6 Changed 3 months ago by https://id.mayfirst.org/jamie

Other candidates for testing these options based on munin graphs of disk i/o are: gaspar, malcolm, june, marx, mumia, proudhon, rose, viewsic.

On chavez - last night - I allocated 15GB of ssd space (so that could be an option for indices if we need more space that we can realistically do with a tempfs).

I'm currently calculating the total size of indices on both chavez and gaspar to get a sense of how much space is in use on a highly used shared mosh.

I've implemented just the fsync changes on viewsic.

comment:7 Changed 3 months ago by https://id.mayfirst.org/jamie

On gaspar, the total indices come to 458MB so I just created a 900MB tempfs and switched dovecot to use it.

Now, we have viewsic using the fsync approach and gaspar using the indices approach. After a day, we can compare and see which made a bigger impact on their disk i/o today compared with their disk i/o yesterday.

comment:8 Changed 3 months ago by https://id.mayfirst.org/jamie

Chavez total indexes amount to: 3.9GB - too much for a tempfs I think, but the ssd is an option.

comment:9 Changed 3 months ago by https://id.mayfirst.org/jamie

so far, chavez is operating extremely well today after the changes, but neither viewsic or gaspar seem to show any difference in the amount of cpu time spent in disk i/o, which is a disappointment and suggests that tweaking dovecot settings may not be a big factor after all.

I'm not sure whether the RAM or the CPU has made the bigger difference with chavez, so our next experiment may be to boost the CPU on viewsic and the ram on gaspar and see which makes a bigger difference.

comment:10 Changed 3 months ago by https://id.mayfirst.org/jamie

viewsic is now operating with 4 CPU (twice what it had before) and gaspar is operating with 9GB of RAM (50% more then before).

Changed 3 months ago by https://id.mayfirst.org/jamie

Changed 3 months ago by https://id.mayfirst.org/jamie

Changed 3 months ago by https://id.mayfirst.org/jamie

comment:11 Changed 3 months ago by https://id.mayfirst.org/jamie

Adding 2 more cpu's makes for much more idle time on viewsic:

But, adding the RAM actually reduces disk i/o on gaspar:

Despite the significant re-allocation of RAM from a few months ago... adding yet more RAM seems to work the best in terms of reducing disk i/o.

comment:12 Changed 3 months ago by https://id.mayfirst.org/jamie

I've just added the fsync dovecot configuration to puppet so it should go out to all servers on the next signed tag. I have manually pushed to ossie, which is a good candidate for testing because so much of it's i/o is imap related (according to resourcehog).

It has 1.6GB of indices, so we could move this to a tempfs, but would need to allocate more RAM first.

comment:13 Changed 3 months ago by https://id.mayfirst.org/jamie

I'm not seeing any noticiable differences with fsync settings on ossie.

I just created a 3GB tempfs on ossie. It seems excessive, but ossie was already allocated 12GB of RAM, so the memory is there.

Let's see how that affects things.

comment:14 Changed 3 months ago by https://id.mayfirst.org/jaimev

I've increased proudhon's ram from 8G to 10G to see if minimal amount of more RAM is enough to make a difference in diskio there. We don't have that many resources left on wiwa.

comment:15 Changed 2 months ago by https://id.mayfirst.org/jaimev

I'm really happy about the new munin graphs.

Based on what I'm seeing there I would like to add another cpu to ella, it currently only has one.

Also looking at the graphs I can see that all vm's on wiwa gets hit by some gnarly spikes in cpu and diskio coming together but wiwa as a whole doesn't get it so bad which I think means some more resources can be added to the vm's. Just need to figure out who is the best candidate.

Also seeing a similar pattern on parsi although there it does seem that parsi as a whole is closer to its resource limits.

comment:16 Changed 2 months ago by https://id.mayfirst.org/jamie

I agree with your analysis - over the last few weeks I have been applying similar logic to add cpus and memory to certain servers.

comment:17 Changed 2 months ago by https://id.mayfirst.org/jaimev

On wiwa, peery and malcolm are the only ones that stick out as far as disk reading and memory usage. Within each of those vm's I am not really seeing clear indications of a specific user or process that is responsible.

cpus on wiwa are already overcommitted I think but there is about 18G of ram left. I am going try dedicating another 4G to peery and 2G to malcolm to see what happens.

comment:18 Changed 2 months ago by https://id.mayfirst.org/jaimev

On parsi, cpu spikes are the worst on ossie, julia and lewis. The spikes coincide and it not easy to tell from the vm's themselves if any particular user or process is responsible. ossie and julia have the highest disk io so my best guess is to add more ram to each of these. ossie has already received an increase in ram but it is one of our biggest hosts. The vm randolph on parsi is not actually in use anymore. I am going to shut this one down until we can decide if it can be officially decommissioned.

comment:19 Changed 2 months ago by https://id.mayfirst.org/jamie

Nice work Jaime! And yes, it does look like randolph can be properly de-comissioned.

I am still trying to work out how to tell if a host is maxed out on cpus and I'm not sure wiwa is maxed out judging from the munin chart. However, I think adding more RAM is a better option.

I'm also a bit confused by ossie, but am certain that it's imap related (resourcehog --resource read --include-commands consistently points to imap). Since ossie has the dovecot indices on a tempfs I'm not sure what else to do.

As we work on our new infrastructure plan.... two things that would be helpful are:

  • Being able to easily move a single mailbox to a different host
  • Being able to move a single mailbox to an ssd

We'll be able to do both once we complete our transition...

comment:20 Changed 2 months ago by https://id.mayfirst.org/jaimev

peery is looking a lot better after receiving 4G more of ram. ossie, julia, malcolm each received 2G, brief spikes are still visible there but they look better than before.

Fixing a wordpress site on gaspar in #14314 made a huge improvement there.

It stilll seems to soon to tell if the extra cpu has made a significant change on ella.

I mentioned overcommitting cpus on wiwa because I noticed there are 24 cpu cores on wiwa but we have committed a total of 38 cpus across vm's there.

comment:21 Changed 2 months ago by https://id.mayfirst.org/jamie

Awesome! Glad to see things start to smooth out.

As for allocating CPUs - over allocated like this might be just fine and a good strategy for us to take. Here is my understanding (which may not be 100% accurate):

Given 24 Cpus...

If we had four guests with 6 (or more) CPUs each, each guest could fully use all the CPUs allocated. If this happens, the host will "steal" cpu cycles and that will show up in vmstat as "stolen" cpu cycles on the guest. That means the guest didn't get CPU cycles it was promised (munin marks this in red). If we carefully ensure that we are never allocating more CPUs then we have available, then we can be 100% certain that all guests will always get the CPU usage they are promised.

This approach sounds great but ... it is really inefficient. If one guest occassionally spikes and needs more than 6 CPUs, it will get constrained because it only has 6 CPUs. This is true even if the other three guests aren't using their CPUs at all.

The nice part about over-allocating is that we make better use of the CPUs. The downside is that we can't guarantee that the guests with allocated CPUs will always have full access to them.

And, probably the most dangerous part is that if the host becomes cpu-bound, it will make it harder for us to manage a CPU crisis via the host. I've never had this happen so I'm not entirely sure what it would be like.

comment:22 Changed 2 months ago by https://id.mayfirst.org/jaimev

It seems we don't have any servers in crisis mode at the moment but now that we are beginning to recognize the trends a little I think there is still some more tweaking to be done to make sure that everyone has enough breathing room.

florence seems very tight on cpu and memory. Going through its vm list I see:

  • stoney and voltairine I was supposed to decomission a long time ago so that's on me.
  • wolf doesn't seem to have any active sites or Maildirs. could also be decomissioned?
  • baldwin, we don't have access to . Its running debian 6! That doesn't sound good.
  • What's going on with jones and uws?
  • after decomissioning some of the above we can give some extra ram to kahlo and viewsic

mumia could use another cpu

erica could use a little more ram

hashmi could use a 2nd cpu and a little extra ram

need to look at the site activity on dorothy again

I would like to give even more ram to and maybe an extra cpu to both malcolm and ossie

same goes for proudhon

There are more worth looking at but that is enough for today.

comment:23 Changed 2 months ago by https://id.mayfirst.org/jamie

Excellent work Jaime! This is really important progress. I've started a conversation with the members responsible for wolf. I agree, it looks like we will be able to de-commission it, but waiting to hear back.

The balwdin situation has gone on too long (see #12412). I've just shut it down. I will de-commission fully tomorrow if no planes fall out of the sky.

The uws situation is really messy because their resources are scattered all over the place. I know it will be a while... but I'm waiting for the new infrastructure plan to complete - at which point we will be able to much more easily and effectively consolidate their resources. If we try to do it now, it will be much more work for us. So, I'm hanging on with jones just a little longer...

I agree that continuing to tweak our resource allocations is a great idea and will only improve the situation.

comment:24 Changed 2 months ago by https://id.mayfirst.org/jamie

And... one last but important piece of good news now that we have gotten most things under control with our current hardware: the LC approved the purchase of two new physical servers, which are now in the works and should be delivered in early 2019! I've been working on pricing and have a quote based on a recent purchase I made for PTP (256GB RAM, 2 1TB ssds, 2 6TB spinning disks, 32 cores).

The only caveat is this: we are at 75% of our electricity usage. We can't go over about 80% without alarms going off at Telehouse (even though we are allowed to use 4KVA of electricity we have to stay far enough below our max on a regular basis to accomodate normal spikes in operations).

In addition, we are nearly out of electric outlets in our PDU and ports in our switch.

So... we may need to retire either one or possibly two existing servers to make space for the two new servers. This should still result in a significant boost in capacity.

Right now I'm eye'ing clr because it is the only remaining physical host without hot-swappable disks. After clr, pietri seems like a reasonable target.

comment:25 Changed 2 months ago by https://id.mayfirst.org/jamie

For the record, I think we could remove a CPU from jones, 2 CPUs from bety and 4GB of ram from bety.

comment:26 Changed 2 months ago by https://id.mayfirst.org/jaimev

So just an update, last week on parsi I gave a whopping 16G ram and 6 cpu cores to ossie. That seems like a lot but ossie is doing much better now so I think it was worth it. julia also got 9G.

On wiwa peery got 14G of ram and proudhon got 12G and 4 cpu cores.

On linda I still haven't figured out where that big morning spike is coming from but increasing resources across several vms has helped reduce its impact. claudette 8G ram 4cpus, colin 6ram 3cpus , erica 14G ram 4cpus , ella 8G ram 3cpus, hashmi 8G ram 3cpus , juanita 6G ram 2cpus, rivera 6G ram 3cpus, smith 6GB ram.

All of the above vm's seem to be doing much better now.

I am hoping that we will eventually get to a point where these adjustments aren't needed so frequently but all of this also reminds me that [Ganeti http://www.ganeti.org/] which has come highly recommended us to before can make automatic recommendations for this kind of provisioning.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.