Opened 2 weeks ago

Last modified 41 hours ago

#14261 assigned Bug/Something is broken

manage disk i/o problems

Reported by: Owned by:
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description (last modified by

Over the last few weeks we have been plagued with disk i/o problems - specifically on peery (which now seems ok) and more recently on chavez (which as pretty bad as recently as yesterday).

I'm working on a few ideas.

First - chavez. Last night I made three changes:

  1. Doubled the CPUs from 2 to 4
  2. Increased RAM from 8GB to 12 GB
  3. Changed dovecot to fsync less often by adding /etc/dovecot/conf.d/99-mfpl-fsync.conf with:
mail_fsync = never

protocol lda {
  # Enable fsyncing for LDA
  mail_fsync = optimized
protocol lmtp {
  # Enable fsyncing for LMTP
  mail_fsync = optimized

Attachments (4)

cpu-day-chavez.png (37.0 KB) - added by 2 weeks ago.
gaspar-cpu.png (55.3 KB) - added by 2 weeks ago.
cpu-day-chavez.2.png (37.0 KB) - added by 2 weeks ago.
viewsic-cpu.png (40.1 KB) - added by 2 weeks ago.

Download all attachments as: .zip

Change History (29)

comment:1 Changed 2 weeks ago by

  • Owner set to
  • Status changed from new to assigned

Changed 2 weeks ago by

comment:2 Changed 2 weeks ago by

I'm not sure which change made the biggest difference, but so far so good:

comment:3 Changed 2 weeks ago by

I'm also experimenting on octavia with putting the index files in a tempfs. However, it doesn't seem to be creating index files there.

comment:4 Changed 2 weeks ago by

Now, indices on the tempfs seem to be working.

I calculated total size of all existing indices with:

find /home/members/ -path '/home/members/*/sites/*/users/*/Maildir/*' -name 'dovecot.index*' -print0 | du --files0-from=- -ch

On octavia, it was: 299MB, so I made a tempfs that is 600MB in size by adding this to fstab:

tmpfs /var/lib/dovecot-indices tmpfs size=600M,mode=1777 0 0


mount /var/lib/dovecot-indices

Then, I added: /etc/dovecot/conf.d/99z-mfpl-indices.conf with the content:

mail_location = maildir:~/Maildir:INDEX=/var/lib/dovecot-indices/%u

And I restarted dovecot.

comment:5 Changed 2 weeks ago by

  • Description modified (diff)

comment:6 Changed 2 weeks ago by

Other candidates for testing these options based on munin graphs of disk i/o are: gaspar, malcolm, june, marx, mumia, proudhon, rose, viewsic.

On chavez - last night - I allocated 15GB of ssd space (so that could be an option for indices if we need more space that we can realistically do with a tempfs).

I'm currently calculating the total size of indices on both chavez and gaspar to get a sense of how much space is in use on a highly used shared mosh.

I've implemented just the fsync changes on viewsic.

comment:7 Changed 2 weeks ago by

On gaspar, the total indices come to 458MB so I just created a 900MB tempfs and switched dovecot to use it.

Now, we have viewsic using the fsync approach and gaspar using the indices approach. After a day, we can compare and see which made a bigger impact on their disk i/o today compared with their disk i/o yesterday.

comment:8 Changed 2 weeks ago by

Chavez total indexes amount to: 3.9GB - too much for a tempfs I think, but the ssd is an option.

comment:9 Changed 2 weeks ago by

so far, chavez is operating extremely well today after the changes, but neither viewsic or gaspar seem to show any difference in the amount of cpu time spent in disk i/o, which is a disappointment and suggests that tweaking dovecot settings may not be a big factor after all.

I'm not sure whether the RAM or the CPU has made the bigger difference with chavez, so our next experiment may be to boost the CPU on viewsic and the ram on gaspar and see which makes a bigger difference.

comment:10 Changed 2 weeks ago by

viewsic is now operating with 4 CPU (twice what it had before) and gaspar is operating with 9GB of RAM (50% more then before).

Changed 2 weeks ago by

Changed 2 weeks ago by

Changed 2 weeks ago by

comment:11 Changed 2 weeks ago by

Adding 2 more cpu's makes for much more idle time on viewsic:

But, adding the RAM actually reduces disk i/o on gaspar:

Despite the significant re-allocation of RAM from a few months ago... adding yet more RAM seems to work the best in terms of reducing disk i/o.

comment:12 Changed 2 weeks ago by

I've just added the fsync dovecot configuration to puppet so it should go out to all servers on the next signed tag. I have manually pushed to ossie, which is a good candidate for testing because so much of it's i/o is imap related (according to resourcehog).

It has 1.6GB of indices, so we could move this to a tempfs, but would need to allocate more RAM first.

comment:13 Changed 2 weeks ago by

I'm not seeing any noticiable differences with fsync settings on ossie.

I just created a 3GB tempfs on ossie. It seems excessive, but ossie was already allocated 12GB of RAM, so the memory is there.

Let's see how that affects things.

comment:14 Changed 2 weeks ago by

I've increased proudhon's ram from 8G to 10G to see if minimal amount of more RAM is enough to make a difference in diskio there. We don't have that many resources left on wiwa.

comment:15 Changed 4 days ago by

I'm really happy about the new munin graphs.

Based on what I'm seeing there I would like to add another cpu to ella, it currently only has one.

Also looking at the graphs I can see that all vm's on wiwa gets hit by some gnarly spikes in cpu and diskio coming together but wiwa as a whole doesn't get it so bad which I think means some more resources can be added to the vm's. Just need to figure out who is the best candidate.

Also seeing a similar pattern on parsi although there it does seem that parsi as a whole is closer to its resource limits.

comment:16 Changed 4 days ago by

I agree with your analysis - over the last few weeks I have been applying similar logic to add cpus and memory to certain servers.

comment:17 Changed 4 days ago by

On wiwa, peery and malcolm are the only ones that stick out as far as disk reading and memory usage. Within each of those vm's I am not really seeing clear indications of a specific user or process that is responsible.

cpus on wiwa are already overcommitted I think but there is about 18G of ram left. I am going try dedicating another 4G to peery and 2G to malcolm to see what happens.

comment:18 Changed 4 days ago by

On parsi, cpu spikes are the worst on ossie, julia and lewis. The spikes coincide and it not easy to tell from the vm's themselves if any particular user or process is responsible. ossie and julia have the highest disk io so my best guess is to add more ram to each of these. ossie has already received an increase in ram but it is one of our biggest hosts. The vm randolph on parsi is not actually in use anymore. I am going to shut this one down until we can decide if it can be officially decommissioned.

comment:19 Changed 3 days ago by

Nice work Jaime! And yes, it does look like randolph can be properly de-comissioned.

I am still trying to work out how to tell if a host is maxed out on cpus and I'm not sure wiwa is maxed out judging from the munin chart. However, I think adding more RAM is a better option.

I'm also a bit confused by ossie, but am certain that it's imap related (resourcehog --resource read --include-commands consistently points to imap). Since ossie has the dovecot indices on a tempfs I'm not sure what else to do.

As we work on our new infrastructure plan.... two things that would be helpful are:

  • Being able to easily move a single mailbox to a different host
  • Being able to move a single mailbox to an ssd

We'll be able to do both once we complete our transition...

comment:20 Changed 3 days ago by

peery is looking a lot better after receiving 4G more of ram. ossie, julia, malcolm each received 2G, brief spikes are still visible there but they look better than before.

Fixing a wordpress site on gaspar in #14314 made a huge improvement there.

It stilll seems to soon to tell if the extra cpu has made a significant change on ella.

I mentioned overcommitting cpus on wiwa because I noticed there are 24 cpu cores on wiwa but we have committed a total of 38 cpus across vm's there.

comment:21 Changed 3 days ago by

Awesome! Glad to see things start to smooth out.

As for allocating CPUs - over allocated like this might be just fine and a good strategy for us to take. Here is my understanding (which may not be 100% accurate):

Given 24 Cpus...

If we had four guests with 6 (or more) CPUs each, each guest could fully use all the CPUs allocated. If this happens, the host will "steal" cpu cycles and that will show up in vmstat as "stolen" cpu cycles on the guest. That means the guest didn't get CPU cycles it was promised (munin marks this in red). If we carefully ensure that we are never allocating more CPUs then we have available, then we can be 100% certain that all guests will always get the CPU usage they are promised.

This approach sounds great but ... it is really inefficient. If one guest occassionally spikes and needs more than 6 CPUs, it will get constrained because it only has 6 CPUs. This is true even if the other three guests aren't using their CPUs at all.

The nice part about over-allocating is that we make better use of the CPUs. The downside is that we can't guarantee that the guests with allocated CPUs will always have full access to them.

And, probably the most dangerous part is that if the host becomes cpu-bound, it will make it harder for us to manage a CPU crisis via the host. I've never had this happen so I'm not entirely sure what it would be like.

comment:22 Changed 3 days ago by

It seems we don't have any servers in crisis mode at the moment but now that we are beginning to recognize the trends a little I think there is still some more tweaking to be done to make sure that everyone has enough breathing room.

florence seems very tight on cpu and memory. Going through its vm list I see:

  • stoney and voltairine I was supposed to decomission a long time ago so that's on me.
  • wolf doesn't seem to have any active sites or Maildirs. could also be decomissioned?
  • baldwin, we don't have access to . Its running debian 6! That doesn't sound good.
  • What's going on with jones and uws?
  • after decomissioning some of the above we can give some extra ram to kahlo and viewsic

mumia could use another cpu

erica could use a little more ram

hashmi could use a 2nd cpu and a little extra ram

need to look at the site activity on dorothy again

I would like to give even more ram to and maybe an extra cpu to both malcolm and ossie

same goes for proudhon

There are more worth looking at but that is enough for today.

comment:23 Changed 2 days ago by

Excellent work Jaime! This is really important progress. I've started a conversation with the members responsible for wolf. I agree, it looks like we will be able to de-commission it, but waiting to hear back.

The balwdin situation has gone on too long (see #12412). I've just shut it down. I will de-commission fully tomorrow if no planes fall out of the sky.

The uws situation is really messy because their resources are scattered all over the place. I know it will be a while... but I'm waiting for the new infrastructure plan to complete - at which point we will be able to much more easily and effectively consolidate their resources. If we try to do it now, it will be much more work for us. So, I'm hanging on with jones just a little longer...

I agree that continuing to tweak our resource allocations is a great idea and will only improve the situation.

comment:24 Changed 2 days ago by

And... one last but important piece of good news now that we have gotten most things under control with our current hardware: the LC approved the purchase of two new physical servers, which are now in the works and should be delivered in early 2019! I've been working on pricing and have a quote based on a recent purchase I made for PTP (256GB RAM, 2 1TB ssds, 2 6TB spinning disks, 32 cores).

The only caveat is this: we are at 75% of our electricity usage. We can't go over about 80% without alarms going off at Telehouse (even though we are allowed to use 4KVA of electricity we have to stay far enough below our max on a regular basis to accomodate normal spikes in operations).

In addition, we are nearly out of electric outlets in our PDU and ports in our switch.

So... we may need to retire either one or possibly two existing servers to make space for the two new servers. This should still result in a significant boost in capacity.

Right now I'm eye'ing clr because it is the only remaining physical host without hot-swappable disks. After clr, pietri seems like a reasonable target.

comment:25 Changed 41 hours ago by

For the record, I think we could remove a CPU from jones, 2 CPUs from bety and 4GB of ram from bety.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.