Opened 11 years ago

Closed 11 years ago

#309 closed Bug/Something is broken (fixed)

assata ran out of memory

Reported by: Jamie McClelland Owned by: Jamie McClelland
Priority: Urgent Component: Tech
Keywords: assata.mayfirst.org mailman RAM Cc:
Sensitive: no

Description

assata, the bulk email server, ran out of memory (and swap).

I discovered this thanks to nagios. After being notified of service failures I logged into the console (assata is hosted on the physical server sontag). And, I saw a lot of out of memory errors scrolling on the screen.

I logged in and ran ps, vmstat and top (and redirected output in the root directory).

Since this server's only critical services are postfix, apache and mailman I suspect that a few messages were sent to large email lists at the same time.

The problem seems to have subsided. I'm restarting apache, postfix, and mailman to ensure that they are running cleanly.

I'm leaving this open to investigate further in the future.

Change History (11)

comment:1 Changed 11 years ago by Daniel Kahn Gillmor

We've got no munin-style monitoring of assata. Do you want that?

comment:2 Changed 11 years ago by Jamie McClelland

Yes - could you add that?

comment:3 Changed 11 years ago by Daniel Kahn Gillmor

Owner: changed from Jamie McClelland to Daniel Kahn Gillmor
Status: newassigned

sure. working on it now.

comment:4 Changed 11 years ago by Daniel Kahn Gillmor

Owner: changed from Daniel Kahn Gillmor to Jamie McClelland
Status: assignednew

I've added munin monitoring to assata. It should be visible from the usual place. Reassigning to jamie, since this ticket appears to be about much more than munin monitoring.

comment:5 Changed 11 years ago by Daniel Kahn Gillmor

Keywords: assata.mayfirst.org mailman RAM added
Priority: MediumUrgent

assata remains heavily overcommitted on RAM, seems to have significant amounts of swap space allocated, and is actively swapping as well. The top memory consumers, as sorted by ps's RSS are (very longrunning!) mailman processes:

0 assata:~# ps -eFH | head -n1 && ps -eFH | sort -n -k 6 | tail
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
postfix  32588  1279  0  1241  2184   0 01:14 ?        00:00:00     smtp -t unix -u -c
postfix  31900  1279  0  1298  2300   0 01:04 ?        00:00:00     smtpd -n smtp -t inet -u -c
root     32556  1304  0  1928  2372   0 01:13 ?        00:00:00     sshd: root@ttyp1 
postfix  32542  1279  1  1299  2388   0 01:11 ?        00:00:06     smtpd -n smtp -t inet -u -c
list      1253  1203  0  4269 11216   0 Jan06 ?        00:00:10     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
list      1254  1203  0 40066 92396   0 Jan06 ?        00:50:47     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=BounceRunner:0:1 -s
list      1256  1203  1 45314 100348  0 Jan06 ?        06:46:18     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:0:1 -s
list      1255  1203  0 34885 111228  0 Jan06 ?        00:47:11     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=CommandRunner:0:1 -s
list      1258  1203  0 44617 125864  0 Jan06 ?        02:33:52     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s
list      1259  1203  1 41258 129096  0 Jan06 ?        06:14:29     /usr/bin/python /var/lib/mailman/bin/qrunner --runner=VirginRunner:0:1 -s
0 assata:~# 

I moved the snapshotted system states that you stored back in december to assata:/root/ticket-309/2007-12-11, and created a new directory to store similar snapshots from today (assata:/root/ticket-309/2008-01-30):

0 assata:~/ticket-309/2008-01-30# ps -eFH > ps.out
0 assata:~/ticket-309/2008-01-30# vmstat 1 5 > vmstat.out
0 assata:~/ticket-309/2008-01-30# top -n 1 > top.out
1 assata:~/ticket-309/2008-01-30#

Comparing the ps -eFH output in particular, it looks like it was also suffering from memory exhaustion based on some very long-running mailman processes. Maybe mailman is leaking RAM? The longterm RAM graphs themselves look pretty scary.

I'm raising this priority because i think this needs to be investigated in the near future, though i'm not sure what next steps to take.

I suppose we could restart mailman and see if the memory consumption goes down and stays down. But looking at leslie.mayfirst.org (also running mailman), it is using similar amounts of RAM for its mailman queuerunners (but not hitting trouble mainly because it actually has over twice as much RAM as assata.)

comment:6 Changed 11 years ago by Jamie McClelland

When comparing ps -eFH output, I see what you mean about memory use (I had to look up what the RSS column was: it means the amount of non-swapped memory that a process is using).

Do you think leakage is the problem? The munin charts seem to show a consistent memory usage since the last restart in early January - I would expect to see angles rather than the mostly flat graphs. Based on the munin graphs I would expect restarting mailman to cause them to resume their current usage.

(I did find an old discussion about the bounce runner memory leakage. It may not be directly relevant to this situation but it does demonstrate the relationship between RAM usage and size of files the processes are reading.)

I looked through /var/lib/mailman/Mailman/Defaults.py and noticed that we can control the number of runners that are started, but the default is 1 of each so we can't exactly reduce them.

comment:7 Changed 11 years ago by Daniel Kahn Gillmor

Yeah, the charts of memory usage do show consistent RAM usage, not a slow leak. But those graphs were started after the last reboot, so they wouldn't show a fast leak, either. Why is assata so overcommitted on RAM (and consistenlty using >40% swap usage)?

Perhaps having more than one runner started would actually reduce the memory requirements for each runner? I don't know how multiple runners communicate or interoperate, so i don't know how to be sure we're getting the lowest resource consumption from them.

But the fact remains that we're running assata with significantly overcommitted memory, and not even much of a swap margin, should a new spike come in. We need to either boost the RAM available to the system, or reduce the size of the working set.

comment:8 Changed 11 years ago by Jamie McClelland

I think we should boost the RAM - it's relatively easy and will hopefully solve the immediate problem.

Sontag (the xen server assata is running on) only has 149 MB of RAM available:

0 sontag:~# xm info | grep mem
total_memory           : 4095
free_memory            : 149
0 sontag:~#

We probably should not try to use up every last MB.

However, I think have over-alloted memory to Harry (1000 MB). Assata has been allocated 768.

0 sontag:~# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      128     2 r-----  19285.0
allende                                    6      128     1 -b----   2675.7
assata                                     4      768     1 -b----  56885.4
harry                                      2     1000     1 -b----   5359.9
katanko                                    5      128     1 -b----    915.1
moses                                      3     1000     2 -b----  16468.2
stallman                                   1      750     1 -b----  59107.4
0 sontag:~# 

I would propose that we shrink harry to 768 and boost assata to 1000. How does that sound? I could do that as part of the next kernel update/reboot.

comment:9 Changed 11 years ago by Daniel Kahn Gillmor

Looking at the munin graphs, assata seems to be consistently using over 1GB of RAM, so i'm not sure that your proposed change would actually let assata avoid swap.

I agree that harry has way more RAM than it needs, though. You could drop harry to 0.5GB and i don't think it would have any adverse effect.

comment:10 Changed 11 years ago by Jamie McClelland

Ok - sounds good to me. Let's do this during the upgrade.

comment:11 Changed 11 years ago by Jamie McClelland

Resolution: fixed
Status: newclosed

Woops - we missed this during the upgrade. And, we just got an email from the War Times saying they are planning on sending to their 500,000 person list tonight (which is hosted on assata).

So - I just made the changes. I lowered both moses (running this ticket tracking system) and harry (running the members.mayfirst.org web site and openid) from 1,000 MB RAM to 512 MB RAM. I then boosted assata's ram from 768 to 1536 (we have a couple 100 MB to spare). I also increased assata's swap from 512 MB to 1 GB.

Sontag currently has allocated:

0 sontag:~# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      128     2 r-----  34497.5
allende                                    1      128     1 -b----   6991.5
assata                                    10     1536     1 -b----     20.1
harry                                      7      512     1 -b----      8.9
katanko                                    4      128     1 -b----   1452.4
moses                                      8      512     2 -b----     31.1
stallman                                   6      750     1 -b---- 110895.0
0 sontag:~#

I did the following to boost the swap.

On Sontag:

0 sontag:~# lvresize --size 1G vg_sontag0/assata-swap
  Extending logical volume assata-swap to 1.00 GB
  Logical volume assata-swap successfully resized
0 sontag:~#

On Assata:

0 assata:~# cat /proc/swaps 
Filename				Type		Size	Used	Priority
/dev/sda2                               partition	524280	0	-1
0 assata:~# swapoff -a
0 assata:~# cat /proc/swaps 
Filename				Type		Size	Used	Priority
0 assata:~# mkswap /dev/sda2
Setting up swapspace version 1, size = 1073737 kB
no label, UUID=fd087189-119e-4d4e-b599-59d659303439
0 assata:~# swapon -a
0 assata:~# cat /proc/swaps 
Filename				Type		Size	Used	Priority
/dev/sda2                               partition	1048568	0	-2
0 assata:~#

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.