Opened 5 years ago

Closed 3 years ago

#9915 closed Bug/Something is broken (fixed)

alp.org site and email is down

Reported by: Jamila Khan Owned by: Jamie McClelland
Priority: Medium Component: Tech
Keywords: viewsic.mayfirst.org Cc: cpage, jeremyb, Ross
Sensitive: no

Description

The ED notified me via text, and monitor.mayfirst.org was saying that http was down on albizu. Nagios is saying that http is back but the site is still not working nor is email.

Change History (9)

comment:1 Changed 5 years ago by jeremyb

Cc: jeremyb Ross added

comment:2 Changed 5 years ago by Joseph

Owner: set to Jamie McClelland
Priority: HighMedium
Status: newassigned

I learned what flapping was in Nagios parlance, which is what several services on viewsic were doing.

Aug 19 16:21:28 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:21:28 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
...
Aug 19 16:23:13 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:23:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:23:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:24:25 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:24:25 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
...
Aug 19 16:25:54 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
...
Aug 19 16:26:57 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:26:57 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:26:57 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:26:57 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:10 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:10 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:10 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:10 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:11 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:11 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:27:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:28:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:28:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:28:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:28:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
Aug 19 16:29:15 viewsic pop3d-ssl: Maximum connection limit reached for ::ffff:71.3.233.100
...

This goes on for about an hour and half until I banned the IP outright. The start of the services flapping correlates roughly with when the member reported the problem.

0 viewsic:~# iptables -A INPUT -s 71.3.233.100 -j DROP
0 viewsic:~# iptables -L INPUT -v -n
Chain INPUT (policy ACCEPT 722 packets, 39686 bytes)
 pkts bytes target     prot opt in     out     source               destination
  32M 2760M fail2ban-web-loose  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 80,443
  20M 1558M fail2ban-courierlogin  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 110,143,993,995
74549   86M fail2ban-sasl  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 587
 820K   79M fail2ban-ssh  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 22
    0     0 DROP       all  --  *      *       71.3.233.100         0.0.0.0/0
0 viewsic:~#

After banning the IP address, services seem to cease to be interrupted.

There's still a fair amount of what appears to be spam in the mail queue, sending from a couple of accounts in particular. Most of it being held because the receiving server rejected it. What's the policy for dealing with potentially compromised accounts? Assigning to Jamie for this latter question as I think he's the keeper of the keys around everything mail infrastructure these days.

comment:3 Changed 5 years ago by Jamie McClelland

At the moment, viewsic is experiencing a lot of disk I/O - and judging from this report, I suspect that was the fundamental problem yesterday.

I'm guessing it's the disk I/O that is make web page loads very slow. And, because of the disk I/O, POP requests are taking a very long time to complete - so I suspect there are a lot of email clients that are just sending more POP requests until we hit the maximum number of allowed requests.

The IP address you banned seems to be used by legitimate users - so I've just re-enabled it.

I'm trying to figure out if there's a way to reduce the disk I/O on viewsic...

comment:4 Changed 5 years ago by Jamie McClelland

I've shutdown apache to reduce the memory problems. I've also emailed the user with the most reads from yesterday and today and asked if we could reduce their inbox size (over 13,000 messages).

comment:5 Changed 5 years ago by Jamie McClelland

Now that the system has returned to normal, I've restart apache and will keep monitoring.

comment:6 Changed 5 years ago by Jamie McClelland

We're still getting fluctuations in performance. I've reduced the inbox of one user and have requested the same of two more users.

comment:7 Changed 5 years ago by Jamila Khan

Thank you Jamie!

If the users in question are at alp.org, I am on site now and it's my job to help them with their email. Let me know if there are things I should do.

comment:8 Changed 5 years ago by jeremyb

Keywords: viewsic.mayfirst.org added

comment:9 Changed 3 years ago by Jamila Khan

Resolution: fixed
Status: assignedclosed

This is old, and I opened it. Closing.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.