Opened 3 years ago

Closed 3 years ago

#11607 closed Bug/Something is broken (fixed)

Intercontinental Cry offline again

Reported by: Ahni Owned by: JaimeV
Priority: Urgent Component: Tech
Keywords: Cc:
Sensitive: no



I'm not too sure what's up, but is totally offline again. This is the second time in the last few days (that I know of).

Any chance someone can check this out?

Change History (24)

comment:1 Changed 3 years ago by Ahni

Well, the site's suddenly back! So I'm going to downgrade this, but if someone could take a sec to see what's up, that'd be awesome! thanks.

comment:2 Changed 3 years ago by Ahni

Priority: UrgentMedium

comment:3 Changed 3 years ago by JaimeV

Owner: set to JaimeV
Status: newassigned

It looks like apache crashed on dorothy and need to be restarted. sorry about the downtime. I am still investigating.

comment:4 Changed 3 years ago by Ahni

no worries, Jaimev. I just want to make sure there's nothing wacky going on. The other incident happened early in the morning as well.

comment:5 Changed 3 years ago by JaimeV

Yes there appears to be an issue with apache crashing after log rotation see #11535. We are still working on it.

comment:6 Changed 3 years ago by Ahni

Hey. IC's having problems again. Not sure if it's related to this, but for a couple days now, I'm periodically getting tonnes of blank pages on the site. One sec the site works, then it's blank.

comment:7 Changed 3 years ago by Jamie McClelland

Resolution: fixed
Status: assignedfeedback

I think i see what may be causing the problem. The server you are on is not properly doing DNS lookups (since last saturday).

I just fixed it now. Can you keep us posted to let us know if you continue having the problem?



comment:8 Changed 3 years ago by Ahni

Resolution: fixed
Status: feedbackassigned

Will do!! Thanks Jamie

comment:9 Changed 3 years ago by Ahni

OK, the site's working much better now! I'm not getting any blank screens, but there are still some periodic delays (with pages taking 15-30 seconds to load. They usually load in less than 2.)

comment:10 Changed 3 years ago by Jamie McClelland

I see that your site and another one on the same server had hit the limit of 12 php processes, which is probably why it feels slow. I also noticed that fail2ban was not running properly (which bans IP addresses that might be abusing your site).

I've just started fail2ban and restarted all PHP processes.

comment:11 Changed 3 years ago by Ahni

The site's still running very slow. What does it mean that it's hit the limit of 12 php processes. Is this something I can address?

comment:12 Changed 3 years ago by Jamie McClelland

To prevent a single site from taking down the whole server, we restrict each site to just 12 parallel processes. Most sites only use 2 or 3. But yours is pegged to 12 constantly. I see a lot of search engine traffic, but am honestly stumped as to why it is so high.

comment:13 Changed 3 years ago by Ahni

hmm, I'm not sure either (actually, I'm not even sure what "php processes" are). I don't think we're doing anything that would cause a burden to the server. We were running Woomcommerce (which demands alot of resources) until two weeks ago, but we finally got rid of it in our continued efforts to lighten the load as much as possible.

What might we do to figure this out? We're in the process of obtaining funds for a major overhaul and expansion of IC (which includes a complete redesign of the site and the development of a wordpress plugin for news orgs). We are also pursuing several major media projects that are going to bring a mountain of traffic to us.

It'd be great to get this situation pinned down before we start snowballing ;)

Maybe we can pick some time out in the next week or two and I can take the site down for maintenance to try to narrow things down?

comment:14 Changed 3 years ago by Benjamin Melançon

From the hiding-the-problem-more-than-fixing-it department, the vast majority of traffic will be anonymous so no reason not to have the site behind a static cache. I'm sorry i missed the meeting that covered Deflect but from my understanding it isn't for day-to-day caching— Varnish would do that, or Cloudflare for a free (in price) service.

Also worth checking analytics to see if there are strange URLs being requested, even when implementing the above, to at least check if there's repeated attempts that might get around caching by changing query strings.

comment:15 Changed 3 years ago by Jamie McClelland

Resolution: fixed
Status: assignedclosed

I have some more info on this one which I think may explain the problem.

First, though, in answer to your question about PHP processes.

Every time someone goes to your web site, a program is started on our server to fetch the page that was requested. Usually this takes less than a second. If more than one person hits your web site at the same time (or, more typically, one person makes multiple requests to your site at the same time) then our server is capable of launching more programs to handle these requests. However, we have a limit. Once you have 12 running processes (aka programs) to handle requests, our server stops creating new ones. New requests have to wait for an existing request to be completed before their page loads. That is what causes the slowness.

When I first took a look at this problem, I noticed that our fail2ban program was not running. That program watches your logs and if it seems an IP address doing something suspicious (like trying to login to your site more than 4 or 5 times in a minute) it will ban the IP address.

This is important - not only to protect your site from being broken into, but also because requests to login take a longer time to process than normal requests. So - if an attacker is tying up 12 processes with their attempts to break into your site, then it slows everything down.

Once fail2ban was launched (and it usually take a while for it to get up to speed) it banned two IP addresses that were attempting logins to your site. Now that those are banned, everything seems back to normal.

I'm going to close for now... but please re-open the minute you find any other problems.


comment:16 Changed 3 years ago by Ahni

Hey Ben.We tried Varnish on IC last year but we ended up having to turn it off for some reason. Can't recall why now. I do usually have a caching plugin enabled but I disabled it this morning (along with a bunch of other non-essential plugins) while trying to figure out why things were so slow. It's back on now. Un-cached pages are still quite slow (before all these hang-ups started, IC would load in 2-3 seconds without the cache; under a second with caching and minify.) but the site's moving waaaaay better now!

Now that fail2ban's on (thank you Jamie) I'm hoping this solves the problem, and another issue I've been having for several months now where, while in the admin, I get a "connection lost" message and can't load any pages.

I'll take a look at analytics and see if anything jumps out. If it does, I'll dropkick it (as in, come back here to report that something might be up).

Thanks guys.

comment:17 Changed 3 years ago by Ahni

So the site was ground to a near-halt again last night. I reinstalled WordFence and it identified 2 particular IP addresses that were hammering away at the login. I set it to lock out IPs that fail to login after 7 tries in 5 minutes. According the logs this morning, both IPs were locked out over 20 times. Needless to say, they are now permanently banned.

Jamie, would fail2ban have caught these? And how long does it take to scan logs? (ie, should I keep WordFence?)

comment:18 Changed 3 years ago by Jamie McClelland

Thanks Ahni for this work - I think you should definitey keep WordFence (Drupal has a similar functionality that is part of core and enabled by default).

WordPress will always be able to do a better job than fail2ban in this area because WordPress can keep track of failed logins, whereas fail2ban only has access to attempted logins. In other words, fail2ban follows your web log - where it can see if an IP address posted a username and password to the login page, but the apache log doesn't record whether or not it worked.

So - with fail2ban we are very conservative and say if you make four attempts in less than 60 seconds we ban you (you might login as one user, then logout, or you might have multiple staff people using the same IP address). However, WordPress could say: if you fail more than 4 times in a full ten minute span you get banned since that is unlikely.

Keep us posted on the performance!

comment:19 Changed 3 years ago by Ahni

Will do, Jamie!

Wordpress really needs to step up with its built in security features. The Jetpack plugin comes with some good security addons but that plugin is such a memory pig. It's just not worth it.

We're also using "Block Bad Queries" on the site. I'm not sure if it does anything as it doesn't keep logs, but I might spring for the new premium version of the plugin which I believe keeps track of what it blocks.

NB: I also turned off public commenting on the site to deal with spam. The amount of fake comments coming at us just kept climbing (according to akismet, we've been hit 1.7m times). Enough's enough.

Next, for some future proofing, I'm going to figure out a light way to block bad bots. The guy that made "Block Bad Queries" just released a new plugin for this "Blackhole for Bad Bots", but it doesn't play nice with caching yet.

I might hide the login and admin pages too at some point (you can't attack what you can't find).

comment:20 Changed 3 years ago by Ahni

Priority: MediumUrgent
Resolution: fixed
Status: closedassigned


IC just utterly and completely died on me. I upgraded to the latest edition of Wordpress and everything was ok. And then, "unable to connect"

comment:21 Changed 3 years ago by Jamie McClelland

Seems that apache on your server stopped responding at about 6:30 am this morning. I just started it and now I see: This site has been archived or suspended. which seems to be coming from wordpress.

comment:22 Changed 3 years ago by Ahni

Thanks man. I'm seeing that message now too. Not sure what happened, but it looks like I have to do this

comment:23 Changed 3 years ago by Ahni

Ok, that did the trick. One of our site's are broken now, but at least the main site is ok. Upgrades, lol.

comment:24 Changed 3 years ago by Ahni

Resolution: fixed
Status: assignedclosed

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.