Opened 11 years ago

Closed 11 years ago

#716 closed Bug/Something is broken (fixed)

~1 hour outage of Viewsic due to apache overload from a single IP address

Reported by: alfredo Owned by: Jamie McClelland
Priority: High Component: Tech
Keywords: viewsic.mayfirst.org apache load Cc:
Sensitive: no

Description

Viewsic has been going in and out of critical state since about 11:30 pm -- the last half hour. The server itself appears to be okay in terms of load, etc. but Apache is really getting hit! Its processes are covering top completely. Nothing else even shows! It looks a lot like a dos although that's pure reaction on my part.

I have tried to pin the problem down but can't because, at this point, the command line response from the box is so latent that I can't communicate with it intelligently at all. Commands don't even show up for about a minute sometimes!

So I'm not even able to restart apache from here.

Change History (15)

comment:1 Changed 11 years ago by alfredo

At 12:40 am, the box returned to normal. The problem appears to be a drupal site since the processes that were abundant belonged to www-data. Wasn't any of the email or anti-spam processes although those may have been competing but, as I said, I didn't see anything that looked like a problem box-wide. Swap looked okay, cpu was like 50 percent or something. Nothing major like that. This was apache and probably some Drupal script going crazy or a drupal site getting hit. But I'm not able to do a logs search across all the sites right now given the rest of my workload so I'll just continue monitoring.

Unless there's something we were doing that I'm not aware of, I'd say we need to look at this closely to figure out what happened here.

comment:2 Changed 11 years ago by Daniel Kahn Gillmor

Keywords: viewsic.mayfirst.org apache load added

Thanks for staying on top of this, Alfredo! I know how frustrating it can be when a box is unresponsive and you're trying to debug it.

My first thought was to check the console output to see if there was any indication of what was happening, but i don't see anything there out of the ordinary from this month (see #129 about the AEN: WARNINGs; times are UTC):

 dkg@squeak ~]$ ssh  viewsic-console@console.mayfirst.org 'cereal follow --cat viewsic | grep ^2008-03'
2008-03-01_10:01:41.44072 3w-xxxx: scsi0: AEN: WARIN:r a ccurred: P #0.
2008-03-02_11:41:35.27497 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-02_11:46:45.29834 3wxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-02_11:46:51.29871 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-03_16:04:44.75337 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-05_10:18:00.40854 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-05_11:21:15.64971 3w-xxxx: scsi0: AEN: ARNNG: Sector repair ccurr Prt #0.
2008-03-06_10:07:25.83651 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-06_11:01:39.03985 3w-xxx sci0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-08_09:57:49.65983 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-11_10:14:19.16474 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-13_09:29:06.95051 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #0.
2008-03-13_10:32:44.19711 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurre: Port #0.
[0 dkg@squeak ~]$ 

When you say CPU was 50% and swap was OK, i start to get worried -- what was causing the unresponsiveness in that case? In future situations like this, it would be helpful to paste the full transcript of what tests you ran, and their output. For example:

  • a copy/paste of a screenful of top
  • the output of vmstat 1 10

For things that might be a little more security-conscious (like the output of lsof -i -n), you could just store the output in a file for later review.

Looking at the munin graphs for viewsic, i can see the gaps during the times you're talking about. The number of mysql queries soared just before the outage, and the load spiked to ~70. The CPU also appears to have been pegged at 100% just before the outage -- maybe it dropped while you were looking at it?

comment:3 Changed 11 years ago by alfredo

It could have dropped and I'm sure the load was rising and dropping violently during this period. It was impossible to tell most of the time because I couldn't write top into the command line. But I think we can be pretty sure about what this was. This was a drupal site going crazy. The apache2 processes owned by www-data and the stress on mysql both point to that. Very often we have a member put in a script or something that goes bonkers or, in some cases, somebody's serving data that becomes suddenly popular.

Question is: how does one run a server-wide test look at everyone's logs. I mean a grep across everybody's log to see who was doing lots of serving at that point. I mean, once the thing went down, everybody's going to show a problem but right before it maybe one of the members is showing some kind of activity?

That's about the only way I can figure it out. I know it takes Jamie sometimes 2 or 3 days to pin this stuff down but I think we really need to get a handle on this.

comment:4 Changed 11 years ago by Daniel Kahn Gillmor

I'm not sure why the load would have been rising and dropping -- usually once the load goes up, it stays pretty heavily loaded. Depending on the reason for the load, the CPU might get pegged at 100% or not.

You can see which sites have the 10 largest logfiles like this:

sudo sh -c 'ls -laS /home/members/*/sites/*/logs/web.log' | head

If you just want to count number of accesses between 23:00 yesterday and 1:00 today, you can do:

sudo sh -c 'egrep -c "13/Mar/2008:23|14/Mar/2008:00" /home/members/*/sites/*/logs/web.log' | sort -t: -n -k 2

This last command shows one site that's far and away more active than any other during that time period -- 18K hits as opposed to 2.3K hits for the next-most-active site. Maybe that's the site to look into?

Looking at those particular hits, it appears that the vast majority of them came from a single IP address:

0 dkg@viewsic$ sudo egrep "13/Mar/2008:23|14/Mar/2008:00" $BUSIEST_SITE_LOG_FILE | cut -f 1 -d\  | sort | uniq -c | sort -n | tail -n1
17347 67.202.29.26
0 dkg@viewsic$  
}}}"
Looking this IP address up in whois (and using reverse DNS) points to [http://www.amazonaws.com/ Amazon Web Services], in particular their "Amazon Development Centre South Africa", which weirdly has a Seattle, WA address in the whois entry:
{{{
[0 dkg@squeak ~]$ whois 67.202.29.26

OrgName:    Amazon.com, Inc. 
OrgID:      AMAZO-4
Address:    Amazon Development Centre South Africa
Address:    1200 12th Avenue South
City:       Seattle
StateProv:  WA
PostalCode: 98144
Country:    US

NetRange:   67.202.0.0 - 67.202.63.255 
CIDR:       67.202.0.0/18 
NetName:    AMAZON-EC2-3
NetHandle:  NET-67-202-0-0-1
Parent:     NET-67-0-0-0-0
NetType:    Direct Assignment
NameServer: PDNS1.ULTRADNS.NET
NameServer: PDNS2.ULTRADNS.NET
NameServer: PDNS3.ULTRADNS.ORG
Comment:    http://www.amazonaws.com/
RegDate:    2007-08-02
Updated:    2007-08-02

OrgTechHandle: ANO24-ARIN
OrgTechName:   Amazon EC2 Network Operations 
OrgTechPhone:  +1-206-266-2187
OrgTechEmail:  aes-noc@amazon.com

# ARIN WHOIS database, last updated 2008-03-13 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
[0 dkg@squeak ~]$ 
}}}
Perhaps it'd be worth contacting these folks to see if they noticed a spike in outbound traffic to the relevant IP address at that time?  Maybe they could help nail down what happened.

comment:5 in reply to:  4 Changed 11 years ago by Daniel Kahn Gillmor

Sorry for the bad formatting.

That last bit should read:

0 dkg@viewsic$ sudo egrep "13/Mar/2008:23|14/Mar/2008:00" $BUSIEST_SITE_LOG_FILE | cut -f 1 -d\  | sort | uniq -c sort -n | tail -n1
17347 67.202.29.26
0 dkg@viewsic$  

Looking this IP address up in whois (and using reverse DNS) points to Amazon Web Services, in particular their "Amazon Development Centre South Africa", which weirdly has a Seattle, WA address in the whois entry:

[0 dkg@squeak ~]$ whois 67.202.29.26

OrgName:    Amazon.com, Inc. 
OrgID:      AMAZO-4
Address:    Amazon Development Centre South Africa
Address:    1200 12th Avenue South
City:       Seattle
StateProv:  WA
PostalCode: 98144
Country:    US

NetRange:   67.202.0.0 - 67.202.63.255 
CIDR:       67.202.0.0/18 
NetName:    AMAZON-EC2-3
NetHandle:  NET-67-202-0-0-1
Parent:     NET-67-0-0-0-0
NetType:    Direct Assignment
NameServer: PDNS1.ULTRADNS.NET
NameServer: PDNS2.ULTRADNS.NET
NameServer: PDNS3.ULTRADNS.ORG
Comment:    http://www.amazonaws.com/
RegDate:    2007-08-02
Updated:    2007-08-02

OrgTechHandle: ANO24-ARIN
OrgTechName:   Amazon EC2 Network Operations 
OrgTechPhone:  +1-206-266-2187
OrgTechEmail:  aes-noc@amazon.com

# ARIN WHOIS database, last updated 2008-03-13 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
[0 dkg@squeak ~]$ 

Perhaps it'd be worth contacting these folks to see if they noticed a spike in outbound traffic to the relevant IP address at that time? Maybe they could help nail down what happened.

comment:6 Changed 11 years ago by alfredo

Priority: UrgentHigh

Thanks. If that's not a denial of service, someone has gotten fanatically interested in save access issues and has an incredibly fast computer. The activity range is so "normal", it's strange. There is one other possibility and this happens sometimes. A researcher or activist overseas gets interested in a very specific issue and decides to download, capture or do something else very rapidly with a site and sets up some kind of script to capture and download stuff from it.

But given that this is from the same IP and that it's just really a bunch of random page requests happening every second or so, I'd say that it's something we should think about and watch for. If it recurs, we should talk with the site managers about what to do.

Let's remember to pull Jamie in and have a three-way consultation on this.

Thanks for the help. Very useful as always. :-)

Let's keep it open but scale it down to high so we can keep discussing it.

comment:7 Changed 11 years ago by Jamie McClelland

Thanks for the great research. And yes - I'd like to participate in the three-way :).

I looked through the log in question. I didn't see a pattern in what was being requested (often in cases like this I would expect to see repeated requests for the register new user or post content - indications of a spam bot trying to post spam). Instead, it seemed more like a spider or perhaps more likely a program downloading the entire site.

We could email the OrgTechEmail - but given that the log does not seem to show anything other than someone downloading the site in an obnoxious way - I'm not sure it will help us. I just opened ticket #726 to start a conversation about how we can generally protect ourselves from something like this.

It seems as though anyone could take down our server by running a simple wget script against a drupal site :(.

comment:8 Changed 11 years ago by Daniel Kahn Gillmor

If it's a spider, it's a very poorly-written spider. during that two-hour span, that same IP address requested the home page 918 times, and nearly every other page around 285 times, as determined by:

sudo egrep "13/Mar/2008:23|14/Mar/2008:00" $BUSIEST_SITE_LOG_FILE | \
 grep '^67\.202\.29\.26' | cut -f7 -d\  | sort | uniq -c | sort -n

I suspect that someone is misusing their AWS account, though i'm not convinced that it must necessarily be DoS attack.

For example, i could imagine a website author making a call to an AWS service as part of a page render -- if that AWS service was set up to call a page on the same site to do its work (i don't know if this is possible with AWS, but i don't know why it wouldn't be), the result could be a traffic loop that drives the machine's network interface to saturation.

Has anyone contacted the relevant member about this to see if they are using AWS in the site itself?

comment:9 Changed 11 years ago by alfredo

They *are* a high traffic site or have been at points in the past. The pattern I saw was basically the kind of visit we see from people who haven't been to this site much. They are opening a node page and then clicking on the category term at the bottom to get the rest of the relevant nodes; that's what that means when you see a hit to a node page and then a "taxonomy/term" numbered page.

They have some "across the Internet" modules (like "ping") activated but I didn't see anything with Amazon associations. We should ask. Yeah. Maybe somebody should email them?

comment:10 Changed 11 years ago by Jamie McClelland

I just emailed the site maintainer and will post back a reply.

comment:11 Changed 11 years ago by alfredo

Above, Jamie wrote: "It seems as though anyone could take down our server by running a simple wget script against a drupal site :(."

Just for general consumption (if others are looking it) because I think us three know a lot of this but it never hurts to go over it.

Any Drupal site with any level of complexity (like some of those we develop here) could have a huge problem with persistent wget calls because of the cms display logic: every Drupal page uses several modules for display out of the box (node, menu, block, taxonomy to name a few) and then we have the third party modules which are sometimes not as efficiently coded as the core modules. This is particularly true of CCK which is now really a module system rather than just a module. CCK has its own display system with a css and display function that uses field names and calls that are different from the node module.

This means that any page production does a lot of work, much more than an html page would require.

There's a particular "vulnerability" (for want of a better word) in the taxonomy system because taxonomy calls at least three tables out of the box and two more if used (synonyms and related). Every page goes through all those tables to display and a call can easily then hit every term displayed on a node and pull up that list for that term and then iterate through the nodes in that taxonomy list.

Essentially, one script can go through an entire site by looping and, because of the way terms are used, it would end up hitting the same nodes a bunch of times as it peruses the site.

The other huge thing to remember is that, with every page call, Drupal goes through several directories (including modules) making a kind of list of modules that are available. This is to allow it to call those modules for any page without specific module includes. In short, the thing visits all the modules you have in your directory and, as I know from personal experience, *tests* each line of those modules for fatal errors and actually breaks your site if it finds one. I mean the entire site is broken if you have even one line of fatally flawed code in a module! Then it goes through its tables to see which ones you have "activated" and then includes those. Every one activated -- included in every single node and block! Not necessarily used at all -- but definitely included.

We're talking about some intense stuff here, guys!

The real issue is that Drupal is *not* gonna get simpler because there is an unquestioned increasing dependency on cck and other third party modules for general site development in the community.

People sometimes ask how the Drupal.org site survives all that activity without crashing every day. Well, sometimes it slows down, but the bottom line is that it doesn't use many third party modules. :-) We just can't follow that model!

What I think we need to aim for is some way of preventing "automated" visits. Maybe what we're talking about with DOS protection might help.

But this problem *is* going to confront us so maybe we need to give it some thought. I'm a fanatical Drupalist, no question, but we can't be blind to the potential problems as we use it more and more.

comment:12 Changed 11 years ago by alfredo

Ooops!!! One mistake in the description of Drupal's drupalizing.

The initial traverse of the modules directories (usually two of them) takes place for Drupal to build the admin modules screen (so you can activate and de-active them). No code reading or test is done at that time, afaik.

Then drupal reads its table to see which modules and active and then reads those for code problem and breaks the site if there are any (like a semi-colon left off a line). That's because only active modules are included -- should have caught that when I wrote it.

Problems in the theme's template pages are caught only when a page uses the specific template in that page, btw.

Still a lot of work for each displayed page!

Sorry about that confusion.

comment:13 Changed 11 years ago by Jamie McClelland

Thanks for the explanation Alfredo. I would also add that Drupal provides caching - which greatly reduces the load on the server. But it only works for anonymous users. The real value of Drupal is to have users logged in - so I think we may also be experiencing a decrease in the usefulness of caching as sites attract more and more people who want to be logged in.

All of this makes me more and more interested in programs like ikiwiki which generate static html pages. They are just as dynamic as Drupal - but every change to a site triggers a call which produces static html pages. So - the server load happens when edits are made to the site, as opposed to happening when pages are viewed. That approach makes a lot more sense for high traffic sites.

For a working example of ikiwiki - I just setup my own homepage using the program.

comment:14 Changed 11 years ago by Daniel Kahn Gillmor

Summary: Viewsic critical -- apparent Apache issue~1 hour outage of Viewsic due to apache overload from a single IP address

What are the next steps on this ticket? What needs to be done to close it?

comment:15 Changed 11 years ago by alfredo

Resolution: fixed
Status: newclosed

Jamie was saying that the only thing remaining would be to send Amazons Web Services an email and I've done so now, copying and/or forwarding Jamie and Daniel. So I think, after that, we're set and I'll close the ticket. I'll reopen when we hear from Amazon.

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.