wiki:syn-flood-defense-narrative

Version 2 (modified by Daniel Kahn Gillmor, 6 years ago) (diff)

--

Beating a Syn Flood Attack - Narrative

On Wed. Aug 8th, 2013, Sahara Reporters was hit by a massive ddos attack. This page is a narrative account and how to for dealing with such an attack. I will attempt to be as generic as possible to help others dealing with such a problem, but some things will also be May First/People Link specific. Also, having never had to deal with attack of this sort, I cannon confirm that the practices described here are best practices. However, after nearly four days, many attempts, a bunch of Start Page searches, and three highly competent sys admins, we finally worked out a solution for this multi-pronged attack.

Prong One POST past Varnish

Sahara Reporters uses varnish as its primary mode of proxy caching. A well tested and effective caching proxy, varnish has been extremely effective for serving a fairly high traffic site for a number of years. We have faced a few attacks in the past and weathered them fairly well due to varnish's flexibility and ability to stand up to increased traffic.

The first problem encountered during this attack rendered the primary server (not the caching servers) virtually inoperable. Apache serves the back-end content and using an ingenious, if obvious route around varnish, the attackers were able to force a dramatic increase in the number of requests made to the apache server. Their simple method was to make POST requests to the site home page.

Looking at top on the apache server, we were able to determine the significance of the issue. Normally this server has some 10-15 apache2 processes running, but top showed a list beyond counting. ps gave better metrics with the command:

ps -eFH | grep apache2 

we were able to determine the total number of processes running, well over 100, causing the server to become unresponsive to almost any activity.

Under this particular load, the varnish servers showed no negative consequences. In fact, other than the significant load on the apache server everything else seemed quite normal.

The Diagnostic Revelation

The initial confusion arose from the fact that it seemed as if varnish wasn't doing it's job, because too many requests were being passed through to the apache server. I searched for caching failures on each of the varnish servers, but nothing seemed obvious. Hit rates were normal around 85-95% and there seemed to be no load problems on the varnish servers at all. The only real symptoms seemed to be the struggling apache server and eventually a clear spike in traffic. We monitor our bandwidth with catic, support team members can get access to this through keyringer. You will be able to see massive traffic spikes in XO on both pianeta and avocet during this period.

Since our varnish servers are distributed, these spikes reflected traffic going directly to the apache server. After talking with Sahara Reporters and confirming that they did not have any reason to expect increased traffic, we determined this to be a legitimate problem with the relationship between our varnish servers and the apache server. Making sure to do the easiest and most obvious first, I restarted apache and varnish on all off the servers.

service apache2 restart 
service varnish restart 

No improvement at all. Frustrating but probably to be expected. Next I checked the apache logs to investigate any unwanted traffic.

tail -f /var/log/apache.log | grep -v -E 'VARNISH_IP_1|VARNISH_IP_2' 

VARNISH_IP_1 etc should be the actual ip addresses, if you need to run this command.

This offered an overview of traffic patterns not passing through the varnish servers. Unfortunately, it did not show any meaningful traffic, certainly nothing that would have caused an overloaded server. If you're using varnish as a proxy, you would not want any connections from your non-varnish servers. In a way this was good news, because it demonstrated that the problem had to do with varnish servers passing to many requests to the apache server, but why?

Well, just to ease traffic to apache, the first step we took was to remove a number of varnish servers. Since the problem was varnish, taking away any given server would reduce the number of calls to apache. So we took out a third of our varnish servers to no avail. The number of requests were simply too high to handle.

The next step we took was to try to determine if any IP addresses might be overloading the varnish servers. For this we used varnish top:

varnishtop -i TxHeader -I '^X-Forwarded-For:' 

which gave output something like this:

list length 19                                                                                                                                                                                                                                                         bouazizi

    39.91 TxHeader       X-Forwarded-For: 109.205.248.192
    27.94 TxHeader       X-Forwarded-For: 37.53.252.79
    22.93 TxHeader       X-Forwarded-For: 130.255.251.114
     9.97 TxHeader       X-Forwarded-For: 87.109.30.45
     4.98 TxHeader       X-Forwarded-For: 196.46.246.50, 217.212.230.234
     3.99 TxHeader       X-Forwarded-For: 151.245.10.171
     2.99 TxHeader       X-Forwarded-For: 49.231.103.138
     2.00 TxHeader       X-Forwarded-For: 93.186.23.81
     2.00 TxHeader       X-Forwarded-For: 192.168.88.6, 41.41.244.13
     2.00 TxHeader       X-Forwarded-For: 66.249.73.136
     1.99 TxHeader       X-Forwarded-For: unknown, 93.186.22.240
     1.99 TxHeader       X-Forwarded-For: unknown, 93.186.22.241
     1.00 TxHeader       X-Forwarded-For: 41.190.5.47
     1.00 TxHeader       X-Forwarded-For: 66.249.73.224
     1.00 TxHeader       X-Forwarded-For: 192.168.102.96, 46.65.52.130
     1.00 TxHeader       X-Forwarded-For: 93.186.22.241, 80.239.243.129
     1.00 TxHeader       X-Forwarded-For: 151.96.3.241
     1.00 TxHeader       X-Forwarded-For: 93.186.31.81
     1.00 TxHeader       X-Forwarded-For: 66.249.73.240

Here you see three IP addresses with a significantly higher hit rate than any others. During the actual attack, the list of rapidly requesting IP addresses was much greater. At the very least, this offers a clue that the problem is likely an attack. Such out of proportion numbers means a likely attack. The next discovery was an "Ah Ha!" moment. Using the command varnishncsa resulted in understanding the root of the problem:

This command:

varnishncsa 

resulted in output something like this:

139.194.226.35 - - [18/Aug/2013:03:00:09 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0"
139.194.226.35 - - [18/Aug/2013:03:00:09 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.420014; .NET CLR 3.5.420014; .NET CLR 3.0.420014"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 5.1; WOW64; U; Edition Grenada Local; ru) Presto/2.10.289 Version/12.07"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 5.1; U; Edition India Local; ru) Presto/2.10.289 Version/9.08"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 5.1; U; Edition Germany Local; ru) Presto/2.10.289 Version/5.00"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC2; .NET CLR 2.0.045312; .NET CLR 3.5.045312; .NET CLR 3.0.045312"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; rv:9.0) Gecko/20100101 Firefox/9.0"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition Russia Local; ru) Presto/2.10.289 Version/6.04"
139.194.226.35 - - [18/Aug/2013:03:00:10 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; WOW64; rv:10.0) Gecko/20100101 Firefox/10.0"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC2; .NET CLR 2.0.702355; .NET CLR 3.5.702355; .NET CLR 3.0.702355"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.335818; .NET CLR 3.5.335818; .NET CLR 3.0.335818"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.743546; .NET CLR 3.5.743546; .NET CLR 3.0.743546"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.342248; .NET CLR 3.5.342248; .NET CLR 3.0.342248"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC2; .NET CLR 2.0.863776; .NET CLR 3.5.863776; .NET CLR 3.0.863776"
139.194.226.35 - - [18/Aug/2013:03:00:11 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.811412; .NET CLR 3.5.811412; .NET CLR 3.0.811412"
139.194.226.35 - - [18/Aug/2013:03:00:12 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC2; .NET CLR 2.0.045312; .NET CLR 3.5.045312; .NET CLR 3.0.045312"
139.194.226.35 - - [18/Aug/2013:03:00:12 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition Russia Local; ru) Presto/2.10.289 Version/6.04"
139.194.226.35 - - [18/Aug/2013:03:00:12 -0400] "POST http://saharareporters.com/ HTTP/1.0" 200 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0"

The above output has been filtered to remove GET requests; however, POST is the telling part of the equation. To really determine if you're experiencing this problem the appropriate command would be:

varnishncsa | grep POST 

Examining the output above, you can see that the originating IP address tried to POST to the homepage of the site. Because of the nature of this site, there would be no normal user needing to issue POST command to the homepage. Still, when varnish saw the POST request, it said, "Oh, it's a POST. I don't deal with those!" and varnish asked apache to take over.

So for every POST request apache got a request as well. Technically, varnish was doing it's job. The solution? Change varnish's job. The first step, just to get the site running again was to turn off all post requests going to the site. Borrowing from this site, we decided to simply block every POST request first. So we added this directive to sub vlc_recv to our varnish configuration:

if ( req.request == "POST" ) {
      error 403 ": Requested Method is not supported by this server.";
}

Et Voila!!! Once all the varnish servers had this directive up, the site once again started loading and apache calmed down to normal levels. Whew, one problem solved. Next, we added the acceptable post pages, so site functionality could continue as normal. Here's the final directive we used:

if ( req.request == "POST" ) {
  if ( req.url ~ "/user"
    || req.url ~ "/node/add"
    || req.url ~ "edit"
    || req.url ~ "comment"
    || req.url ~ "delete" ) {
       return (pass);
  } else {
      error 403 ": Requested Method is not supported by this server.";
  }
}

Now varnish would not pass any post requests to the homepage, and clog up the works of apache. Apache was happy to go back to it's old job, varnish was happy to have a new job, and I was happy to have done my job. The only people who weren't happy were the attackers!!!

The Diagnostic Duh!

All seemed well for the better part of the day, after taking these steps. Unfortunately, by the end of the evening, the attackers made quick, though obvious adjustments to their approach. Rather than targeting the home page, these attackers decided to revamp their methodology and began running POST requests to pages that did not throw 403 errors at them. The apache server once again bogged down and we had to begin approaching the problem in a more targeted manner.

varnishnsca output looked more like this:

39.52.217.81 - - [18/Aug/2013:03:51:40 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.667160; .NET CLR 3.5.667160; .NET CLR 3.0.667160"
39.52.217.81 - - [18/Aug/2013:03:51:40 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition United Kingdom Local; ru) Presto/2.10.289 Version/10.05"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.612601; .NET CLR 3.5.612601; .NET CLR 3.0.612601"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0"
173.245.221.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 5.1; U; Edition Grenada Local; ru) Presto/2.10.289 Version/5.03"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.305161; .NET CLR 3.5.305161; .NET CLR 3.0.305161"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC2; .NET CLR 2.0.998117; .NET CLR 3.5.998117; .NET CLR 3.0.998117"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; U; Edition Mongolia Local; ru) Presto/2.10.289 Version/6.04"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.808502; .NET CLR 3.5.808502; .NET CLR 3.0.808502"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 5.1; U; Edition Russia Local; ru) Presto/2.10.289 Version/12.08"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition Bangladesh Local; ru) Presto/2.10.289 Version/10.07"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.247061; .NET CLR 3.5.247061; .NET CLR 3.0.247061"
42.118.204.24 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/5.0 (Windows NT 5.1; rv:10.0) Gecko/20100101 Firefox/10.0"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.452771; .NET CLR 3.5.452771; .NET CLR 3.0.452771"
39.52.217.81 - - [18/Aug/2013:03:51:41 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.667160; .NET CLR 3.5.667160; .NET CLR 3.0.667160"
42.118.204.24 - - [18/Aug/2013:03:51:42 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.269189; .NET CLR 3.5.269189; .NET CLR 3.0.269189"
173.245.221.81 - - [18/Aug/2013:03:51:42 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition Iran Local; ru) Presto/2.10.289 Version/7.02"
39.52.217.81 - - [18/Aug/2013:03:51:42 -0400] "POST http://saharareporters.com/user/login HTTP/1.1" 500 837 "http://saharareporters.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.456529; .NET CLR 3.5.456529; .NET CLR 3.0.456529"

Since this is a drupal site, attacking user/login makes sense because users must send a POST request in order to login. The choice becomes either don't let users login or allow attackers to make these POST requests.

Our first approach was to block POST requests only to the "/user/login" path, by changing the value in the varnish directive from if ( req.url ~ "/user" to if ( req.url == "/user" and mistakenly believing that this would solve the problem. It only took a few hours to discover such a solution would ultimately end up with either all POST pages being blocked as the attackers continued to seek alternative paths.

Blocking IP addresses

Realizing that varnish might not be able to offer a complete solution to the problem, we began looking for other alternatives and reluctantly decided to begin blocking IP addresses. This was not an easy decision, since blocking IP addresses is essentially what a ddos attack does. Such a step means potentially keeping legitimate traffic from reaching the site as well, not what we want, but given the circumstances such a process seemed imperative.

Rather than being indiscriminate, we chose to only block those IP addresses sending POST requests from countries most likely to be the source of a botnet and least likely to speak the primary language of the site and also those with the largest number of requests. This resulted in a list of countries:

Russia Taiwan China Vietnam Hungary Iran Romania Czech Republic Belarus

These countries seemed to be the largest offenders. We targeted both POST and GET requests from these countries by running the following two scripts:

#!/bin/bash

while :;
do a=$(varnishncsa | grep "POST http://saharareporters.com/user/login/ HTTP/1.1" -m 1 |
        grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'); echo "$a";
    b=$(whois "$a" | grep -i -m 1 country | grep -E 'UA|TW|HU|RU|VN|vn|IR');
    # b=$(whois "$a" | grep -i -m 1 country);
    if [ -n "$b" ];
    then
        echo "$a";
        mf-ip-ban-address  "$a";
    fi
done

The above script blocks POST requests coming from selected countries. Ultimately, this seemed to produce fewer results than needed, so we switched to blocking GET requests from countries as well.

#!/bin/bash

while :;
do a=$(varnishncsa | grep "GET http://saharareporters.com/ HTTP/1.0" -m 1 |
        grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'); 
    c=$(whois "$a" | grep -i -m 1 country);
    b=$(whois "$a" | grep -i -m 1 country | grep -E 'UA|TW|HU|RU|VN|vn|IR|RO|CZ|BY');
    # b=$(whois "$a" | grep -i -m 1 country);
    echo "$a -- $c";
    if [ -n "$b" ];
    then
        echo "Banned -- $a";
        /usr/local/sbin/mf-ip-ban-address "$a";
    fi
done

This script finds the IP address, checks to make sure it's from one of the designated countries and then calls mf-ip-ban-address to ban them. mf-ip-ban-address looks like this (we'll later change this script):

#!/bin/bash
if [ ! $# == 1 ]; then
        echo    "You did not specify an IP address to ban 
USAGE: $0 ip_address_to_ban"
        exit
fi
IP=$1

IPTABLES=/sbin/iptables
$IPTABLES -A INPUT -s $IP -j LOG --log-ip-options --log-tcp-options --log-level debug --log-prefix=Banned: 
$IPTABLES -A INPUT -s $IP -j DROP

Using this method, we managed to keep the offending IP addresses at bay and began reducing the number of requests passing through varnish to the apache server. Whew!!!

Not So Fast SYN-ner

Banning by country calmed things down for the better part of a day, but by the early morning on Friday, all bets were off. The attackers had switched their approach yet again. All of what we had done remained in place, and the apache server, in fact, all the servers seemed to be chugging along just fine. Meanwhile the site would not load at all :-( .

netstat became our tool of choice. First we relied on

netstat -net | wc -l

just to find out how many connections were active. Whoa!!! Looked like tens of thousands, up to 200,000 at certain times. That's a lot of concurrent connections and certainly more than we could handle or explain. varnish continued doing it's job, blocking POST requests. iptables continued to block well over a thousand IP addresses. For all intents and purposes, everything seemed to be just fine, but perusing netstat -net showed a viciously high number of SYN_RECV requests (thanks to jamie for noticing this).

It appears that these determined attackers decided to switch tactics and use a SYN flood attack. Having not had to deal with this particular type of attack, the effects seem rather perplexing, may especially behind varnish. Everything looked like it was functioning correctly, but the site simply wouldn't load. Restarting varnish resulted in the ability to load some pages for a short period of time, and then just an infinite stall.

At MF/PL we have a super sweet script written by dkg to check for open syns, mf-ip-list-open-syns. Apparently, the script was written back in 2003 to watch for potential attacks to the counter-convention website during The Republican National Convention in 2004.

Running mf-ip-list-open-syns, showed pages and pages of IP addresses. A good indication that the SYN flood attack was the probable attack we faced.

In our search for answers, we discovered a nifty one line bash command that gave us a pretty clear sense of what we faced:

netstat -ant | grep 80 | awk '{print $6}' | sort | uniq -c | sort -n 

It produced output something like this (numbers modified for example):

      1 LISTEN
      2 CLOSING
     30 FIN_WAIT2
     39 FIN_WAIT1
     42 LAST_ACK
    166 SYN_RECV
    226 ESTABLISHED
    634 TIME_WAIT
    34030 CLOSE_WAIT

We saw a huge number of CLOSE_WAIT connections. The CLOSE_WAIT connection, we would learn means that the process handling the connection has not yet been able to close the connection. We used the following command to determine which process was holding open the connection:

~#  netstat -antp | grep CLOSE_WAIT | head -10
tcp        0  13140 216.66.23.43:80         109.160.88.5:3912       CLOSE_WAIT  29613/varnishd  
tcp        0  12708 216.66.23.43:80         46.225.41.180:61010     CLOSE_WAIT  29613/varnishd  
tcp        0  13140 216.66.23.43:80         109.160.88.5:3902       CLOSE_WAIT  29613/varnishd  
tcp        0  13140 216.66.23.43:80         109.160.88.5:3883       CLOSE_WAIT  29613/varnishd  
tcp        0  13140 216.66.23.43:80         109.160.88.5:3996       CLOSE_WAIT  29613/varnishd  
tcp        1  12708 216.66.23.43:80         46.225.41.180:61131     CLOSE_WAIT  29613/varnishd  
tcp        0  12240 216.66.23.43:80         171.4.214.125:23126     CLOSE_WAIT  29613/varnishd  
tcp        0  12708 216.66.23.43:80         46.225.41.180:61129     CLOSE_WAIT  29613/varnishd  
tcp        0  12780 216.66.23.43:80         41.43.168.155:56923     CLOSE_WAIT  29613/varnishd  
tcp        0  13140 216.66.23.43:80         109.160.88.5:3908       CLOSE_WAIT  29613/varnishd  
0

As might be expected, varnish was responsible for all of the CLOSE_WAIT connections. This seemed like progress, all we needed to do was figure out how to end all of the CLOSE_WAIT connections. Easy, right? I wish...

Perhaps I'm jumping ahead a little bit, because as soon as we discovered this was a SYN flood attack, we begin researching how to resist this attack. In almost ever case we found reference to two things:

  1. Turing on synflood_cookies.
  2. A set of iptable rules to mitigate against SYN flood attacks.

Turn on synflood cookies

This is a fairly standard practice and can be done with a live system by modifying /proc/sys/net/ipv4/tcp_syncookies . Check to see the current status with:

cat /proc/sys/net/ipv4/tcp_syncookies 

If the result is "0", you can turn on syncookies with:

echo 1 > /proc/sys/net/ipv4/tcp_syncookies 

We did not have tcp_syncookies enabled, so it seemed like a great and easy solution. However, there were no noticeable improvements produced by enabling this value. Even after full reboot, this value did not seem to reduce the syn flood as it was implemented. This is not to say it isn't important, and we will leave it enabled.

iptables resistance

From a number of different sources, we found a similar set of iptables rules to implement as general resistance to SYN flood attacks. Below is one set, though there we would end up implementing others as well. This set creates a chain syn-flood that limits the amount of time taken by any connection or something along those lines.

iptables -N syn-flood
iptables -A syn-flood -m limit --limit 10/second --limit-burst 50 -j RETURN
iptables -A syn-flood -j LOG --log-prefix "SYN flood: "
iptables -A syn-flood -j DROP 

Again, this method did not seem to produce any noticeable results. Even after shutting down varnish, making sure all connections had terminated, and then restarting varnish, the number of CLOSE_WAIT connections piled up almost instantly. Quite frustrating to say the least.

Meanwhile, we continued to ban IP addresses at an alarming rate and unable to tell with certainty whether or not these addresses were spoofed. After hours of this approach, we could only periodically get varnish to serve content and then only for moments.

Throughout this process we tried numerous additional firewall mechanisms with limited results. These iptable rules seemed hopeful:

iptables -A INPUT -p tcp --syn --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --syn -m limit --limit 1/s --limit-burst 4 -j ACCEPT
iptables -A INPUT -p tcp --syn -j DROP

Similar to the above chain, these rules test for SYN packets and if their are more than 4 from the same IP in a second, they get dropped. These restrictions are more severe than the earlier 'syn-flood' chain. Still little improvement.

Next we dove into netfilter. There exist numerous configuration options in /etc/sysctl.conf. We configure these settings first:

net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1

In reality, we had little knowledge of defending against these types of attacks and tried whatever we could to mitigate the problem. Upon reflection, it seems that some of the sysctl settings will only make an impact on routers and not stand alone servers.

One recommended setting is net.netfilter.nf_conntrack_tcp_timeout_syn_recv=30, which seems to reduce the amount of time a SYN request can remain open. After trying to set this value, we discovered a that the servers were without conntrack-tools. After installing conntrack-tools apt-get install conntrack, after installing conntrack tools we needed to enable three modules to utilize it's capabilities.

modprobe nf_conntrack
modprobe nf_conntrack_ipv4
modprobe nf_conntrack_netlink

With the above modules enabled we can now run conntrack -L to see current flow states of all connections, e.g.

~# conntrack -L | head -10
tcp      6 345409 ESTABLISHED src=199.87.167.202 dst=187.14.214.174 sport=80 dport=12703 packets=3 bytes=1893 [UNREPLIED] src=187.14.214.174 dst=199.87.167.202 sport=12703 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 320054 ESTABLISHED src=199.87.167.202 dst=54.236.252.74 sport=80 dport=37853 packets=1 bytes=52 [UNREPLIED] src=54.236.252.74 dst=199.87.167.202 sport=37853 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 401819 ESTABLISHED src=199.87.167.202 dst=37.8.76.60 sport=80 dport=30865 packets=1 bytes=1492 [UNREPLIED] src=37.8.76.60 dst=199.87.167.202 sport=30865 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 289081 ESTABLISHED src=31.207.246.124 dst=199.87.167.202 sport=3007 dport=80 packets=4 bytes=184 src=199.87.167.202 dst=31.207.246.124 sport=80 dport=3007 packets=1 bytes=44 [ASSURED] mark=0 secmark=0 use=2
tcp      6 261068 ESTABLISHED src=175.176.150.152 dst=199.87.167.202 sport=7286 dport=80 packets=2 bytes=88 src=199.87.167.202 dst=175.176.150.152 sport=80 dport=7286 packets=1 bytes=44 [ASSURED] mark=0 secmark=0 use=2
tcp      6 321533 ESTABLISHED src=199.87.167.202 dst=54.236.254.18 sport=80 dport=34831 packets=1 bytes=52 [UNREPLIED] src=54.236.254.18 dst=199.87.167.202 sport=34831 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 397643 ESTABLISHED src=199.87.167.202 dst=54.236.254.116 sport=80 dport=54920 packets=1 bytes=52 [UNREPLIED] src=54.236.254.116 dst=199.87.167.202 sport=54920 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 345481 ESTABLISHED src=199.87.167.202 dst=189.100.29.153 sport=80 dport=12262 packets=3 bytes=1815 [UNREPLIED] src=189.100.29.153 dst=199.87.167.202 sport=12262 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 345350 ESTABLISHED src=199.87.167.202 dst=197.160.90.202 sport=80 dport=28290 packets=1 bytes=604 [UNREPLIED] src=197.160.90.202 dst=199.87.167.202 sport=28290 dport=80 packets=0 bytes=0 mark=0 secmark=0 use=2
tcp      6 260548 ESTABLISHED src=175.176.150.152 dst=199.87.167.202 sport=65299 dport=80 packets=2 bytes=88 src=199.87.167.202 dst=175.176.150.152 sport=80 dport=65299 packets=1 bytes=44 [ASSURED] mark=0 secmark=0 use=2
0

Having conntrack installed, proved a great boon for helping track what was happening on the server. One of the advantages of conntrack is supposed to be more effectively managing flow control of ip addresses. For better or worse, I never managed to notice a specific mechanism to utilize conntrack in this way. However, we were able to use it as a reference point for examining different types of connections.

This would come in handy later on, but first we found another mechanism by which to thwart the attack. It turned out that many of the referrer addresses were bogus, looking like "stahoustoa.com". Steve had the fabulous idea of using varnish to throw a 500 error on malformed refer headers.

We then added this line to our varnish configuration:

if (req.http.referer && req.http.referer !~ "^http") {
    error 500 ": Internal Server Error";
}

To our surprise, this allowed varnish to begin serving content again, and when we went to sleep the site was again live. By morning our hopes again turned to horror.

In the end, it would be conntrack and iptables that did the heavy firewall lifting. As should probably be expected, since varnish became incapable of closing outgoing connections. The big ah ha moment came with the idea of blocking outbound connections to the offending ip addresses. Since iptables INPUT does not negotiate SYN connections, all of our ip blocking up to this point had little effect.

We'd mistakenly believed that dropping an IP address meant blocking an IP address, which on the whole is true. The caveat being that iptables doesn't block the SYN part of the connection.

Using contrack, we wrote a script that would find IP addresses with more than 20 CLOSE_WAIT connections and then block the outgoing response. The magic single line turned out to be rather simple:

iptables -A OUTPUT -d $IP -j DROP 

The script for using conntrack for this purpose looks like this:

#!/bin/bash

# This script finds ip addresses that are 
# holding open multiple connections and
# calls mf-ip-delete-and-ban to block access
# from and to the ip address.
# It parses the conntrack logs, so conntrack
# is a dependancy.

type conntrack >/dev/null 2>&1 || { echo >&2 "This script depends on conntrack but it's not installed.  Aborting."; exit 1; }

while : 
do 
    for i in $(conntrack -L | 
        grep CLOSE_WAIT | awk '{print $5}' | 
        cut -f2 -d'=' | sort | uniq -c | 
        sort -n | awk '{if($1>=20)print $2;}') 
    do  /root/mf-ip-delete-and-ban "$i"
    done 
    sleep 10
done

And we modified mf-ip-ban-address to be mf-ip-delete-and-ban, which looks like this:

#!/bin/bash

# This script is adds OUTPUT blocking and iptable
# delete to the standard mf-ip-ban-address script.
# iptable -D (delete) will remove any duplicate
# record in iptables before creating a new one.
# OUTPUT blocking tells iptables to disallow outgoing
# connections to the ip address.  This is useful
# for dealing with syn flood attacks.

if [ ! $# == 1 ]; then
        echo    "You did not specify an IP address to ban 
USAGE: $0 ip_address_to_ban"
        exit
fi
IP=$1

IPTABLES=/sbin/iptables
$IPTABLES -D OUTPUT -d  $IP -j DROP
$IPTABLES -A OUTPUT -d  $IP -j DROP 
printf "banned output from -- $IP\n";
$IPTABLES -D INPUT -s $IP -j LOG --log-ip-options --log-tcp-options --log-level debug --log-prefix=Banned: 
$IPTABLES -A INPUT -s $IP -j LOG --log-ip-options --log-tcp-options --log-level debug --log-prefix=Banned: 
$IPTABLES -D INPUT -s $IP -j DROP
$IPTABLES -A INPUT -s $IP -j DROP

Notice a couple of changes from mf-ip-ban-address. The first and most important change was adding OUTPUT dropping. As soon we began using this method, varnish could relinquish it's open connections to the offending IP addresses. Since it no longer needed to wait for a FINAL_ACK.

The second change was adding a delete line for each creation line. While under a single offender context, blocking a single IP may not require this delete line. Since we scripted IP blocking, one of the major effects turned out to be a huge duplicate list of iptable rules. Adding the -D switch, meant deleting any duplicate entry before adding the current entry.

And that's the story. So far the site seems to be happily chugging along.

Other Gotchas Encountered

Mistakes and oversights occurred on a few occasions during this process. The first thing I, ross, the author of this narrative learned was:

Never do service networking restart on a machine for which you don't have console access.

The importance of this lesson continues to develop as the provider http://wgwilkins.com apparently has stopped responding to support requests. The server from that provider continues to linger in a non-networked state.

Watch your logs

iptables generates excessive logging traffic to /var/log/kern.log /var/log/syslog and /var/log/debug. When banning thousands of IP addresses and logging those bans, you may want to either dramatically increase rsyslog's rotation frequency or turn off logging to those files. In a number of instances our /var partition filled up, adding unnecessary confusion about server behavior. Especially in high intensity situations these additional concerns do not make life pleasant.

iptables do not persist

It's easy to forget in the middle of debugging something like this that iptables do not, by default, persist on a server reboot. In order to retain your ip settings, you'll need to do:

/sbin/iptables-save > /etc/iptables.up.rules 

before rebooting. And then:

/sbin/iptables-restore < /etc/iptables.up.rules 

after rebooting. There are ways to automate this see the Debian Admin guide.

De-duping iptables rules

In case you end up with a bunch if duplicate IP address entries in your iptables. Here's one approach for de-duping your rules.

First create a duplicate IP list

This long one liner builds a file of IP addresses from iptables with values greater than 1.

iptables -L -n | grep DROP | grep -o -E 'all -- [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | grep -v '0.0.0.0' | sort | uniq -c | sort -n | awk '$1 > 1 {print $2}' > ~/duplicate-ip-table-entries.txt 

Remove all duplicate entries and re-add them

This one liner removes all IP addresses and re-adds them.

for i in $(cat duplicate-ip-table-entries.txt); do for ip in $(iptables -L -n | grep DROP | grep -o -E "$i"); do echo "$ip"; iptables -D INPUT -s "$ip" -j LOG --log-ip-options --log-tcp-options --log-level debug --log-prefix=Banned:; iptables -D INPUT -s "$ip" -j DROP; iptables -D OUTPUT -d "$ip" -j DROP; done; done;

for i in $(cat duplicate-ip-table-entries.txt); do ./mf-ip-delete-and-ban "$i"; done

You'd need to modify the iptables rules to match the specific way you added them. The above lines delete rules specified in the mf-delete-and-ban script listed above.