Opened 10 years ago

Closed 7 years ago

#1598 closed Bug/Something is broken (fixed)

mail problems on evo

Reported by: Jamie McClelland Owned by: Jamie McClelland
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description

Between 5:00 am and 6:03 am this morning, I recieved 1,059 message like this:

Date: Fri, 12 Dec 2008 06:03:00 -0500 (EST)
From: Mail Delivery System <MAILER-DAEMON@evo.mayfirst.org>
To: Postmaster <postmaster@evo.mayfirst.org>
Subject: Postfix SMTP server: errors from localhost.localdomain[127.0.0.1]

Transcript of session follows.

 Out: 220 evo.mayfirst.org ESMTP Postfix (Debian/GNU)
 In:  EHLO evo
 Out: 250-evo.mayfirst.org
 Out: 250-PIPELINING
 Out: 250-SIZE 10240000
 Out: 250-VRFY
 Out: 250-ETRN
 Out: 250-ENHANCEDSTATUSCODES
 Out: 250-8BITMIME
 Out: 250 DSN
 In:  MAIL FROM:<>
 Out: 452 4.3.1 Insufficient system storage

Session aborted, reason: lost connection

At the moment (9:32 am), the /var partition is not filled up:

0 evo:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              2.8G  786M  1.9G  30% /
tmpfs                 507M     0  507M   0% /lib/init/rw
udev                   10M   72K   10M   1% /dev
tmpfs                 507M     0  507M   0% /dev/shm
/dev/md3              139G  2.2G  130G   2% /srv
/dev/md1              4.6G  3.6G  827M  82% /var
0 evo:~#

Still investigating what happened.

evo is a dedicated email list server.

Change History (4)

comment:1 Changed 10 years ago by Jamie McClelland

I've started by clearing out more room in /var (the partition that filled up).

We had /var/log/listserv, which contained old listserv logs. Now listserv is logging in /srv/listserv. I've moved /var/log/listserv to a new directory: /srv/listserv-logs.

I also removed old apache web logs and changed log rotate to only keep 10 weeks worth of logs rather than 52 weeks.

And finally, I ran aptitude clean to purge old versions of software packages.

Now, the /var partition has a lot more room:

0 evo:~# df -h /var 
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              4.6G  1.7G  2.7G  39% /var
0 evo:~#

To diagnose the problem, I've made copies of the past 7 days of mail logs. I first noticed that the log files have been getting larger and larger every day:

0 evo:/srv/ticket1598# ls -lh
total 1.5G
-rw-r----- 1 root root 353M Dec 12 09:58 mail.log.0
-rw-r----- 1 root root 338M Dec 12 09:58 mail.log.1
-rw-r----- 1 root root 278M Dec 12 09:58 mail.log.2
-rw-r----- 1 root root 225M Dec 12 09:58 mail.log.3
-rw-r----- 1 root root 164M Dec 12 09:58 mail.log.4
-rw-r----- 1 root root 111M Dec 12 09:59 mail.log.5
-rw-r----- 1 root root  64M Dec 12 09:59 mail.log.6
0 evo:/srv/ticket1598#

I've downloaded all the logs to my laptop and installed a postfix log analyzer to try get an overview of what is going on with the mail logs.

comment:2 Changed 10 years ago by Jamie McClelland

Despite the difference in log file sizes, the log analyzer summaries don't seem to indicate many differences in the overall activity of the logs:

==> mail.log.0.summary <==

Grand Totals
------------
messages

   3177   received
  34563   delivered
    859   forwarded
   1732   deferred  (853k deferrals)
    261   bounced
   1588   rejected (4%)
      0   reject warnings
      0   held
      0   discarded (0%)

  21060k  bytes received
 261149k  bytes delivered
   1091   senders
    265   sending hosts/domains
   7873   recipients
   1697   recipient hosts/domains

==> mail.log.1.summary <==

Grand Totals
------------
messages

   2649   received
  32243   delivered
      0   forwarded
   1550   deferred  (832k deferrals)
    367   bounced
    326   rejected (1%)
      0   reject warnings
      0   held
      0   discarded (0%)

  20335k  bytes received
 362799k  bytes delivered
    484   senders
    325   sending hosts/domains
   8227   recipients
   1827   recipient hosts/domains

==> mail.log.2.summary <==

Grand Totals
------------
messages

   3292   received
  32693   delivered
      0   forwarded
   1579   deferred  (682k deferrals)
    292   bounced
    240   rejected (0%)
      0   reject warnings
      0   held
      0   discarded (0%)

  28037k  bytes received
 332654k  bytes delivered
   1176   senders
    298   sending hosts/domains
   7228   recipients
   1587   recipient hosts/domains

==> mail.log.3.summary <==

Grand Totals
------------
messages

   3285   received
  32078   delivered
      0   forwarded
   1422   deferred  (561k deferrals)
    319   bounced
    583   rejected (1%)
      0   reject warnings
      0   held
      0   discarded (0%)

  25313k  bytes received
 304953k  bytes delivered
   1238   senders
    284   sending hosts/domains
   7907   recipients
   1814   recipient hosts/domains

==> mail.log.4.summary <==

Grand Totals
------------
messages

   3013   received
  29907   delivered
      0   forwarded
   1277   deferred  (404159  deferrals)
    270   bounced
    299   rejected (0%)
      0   reject warnings
      0   held
      0   discarded (0%)

  36367k  bytes received
 426745k  bytes delivered
   1126   senders
    288   sending hosts/domains
   7407   recipients
   1693   recipient hosts/domains

==> mail.log.5.summary <==

Grand Totals
------------
messages

   2277   received
  32542   delivered
      0   forwarded
   1224   deferred  (261907  deferrals)
    290   bounced
    405   rejected (1%)
      0   reject warnings
      0   held
      0   discarded (0%)

  15415k  bytes received
 284241k  bytes delivered
    370   senders
    277   sending hosts/domains
   7800   recipients
   1777   recipient hosts/domains

==> mail.log.6.summary <==

Grand Totals
------------
messages

   2956   received
  30849   delivered
      0   forwarded
   1232   deferred  (137035  deferrals)
    291   bounced
    173   rejected (0%)
      0   reject warnings
      0   held
      0   discarded (0%)

  19814k  bytes received
 338475k  bytes delivered
   1187   senders
    293   sending hosts/domains
   7887   recipients
   1756   recipient hosts/domains

comment:3 Changed 10 years ago by Jamie McClelland

I think I've figured out the problem.

It seems as though Yahoo has changed their mail acceptance process.

Yahoo always defers our email when they detect a single server sending to a lot of Yahoo recipients. Typically they defer the first few delivery attempts and then slowly start accepting deliveries.

Now, it seems that they are deferring a lot more delivery attempts. The summary statistics are the same because the same number of email messages are getting deferred, they're simply getting deferred more times before successfully getting delivered.

A recent log is showing that messages to Yahoo were deferred a total of 683,699 times:

0 jamie@liberace:ticket1598$ grep "YAHOO.*temporarily deferred due to user complaints" mail.log.1 | wc -l
683699
0 jamie@liberace:ticket1598$

Compared with a week ago, that number was only 76,003:

0 jamie@liberace:ticket1598$ grep "YAHOO.*temporarily deferred due to user complaints" mail.log.6 | wc -l
76003
0 jamie@liberace:ticket1598$

To compound the problem, Yahoo has also been timing out. This could be due to infrastructure problems with Yahoo or because they are taking a policy of temporarily preventing our IP address from connecting.

A recent log shows 683,699 connection timeouts:

0 jamie@liberace:ticket1598$ grep "YAHOO.*temporarily deferred due to" mail.log.1 | wc -l
683699
0 jamie@liberace:ticket1598$

Compared with a week ago that was only 34:

 grep "YAHOO.*timed out while receiving the initial server greeting" mail.log.6 | wc -l
34
0 jamie@liberace:ticket1598$

In the end, I'm not seeing a downward trend in Yahoo messages getting delivered:

0 jamie@liberace:ticket1598$ for log in $(ls mail.log.{1,2,3,4,5,6}); do echo -n "$log: "; grep "YAHOO.*status=sent" "$log" | wc -l; done
mail.log.1: 4787
mail.log.2: 5124
mail.log.3: 4003
mail.log.4: 3773
mail.log.5: 6467
mail.log.6: 3448
0 jamie@liberace:ticket1598$

The mail log rotates at 6:30 am - so the varying number of deliveries per day I would attribute to the time the original post to portside is made. When it's made on the early side, more deliveries are made prior to 6:30 am, when it's on the late side, less deliveries are made.

However, since there is no downward trend, I don't think we're seeing less ultimate deliveries.

comment:4 Changed 7 years ago by Ross

Resolution: fixed
Status: newclosed

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.