Changes between Initial Version and Version 1 of projects/membership-meeting/2011/tech-report


Ignore:
Timestamp:
Sep 16, 2011, 2:07:43 PM (8 years ago)
Author:
Jamie McClelland
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • projects/membership-meeting/2011/tech-report

    v1 v1  
     1= Report on Infrastructure for the Membership Meeting 2011 =
     2
     3It's been 16 years since Alfredo Lopez founded People Link and six years
     4since People Link and the May First Technology Collective merged to form
     5May First/People Link.
     6
     7While our growth and development has been fairly steady over time, this
     8past year our infrastructure has made a qualitative leap unlike any
     9previous year.
     10
     11When our organization merged six years ago, we built an infrastructure
     12composed of four servers in two co-location facilities to serve barely a
     13100 members. Web applications were becoming common but were still
     14relatively simple and only caused us headaches when they really blew up.
     15And most of the tech work could be managed by one individual, albeit
     16with copious help from others.
     17
     18During the Summer of 2010, after surpassing the 500 member mark, we
     19re-evaluated our situation: 90 physical and virtual servers in 5
     20different locations, a new breed of web applications that eat servers
     21for breakfast, and dramatically expanding email loads. We needed some
     22serious changes.
     23
     24== Support Team ==
     25
     26Our first step was to organize. Drawing on the success of the US Social
     27Forum, we asked our fellow techies for help and they responded.
     28Resoundingly. In September, 2010 we had our first face-to-face meeting
     29of the newly constituted all volunteer MFPL Support Team. We're now
     30meeting in person on a near-monthly basis (usually for one or two day
     31weekend sprints) and strategize and collaborate via Internet chat most
     32days to handle support requests and deal with any other issues that
     33arise.
     34
     35Thanks to external funding sources, we've been able to develop our
     36support team politically via our participation in organizing events
     37around the world. In October, Mallory supported the Education Social
     38Forum in Palestine. In December we sent Nat and Carlos to Mexico to
     39suport COP16 activism. In January, Ross, Mallory, Joseph and I traveled
     40to Dakar to support both the World Social Forum and the Indymedia Africa
     41organizing project. In March, Nat, Mallory and Melissa support the
     42Cochamba+1 conference in Montreal.
     43
     44These trips have provided the critical political glue to strengthen the
     45volunteer technology team and to support our goal of integrating
     46technology with politics and the global movement. Without these
     47opportunities for techie organizing, our support team would not be
     48possible.
     49
     50== Accomplishments ==
     51
     52One of our first challenges we undertook was to prevent one web site,
     53that was either compromised or simply getting a lot traffic, from
     54pulling down an entire server. With newly complex and powerful web
     55applications like Drupal and CiviCRM available, it had become much
     56harder to prevent one member from interfering with other members on the
     57same server.
     58
     59We started with a change in the way we ran PHP, the scripting language
     60that powers many popular web applications. This change went live in
     61October 2010. That was the first step, allowing us more flexibility and
     62control. However, the step wasn't completed until the upgrades of early
     63June, which included the painful switch from php 5.2 to php 5.3. Now, at
     64long last, we can easily limit the resources used by one site.
     65
     66If your site explodes, it may stop accepting new visitors, but it won't
     67hurt other sites on the server. This way we can focus our immediate
     68attention on giving your site the resources it needs rather than
     69struggling to keep the server under control.
     70
     71The early June upgrades also completed another important and fundamental
     72infrastructure improvement: the transition to puppet. Previously, all of
     73our servers were configured individually, through a motley collection of
     74scripts. The process was tedious, error prone, and left our servers in
     75various and divergent states. In many ways our biggest barrier to
     76expansion was the time it took to properly setup a new server.
     77Additionally, the divergent states of configuration was one of the
     78causes of our inability to restore the databases to julia back in April.
     79The database backup script was not properly setup on julia and we had no
     80system in place for ensuring it was there.
     81
     82Starting early this year, we began transitioning our collection of
     83scripts to a widely used free and open source system of server
     84management called puppet. With the completion of our upgrade in early
     85June, we finished the bulk of the transition. The result is that we can
     86now deploy a new virtual server in about 10 - 20 minutes with most of
     87the tedious tasks eliminated and a more solid guarantee that all the
     88configuration tasks are complete. Furthermore, this setup is designed to
     89scale and should support us for years to come even as we continue to
     90expand.
     91
     92== Bumps along the road ==
     93
     94The past years hasn't been all roses.
     95
     96We experienced an unprecedented number of server crashes (5 between
     97November 2010 and June 2011). The resulting downtime was very painful
     98for the affected members and frustrating to the support team as we
     99struggled to understand the cause of the problem. After exensive
     100sleuthing, we were finally able to track the problem to a bug in the
     101Linux kernel. Support team members Daniel and Greg developed a patch to
     102fix the problem that is now available for every linux user on the
     103planet.
     104
     105In addition, our upgrades in early June included a signficant upgrade to
     106PHP (the scripting language used for many web sites) that broke many of
     107our members' web sites. We failed to properly warn members in advance of
     108this transition, which left many members in a difficult position of
     109scrambling to get their web sites functional again. All future upgrades
     110of this nature will, from now on, be announced far in advance, with more
     111significant testing.
     112
     113We also experienced two instances in which a lost encryption passphrase
     114forced us to re-build a server from backup. Every May First/People Link
     115server's hard disk is encrypted with a passphrase to protect member
     116privacy in the unlikely event a server is seized or stolen. Prior to our
     117transition to the support team, all passphrases were stored on a single
     118computer without a proper backup. When the passphrase store was
     119accidentally lost, we were able to re-build (and re-create) the lost
     120passphrases and immediately switched to a passphrase storage system that
     121securely distributes multiple copies of passphrases to all members of
     122the support team. We didn't realize that our re-created file had
     123incorrect passphrses for two servers until we had to reboot those
     124servers. We've now instituted a process for auditing all encrypted disks
     125in our network so we can ensure this problem will never happen again.
     126
     127Lastly, our most critical bump was our inability to restore databases on
     128a julia last April. This incident was May First/People Link's first
     129experience of data loss of a significant scale and had a tremendous
     130impact on members who were affected. The cause was a mis-configured
     131backup script. We've addressed the problem through the use of puppet -
     132our system for ensuring consistent server configuration across all of
     133our servers. In addition, thanks to the help of our adhoc backup
     134committee, we've made a number of improvements for members, including
     135the new ability to easily download your database backups.
     136
     137== What's coming ==
     138
     139Compared to a year ago - our infrastructure is dramatically improved. In
     140particular, thanks to the resolution of the Linux kernel bug and the
     141improved handling of individual web sites, we expect better up time and
     142stability across all servers.
     143
     144But, there is still more work to be done. We're currently in the testing
     145phases of a new mail configuration that will allow all members to use
     146the same mail settings - no need to keep track of your primary server
     147when setting up your email accounts. This change will also allow us to
     148more easily move members between primary hosts, which will greatly help
     149us re-distribute members based on the amount of resources they are
     150consuming.
     151
     152And, starting in October, Ross will be joining the May First/People Link
     153as a paid, part-timer dedicated to answering support tickets. Welcome
     154Ross!
     155
     156This past year, our work on the behind-the-scenes infrastructure has
     157been large invisible to the membership. With these improvements under
     158our belt, we look forward to engaging members over the up-coming year to
     159prioritize new features and services.
     160
     161