wiki:projects/membership-meeting/2011/tech-report

Report on Infrastructure for the Membership Meeting 2011

It's been 16 years since Alfredo Lopez founded People Link and six years since People Link and the May First Technology Collective merged to form May First/People Link.

While our growth and development has been fairly steady over time, this past year our infrastructure has made a qualitative leap unlike any previous year.

When our organization merged six years ago, we built an infrastructure composed of four servers in two co-location facilities to serve barely a 100 members. Web applications were becoming common but were still relatively simple and only caused us headaches when they really blew up. And most of the tech work could be managed by one individual, albeit with copious help from others.

During the Summer of 2010, after surpassing the 500 member mark, we re-evaluated our situation: 90 physical and virtual servers in 5 different locations, a new breed of web applications that eat servers for breakfast, and dramatically expanding email loads. We needed some serious changes.

Support Team

Our first step was to organize. Drawing on the success of the US Social Forum, we asked our fellow techies for help and they responded. Resoundingly. In September, 2010 we had our first face-to-face meeting of the newly constituted all volunteer MFPL Support Team. We're now meeting in person on a near-monthly basis (usually for one or two day weekend sprints) and strategize and collaborate via Internet chat most days to handle support requests and deal with any other issues that arise.

Thanks to external funding sources, we've been able to develop our support team politically via our participation in organizing events around the world. In October, Mallory supported the Education Social Forum in Palestine. In December we sent Nat and Carlos to Mexico to suport COP16 activism. In January, Ross, Mallory, Joseph and I traveled to Dakar to support both the World Social Forum and the Indymedia Africa organizing project. In March, Nat, Mallory and Melissa support the Cochamba+1 conference in Montreal.

These trips have provided the critical political glue to strengthen the volunteer technology team and to support our goal of integrating technology with politics and the global movement. Without these opportunities for techie organizing, our support team would not be possible.

Accomplishments

One of our first challenges we undertook was to prevent one web site, that was either compromised or simply getting a lot traffic, from pulling down an entire server. With newly complex and powerful web applications like Drupal and CiviCRM available, it had become much harder to prevent one member from interfering with other members on the same server.

We started with a change in the way we ran PHP, the scripting language that powers many popular web applications. This change went live in October 2010. That was the first step, allowing us more flexibility and control. However, the step wasn't completed until the upgrades of early June, which included the painful switch from php 5.2 to php 5.3. Now, at long last, we can easily limit the resources used by one site.

If your site explodes, it may stop accepting new visitors, but it won't hurt other sites on the server. This way we can focus our immediate attention on giving your site the resources it needs rather than struggling to keep the server under control.

The early June upgrades also completed another important and fundamental infrastructure improvement: the transition to puppet. Previously, all of our servers were configured individually, through a motley collection of scripts. The process was tedious, error prone, and left our servers in various and divergent states. In many ways our biggest barrier to expansion was the time it took to properly setup a new server. Additionally, the divergent states of configuration was one of the causes of our inability to restore the databases to julia back in April. The database backup script was not properly setup on julia and we had no system in place for ensuring it was there.

Starting early this year, we began transitioning our collection of scripts to a widely used free and open source system of server management called puppet. With the completion of our upgrade in early June, we finished the bulk of the transition. The result is that we can now deploy a new virtual server in about 10 - 20 minutes with most of the tedious tasks eliminated and a more solid guarantee that all the configuration tasks are complete. Furthermore, this setup is designed to scale and should support us for years to come even as we continue to expand.

Bumps along the road

The past years hasn't been all roses.

We experienced an unprecedented number of server crashes (5 between November 2010 and June 2011). The resulting downtime was very painful for the affected members and frustrating to the support team as we struggled to understand the cause of the problem. After exensive sleuthing, we were finally able to track the problem to a bug in the Linux kernel. Support team members Daniel and Greg developed a patch to fix the problem that is now available for every linux user on the planet.

In addition, our upgrades in early June included a signficant upgrade to PHP (the scripting language used for many web sites) that broke many of our members' web sites. We failed to properly warn members in advance of this transition, which left many members in a difficult position of scrambling to get their web sites functional again. All future upgrades of this nature will, from now on, be announced far in advance, with more significant testing.

We also experienced two instances in which a lost encryption passphrase forced us to re-build a server from backup. Every May First/People Link server's hard disk is encrypted with a passphrase to protect member privacy in the unlikely event a server is seized or stolen. Prior to our transition to the support team, all passphrases were stored on a single computer without a proper backup. When the passphrase store was accidentally lost, we were able to re-build (and re-create) the lost passphrases and immediately switched to a passphrase storage system that securely distributes multiple copies of passphrases to all members of the support team. We didn't realize that our re-created file had incorrect passphrses for two servers until we had to reboot those servers. We've now instituted a process for auditing all encrypted disks in our network so we can ensure this problem will never happen again.

Lastly, our most critical bump was our inability to restore databases on a julia last April. This incident was May First/People Link's first experience of data loss of a significant scale and had a tremendous impact on members who were affected. The cause was a mis-configured backup script. We've addressed the problem through the use of puppet - our system for ensuring consistent server configuration across all of our servers. In addition, thanks to the help of our adhoc backup committee, we've made a number of improvements for members, including the new ability to easily download your database backups.

What's coming

Compared to a year ago - our infrastructure is dramatically improved. In particular, thanks to the resolution of the Linux kernel bug and the improved handling of individual web sites, we expect better up time and stability across all servers.

But, there is still more work to be done. We're currently in the testing phases of a new mail configuration that will allow all members to use the same mail settings - no need to keep track of your primary server when setting up your email accounts. This change will also allow us to more easily move members between primary hosts, which will greatly help us re-distribute members based on the amount of resources they are consuming.

And, starting in October, Ross will be joining the May First/People Link as a paid, part-timer dedicated to answering support tickets. Welcome Ross!

This past year, our work on the behind-the-scenes infrastructure has been large invisible to the membership. With these improvements under our belt, we look forward to engaging members over the up-coming year to prioritize new features and services.

Last modified 8 years ago Last modified on Sep 16, 2011, 2:07:43 PM