| | 1 | = Report on Infrastructure for the Membership Meeting 2011 = |
| | 2 | |
| | 3 | It's been 16 years since Alfredo Lopez founded People Link and six years |
| | 4 | since People Link and the May First Technology Collective merged to form |
| | 5 | May First/People Link. |
| | 6 | |
| | 7 | While our growth and development has been fairly steady over time, this |
| | 8 | past year our infrastructure has made a qualitative leap unlike any |
| | 9 | previous year. |
| | 10 | |
| | 11 | When our organization merged six years ago, we built an infrastructure |
| | 12 | composed of four servers in two co-location facilities to serve barely a |
| | 13 | 100 members. Web applications were becoming common but were still |
| | 14 | relatively simple and only caused us headaches when they really blew up. |
| | 15 | And most of the tech work could be managed by one individual, albeit |
| | 16 | with copious help from others. |
| | 17 | |
| | 18 | During the Summer of 2010, after surpassing the 500 member mark, we |
| | 19 | re-evaluated our situation: 90 physical and virtual servers in 5 |
| | 20 | different locations, a new breed of web applications that eat servers |
| | 21 | for breakfast, and dramatically expanding email loads. We needed some |
| | 22 | serious changes. |
| | 23 | |
| | 24 | == Support Team == |
| | 25 | |
| | 26 | Our first step was to organize. Drawing on the success of the US Social |
| | 27 | Forum, we asked our fellow techies for help and they responded. |
| | 28 | Resoundingly. In September, 2010 we had our first face-to-face meeting |
| | 29 | of the newly constituted all volunteer MFPL Support Team. We're now |
| | 30 | meeting in person on a near-monthly basis (usually for one or two day |
| | 31 | weekend sprints) and strategize and collaborate via Internet chat most |
| | 32 | days to handle support requests and deal with any other issues that |
| | 33 | arise. |
| | 34 | |
| | 35 | Thanks to external funding sources, we've been able to develop our |
| | 36 | support team politically via our participation in organizing events |
| | 37 | around the world. In October, Mallory supported the Education Social |
| | 38 | Forum in Palestine. In December we sent Nat and Carlos to Mexico to |
| | 39 | suport COP16 activism. In January, Ross, Mallory, Joseph and I traveled |
| | 40 | to Dakar to support both the World Social Forum and the Indymedia Africa |
| | 41 | organizing project. In March, Nat, Mallory and Melissa support the |
| | 42 | Cochamba+1 conference in Montreal. |
| | 43 | |
| | 44 | These trips have provided the critical political glue to strengthen the |
| | 45 | volunteer technology team and to support our goal of integrating |
| | 46 | technology with politics and the global movement. Without these |
| | 47 | opportunities for techie organizing, our support team would not be |
| | 48 | possible. |
| | 49 | |
| | 50 | == Accomplishments == |
| | 51 | |
| | 52 | One of our first challenges we undertook was to prevent one web site, |
| | 53 | that was either compromised or simply getting a lot traffic, from |
| | 54 | pulling down an entire server. With newly complex and powerful web |
| | 55 | applications like Drupal and CiviCRM available, it had become much |
| | 56 | harder to prevent one member from interfering with other members on the |
| | 57 | same server. |
| | 58 | |
| | 59 | We started with a change in the way we ran PHP, the scripting language |
| | 60 | that powers many popular web applications. This change went live in |
| | 61 | October 2010. That was the first step, allowing us more flexibility and |
| | 62 | control. However, the step wasn't completed until the upgrades of early |
| | 63 | June, which included the painful switch from php 5.2 to php 5.3. Now, at |
| | 64 | long last, we can easily limit the resources used by one site. |
| | 65 | |
| | 66 | If your site explodes, it may stop accepting new visitors, but it won't |
| | 67 | hurt other sites on the server. This way we can focus our immediate |
| | 68 | attention on giving your site the resources it needs rather than |
| | 69 | struggling to keep the server under control. |
| | 70 | |
| | 71 | The early June upgrades also completed another important and fundamental |
| | 72 | infrastructure improvement: the transition to puppet. Previously, all of |
| | 73 | our servers were configured individually, through a motley collection of |
| | 74 | scripts. The process was tedious, error prone, and left our servers in |
| | 75 | various and divergent states. In many ways our biggest barrier to |
| | 76 | expansion was the time it took to properly setup a new server. |
| | 77 | Additionally, the divergent states of configuration was one of the |
| | 78 | causes of our inability to restore the databases to julia back in April. |
| | 79 | The database backup script was not properly setup on julia and we had no |
| | 80 | system in place for ensuring it was there. |
| | 81 | |
| | 82 | Starting early this year, we began transitioning our collection of |
| | 83 | scripts to a widely used free and open source system of server |
| | 84 | management called puppet. With the completion of our upgrade in early |
| | 85 | June, we finished the bulk of the transition. The result is that we can |
| | 86 | now deploy a new virtual server in about 10 - 20 minutes with most of |
| | 87 | the tedious tasks eliminated and a more solid guarantee that all the |
| | 88 | configuration tasks are complete. Furthermore, this setup is designed to |
| | 89 | scale and should support us for years to come even as we continue to |
| | 90 | expand. |
| | 91 | |
| | 92 | == Bumps along the road == |
| | 93 | |
| | 94 | The past years hasn't been all roses. |
| | 95 | |
| | 96 | We experienced an unprecedented number of server crashes (5 between |
| | 97 | November 2010 and June 2011). The resulting downtime was very painful |
| | 98 | for the affected members and frustrating to the support team as we |
| | 99 | struggled to understand the cause of the problem. After exensive |
| | 100 | sleuthing, we were finally able to track the problem to a bug in the |
| | 101 | Linux kernel. Support team members Daniel and Greg developed a patch to |
| | 102 | fix the problem that is now available for every linux user on the |
| | 103 | planet. |
| | 104 | |
| | 105 | In addition, our upgrades in early June included a signficant upgrade to |
| | 106 | PHP (the scripting language used for many web sites) that broke many of |
| | 107 | our members' web sites. We failed to properly warn members in advance of |
| | 108 | this transition, which left many members in a difficult position of |
| | 109 | scrambling to get their web sites functional again. All future upgrades |
| | 110 | of this nature will, from now on, be announced far in advance, with more |
| | 111 | significant testing. |
| | 112 | |
| | 113 | We also experienced two instances in which a lost encryption passphrase |
| | 114 | forced us to re-build a server from backup. Every May First/People Link |
| | 115 | server's hard disk is encrypted with a passphrase to protect member |
| | 116 | privacy in the unlikely event a server is seized or stolen. Prior to our |
| | 117 | transition to the support team, all passphrases were stored on a single |
| | 118 | computer without a proper backup. When the passphrase store was |
| | 119 | accidentally lost, we were able to re-build (and re-create) the lost |
| | 120 | passphrases and immediately switched to a passphrase storage system that |
| | 121 | securely distributes multiple copies of passphrases to all members of |
| | 122 | the support team. We didn't realize that our re-created file had |
| | 123 | incorrect passphrses for two servers until we had to reboot those |
| | 124 | servers. We've now instituted a process for auditing all encrypted disks |
| | 125 | in our network so we can ensure this problem will never happen again. |
| | 126 | |
| | 127 | Lastly, our most critical bump was our inability to restore databases on |
| | 128 | a julia last April. This incident was May First/People Link's first |
| | 129 | experience of data loss of a significant scale and had a tremendous |
| | 130 | impact on members who were affected. The cause was a mis-configured |
| | 131 | backup script. We've addressed the problem through the use of puppet - |
| | 132 | our system for ensuring consistent server configuration across all of |
| | 133 | our servers. In addition, thanks to the help of our adhoc backup |
| | 134 | committee, we've made a number of improvements for members, including |
| | 135 | the new ability to easily download your database backups. |
| | 136 | |
| | 137 | == What's coming == |
| | 138 | |
| | 139 | Compared to a year ago - our infrastructure is dramatically improved. In |
| | 140 | particular, thanks to the resolution of the Linux kernel bug and the |
| | 141 | improved handling of individual web sites, we expect better up time and |
| | 142 | stability across all servers. |
| | 143 | |
| | 144 | But, there is still more work to be done. We're currently in the testing |
| | 145 | phases of a new mail configuration that will allow all members to use |
| | 146 | the same mail settings - no need to keep track of your primary server |
| | 147 | when setting up your email accounts. This change will also allow us to |
| | 148 | more easily move members between primary hosts, which will greatly help |
| | 149 | us re-distribute members based on the amount of resources they are |
| | 150 | consuming. |
| | 151 | |
| | 152 | And, starting in October, Ross will be joining the May First/People Link |
| | 153 | as a paid, part-timer dedicated to answering support tickets. Welcome |
| | 154 | Ross! |
| | 155 | |
| | 156 | This past year, our work on the behind-the-scenes infrastructure has |
| | 157 | been large invisible to the membership. With these improvements under |
| | 158 | our belt, we look forward to engaging members over the up-coming year to |
| | 159 | prioritize new features and services. |
| | 160 | |
| | 161 | |