Changes between Initial Version and Version 1 of julia-recovery/notes-2011-04-13


Ignore:
Timestamp:
Apr 15, 2011, 1:02:17 PM (13 years ago)
Author:
Jamie McClelland
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • julia-recovery/notes-2011-04-13

    v1 v1  
     1Notes from Meeting on Julia Data Loss
     2
     3Date: 2011-04-13
     4
     5Present: Approximately 25 members participated in the conference call
     6
     7== Opening ==
     8
     9Alfredo began the meeting with introductions and then proceeded to frame the discussion. He reported that it was the worst data loss in the history of our organization. Although only affecting a small percentage of our membership (about 35 out of 800 - 900 sites and about 20 members out of 500 were affected), for the members affected it has been nothing short of catastrophic. He also related his own personal experience, having lost a site himself that is critical to his work.
     10
     11== Report ==
     12
     13=== What happened ===
     14
     15Next, Jamie gave a detailed report on exactly what happened.
     16
     17While installing a server on the previous thursday, he accidentally brushed the power cable used by the server pietri causing it to shutdown. Upon restarting, he was prompted for the disk passphrase. (All MFPL servers have their hard disks encrypted. In the event they are seized or stolen, encryption provides the organization with more control over the privacy of member data.)
     18
     19He then discovered that the passphrase on file was not the proper passphrase for the server and we did not have a copy of the proper disk encryption passphrase. This was the first critical mistake in the chain of events.
     20
     21Jamie then went on to explain the reason for this incorrect passphrase.
     22
     23Prior to September 2010, all MFPL server passphrases were stored in a single encrypted file stored on Jamie's laptop and backed up daily. The in September 2010, this file was accidentally deleted and, Jamie discovered, the backup had not been running properly since February 2010.
     24
     25This was not a crisis. All passphrases can be recovered from running machines. The tech team performed an audit, searching for missing passphrases and recovering each one.
     26
     27In addition, we fundamentally changed our passphrase distribution system. Starting in September 2010, we began storing our passphrases using a system called keyringer. Each time a passphrase is added, changed, or removed, the file is updated, encrypted to each of the 8 members of the MFPL support team, and copied to a central server. Each MFPL support team member periodically refreshes their copy to ensure multiple copies are always available.
     28
     29In late February, during a reboot of the server kiyoshi, we discovered that the passphrase we had on file for kiyoshi was not the correct passphrase (all data was restored on kiyoshi from backup). The reason is because, during routine maintenance in May 2010, the passphrase had been changed. During our September 2010 audit, we only looked for missing passphrases, so we did not realize the passphrase had changed.
     30
     31At this point we made another critical mistake. We should have run a new audit to ensure that all passphrases on file were the correct passphrases for each server. Our failure to take this step was in excusable and reckless mistake that directly resulted in the lost data on julia.
     32
     33Once we discovered that we would not be able to decrypt the disk on pietri, we released a service advisory warning our members that we would expect about 12 hours of downtime as we re-built pietri and all it's virtual servers from backup. 
     34
     35During the early morning hours on friday, we discovered that, while the backups of eagle and kramer were complete, the backup of julia was missing the databases. The reason was due to a missing configuration file. This failure was another critical mistake contributing to the data loss. It was a failure not only of the initial setup of julia, but also it points to a lack of robust auditing systems to alert us that a critical component of the backup was missing.
     36
     37=== Immediate Steps ===
     38
     39Immediately after discovering the incorrect encrypted disk, Jamie asked a member of the support team to run an audit to ensure that we were in possession of all encrypted passphrases for all running servers (the test came back positive).
     40
     41On Friday morning, we compiled as accurate a list as was possible of the affected sites, members and their email addresses. We drafted and sent individual emails to the contacts for each affected members explaining what happened.
     42
     43Support team members worked all day on friday, having phone calls and extended email communications with affected members, helping individual members restore their sites from their own backups (when available), running warrick (a program that recovers public files from the Internet caches) for each site, and answering support tickets.
     44
     45On Saturday we published a wiki page outlining each affected site and the status of its recovery (https://support.mayfirst.org/wiki/julia-recovery), and continued the support effort through the weekend and into the following week.
     46
     47On Monday, we ran an initial human audit to ensure that all backups were running properly
     48
     49=== Proposal for changes ===
     50
     51Jamie presented several proposals to ensure that this problem will never happen again. These have been expanded upon on a separate, and currently in-revision wiki page (https://support.mayfirst.org/wiki/proposals/2011/new-data-protection-procedures).
     52
     53 * Regular human-run passphrase audits to ensure we always have all passphrases for all servers.
     54
     55 * Regular automated backup audits. We currently receive an email alert when a backup returns an error. We need a second automatic audit, independent of the backup process that confirms they have run
     56
     57 * Regular human-run backup audits. In addition to automatic audits, we need periodic human review of backups to ensure that what needs to be backed up is in fact being backed up.
     58
     59 * Better tools and instructions for non-technical members empowering them to ensure they have a backup of their own data.
     60
     61 * Discussion and recommendations on best practices for developers on how to incorporate full backups into the site development cycle.
     62
     63=== Backup system ===
     64
     65Jamie concluded by describing May First/People Links general backup strategy.
     66
     67MFPL has a three-part backup system. Each server has at least two hard disks configured in what's called a RAID 1 array. That means that all data is always written to two disks rather than just one disk. If one disk fails, then our servers continue running on the remaining disk.  RAID 1 is our first line of defense against hard drive failure. 
     68
     69Second, all servers backup nightly to an on-site server (a dedicated physical server for backups). Five days of incremental backups are kept on the onsite backup server.
     70
     71Third, all servers backup nightly to an off-site server. Only the most recent copy of the data is stored on the off-site backup server.
     72
     73== Feedback and Discussion ==
     74
     75Hutch reported difficulty understanding the current instructions on the wiki page and requested that any documentation developed be designed absolute tech beginners.
     76
     77Ana and Jack gave more detail about warrick (http://warrick.cs.odu.edu/), the program used to recover cached versions of data from the web sites affected.
     78
     79Daniel S asked about what procedures we were using for each affected site during the recovery process.
     80
     81Ana and Jamie answered that we are running warrick for each site, however, each site's approach was being handled on a case-by-case basis and lead by the member involved. Some members were restoring from their own backups, others have switched their site to the cached version and other members are re-building from scratch.
     82
     83Ken made a statement of support and appreciation for the honesty and transparency with which MFPL has responded. Ken also asked if MFPL considered it the member's responsibility to make backups.
     84
     85Jamie emphatically stated that MFPL considers it a critical organizational priority and responsibility to ensure the safety of all member data. Despite this present failure, MFPL continues to consider it a crucial organizational responsibility. All proposals for helping members run their own backups are intended as redundant procedures to build member empowerment and control over their own technology, not as an indication of a change in our fundamental approach to data.
     86
     87Andrew asked if we tried to recover the encrypted disk via a brute-force attack or any other method. Jamie responded that we did not make an effort because our systems are designed to be impossible to access without the passphrase (all passphrases are random generated 15 character strings).
     88
     89Jon asked why the passphrases changed. Jamie explained it was routine maintenance (new replacement disk was introduced and we took the opportunity to change the partition scheme, which require moving all data from one encrypted disk to a newly created encrypted disk). 
     90
     91And asked how are we sure it won't happen again? Jamie responded by describing the immediate audit steps taken and referencing the proposals on the table.
     92
     93Alfredo raised that we have been inadequate in explaining to the lay person MFPL leadership what these processes are. Could the leadership have changed these procedures? Too many members shrug their shoulders and say "take care of it" perhaps due to the gap between what our techies know and what our other members know.
     94
     95Amy noted that a number of proposal require participation and that she would be interested in having the discussion on building that participation.
     96
     97Jack pointed out that one affected member (a client of Jack's) did not get notified - and is not subscribed to the service advisory emails.  Jack suggested that at least one individual from each membership should be subscribed to the service-advisory emails. 
     98
     99Jack also mentioned that two members she works with (PTP and Latina Institute) are looking for advice if it's safe to stick with MFPL and want assurances that this won't happen again.
     100
     101Jamie agreed with the need for more robust communication and reported that notes from this meeting would be sent to everyone. He also expressed in interest in continuing the discussion on who and how to subscribe all members to the service-advisories list.
     102
     103Jack also asked what is MFPL's guarantee in terms of hosting.
     104
     105Jamie responded that outside of the Statement of Unity and Membership Agreement page on the web site, we have no official or contractual guarantees.
     106
     107Jack also suggested that the weight of the organization is heavy for an all volunteer organizations, that it seems like a big load for a few people
     108
     109Mark also asked about paid staff and expressed a concerned that international work is detracting from the core operations of the
     110organization
     111
     112Jamie acknowledged that all support work, including the work of the co-directors, is volunteer. He also said that this week we are hiring our first part time worker, who will be focusing on financial tasks and following up on members behind in their dues and that we specifically chose this task as our first paid staff person because it would contribute to the financial health of the organization. Jamie also pointed out that, in the last year, we have effectively organized a support team which is meeting regularly and has provided significantly more labor to the organization than was available just a year ago and provides significantly more resources than even a single paid staff person could provide. Jamie also acknowledged that providing financial compensation to the support team was a priority and discussions were under-way as to how we might effectively begin that process over the next year.
     113
     114Alfredo emphasized that MFPL shared infrastructure is the absolute priority and if international work affects this work, we must address it on a case by case basis.
     115
     116He also re-iterated that we have no contractual guarantee but we do promise data protection, which is a promise we failed with julia. 
     117
     118Jack said we should update the membership agreement to say we guarantee data protection.
     119
     120Jack also added that it's important to re-examine to make sure that the resources for the infrastructure are being made and that international work is not detracting from that
     121
     122Ana referenced Jack's point that non-technical people were not aware of the problem. She related it to Alfredo's point about chasm between tech and non-tech folks. Is there a way for the membership to have a non-technical conversation? That would be one way to organize to enable more membership voices - ability for membership to quickly give feedback of technical and non-technical nature. 
     123
     124Ana also said our commitment to principles is deep - we all see different parts of it. We will not turn off web sites for lack of ability to pay. In practice it needs a lot more finesse.  We need to be able to distinguish between abandoned site and non-abandoned sites, for example.
     125
     126We should have a policy to require people to respond: if they can't pay, they have to provide something. If people do not respond, the sites should be disabled.
     127
     128Mark added that we should make more obvious how to pay bills.
     129
     130Suggestion from a Sanctuary Movement: have an tutorial event for using the system - more than just a web page, but a training event.
     131
     132And a question: is there a way for members to help each other?
     133
     134Alfredo responded that https://support.mayfirst.org/ is the way for members to support each other.
     135
     136He also explained that we don't have an introductory video - we have a pitch video, but we don't have something to point new members to - that may explain the basics of MFPL - like how to post support tickets. 
     137
     138Ken said that goes back to what Jack brought up. He never had an expectation of a host like MFPL to do backup for me - he imagined the constraints MFPL operates under. Clarifying that would be helpful.  Good for mission. It would be good to have that documentation at a primer level.
     139
     140Ken also suggested that we stay away from service level agreement
     141language that paints you into the corner because someone holds you
     142liable for an outage.
     143
     144Amy said thanks to everyone - quick forward in support of idea of tutorial in video or real-space and a couple items on Jamie's list could dove tail with that. Tutorial as a follow up on backups - put that out to seize the moment. Also, perhaps performing human audits could be something we train people to to a support team member could partner with new person to learn new skill.
     145
     146Ross: one thing we don't have is sense of what members responsibilities.  more than just a piece of paper - but should really incorporate that in our conversations and activities
     147
     148Daniel S reported that he never quite feels he is a member and doesn't know how to build that community. something to do with what happens when you join MFPL. There is very little. Maybe a new member orientation? So people really start to feel like a member. MFPL has possibility to become a community.
     149
     150Ivan: piggy back off of that - another thing that would help a sense of community would involve some kind of in person events or gatherings.  MFPL members in different cities might be ambassadors to meet people in other cities if they have that capacity. 
     151
     152Jon said, without taking away from what's been said,we need a means for members to interact that is not the support system. The ticket tracking system is not friendly or welcoming for things that have anything to do with something other than tech support.
     153
     154Ivan: maybe a member-wide discussion mailing list?
     155
     156Jack: supporting folks drupal web sites, supporting each other in building web sites is lots of work and makes me feel bad for mfpl and worried. It's a whole other thing to offer support for a range of drupal sites.
     157
     158Ross: if we had a sense of community - then that obligation is not MFPL, it's the membership - 500 or 1000 people who might know how to tweak a theme.
     159
     160Alfredo responded that membership agreement says members are responsible for their own sites, but we "sin" in helping our members. Ross hit the nail on the head - not a high level of conciousness among our members about their membership.  Most see MFPL as the ability to get resources - "you host our web site" - as if we were a commercial provider. 
     161
     162Next steps: put notes on wiki and proposals, Jamie heads up ad hoc committee. Also he will open a ticket for public comment.