julia-recovery/notes-2011-04-13 – Support

wiki:julia-recovery/notes-2011-04-13

Context Navigation

Notes from Meeting on Julia Data Loss

Date: 2011-04-13

Present: Approximately 25 members participated in the conference call

Please see:

overview of sites affected
proposals for new procedures
#4119 - the ticket for discussion on this topic

Opening

Alfredo began the meeting with introductions and then proceeded to frame the discussion. He reported that it was the worst data loss in the history of our organization. Although only affecting a small percentage of our membership (about 35 out of 800 - 900 sites and about 20 members out of 500 were affected), for the members affected it has been nothing short of catastrophic. He also related his own personal experience, having lost a site himself that is critical to his work.

Report

What happened

Next, Jamie gave a detailed report on exactly what happened.

While installing a server on the previous thursday, he accidentally brushed the power cable used by the server pietri causing it to shutdown. Upon restarting, he was prompted for the disk passphrase. (All MFPL servers have their hard disks encrypted. In the event they are seized or stolen, encryption provides the organization with more control over the privacy of member data.)

He then discovered that the passphrase on file was not the proper passphrase for the server and we did not have a copy of the proper disk encryption passphrase. This was the first critical mistake in the chain of events.

Jamie then went on to explain the reason for this incorrect passphrase.

Prior to September 2010, all MFPL server passphrases were stored in a single encrypted file stored on Jamie's laptop and backed up daily. The in September 2010, this file was accidentally deleted and, Jamie discovered, the backup had not been running properly since February 2010.

This was not a crisis. All passphrases can be recovered from running machines. The tech team performed an audit, searching for missing passphrases and recovering each one.

In addition, we fundamentally changed our passphrase distribution system. Starting in September 2010, we began storing our passphrases using a system called keyringer. Each time a passphrase is added, changed, or removed, the file is updated, encrypted to each of the 8 members of the MFPL support team, and copied to a central server. Each MFPL support team member periodically refreshes their copy to ensure multiple copies are always available.

In late February, during a reboot of the server kiyoshi, we discovered that the passphrase we had on file for kiyoshi was not the correct passphrase (all data was restored on kiyoshi from backup). The reason is because, during routine maintenance in May 2010, the passphrase had been changed. During our September 2010 audit, we only looked for missing passphrases, so we did not realize the passphrase had changed.

At this point we made another critical mistake. We should have run a new audit to ensure that all passphrases on file were the correct passphrases for each server. Our failure to take this step was in excusable and reckless mistake that directly resulted in the lost data on julia.

Once we discovered that we would not be able to decrypt the disk on pietri, we released a service advisory warning our members that we would expect about 12 hours of downtime as we re-built pietri and all it's virtual servers from backup.

During the early morning hours on friday, we discovered that, while the backups of eagle and kramer were complete, the backup of julia was missing the databases. The reason was due to a missing configuration file. This failure was another critical mistake contributing to the data loss. It was a failure not only of the initial setup of julia, but also it points to a lack of robust auditing systems to alert us that a critical component of the backup was missing.

Immediate Steps

Immediately after discovering the incorrect encrypted disk, Jamie asked a member of the support team to run an audit to ensure that we were in possession of all encrypted passphrases for all running servers (the test came back positive).

On Friday morning, we compiled as accurate a list as was possible of the affected sites, members and their email addresses. We drafted and sent individual emails to the contacts for each affected members explaining what happened.

Support team members worked all day on friday, having phone calls and extended email communications with affected members, helping individual members restore their sites from their own backups (when available), running warrick (a program that recovers public files from the Internet caches) for each site, and answering support tickets.

On Saturday we published a wiki page outlining each affected site and the status of its recovery (https://support.mayfirst.org/wiki/julia-recovery), and continued the support effort through the weekend and into the following week.

On Monday, we ran an initial human audit to ensure that all backups were running properly

Proposal for changes

Jamie presented several proposals to ensure that this problem will never happen again. These have been expanded upon on a separate, and currently in-revision wiki page (https://support.mayfirst.org/wiki/proposals/2011/new-data-protection-procedures).

Regular human-run passphrase audits to ensure we always have all passphrases for all servers.

Regular automated backup audits. We currently receive an email alert when a backup returns an error. We need a second automatic audit, independent of the backup process that confirms they have run

Regular human-run backup audits. In addition to automatic audits, we need periodic human review of backups to ensure that what needs to be backed up is in fact being backed up.

Better tools and instructions for non-technical members empowering them to ensure they have a backup of their own data.

Discussion and recommendations on best practices for developers on how to incorporate full backups into the site development cycle.

Backup system

Jamie concluded by describing May First/People Links general backup strategy.

MFPL has a three-part backup system. Each server has at least two hard disks configured in what's called a RAID 1 array. That means that all data is always written to two disks rather than just one disk. If one disk fails, then our servers continue running on the remaining disk. RAID 1 is our first line of defense against hard drive failure.

Second, all servers backup nightly to an on-site server (a dedicated physical server for backups). Five days of incremental backups are kept on the onsite backup server.

Third, all servers backup nightly to an off-site server. Only the most recent copy of the data is stored on the off-site backup server.

Feedback and Discussion

Hutch reported difficulty understanding the current instructions on the wiki page and requested that any documentation developed be designed absolute tech beginners.

Ana and Jack gave more detail about warrick (http://warrick.cs.odu.edu/), the program used to recover cached versions of data from the web sites affected.

Daniel S asked about what procedures we were using for each affected site during the recovery process.

Ana and Jamie answered that we are running warrick for each site, however, each site's approach was being handled on a case-by-case basis and lead by the member involved. Some members were restoring from their own backups, others have switched their site to the cached version and other members are re-building from scratch.

Ken made a statement of support and appreciation for the honesty and transparency with which MFPL has responded. Ken also asked if MFPL considered it the member's responsibility to make backups.

Jamie emphatically stated that MFPL considers it a critical organizational priority and responsibility to ensure the safety of all member data. Despite this present failure, MFPL continues to consider it a crucial organizational responsibility. All proposals for helping members run their own backups are intended as redundant procedures to build member empowerment and control over their own technology, not as an indication of a change in our fundamental approach to data.

Andrew asked if we tried to recover the encrypted disk via a brute-force attack or any other method. Jamie responded that we did not make an effort because our systems are designed to be impossible to access without the passphrase (all passphrases are random generated 15 character strings).

Jon asked why the passphrases changed. Jamie explained it was routine maintenance (new replacement disk was introduced and we took the opportunity to change the partition scheme, which require moving all data from one encrypted disk to a newly created encrypted disk).

And asked how are we sure it won't happen again? Jamie responded by describing the immediate audit steps taken and referencing the proposals on the table.

Alfredo raised that we have been inadequate in explaining to the lay person MFPL leadership what these processes are. Could the leadership have changed these procedures? Too many members shrug their shoulders and say "take care of it" perhaps due to the gap between what our techies know and what our other members know.

Amy noted that a number of proposal require participation and that she would be interested in having the discussion on building that participation.

Jack pointed out that one affected member (a client of Jack's) did not get notified - and is not subscribed to the service advisory emails. Jack suggested that at least one individual from each membership should be subscribed to the service-advisory emails.

Jack also mentioned that a member she works with is looking for advice if it's safe to stick with MFPL and wants assurances that this won't happen again.

Jamie agreed with the need for more robust communication and reported that notes from this meeting would be sent to everyone. He also expressed in interest in continuing the discussion on who and how to subscribe all members to the service-advisories list.

Jack also asked what is MFPL's guarantee in terms of hosting.

Jamie responded that outside of the Statement of Unity and Membership Agreement page on the web site, we have no official or contractual guarantees.

Jack also suggested that the weight of the organization is heavy for an all volunteer organizations, that it seems like a big load for a few people

Mark also asked about paid staff and expressed a concerned that international work is detracting from the core operations of the organization

Jamie acknowledged that all support work, including the work of the co-directors, is volunteer. He also said that this week we are hiring our first part time worker, who will be focusing on financial tasks and following up on members behind in their dues and that we specifically chose this task as our first paid staff person because it would contribute to the financial health of the organization. Jamie also pointed out that, in the last year, we have effectively organized a support team which is meeting regularly and has provided significantly more labor to the organization than was available just a year ago and provides significantly more resources than even a single paid staff person could provide. Jamie also acknowledged that providing financial compensation to the support team was a priority and discussions were under-way as to how we might effectively begin that process over the next year.

Alfredo emphasized that MFPL shared infrastructure is the absolute priority and if international work affects this work, we must address it on a case by case basis.

He also re-iterated that we have no contractual guarantee but we do promise data protection, which is a promise we failed with julia.

Jack said we should update the membership agreement to say we guarantee data protection.

Jack also added that it's important to re-examine to make sure that the resources for the infrastructure are being made and that international work is not detracting from that

Ana referenced Jack's point that non-technical people were not aware of the problem. She related it to Alfredo's point about chasm between tech and non-tech folks. Is there a way for the membership to have a non-technical conversation? That would be one way to organize to enable more membership voices - ability for membership to quickly give feedback of technical and non-technical nature.

Ana also said our commitment to principles is deep - we all see different parts of it. We will not turn off web sites for lack of ability to pay. In practice it needs a lot more finesse. We need to be able to distinguish between abandoned site and non-abandoned sites, for example.

We should have a policy to require people to respond: if they can't pay, they have to provide something. If people do not respond, the sites should be disabled.

Mark added that we should make more obvious how to pay bills.

Suggestion from a Sanctuary Movement: have an tutorial event for using the system - more than just a web page, but a training event.

And a question: is there a way for members to help each other?

Alfredo responded that https://support.mayfirst.org/ is the way for members to support each other.

He also explained that we don't have an introductory video - we have a pitch video, but we don't have something to point new members to - that may explain the basics of MFPL - like how to post support tickets.

Ken said that goes back to what Jack brought up. He never had an expectation of a host like MFPL to do backup for me - he imagined the constraints MFPL operates under. Clarifying that would be helpful. Good for mission. It would be good to have that documentation at a primer level.

Ken also suggested that we stay away from service level agreement language that paints you into the corner because someone holds you liable for an outage.

Amy said thanks to everyone - quick forward in support of idea of tutorial in video or real-space and a couple items on Jamie's list could dove tail with that. Tutorial as a follow up on backups - put that out to seize the moment. Also, perhaps performing human audits could be something we train people to to a support team member could partner with new person to learn new skill.

Ross: one thing we don't have is sense of what members responsibilities. more than just a piece of paper - but should really incorporate that in our conversations and activities

Daniel S reported that he never quite feels he is a member and doesn't know how to build that community. something to do with what happens when you join MFPL. There is very little. Maybe a new member orientation? So people really start to feel like a member. MFPL has possibility to become a community.

Ivan: piggy back off of that - another thing that would help a sense of community would involve some kind of in person events or gatherings. MFPL members in different cities might be ambassadors to meet people in other cities if they have that capacity.

Jon said, without taking away from what's been said,we need a means for members to interact that is not the support system. The ticket tracking system is not friendly or welcoming for things that have anything to do with something other than tech support.

Ivan: maybe a member-wide discussion mailing list?

Jack: supporting folks drupal web sites, supporting each other in building web sites is lots of work and makes me feel bad for mfpl and worried. It's a whole other thing to offer support for a range of drupal sites.

Ross: if we had a sense of community - then that obligation is not MFPL, it's the membership - 500 or 1000 people who might know how to tweak a theme.

Alfredo responded that membership agreement says members are responsible for their own sites, but we "sin" in helping our members. Ross hit the nail on the head - not a high level of conciousness among our members about their membership. Most see MFPL as the ability to get resources - "you host our web site" - as if we were a commercial provider.

Next steps: put notes on wiki and proposals, Jamie heads up ad hoc committee. Also he will open a ticket for public comment.

Last modified 15 years ago Last modified on Apr 18, 2011, 7:35:55 PM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text