Changes between Initial Version and Version 1 of infrastructure-2018

Jul 9, 2018, 9:34:48 AM (16 months ago)
Jamie McClelland



  • infrastructure-2018

    v1 v1  
     1= Infrastructure 2018 Plans =
     3Everything is changing. In the US, Trump's presidency and his moves toward neo fascism is forcing a decisive shift in our political struggle. In Mexico, the election of Obrador marks a dramatic break from decades of leadership from the centrist and right-wing parties.
     5Just about the only thing that has stayed the same over the last 15 years is the underling architecture of May First/People Link's hosting technology. And now, that is going to change as well to address the changing political landscape.
     7This document provides a less technical overview of the big picture. A [wiki:red-mosh-reorganization companion piece provides more technical details about immediate steps we are planning] to reach our goals.
     9== The Problem ==
     11**The primary problem with our current infrastructure is inflexibility:** when a site suddenly becomes popular, it is painfully slow and tedious to re-allocate hardware to support it. When a web site is attacked it is hard to move it behind a better firewall to block the attacks. And, when a site is compromised, it is hard to isolate it so it does not negatively impact other members. As our membership grows and increasingly experiences sudden and dramatic changes in technology needs, we need to be better able to handle them in a matter of minutes rather than hours or days.
     13== The Goals ==
     15The new technologies we will explore are all based on free, open source software however it's development is driven by capitalism. So, it's important that we keep our goals clear and understand where they differ from the goals of the technology we are implementing.
     17In particular, the new technologies often assumes the use of leased hardware from corporations like Amazon, whereas our politics require us to fully own and control our own hardware. This important distinction has an impact on how we approach the technology.
     19**The primary goal of our project is to allow us to more flexibily allocate our limited computer hardware to meet the needs of our members.**
     21There are also a number of secondary goals which we hope to achieve but not at the expense of our first goal:
     23 * ''Ability to scale from a few thousand users to millions of users:'' this goal is the primary goal of many new container based technologies, however, it only marginally applies to us. Yes, we want to be able to handle a web site that becomes explosively popular over night. However, our primary need is to handle thousands of relatively low traffic web sites rather than a single high traffic web site. This goal is still an important secondary goal so we have the ability to support the few members that are focused on growing their Internet resources into the millions.
     25 * ''Ability to instantly recover from hardware failure:'' this goal is also a primary goal of most new technologies, but does not apply well to us. It largely depends on hardware capacity that is more than double the capacity you need to run your servers. When you have access to leased hardware via Amazon, this is quite simple and affordable. When you own all of your hardware it is prohibitively expensive. This goal still remains as an important secondary goal - and the ability to manually recover from hardware failure in a matter of minutes will still be a critical requirement. However, auto fail-over will most likely not be feasible for all member services.
     27== First Steps ==
     29We don't have all the solutions right now, however, we are looking carefully at what are called "container" based technologies. Containers allow us to provide individual services (like a single web site) in a mostly isolated way that can easily be moved between servers and more efficiently use hardware. It depends on a tightly integrated collection of servers that work together.
     31In contrast, our current infrastructure is a collection of autonomous servers designed so that if one breaks down, it has no impact on any of the others. Toward that goal, most of our services are organized into individual servers called MOSH's - which provide web, email, database, and ssh/sftp services on a single virtual server that is shared by about 50 members. Each MOSH is mostly independent of all other servers - it will keep on working even if every other server goes down.
     33Currently we have about 75 MOSH servers.
     35Unfortunately, container technology is still quite new and is changing rapidly, with some solutions dying off and others developing quickly. Therefore, it's still too soon to make a commitment to one container based technology.
     37However, we have identified three crucial steps we need to take to prepare for a future shift to containers.
     39Each step is designed to move us toward a more integrated environment that will both help us in our primary goal to more effectively move resources around and will allow us to more easily and quickly shift to new container technologies.
     41=== Routing ===
     43Our current infrastructure mostly uses the Domain Name System (DNS) to determine which member web sites, email etc. should be routed to which of our 75 MOSH's.
     45To prepare for the container-based approach, we will need to change, so that the DNS system routes all members to one or several public facing servers, and these servers in turn route the request to the appropriate place in our network.
     47We are currently using this approach for email - all members configure their email programs to send and receive via - which in turn routes the request to the appropriate server.
     49We also are starting to provide that services for web sites - we have one web server that can provide caching services to protect it from DDOS and also high traffic, which in turn routes that web traffic to the appropriate server. This approach will need to be further developed to provide a generic form of the service for all web sites.
     51Lastly, we have not started implementing this approach for incoming email (MX services) or SSH/SFTP which are still routed via DNS or MySQL servers (which are all served via localhost).
     53=== Authentication ===
     55Our current authentication system is a mish-mash of MySQL provided by our control panel (the final authoriy), a [wiki:login-service login service api] that is backed by the database, an open ID system (also backed by the database) that is due to be retired, and a process of keeping traditional /etc/shadow files in sync with the control panel MySQL database.
     57These will need to be replaced by a single, distributed system - most likely LDAP, [ FreeIPA], or an improved SQL based solution.
     59With a single system, we can manage user authentication as well as common user and group ids to help ensure file system permissions are preserved.
     61=== Network storage ===
     63Network storage means that a hard disk that is mounted on one physical server can be quickly unmounted on that server and re-mounted on a different server. It is a critical component to a container-based infrastructure in general and meeting our primary goal in particular.
     65Currently, all hard disks in our network are provided by the physical servers hosting the services which means moving data is a slow and resource-intensive process.
     67We will need to invest in a dedicated server to provide file systems to our network and begin experimenting with moving our data to this new server, probably running NFS plus [ DRBD] or [ ceph].