Starting from scratch

This post is part of a series how we as SysEleven manage a large number of nodes. We deploy OpenStack on those nodes, but this could be basically everything.

For sure, this is not our first attempt to deploy a managable OpenStack platform. We not only deployed this platform, we also deployed a platform based on Virtuozzo, which is still in heavy use for our managed customers. We have a whole bunch of learnings from deploying and managing the mentioned platforms, which leds us to the decision to start from scratch.

In the beginning of this year, there is basically nothing, but a rough plan. Not even a single line of code, neither software, nor some kind of automation. With this project, we were able to break with everything we run in the company. We were (and still are) allowed to question each software and hardware decission. We started on a greenfield.

There may be much things you will put into question. Most of what we did is inpired by what others did. Nothing is new, or kind of rocket science. Sometimes we needed to take the shortest path, because we had a deadline to mid of the year to deploy a stable high available OpenStack platform.

So, where to start over?

First of all, we needed to drill down the issues we encountered with the previous deployments. We went from vague ideas to concret problems with soft- and hardware solutions we used. But the larges problem was a lack of formal definition of what we want to build, and how we want to do that.

So the very first step was to change this. We wrote a whole bunch of blueprints, capturing a high level view of the cluster. Some of them were very precise, like the decision for Ansible as our primary configuration management system, although the former cluster was build with Puppet. Or the decission for network equipment, how we plug them together and how we operate it. Some other very specific blueprints described that we use a single mono-repo, how we manage the review process, everything we script has to be done in Python of Go, styleguides for those language and Ansible, how we handle dependencies in Ansible, that we are going to solve our current problem, and not every thinkable future problem, and so on and so forth. There were some vague blueprints about: We need a way to get an overview of the current cluster state. Not just what we expect. We need a system to initially configure and track our nodes.

All blueprints are meant to change. Not in a whole. But each time, someone diggs into the next topic, the blueprint is extended with the current knowledge and needs.

So we had a rough overview of what we need. We split the team into two groups. As we knew that we had to deploy a working OpenStack, one group started to deploy OpenStack, with all the components needed via Ansible. The primary goal still is to provide a high available OpenStack. The group I belong to works on the individual software we need to provide a basic configuration for our nodes, keep track of them and keep them up to date.

I joined the team after 8 month off, to take care of my boy. At that point, the basic blueprints were written and the team was consolidated. And to be honest, I am not sad to miss this process, and espacially not the path leading to this.

Where do we get from there?

We dedicated three nodes to be our bootstrap nodes, so called bootinfra. They are mostly installed and configured by hand.

Within half of the year, we were able to plug a new node into a rack. This shows up in our MachineDB where we configure the location of the node. After this, a very basic configuration happens automagically. The node configures its designated fixed ip addresses, hostname and fqdn, some bgp configuration, setup of repos, setup of ssh configuration. Just very basic stuff to reach the node. The next step is still very manual. We assign the node to a group in our Ansible Inventory File, and run Ansible by hand.

From the Ansible point of view, there is no change since then, even there was much process in detail. But we were able to deploy two other cluster in the same manner. One of them is our lab, which shares the production bootinfra. That tends to be easier, but it threw up a whole bunch of problems. For the other cluster, we needed to automate much of our (so far handcrafted) bootinfra. Which now pays out, since we are about to replace the initial bootinfra nodes by automated ones.

Next time, I will write about our bootinfra. Not too much about the configuration, but which services we run, and for what reasons.