This post is
part of a
series
how we as SysEleven manage a large number of
nodes. We deploy OpenStack on those nodes,
but this could be basically everything.
For sure, this is not our first attempt to deploy a managable OpenStack
platform. We not only deployed this platform, we also deployed a
platform based on Virtuozzo, which is still in heavy use for our managed
customers. We have a whole bunch of learnings from deploying and
managing the mentioned platforms, which leds us to the decision to start
from scratch.
In the beginning of this year, there is basically nothing, but a rough
plan. Not even a single line of code, neither software, nor some kind of
automation. With this project, we were able to break with everything we
run in the company. We were (and still are) allowed to question each
software and hardware decission. We started on a greenfield.
There may be much things you will put into question. Most of what we did
is inpired by what others did. Nothing is new, or kind of rocket
science. Sometimes we needed to take the shortest path, because we had a
deadline to mid of the year to deploy a stable high available OpenStack
platform.
So, where to start over? #
First of all, we needed to drill down the issues we encountered with the
previous deployments. We went from vague ideas to concret problems with
soft- and hardware solutions we used. But the larges problem was a lack
of formal definition of what we want to build, and how we want to do
that.
So the very first step was to change this. We wrote a whole bunch of
blueprints, capturing a high level view of the cluster. Some of them
were very precise, like the decision for
Ansible as our primary configuration
management system, although the former cluster was build with
Puppet. Or the decission for network equipment,
how we plug them together and how we operate it. Some other very
specific blueprints described that we use a single mono-repo, how we
manage the review process, everything we script has to be done in Python
of Go, styleguides for those language and Ansible, how we handle
dependencies in Ansible, that we are going to solve our current problem,
and not every thinkable future problem, and so on and so forth. There
were some vague blueprints about: We need a way to get an overview of
the current cluster state. Not just what we expect. We need a system to
initially configure and track our nodes.
All blueprints are meant to change. Not in a whole. But each time,
someone diggs into the next topic, the blueprint is extended with the
current knowledge and needs.
So we had a rough overview of what we need. We split the team into two
groups. As we knew that we had to deploy a working OpenStack, one group
started to deploy OpenStack, with all the components needed via Ansible.
The primary goal still is to provide a high available OpenStack. The
group I belong to works on the individual software we need to provide a
basic configuration for our nodes, keep track of them and keep them up
to date.
I joined the team after 8 month off, to take care of my boy. At that
point, the basic blueprints were written and the team was consolidated.
And to be honest, I am not sad to miss this process, and espacially not
the path leading to this.
Where do we get from there? #
We dedicated three nodes to be our bootstrap nodes, so called bootinfra.
They are mostly installed and configured by hand.
Within half of the year, we were able to plug a new node into a rack.
This shows up in our MachineDB where we configure the location of the
node. After this, a very basic configuration happens automagically. The
node configures its designated fixed ip addresses, hostname and fqdn,
some bgp configuration, setup of repos, setup of ssh configuration. Just
very basic stuff to reach the node. The next step is still very manual.
We assign the node to a group in our Ansible Inventory
File, and run
Ansible by hand.
From the Ansible point of view, there is no change since then, even
there was much process in detail. But we were able to deploy two other
cluster in the same manner. One of them is our lab, which shares the
production bootinfra. That tends to be easier, but it threw up a whole
bunch of problems. For the other cluster, we needed to automate much of
our (so far handcrafted) bootinfra. Which now pays out, since we are
about to replace the initial bootinfra nodes by automated ones.
Next time, I will write about our bootinfra. Not too much about the
configuration, but which services we run, and for what reasons.