Starting from scratch

This post is part of a series how we as SysEleven manage a large number of nodes. We deploy OpenStack on those nodes, but this could be basically everything.

For sure, this is not our first attempt to deploy a managable OpenStack platform. We not only deployed this platform, we also deployed a platform based on Virtuozzo, which is still in heavy use for our managed customers. We have a whole bunch of learnings from deploying and managing the mentioned platforms, which leds us to the decision to start from scratch.

In the beginning of this year, there is basically nothing, but a rough plan. Not even a single line of code, neither software, nor some kind of automation. With this project, we were able to break with everything we run in the company. We were (and still are) allowed to question each software and hardware decission. We started on a greenfield.

There may be much things you will put into question. Most of what we did is inpired by what others did. Nothing is new, or kind of rocket science. Sometimes we needed to take the shortest path, because we had a deadline to mid of the year to deploy a stable high available OpenStack platform.

So, where to start over?

First of all, we needed to drill down the issues we encountered with the previous deployments. We went from vague ideas to concret problems with soft- and hardware solutions we used. But the larges problem was a lack of formal definition of what we want to build, and how we want to do that.

So the very first step was to change this. We wrote a whole bunch of blueprints, capturing a high level view of the cluster. Some of them were very precise, like the decision for Ansible as our primary configuration management system, although the former cluster was build with Puppet. Or the decission for network equipment, how we plug them together and how we operate it. Some other very specific blueprints described that we use a single mono-repo, how we manage the review process, everything we script has to be done in Python of Go, styleguides for those language and Ansible, how we handle dependencies in Ansible, that we are going to solve our current problem, and not every thinkable future problem, and so on and so forth. There were some vague blueprints about: We need a way to get an overview of the current cluster state. Not just what we expect. We need a system to initially configure and track our nodes.

All blueprints are meant to change. Not in a whole. But each time, someone diggs into the next topic, the blueprint is extended with the current knowledge and needs.

So we had a rough overview of what we need. We split the team into two groups. As we knew that we had to deploy a working OpenStack, one group started to deploy OpenStack, with all the components needed via Ansible. The primary goal still is to provide a high available OpenStack. The group I belong to works on the individual software we need to provide a basic configuration for our nodes, keep track of them and keep them up to date.

I joined the team after 8 month off, to take care of my boy. At that point, the basic blueprints were written and the team was consolidated. And to be honest, I am not sad to miss this process, and espacially not the path leading to this.

Where do we get from there?

We dedicated three nodes to be our bootstrap nodes, so called bootinfra. They are mostly installed and configured by hand.

Within half of the year, we were able to plug a new node into a rack. This shows up in our MachineDB where we configure the location of the node. After this, a very basic configuration happens automagically. The node configures its designated fixed ip addresses, hostname and fqdn, some bgp configuration, setup of repos, setup of ssh configuration. Just very basic stuff to reach the node. The next step is still very manual. We assign the node to a group in our Ansible Inventory File, and run Ansible by hand.

From the Ansible point of view, there is no change since then, even there was much process in detail. But we were able to deploy two other cluster in the same manner. One of them is our lab, which shares the production bootinfra. That tends to be easier, but it threw up a whole bunch of problems. For the other cluster, we needed to automate much of our (so far handcrafted) bootinfra. Which now pays out, since we are about to replace the initial bootinfra nodes by automated ones.

Next time, I will write about our bootinfra. Not too much about the configuration, but which services we run, and for what reasons.

Flattr this!

Managing a large number of nodes

Imagine following setup. As surly mentions somewhen in this blog, we are running a cloud based on OpenStack. The team, I am working in, is responsible to provision new nodes to the cluster, and manage the lifecycle of the existing ones. We also deploy OpenStack and related services to our fleet. One hugh task is running OpenStack high available. This seems not too difficult, but also means, we have to make each component OpenStack depends on, HA as well. So we use Galera as Database, Quobyte as a distributed storage, clustered RabbitMQ, clustered Cassandra, Zookeeper, and things I may have forgotten.

I will write about some aspects of our setup, how we deploy our nodes, and how we keep them up to date.

Flattr this!

Dropping weekly format

Turns out. Like expected I am not able to write one post per week, and I didn't manage to pick up after two weeks of vacation. When reading the weekly posts again, they seem kind of weird. Which is not bad, but nut worse the time. So I call the my first attempt to write a series to be failed.

Anyway. Natually, not each week is equally interresting. I will try to pick up interresting topic here, and try to go a bit into detail.

Flattr this!

Weekly 2016/37

No such meeting Monday. This week, our all-hands-meeting was moved to Tuesday. Because most people in my team were on vacation, we did a short daily instead of a 1 hour weekly. More time to do real work :) I returned to the cluster wide unattended update. I worked at least half of the day on this topic. Another look into the networking setup of our lab and some more documenation did the rest of it.

Back in office on Wednesday after a day of sickness. From Tuesday on, the production cloud was kind of slow. In terms of: Creating a bare network took 6 seconds in average, instead of 2 seconds in the lab. Digging into the problem showed a problem in filling up one rabbitmq queue. Usually this should be consumed by a telemetry solution, which is not deployed correctly so far. I removed all the messages in the queue and everything went back to normal again.

On Thursday, I had not too much time, to digg into unattended updates again. Just an hour in the morning. This week is strange, with many little disturbances, and I literally forgot all the small stuff. After lunch we did a complete wipe and reinstall of our lab. I left after 2 hours of installation for 10 hardware nodes. Does not seem to be pretty fast, but in fact this is ok. At least we had to do the process on our first 3 nodes (those form our consul cluster, now) twice, because of an expired pgp key. My colleagues Continues with running Ansible on the nodes, without huge problems.

Although, the installation on the day before was pretty ok overall, I cleaned up some stuff on Friday. With some minor changes, the installation is smoother than Thursday. I did not run an entire installation. But shut off the consul cluster to simulate the "nothing is there"-situation. This worked pretty well. I did a second reinstallation, but with consul turned on. This is the usual scenario if we grow our existing cluster by new machines. I finalized the installation by applying Ansible. Because the node, I selected, was a gateway of our SDN, there was a little problem with a new UUID for this node, but easily to resolve by hand. This sceanrio is not usual, so I just documented this issue and did not automate it away. Furthermore, I extended the existing documentation by things, I saw yesterday, but not yet in the documentation. Somewhen in between, we had a call with our storage vendor about the next steps to happen. The afternoon I prepared a update of consul 0.6.4 to 0.7. Its going to be bad:

"Consul's default Raft timing is now set to work more reliably on lower-performance servers, which allows small clusters to use lower cost compute at the expense of reduced performance for failed leader detection and leader elections. You will need to configure Consul to get the same performance as before. See the new Server Performance guide for more details. [GH-2303]"

In other words: It did not work well on raspberry pi, we fixed this, but destroyed it for everyone else. And for sure, 0.6er consul refuse to start because of an unkown key, if I already would place the new paremeter to all config files :(

Oh. Almost forgotten. Last week I wrote about my plans of cleaning up my vim modules. And with this, a switch to neovim. Turned out. Cleaning up stuff went well. The neovim experiment not. It felt ok in most points, but not better then good old vim. And one thing let me scream several times, and go back vim. In some/most cases a hit on ESC took a while to complete. That long, that I typed some more characters before leaving insert mode. An I'm escaping a lot!

Flattr this!

Weekly 2016/36

Starting with meeting Monday. The first half of the day was gone for this. After lunch, I had to add another authentication mechanism to our MachineDB. In fact this is a Flask-app which uses ldap as an authentication backend. We have to put this part of the software into a place without any access to our ldap. Since we use postgres as the DB backend, I created another authentication backend, using postgres. In the afternoon I helped a fellow with some scripts to setup floating ip networking to our lab cluster.

On Tuesday, I continued my work on the authentication backend and got it merge ready. More debugging was done at the floating ip creation stuff. Because the script is more complex than I hoped, I created the networks by hand and was able to configure our integration test to run against the lab.

The next day was a short day for me. I figured out some problems detected by the tempest run. Some strange errors happen due to the ordering of openstack endpoints. The config parameter v3_endpoint_type has been ignored, and adminURL was used by accident. By accident, since it really depended on the sorting of endpoints. I fixed this by updating the id of the endpoint in the keystone database. Beside this, I did some daywork.

Thursday, was kind of boring. First of all, it was a short day again. So primarily I did some daywork, and prepared a backup-presentation for the Berlin Ops Summit on Friday. Turned out, we didn't need it.

Friday was Conference day. I liked the Talk of Jeff Sussna. Most other talks were ok, but not new to me. The most entertaining talk was UI engineering by Dennis Reimann and Jan Persiel. The closing party was great :)

Btw: I am on my way to re-create my vimrc in terms of throwing away modules and settings, I don't use any more. I take the chance to switch to neovim in this step. Lets see how far I come with this.

Flattr this!

Linux Kernel Features…

… mit dessen Hilfe Container-Runtimes gebaut werden. Keine Garantie für die Vollständigkeit dieser Liste.

chroot

z.B. Alpine Linux in chroot

Um den gestarteten Prozessen ein anderes / vorzugeben. Noch nicht mal ein Linux-Feature und wurde bereits 1979 in Unix implementiert.

# Ich habe da mal was vorbereitet.
export mirror=http://dl-3.alpinelinux.org/alpine/
export chroot_dir=/root/alpine
export version=2.6.7-r0
wget ${mirror}/latest-stable/main/x86_64/apk-tools-static-${version}.apk
tar -xzf apk-tools-static-*.apk
./sbin/apk.static -X ${mirror}/latest-stable/main -U --allow-untrusted --root ${chroot_dir} --initdb add alpine-base
mknod -m 666 ${chroot_dir}/dev/full c 1 7
mknod -m 666 ${chroot_dir}/dev/ptmx c 5 2
mknod -m 644 ${chroot_dir}/dev/random c 1 8
mknod -m 644 ${chroot_dir}/dev/urandom c 1 9
mknod -m 666 ${chroot_dir}/dev/zero c 1 5
mknod -m 666 ${chroot_dir}/dev/tty c 5 0
cp /etc/resolv.conf ${chroot_dir}/etc/
mkdir -p ${chroot_dir}/root
mkdir -p ${chroot_dir}/etc/apk
echo "${mirror}/${branch}/main" > ${chroot_dir}/etc/apk/repositories

mount -t proc none ${chroot_dir}/proc
mount -o bind /sys ${chroot_dir}/sys

chroot ${chroot_dir} /bin/sh -l

Namespaces

Definiert was Prozesse sehen und machen können.

Mount

unshare -m /bin/bash
mount -t tmpfs tmpfs /tmp

touch /tmp/blah
ls -l /tmp ### nur blah


Anderes Terminal
ls -l /tmp ### orig-stuff aber kein blah

PID

unshare --pid /bin/bash
ps -ef ## segmentation fault, da fehlt wohl noch was

unshare --pid --mount-proc -f /bin/bash
ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:26 pts/1    00:00:00 /bin/bash
root         4     1  0 08:26 pts/1    00:00:00 ps -ef

User

Wichtig, als != root starten

id -u ## 500
unshare --user --map-root-user /bin/bash
id -u ## 0

Trotzdem hat man keine root-Berechtigung!

mount -o tmpfs tmpfs /tmp  ## permission denied

Network

man ip-netns

ip netns list
ip netns add testing
ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1

ip link add veth0 type veth peer name veth1
ip link set veth1 netns testing
ip a ## kein veth1

ip addr add 192.168.0.1/32 dev veth0
ip route add 192.168.0.0/24 dev veth0
ip netns exec testing ip link set veth1 up
ip netns exec testing ip addr add 192.168.0.2/32 dev veth1
ip netns exec testing ip route add 192.168.0.0/24 dev veth1

ping 192.168.0.2 ## works

ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
7: veth1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
   inet 192.168.0.2/32 scope global veth1

Capabilities

Gezielt Fähigkeiten von Prozessen beschränken. Auch wenn der Prozess als effective user 0 (root) läuft.

man capabilities

mount -t tmpfs tmpfs /mnt
umount /mnt
capsh --drop=CAP_SYS_ADMIN --
id -u  ## 0
mount -t tmpfs tmpfs /mnt  ## permission denied

cgroups

Resourcen von Prozessen einschränken. Beispiel Memory.

cd /sys/fs/cgroup/memory
mkdir testing
cd testing
ls -l  # zeigt u.a. tasks

# 2. Console auf und dort die PID ermitteln
ls -l  # funktioniert
echo $$

# Zurück in Console 1 und bash der Console 2 in die cgroup testing aufnehmen
echo PID > tasks
cat tasks
PID

echo 1 > memory.limit_in_bytes

# Console 2
ls -l
Killed

Wie man sieht hängen alle Child Prozesse auch in dieser cgroup!

Weitere Resourenlimits:

  • CPU
  • blkio
  • devices
  • pids (Process numbers)

Siehe die Kernel Dokumentation zu cgroups.

seccomp

secure computing mode. Gezieltes filtern von system calls. Leider habe ich hier kein einfaches Beispiel.

AppArmor / SELinux / SMACK

Mandatory Access Control im Linux Kernel. Es gibt verschiedene Implementierungen. Ubuntu verwendet z.B. AppArmor.

Auf Ubuntu

vi /etc/apparmor.d/usr.sbin.rsyslogd
...
/usr/sbin/rsyslogd {
  # rsyslog configuration
  /etc/rsyslog.conf r,
  /etc/rsyslog.d/ r,
  /etc/rsyslog.d/** r,
  /{,var/}run/rsyslogd.pid rwk,
  /var/spool/rsyslog/ r,
  /var/spool/rsyslog/** rwk
 ...
}

AppArmor wird unter Ubuntu auch dazu verwendet den Zugriff von VMs-Prozessen einzuschränken. Damit wird der Schaden potentielle Schaden bei einem Ausbruch aus der VM verringert.

Zum Schluss

Alle der genannten Features existieren schon einige Jahre im Linux-Kernel. Vorläufer davon existieren in Form von OpenVZ / Parallels Virtuozzo bereits seit ca. 2000. Allerdings außerhalb des Mainline Kernels gepflegt. Die Features können nicht nur dafür benutzt werden um Container Runtimes wie Docker oder CoreOS rkt zu erstellen, man kann auch "regulär im System" laufende Anwendungen damit einpferchen. Auch systemd bietet Flags, die einige der angesprochenen Techniken verweden um Prozesse besser voneinander zu trennen. Siehe dazu Security Features in systemd.

Flattr this!

OpenStack-Tempest No nw_info cache associated with instance

When using Tempest to test an OpenStack-Cloud, it is easy to run into this error

tempest.lib.exceptions.BadRequest: Bad request
Details: {u'message': u'No nw_info cache associated with instance', u'code': 400}

In fact, networks are not configured proper. Tempest needs a full configured network, connected to the external network, to get some things working. This must be available to all tempest projects/tenante, hence it is enough to create it for each, but use the same name. I created a heat stack for this, and called this network needed-for-tempest. .. code:

heat_template_version: 2014-10-16

description: >
  Setting up a private Network with an snat-router

parameters:
  public_net_id:
    type: string
    description: ID of public network for which floating IP addresses will be allocated

resources:

  ### Creates a Neutron network and subnet
  network:
    type: OS::Neutron::Net
    properties:
      name: needed-for-tempest

  subnet:
    type: OS::Neutron::Subnet
    properties:
      name: needed-for-tempest1
      dns_nameservers:
        - 37.123.105.116
        - 37.123.105.117
      network_id: { get_resource: network }
      ip_version: 4
      gateway_ip: 10.0.0.1
      cidr: 10.0.0.0/24
      allocation_pools:
        - { start: 10.0.0.230, end: 10.0.0.240 }

  router:
    type: OS::Neutron::Router
    properties:
      external_gateway_info: { network: { get_param: public_net_id } }

  router_subnet_connect:
    type: OS::Neutron::RouterInterface
    depends_on: [ subnet, router ]
    properties:
      router_id: { get_resource: router }
      subnet: { get_resource: subnet }

Of cource, tempest.conf has to be edited as well:

[network]
fixed_network_name = needed-for-tempest
Flattr this!

Weekly 2016/35

This is the third approach to write a weekly, and the first one, I totally failed. I didn't make any notes of the week :/

But let's start anyway. This week was covered by a re-deployment of our lab. As expected several things went wrong, and we had to fix it iteratively. Nothing special so far. This couldn't be sticked to one special day. Like describe in the last weeks, I digged around in our network setup. This week, I took my time and documented it down into our IPAM nipap. That includes all the transfer nets and all public IPs assigned to us. In case of service IPs, I documented the "function" as well. We are now able to collect new IPs directly from this pools, without having to grep config files, and hope you catched everything.

On Wednesday, one day with fixing lab-stuff, I got the request to benchmark our S3. Which I started on my short Thursday. I decided to write a script in Go to parallelize the work. This would be perfectly doable in python as well, but I wanted to use something new. First results were collected on Friday morning. Over and over again in this week, I helped a fellow to configure the forward proxy corretly.

Beside of the S3-work, I was in our change commission, to discuss what things we can do on our own, and what should be discussed again in this commission. Lucky us, we can do most of the things, I wanted to.

Friday, started with collecting S3 benchmarks. Then an outage of our cinder-volume occured, which I helped to fix. For sure, since one of my changes broke the configuration :( Fortunately, it resulted in a loss of control and running instances were not affected. I wrote the post-mortem about this outage, and filed some tickets to fix missing bits in our monitoring. After this, I fixed an issue with our StorageManager. It creates new logical volumes on cluster bootstrap, but in our configurations they are very likely to be at the same position like before. So xfs restored the filesystem instead of creating a new one. I now wipefs the new created lv.

Currently I am graphing the S3-results with jupyter notebook, which is really cool. Give it a try. I installed mathplotlib and numpy as well. The performance in average is pretty ok, both up and download. But the long tail does really hurt. Some of the requests took around 8 seconds, while the average is around 0.08 seconds.

What did I test? I uploaded around 12500 files from a pool of 25 unique images. Using 5 uploads in parallel, and parallel to this, 10 downloads of random images. Both over and over again. I could even do more in parallel. Anyway. I we got some tickets about slow S3-requests. But like all tickets without any steps to recreate. The result we see here, is exactly what I wanted to get out of this tests. In short-term we are going to update our S3-Layer to the latest version. When the results are still the same, we can file a bug against our vender. Of course with steps to recreate ;) Maybe the latest version will be in lab somewhen next week.

Flattr this!

Weekly 2016/34

Starting this week with minor devel tasks. Two weeks before, I implemented a feature in our machine database, to be able to talk to different consul datacenters. There were some review comments, I had to fix. This enables us to talk to reach the consul cluster of our lab, even though the machine database is located in a different cluster.

My colleagues started to roll out our ansible roles to the lab cluster. Since we changed lots of stuff to support multiple clusters, we expect the rollout to be painful. And it did not last long to run into the first error. The storage manager crashes, due to a wrong usage of pathlib.PosixPath. I fixed those issues, and built debian packages as well. Whereas building debs was the most painful part of this task.

At some point in the rollout, we were unable to reach all of the lab hosts, but only from two hosts, including one ouf our cluster workstations. The rest was working fine. I discovered "strange" routes on the ToR of lab, and was able to track down the issue with the help of our NetOps. I learned to ask bird for learned and exported routes. This quickly showed, two hosts in the lab exported the same ip address like the production cluster. With this information, a fellow reconfigured the canonical ip addresses in his host-vars and re-rolled ansible. After bringing the wrong (duplicate) address down, everything worked fine.

Tuesday, we continued our work on the lab plattform. After some minor changes in our ansible roles, we discovered some inconsistency in ASN numbering. I reverse-engineered the remaining parts of the bgp configurations of both clusters and introduced a new ASN policy. This included a change to all the ASN in our new lab. Which means a temporarily loss of control of all the nodes in the lab. Fortunately the configuration was successful and all nodes re-announced their IPs. The configuration change was done with our NetOps again, because we need some changes on devices, we don't have access to. After all the changes the connection between production and lab were established again, and the loadbalancers in the lab were able to annonce public ips. We have an accessable lab now :)

At some time on Tuesday, I tried to get metrics from Consul. Turned out, Consul is only able to stream changes to something like statsd.

Bring all the stuff together. One of my colleagues mainly changed many of our ansible roles, to make the lab possible. In fact, the rollout of the last days was done by these changes. He started a while ago, and so master diverged a lot, resulting in a hugh, unmaintainable merge request. Since the cluster rollout worked fine, we split the merge request into smaller ones, that we merge into master, without affecting production. Just one large refactoring of the loadbalancer role remaining. After getting all the v4 bgp-sessions alive during the last days, today I fixed issues with the v6 configuration an brought up all the sessions as well. This was the last bit for a fully working underlay-network :) The afternoon I documented things for baremetal deployment and the network configuration.

On Thursday we did some minor fixes to the loadbalancer role and finally merged it into master. Another feature of yet another colleagues became merge ready, and we brought this to master as well. Since it is not yet in any playbook, this will not affect any running services. After those merges, we applied the entire new master on our lab. The next step would be to re-deploy the lab from master. Since another team would like to participate, we scheduled this into next week. I tried to check some network settings with our SDN Midonet, as a prerequisite for running the integration test-suite, and discovered missing configuration of external IPs and BGP sessions. It was barely automated with a shell script, and it was available in our repo, but never called automatically on bootstrap. However, this script does not perform any checks whether the objects it tries to create, are already available. It also works only for our production cloud, even if it should never ever be called against it, due to the lack of checks. Anyway, it is a nice task for our new team member, to rewrite most of the script as an ansible role, that could run regulary. Since we need this script updated for the lab deployment anyway, we just wait for this rewrite.

Friday, and this week one of my short days, and the lab works good enough to test some things. First I helped out Mathias to setup the Loadbalancers with the config of production again. Then we migrated to the new loadbalancer role, without any problems. Yeahie. The rest of the day, I tried a fix to the configuration of all OpenStack-APIs, I did a few weeks ago. Due to a lack of a "not production" environment, I was not able to test it so far. Today I did, and the first shot worked well for all services except the both relevant ones nova and neutron. The latest version of python-openstacksdk does not work against our neutron. We use the traditional OpenStack service ports, but using https. The services http-links in api-json-results. If the client follows those links, it talk http to the https endpoint resulting in a bad request (400). The solution is to configure oslo_middleware.HTTPProxyToWSGI for each service. With this, all the returned json-links are https.

Before:

> curl https://api.cbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "http://api.cbk.cloud.syseleven.net:9696/v2.0",  <<===
          "rel": "self"
        }
      ]
    }
  ]
}

After:

> curl https://api.zbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "https://api.zbk.cloud.syseleven.net:9696/v2.0", <<==
          "rel": "self"
        }
      ]
    }
  ]
}

The first shot did not work for nova and neutron, because the versions-endpoints had a different pipeline configuration, than all the other stuff. I just needed to add HTTPProxyToWSGI to those endpoints and everything worked fine. Because some oslo_middleware developers decided to disable HTTPProxyToWSGI by default for a strange reason. This flag is not yet available in OpenStack Mitaka, but the default will break our deployment on an update to Newton. So I set enable_proxy_headers_parsing = True in all our OpenStack Services, and let it be ignored in the current deployment.

Flattr this!

Weekly 2016/33

This is the first attempt to write about my week in tech. More about the things I did, and less about things I read about. First of all, this is for me. So times, I tend to be a bit frustrated about being caught in a treadmill. So I really don't know where this is going.

Let's step into this week. It was going to be a short week. I'm still on part time, and working 30h a week. This week from Monday to Thursday. The main task, and target as well, was bootstrapping the hardware nodes for our second OpenStack Cloud, which is going to be a lab. Means, this is the second time to bootstrap with our automation, and the first time doing it from zero to working in without steps in between.

For sure, the hardware is not exactly the same like the production cloud, but only differs some nuances. The most important difference for the entire cluster is the fact, that we dual use the boot infrastructure (like pxe boot) of the production cloud. However, the lab cloud must get its own network, but also reach our boot infrastructure and must be reachable from the cluster workstations of the production cloud. For sure, the connections were set up by our netops.

Slight insight into the networking. The ToR-switches of the production cloud are connected to a qfabric. The ToR of lab as well. Between those two, a mpls l3vpn is configured. Even the racks stand side by side, we decided to use this configuration. Our plan is to spread several clusters over the data center. So this is a quite nice test case.

The nodes were started end of the week before, but did not show up in the machine database. I even didn't expect this working the first time. Unfortunately, due to the design, I was not able to reach any bmc of the nodes. The nodes should be configured in a way to boot from network into a live system. This does a quick inventory of the machine, and submits it into the database. We as the admins, have to configure the rack position. The live system picks this configuration periodically and does some system configuration like setting a fixed bmc ip address and password. We are also able, to set the system disk in our machine database. After this two configuration bits, rack position and system disk, the node is ready to boot into the ubuntu installer and configures itself with a preseed. Again, this was some decision made somewhen. I would prefer Images at this point. But however.

However. None of the 10 machines showed up, and bmc did not work. We send out one of our colleages to the datacenter. While he was on his way, I connected to the ToR of the lab, because this was reachable, and did some basic tests. Because we use anycast for dns, ntp and some more stuff, the first guess was no working anycast. Turns out anycast worked. But the throughput was at least one problem. I only got 1.4MByte/s from ToR to ToR. Which is really bad, they are connected with 40GBit/s. My DC-fellow check network connections, and correct patching. Everything looked good so far. Since we started late, due to the meeting-monday, we called it a day.

On Tuesday, i tried to get our netops, to get the problem solved. Most of them were busy as well, but with little chatting, and further debugging, I detected a drop of the mtu from 9000 to 1500 somewhere in the qfabric. With the help of the netops, we checked the links. All were configured properly. Great!? Since both ToRs are configured with a mtu of 9000, and the bad throughput, I expect pmtu discovery not to work. Which is even worse if 10 nodes should do netboot/install stuff almost at the same time. Since we still lacked bmc, my colleage again visited the DC. He shut off all the nodes and brought back the first one. This barely worked, and it was able to download the live image.

I was stucked at this point. I had to wait for the netops who initially configured the l3vpn to help out. So in the meantime, I helped to submit a new version of the internal storage manager (that configures disks to be system or lvm or cluster fs), and get it out to the production cluster. This removed a blocker of a aborted rollout of the week before. At my end of the day, we had 4 of 10 nodes running. Before I left, I could grab the netops and he promised to have a look at this issue. During my spare time, my fellow in the DC discovered, that six nodes didn't even boot from the correct network device. For sure he was not able to fix this because we need a firmware update for the cards to get it working.

Wednesday. Like my netops promised, the mtu issue was fixed. Wohooo. Turned out, there was an overall-setting that overrides the concrete port configuration m( Because of the mtu struggles, I didn't debug the ToR of the lab enough. To recap, we still lack bmc. In the morning I discovered a missing link on ToR to bmc-switch. Lukily for me, my fellow had to visit the DC again, to update the firmwares. So he armed himself with some optics.

In this field, there was nothing to do for me. I waited for the updated nodes to submit their inventory to our database, and a working bmc. So I continued the work on a solution to run a cluster wide unattended update, but stop if something strange happen, or a human hit the red button. In the evening, my fellow replaced the fiber and got another 3 nodes running. The remaining nodes ended in an endless loop in the installer. But I thought, this is pretty good. We could define those nodes to be compute nodes, and deploy OpenStack with only one compute.

My Thursday starts with a totally different topic. We have a Thursday time slot of 15 minutes to present some interresting stuff. I showed my kubernetes stack running onto of our OpenStack. 15 minutes are way to less. I could talk hours about this topic. But anyway. It is a great way to show other teams topics to think about. Even I would like to see more faces in my presentation. I did a live demo of booting the stack, so there are no slides, but the stack is available on github.

After this short trip, I went back to the cluster bootstrap an noticed, that exactly those broken nodes have to be the controllers, because of the ssd provisioning. So we had to get the nodes out of the door. However, we have bmc, so at least no one had to visit the DC. For sure, no errors were shown on the serial console. Nothing but the over and over starting progress bar. Thankfully my DC-fellow was able to connect to ilo and get some error messages. Somewhere in the bootup-time, the routes to the lab disappeared. There was a experimental rollout on the lab-ToR, that broke the config of bird. Trivial fix and we were up again. Some of my tests of the installed nodes showed missing bits in the consul configuration, as well as wrong configuration of the fqdn. Trivial fixes for another fellow.

Back to the insallation process. Turned out. The very last step failed. Because ubuntu changes the boot device to install-disk, there is some late command to reset the boot order netboot. We always use netboot, to be able to change grub config even if we are not able to connect to the nodes at some time. The netboot grub fetches the next desired boot target for the node. A tiny fix in a regexp to detect the network device did the trick. The nodes installed correctly and came up like expected.

This was at 1430. I called it a week and left for weekend. With major help of "the DC-fellow", I was able to get done the ground work for a running lab (installed with the same automation like production). The next steps are up to my colleages anyway.

All in all a very productive week, in face of "only" 30h :)

Flattr this!