About Load Balancing at GitHub

I just want to point to an interresting resource on load balancing. GitHub recently released GLB: GitHub's open source load balancer (GitHub Load Balancer Director and supporting tooling Repo), which is the actual implementation of what was descibed back in September 2016 Introducing the GitHub Load Balancer. It guides the reader through the challanges GitHub faces in the space of load balancing, and how they came to their solution, and what didn't work for them.

The load balancer at GitHub needs to be increadibly reliable. We all use it when cloning or pushing to a repo, which both could take a while. Cutting the connection for maintenance purposes would affect many people, especially those, working with bad internet access.

Even if you are not GitHub, this might be interresting. For instance, if you run OpenStack and depend on the Image-API (Glance) for VM snapshots, a restart of the load balancer results into a corrupt snapshot, which might be unnoticed until there is an urge to use the snapshot. For sure. One should verify if the snapshot works, but not everyone does. And even if one does, it is quite annoying to wait let's say an our for the upload, just to discover the time has been wasted.

Use HyperFIDO HYPERSECU on Linux

To use HyperFIDO HYPERSECU on Linux you must have access to the created device. So you need to add a udev rule to set up the permissions. There are two examples on the support page, but for the systemd-uacces method, one information is missing.

You can use the HyperFIDO-k5-1.rules approach:

ACTION!="add|change", GOTO="u2f_end"

KERNEL=="hidraw*", SUBSYSTEM=="hidraw", ATTRS{idVendor}=="096e", ATTRS{idProduct}=="0880", TAG+="uaccess"

LABEL="u2f_end"

But you need to rename it to something smaller than 70- in /etc/udev/rules.d/, because the uaccess-Tag is processed in 70-uaccess.

E.g. /etc/udev/rules.d/69-hyperfido.rules.

Starting from scratch

This post is part of a series how we as SysEleven manage a large number of nodes. We deploy OpenStack on those nodes, but this could be basically everything.

For sure, this is not our first attempt to deploy a managable OpenStack platform. We not only deployed this platform, we also deployed a platform based on Virtuozzo, which is still in heavy use for our managed customers. We have a whole bunch of learnings from deploying and managing the mentioned platforms, which leds us to the decision to start from scratch.

In the beginning of this year, there is basically nothing, but a rough plan. Not even a single line of code, neither software, nor some kind of automation. With this project, we were able to break with everything we run in the company. We were (and still are) allowed to question each software and hardware decission. We started on a greenfield.

There may be much things you will put into question. Most of what we did is inpired by what others did. Nothing is new, or kind of rocket science. Sometimes we needed to take the shortest path, because we had a deadline to mid of the year to deploy a stable high available OpenStack platform.

So, where to start over?

First of all, we needed to drill down the issues we encountered with the previous deployments. We went from vague ideas to concret problems with soft- and hardware solutions we used. But the larges problem was a lack of formal definition of what we want to build, and how we want to do that.

So the very first step was to change this. We wrote a whole bunch of blueprints, capturing a high level view of the cluster. Some of them were very precise, like the decision for Ansible as our primary configuration management system, although the former cluster was build with Puppet. Or the decission for network equipment, how we plug them together and how we operate it. Some other very specific blueprints described that we use a single mono-repo, how we manage the review process, everything we script has to be done in Python of Go, styleguides for those language and Ansible, how we handle dependencies in Ansible, that we are going to solve our current problem, and not every thinkable future problem, and so on and so forth. There were some vague blueprints about: We need a way to get an overview of the current cluster state. Not just what we expect. We need a system to initially configure and track our nodes.

All blueprints are meant to change. Not in a whole. But each time, someone diggs into the next topic, the blueprint is extended with the current knowledge and needs.

So we had a rough overview of what we need. We split the team into two groups. As we knew that we had to deploy a working OpenStack, one group started to deploy OpenStack, with all the components needed via Ansible. The primary goal still is to provide a high available OpenStack. The group I belong to works on the individual software we need to provide a basic configuration for our nodes, keep track of them and keep them up to date.

I joined the team after 8 month off, to take care of my boy. At that point, the basic blueprints were written and the team was consolidated. And to be honest, I am not sad to miss this process, and espacially not the path leading to this.

Where do we get from there?

We dedicated three nodes to be our bootstrap nodes, so called bootinfra. They are mostly installed and configured by hand.

Within half of the year, we were able to plug a new node into a rack. This shows up in our MachineDB where we configure the location of the node. After this, a very basic configuration happens automagically. The node configures its designated fixed ip addresses, hostname and fqdn, some bgp configuration, setup of repos, setup of ssh configuration. Just very basic stuff to reach the node. The next step is still very manual. We assign the node to a group in our Ansible Inventory File, and run Ansible by hand.

From the Ansible point of view, there is no change since then, even there was much process in detail. But we were able to deploy two other cluster in the same manner. One of them is our lab, which shares the production bootinfra. That tends to be easier, but it threw up a whole bunch of problems. For the other cluster, we needed to automate much of our (so far handcrafted) bootinfra. Which now pays out, since we are about to replace the initial bootinfra nodes by automated ones.

Next time, I will write about our bootinfra. Not too much about the configuration, but which services we run, and for what reasons.

Managing a large number of nodes

Imagine following setup. As surly mentions somewhen in this blog, we are running a cloud based on OpenStack. The team, I am working in, is responsible to provision new nodes to the cluster, and manage the lifecycle of the existing ones. We also deploy OpenStack and related services to our fleet. One hugh task is running OpenStack high available. This seems not too difficult, but also means, we have to make each component OpenStack depends on, HA as well. So we use Galera as Database, Quobyte as a distributed storage, clustered RabbitMQ, clustered Cassandra, Zookeeper, and things I may have forgotten.

I will write about some aspects of our setup, how we deploy our nodes, and how we keep them up to date.

Dropping weekly format

Turns out. Like expected I am not able to write one post per week, and I didn't manage to pick up after two weeks of vacation. When reading the weekly posts again, they seem kind of weird. Which is not bad, but nut worse the time. So I call the my first attempt to write a series to be failed.

Anyway. Natually, not each week is equally interresting. I will try to pick up interresting topic here, and try to go a bit into detail.

Weekly 2016/37

No such meeting Monday. This week, our all-hands-meeting was moved to Tuesday. Because most people in my team were on vacation, we did a short daily instead of a 1 hour weekly. More time to do real work :) I returned to the cluster wide unattended update. I worked at least half of the day on this topic. Another look into the networking setup of our lab and some more documenation did the rest of it.

Back in office on Wednesday after a day of sickness. From Tuesday on, the production cloud was kind of slow. In terms of: Creating a bare network took 6 seconds in average, instead of 2 seconds in the lab. Digging into the problem showed a problem in filling up one rabbitmq queue. Usually this should be consumed by a telemetry solution, which is not deployed correctly so far. I removed all the messages in the queue and everything went back to normal again.

On Thursday, I had not too much time, to digg into unattended updates again. Just an hour in the morning. This week is strange, with many little disturbances, and I literally forgot all the small stuff. After lunch we did a complete wipe and reinstall of our lab. I left after 2 hours of installation for 10 hardware nodes. Does not seem to be pretty fast, but in fact this is ok. At least we had to do the process on our first 3 nodes (those form our consul cluster, now) twice, because of an expired pgp key. My colleagues Continues with running Ansible on the nodes, without huge problems.

Although, the installation on the day before was pretty ok overall, I cleaned up some stuff on Friday. With some minor changes, the installation is smoother than Thursday. I did not run an entire installation. But shut off the consul cluster to simulate the "nothing is there"-situation. This worked pretty well. I did a second reinstallation, but with consul turned on. This is the usual scenario if we grow our existing cluster by new machines. I finalized the installation by applying Ansible. Because the node, I selected, was a gateway of our SDN, there was a little problem with a new UUID for this node, but easily to resolve by hand. This sceanrio is not usual, so I just documented this issue and did not automate it away. Furthermore, I extended the existing documentation by things, I saw yesterday, but not yet in the documentation. Somewhen in between, we had a call with our storage vendor about the next steps to happen. The afternoon I prepared a update of consul 0.6.4 to 0.7. Its going to be bad:

"Consul's default Raft timing is now set to work more reliably on lower-performance servers, which allows small clusters to use lower cost compute at the expense of reduced performance for failed leader detection and leader elections. You will need to configure Consul to get the same performance as before. See the new Server Performance guide for more details. [GH-2303]"

In other words: It did not work well on raspberry pi, we fixed this, but destroyed it for everyone else. And for sure, 0.6er consul refuse to start because of an unkown key, if I already would place the new paremeter to all config files :(

Oh. Almost forgotten. Last week I wrote about my plans of cleaning up my vim modules. And with this, a switch to neovim. Turned out. Cleaning up stuff went well. The neovim experiment not. It felt ok in most points, but not better then good old vim. And one thing let me scream several times, and go back vim. In some/most cases a hit on ESC took a while to complete. That long, that I typed some more characters before leaving insert mode. An I'm escaping a lot!

Weekly 2016/36

Starting with meeting Monday. The first half of the day was gone for this. After lunch, I had to add another authentication mechanism to our MachineDB. In fact this is a Flask-app which uses ldap as an authentication backend. We have to put this part of the software into a place without any access to our ldap. Since we use postgres as the DB backend, I created another authentication backend, using postgres. In the afternoon I helped a fellow with some scripts to setup floating ip networking to our lab cluster.

On Tuesday, I continued my work on the authentication backend and got it merge ready. More debugging was done at the floating ip creation stuff. Because the script is more complex than I hoped, I created the networks by hand and was able to configure our integration test to run against the lab.

The next day was a short day for me. I figured out some problems detected by the tempest run. Some strange errors happen due to the ordering of openstack endpoints. The config parameter v3_endpoint_type has been ignored, and adminURL was used by accident. By accident, since it really depended on the sorting of endpoints. I fixed this by updating the id of the endpoint in the keystone database. Beside this, I did some daywork.

Thursday, was kind of boring. First of all, it was a short day again. So primarily I did some daywork, and prepared a backup-presentation for the Berlin Ops Summit on Friday. Turned out, we didn't need it.

Friday was Conference day. I liked the Talk of Jeff Sussna. Most other talks were ok, but not new to me. The most entertaining talk was UI engineering by Dennis Reimann and Jan Persiel. The closing party was great :)

Btw: I am on my way to re-create my vimrc in terms of throwing away modules and settings, I don't use any more. I take the chance to switch to neovim in this step. Lets see how far I come with this.

Linux Kernel Features…

… mit dessen Hilfe Container-Runtimes gebaut werden. Keine Garantie für die Vollständigkeit dieser Liste.

chroot

z.B. Alpine Linux in chroot

Um den gestarteten Prozessen ein anderes / vorzugeben. Noch nicht mal ein Linux-Feature und wurde bereits 1979 in Unix implementiert.

# Ich habe da mal was vorbereitet.
export mirror=http://dl-3.alpinelinux.org/alpine/
export chroot_dir=/root/alpine
export version=2.6.7-r0
wget ${mirror}/latest-stable/main/x86_64/apk-tools-static-${version}.apk
tar -xzf apk-tools-static-*.apk
./sbin/apk.static -X ${mirror}/latest-stable/main -U --allow-untrusted --root ${chroot_dir} --initdb add alpine-base
mknod -m 666 ${chroot_dir}/dev/full c 1 7
mknod -m 666 ${chroot_dir}/dev/ptmx c 5 2
mknod -m 644 ${chroot_dir}/dev/random c 1 8
mknod -m 644 ${chroot_dir}/dev/urandom c 1 9
mknod -m 666 ${chroot_dir}/dev/zero c 1 5
mknod -m 666 ${chroot_dir}/dev/tty c 5 0
cp /etc/resolv.conf ${chroot_dir}/etc/
mkdir -p ${chroot_dir}/root
mkdir -p ${chroot_dir}/etc/apk
echo "${mirror}/${branch}/main" > ${chroot_dir}/etc/apk/repositories

mount -t proc none ${chroot_dir}/proc
mount -o bind /sys ${chroot_dir}/sys

chroot ${chroot_dir} /bin/sh -l

Namespaces

Definiert was Prozesse sehen und machen können.

Mount

unshare -m /bin/bash
mount -t tmpfs tmpfs /tmp

touch /tmp/blah
ls -l /tmp ### nur blah


Anderes Terminal
ls -l /tmp ### orig-stuff aber kein blah

PID

unshare --pid /bin/bash
ps -ef ## segmentation fault, da fehlt wohl noch was

unshare --pid --mount-proc -f /bin/bash
ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:26 pts/1    00:00:00 /bin/bash
root         4     1  0 08:26 pts/1    00:00:00 ps -ef

User

Wichtig, als != root starten

id -u ## 500
unshare --user --map-root-user /bin/bash
id -u ## 0

Trotzdem hat man keine root-Berechtigung!

mount -o tmpfs tmpfs /tmp  ## permission denied

Network

man ip-netns

ip netns list
ip netns add testing
ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1

ip link add veth0 type veth peer name veth1
ip link set veth1 netns testing
ip a ## kein veth1

ip addr add 192.168.0.1/32 dev veth0
ip route add 192.168.0.0/24 dev veth0
ip netns exec testing ip link set veth1 up
ip netns exec testing ip addr add 192.168.0.2/32 dev veth1
ip netns exec testing ip route add 192.168.0.0/24 dev veth1

ping 192.168.0.2 ## works

ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
7: veth1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
   inet 192.168.0.2/32 scope global veth1

Capabilities

Gezielt Fähigkeiten von Prozessen beschränken. Auch wenn der Prozess als effective user 0 (root) läuft.

man capabilities

mount -t tmpfs tmpfs /mnt
umount /mnt
capsh --drop=CAP_SYS_ADMIN --
id -u  ## 0
mount -t tmpfs tmpfs /mnt  ## permission denied

cgroups

Resourcen von Prozessen einschränken. Beispiel Memory.

cd /sys/fs/cgroup/memory
mkdir testing
cd testing
ls -l  # zeigt u.a. tasks

# 2. Console auf und dort die PID ermitteln
ls -l  # funktioniert
echo $$

# Zurück in Console 1 und bash der Console 2 in die cgroup testing aufnehmen
echo PID > tasks
cat tasks
PID

echo 1 > memory.limit_in_bytes

# Console 2
ls -l
Killed

Wie man sieht hängen alle Child Prozesse auch in dieser cgroup!

Weitere Resourenlimits:

  • CPU
  • blkio
  • devices
  • pids (Process numbers)

Siehe die Kernel Dokumentation zu cgroups.

seccomp

secure computing mode. Gezieltes filtern von system calls. Leider habe ich hier kein einfaches Beispiel.

AppArmor / SELinux / SMACK

Mandatory Access Control im Linux Kernel. Es gibt verschiedene Implementierungen. Ubuntu verwendet z.B. AppArmor.

Auf Ubuntu

vi /etc/apparmor.d/usr.sbin.rsyslogd
...
/usr/sbin/rsyslogd {
  # rsyslog configuration
  /etc/rsyslog.conf r,
  /etc/rsyslog.d/ r,
  /etc/rsyslog.d/** r,
  /{,var/}run/rsyslogd.pid rwk,
  /var/spool/rsyslog/ r,
  /var/spool/rsyslog/** rwk
 ...
}

AppArmor wird unter Ubuntu auch dazu verwendet den Zugriff von VMs-Prozessen einzuschränken. Damit wird der Schaden potentielle Schaden bei einem Ausbruch aus der VM verringert.

Zum Schluss

Alle der genannten Features existieren schon einige Jahre im Linux-Kernel. Vorläufer davon existieren in Form von OpenVZ / Parallels Virtuozzo bereits seit ca. 2000. Allerdings außerhalb des Mainline Kernels gepflegt. Die Features können nicht nur dafür benutzt werden um Container Runtimes wie Docker oder CoreOS rkt zu erstellen, man kann auch "regulär im System" laufende Anwendungen damit einpferchen. Auch systemd bietet Flags, die einige der angesprochenen Techniken verweden um Prozesse besser voneinander zu trennen. Siehe dazu Security Features in systemd.

OpenStack-Tempest No nw_info cache associated with instance

When using Tempest to test an OpenStack-Cloud, it is easy to run into this error

tempest.lib.exceptions.BadRequest: Bad request
Details: {u'message': u'No nw_info cache associated with instance', u'code': 400}

In fact, networks are not configured proper. Tempest needs a full configured network, connected to the external network, to get some things working. This must be available to all tempest projects/tenante, hence it is enough to create it for each, but use the same name. I created a heat stack for this, and called this network needed-for-tempest. .. code:

heat_template_version: 2014-10-16

description: >
  Setting up a private Network with an snat-router

parameters:
  public_net_id:
    type: string
    description: ID of public network for which floating IP addresses will be allocated

resources:

  ### Creates a Neutron network and subnet
  network:
    type: OS::Neutron::Net
    properties:
      name: needed-for-tempest

  subnet:
    type: OS::Neutron::Subnet
    properties:
      name: needed-for-tempest1
      dns_nameservers:
        - 37.123.105.116
        - 37.123.105.117
      network_id: { get_resource: network }
      ip_version: 4
      gateway_ip: 10.0.0.1
      cidr: 10.0.0.0/24
      allocation_pools:
        - { start: 10.0.0.230, end: 10.0.0.240 }

  router:
    type: OS::Neutron::Router
    properties:
      external_gateway_info: { network: { get_param: public_net_id } }

  router_subnet_connect:
    type: OS::Neutron::RouterInterface
    depends_on: [ subnet, router ]
    properties:
      router_id: { get_resource: router }
      subnet: { get_resource: subnet }

Of cource, tempest.conf has to be edited as well:

[network]
fixed_network_name = needed-for-tempest

Weekly 2016/35

This is the third approach to write a weekly, and the first one, I totally failed. I didn't make any notes of the week :/

But let's start anyway. This week was covered by a re-deployment of our lab. As expected several things went wrong, and we had to fix it iteratively. Nothing special so far. This couldn't be sticked to one special day. Like describe in the last weeks, I digged around in our network setup. This week, I took my time and documented it down into our IPAM nipap. That includes all the transfer nets and all public IPs assigned to us. In case of service IPs, I documented the "function" as well. We are now able to collect new IPs directly from this pools, without having to grep config files, and hope you catched everything.

On Wednesday, one day with fixing lab-stuff, I got the request to benchmark our S3. Which I started on my short Thursday. I decided to write a script in Go to parallelize the work. This would be perfectly doable in python as well, but I wanted to use something new. First results were collected on Friday morning. Over and over again in this week, I helped a fellow to configure the forward proxy corretly.

Beside of the S3-work, I was in our change commission, to discuss what things we can do on our own, and what should be discussed again in this commission. Lucky us, we can do most of the things, I wanted to.

Friday, started with collecting S3 benchmarks. Then an outage of our cinder-volume occured, which I helped to fix. For sure, since one of my changes broke the configuration :( Fortunately, it resulted in a loss of control and running instances were not affected. I wrote the post-mortem about this outage, and filed some tickets to fix missing bits in our monitoring. After this, I fixed an issue with our StorageManager. It creates new logical volumes on cluster bootstrap, but in our configurations they are very likely to be at the same position like before. So xfs restored the filesystem instead of creating a new one. I now wipefs the new created lv.

Currently I am graphing the S3-results with jupyter notebook, which is really cool. Give it a try. I installed mathplotlib and numpy as well. The performance in average is pretty ok, both up and download. But the long tail does really hurt. Some of the requests took around 8 seconds, while the average is around 0.08 seconds.

What did I test? I uploaded around 12500 files from a pool of 25 unique images. Using 5 uploads in parallel, and parallel to this, 10 downloads of random images. Both over and over again. I could even do more in parallel. Anyway. I we got some tickets about slow S3-requests. But like all tickets without any steps to recreate. The result we see here, is exactly what I wanted to get out of this tests. In short-term we are going to update our S3-Layer to the latest version. When the results are still the same, we can file a bug against our vender. Of course with steps to recreate ;) Maybe the latest version will be in lab somewhen next week.