Weekly 2016/34

Starting this week with minor devel tasks. Two weeks before, I implemented a feature in our machine database, to be able to talk to different consul datacenters. There were some review comments, I had to fix. This enables us to talk to reach the consul cluster of our lab, even though the machine database is located in a different cluster.

My colleagues started to roll out our ansible roles to the lab cluster. Since we changed lots of stuff to support multiple clusters, we expect the rollout to be painful. And it did not last long to run into the first error. The storage manager crashes, due to a wrong usage of pathlib.PosixPath. I fixed those issues, and built debian packages as well. Whereas building debs was the most painful part of this task.

At some point in the rollout, we were unable to reach all of the lab hosts, but only from two hosts, including one ouf our cluster workstations. The rest was working fine. I discovered "strange" routes on the ToR of lab, and was able to track down the issue with the help of our NetOps. I learned to ask bird for learned and exported routes. This quickly showed, two hosts in the lab exported the same ip address like the production cluster. With this information, a fellow reconfigured the canonical ip addresses in his host-vars and re-rolled ansible. After bringing the wrong (duplicate) address down, everything worked fine.

Tuesday, we continued our work on the lab plattform. After some minor changes in our ansible roles, we discovered some inconsistency in ASN numbering. I reverse-engineered the remaining parts of the bgp configurations of both clusters and introduced a new ASN policy. This included a change to all the ASN in our new lab. Which means a temporarily loss of control of all the nodes in the lab. Fortunately the configuration was successful and all nodes re-announced their IPs. The configuration change was done with our NetOps again, because we need some changes on devices, we don't have access to. After all the changes the connection between production and lab were established again, and the loadbalancers in the lab were able to annonce public ips. We have an accessable lab now :)

At some time on Tuesday, I tried to get metrics from Consul. Turned out, Consul is only able to stream changes to something like statsd.

Bring all the stuff together. One of my colleagues mainly changed many of our ansible roles, to make the lab possible. In fact, the rollout of the last days was done by these changes. He started a while ago, and so master diverged a lot, resulting in a hugh, unmaintainable merge request. Since the cluster rollout worked fine, we split the merge request into smaller ones, that we merge into master, without affecting production. Just one large refactoring of the loadbalancer role remaining. After getting all the v4 bgp-sessions alive during the last days, today I fixed issues with the v6 configuration an brought up all the sessions as well. This was the last bit for a fully working underlay-network :) The afternoon I documented things for baremetal deployment and the network configuration.

On Thursday we did some minor fixes to the loadbalancer role and finally merged it into master. Another feature of yet another colleagues became merge ready, and we brought this to master as well. Since it is not yet in any playbook, this will not affect any running services. After those merges, we applied the entire new master on our lab. The next step would be to re-deploy the lab from master. Since another team would like to participate, we scheduled this into next week. I tried to check some network settings with our SDN Midonet, as a prerequisite for running the integration test-suite, and discovered missing configuration of external IPs and BGP sessions. It was barely automated with a shell script, and it was available in our repo, but never called automatically on bootstrap. However, this script does not perform any checks whether the objects it tries to create, are already available. It also works only for our production cloud, even if it should never ever be called against it, due to the lack of checks. Anyway, it is a nice task for our new team member, to rewrite most of the script as an ansible role, that could run regulary. Since we need this script updated for the lab deployment anyway, we just wait for this rewrite.

Friday, and this week one of my short days, and the lab works good enough to test some things. First I helped out Mathias to setup the Loadbalancers with the config of production again. Then we migrated to the new loadbalancer role, without any problems. Yeahie. The rest of the day, I tried a fix to the configuration of all OpenStack-APIs, I did a few weeks ago. Due to a lack of a "not production" environment, I was not able to test it so far. Today I did, and the first shot worked well for all services except the both relevant ones nova and neutron. The latest version of python-openstacksdk does not work against our neutron. We use the traditional OpenStack service ports, but using https. The services http-links in api-json-results. If the client follows those links, it talk http to the https endpoint resulting in a bad request (400). The solution is to configure oslo_middleware.HTTPProxyToWSGI for each service. With this, all the returned json-links are https.

Before:

> curl https://api.cbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "http://api.cbk.cloud.syseleven.net:9696/v2.0",  <<===
          "rel": "self"
        }
      ]
    }
  ]
}

After:

> curl https://api.zbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "https://api.zbk.cloud.syseleven.net:9696/v2.0", <<==
          "rel": "self"
        }
      ]
    }
  ]
}

The first shot did not work for nova and neutron, because the versions-endpoints had a different pipeline configuration, than all the other stuff. I just needed to add HTTPProxyToWSGI to those endpoints and everything worked fine. Because some oslo_middleware developers decided to disable HTTPProxyToWSGI by default for a strange reason. This flag is not yet available in OpenStack Mitaka, but the default will break our deployment on an update to Newton. So I set enable_proxy_headers_parsing = True in all our OpenStack Services, and let it be ignored in the current deployment.

Weekly 2016/33

This is the first attempt to write about my week in tech. More about the things I did, and less about things I read about. First of all, this is for me. So times, I tend to be a bit frustrated about being caught in a treadmill. So I really don't know where this is going.

Let's step into this week. It was going to be a short week. I'm still on part time, and working 30h a week. This week from Monday to Thursday. The main task, and target as well, was bootstrapping the hardware nodes for our second OpenStack Cloud, which is going to be a lab. Means, this is the second time to bootstrap with our automation, and the first time doing it from zero to working in without steps in between.

For sure, the hardware is not exactly the same like the production cloud, but only differs some nuances. The most important difference for the entire cluster is the fact, that we dual use the boot infrastructure (like pxe boot) of the production cloud. However, the lab cloud must get its own network, but also reach our boot infrastructure and must be reachable from the cluster workstations of the production cloud. For sure, the connections were set up by our netops.

Slight insight into the networking. The ToR-switches of the production cloud are connected to a qfabric. The ToR of lab as well. Between those two, a mpls l3vpn is configured. Even the racks stand side by side, we decided to use this configuration. Our plan is to spread several clusters over the data center. So this is a quite nice test case.

The nodes were started end of the week before, but did not show up in the machine database. I even didn't expect this working the first time. Unfortunately, due to the design, I was not able to reach any bmc of the nodes. The nodes should be configured in a way to boot from network into a live system. This does a quick inventory of the machine, and submits it into the database. We as the admins, have to configure the rack position. The live system picks this configuration periodically and does some system configuration like setting a fixed bmc ip address and password. We are also able, to set the system disk in our machine database. After this two configuration bits, rack position and system disk, the node is ready to boot into the ubuntu installer and configures itself with a preseed. Again, this was some decision made somewhen. I would prefer Images at this point. But however.

However. None of the 10 machines showed up, and bmc did not work. We send out one of our colleages to the datacenter. While he was on his way, I connected to the ToR of the lab, because this was reachable, and did some basic tests. Because we use anycast for dns, ntp and some more stuff, the first guess was no working anycast. Turns out anycast worked. But the throughput was at least one problem. I only got 1.4MByte/s from ToR to ToR. Which is really bad, they are connected with 40GBit/s. My DC-fellow check network connections, and correct patching. Everything looked good so far. Since we started late, due to the meeting-monday, we called it a day.

On Tuesday, i tried to get our netops, to get the problem solved. Most of them were busy as well, but with little chatting, and further debugging, I detected a drop of the mtu from 9000 to 1500 somewhere in the qfabric. With the help of the netops, we checked the links. All were configured properly. Great!? Since both ToRs are configured with a mtu of 9000, and the bad throughput, I expect pmtu discovery not to work. Which is even worse if 10 nodes should do netboot/install stuff almost at the same time. Since we still lacked bmc, my colleage again visited the DC. He shut off all the nodes and brought back the first one. This barely worked, and it was able to download the live image.

I was stucked at this point. I had to wait for the netops who initially configured the l3vpn to help out. So in the meantime, I helped to submit a new version of the internal storage manager (that configures disks to be system or lvm or cluster fs), and get it out to the production cluster. This removed a blocker of a aborted rollout of the week before. At my end of the day, we had 4 of 10 nodes running. Before I left, I could grab the netops and he promised to have a look at this issue. During my spare time, my fellow in the DC discovered, that six nodes didn't even boot from the correct network device. For sure he was not able to fix this because we need a firmware update for the cards to get it working.

Wednesday. Like my netops promised, the mtu issue was fixed. Wohooo. Turned out, there was an overall-setting that overrides the concrete port configuration m( Because of the mtu struggles, I didn't debug the ToR of the lab enough. To recap, we still lack bmc. In the morning I discovered a missing link on ToR to bmc-switch. Lukily for me, my fellow had to visit the DC again, to update the firmwares. So he armed himself with some optics.

In this field, there was nothing to do for me. I waited for the updated nodes to submit their inventory to our database, and a working bmc. So I continued the work on a solution to run a cluster wide unattended update, but stop if something strange happen, or a human hit the red button. In the evening, my fellow replaced the fiber and got another 3 nodes running. The remaining nodes ended in an endless loop in the installer. But I thought, this is pretty good. We could define those nodes to be compute nodes, and deploy OpenStack with only one compute.

My Thursday starts with a totally different topic. We have a Thursday time slot of 15 minutes to present some interresting stuff. I showed my kubernetes stack running onto of our OpenStack. 15 minutes are way to less. I could talk hours about this topic. But anyway. It is a great way to show other teams topics to think about. Even I would like to see more faces in my presentation. I did a live demo of booting the stack, so there are no slides, but the stack is available on github.

After this short trip, I went back to the cluster bootstrap an noticed, that exactly those broken nodes have to be the controllers, because of the ssd provisioning. So we had to get the nodes out of the door. However, we have bmc, so at least no one had to visit the DC. For sure, no errors were shown on the serial console. Nothing but the over and over starting progress bar. Thankfully my DC-fellow was able to connect to ilo and get some error messages. Somewhere in the bootup-time, the routes to the lab disappeared. There was a experimental rollout on the lab-ToR, that broke the config of bird. Trivial fix and we were up again. Some of my tests of the installed nodes showed missing bits in the consul configuration, as well as wrong configuration of the fqdn. Trivial fixes for another fellow.

Back to the insallation process. Turned out. The very last step failed. Because ubuntu changes the boot device to install-disk, there is some late command to reset the boot order netboot. We always use netboot, to be able to change grub config even if we are not able to connect to the nodes at some time. The netboot grub fetches the next desired boot target for the node. A tiny fix in a regexp to detect the network device did the trick. The nodes installed correctly and came up like expected.

This was at 1430. I called it a week and left for weekend. With major help of "the DC-fellow", I was able to get done the ground work for a running lab (installed with the same automation like production). The next steps are up to my colleages anyway.

All in all a very productive week, in face of "only" 30h :)

Heute war ich im Amtsgericht Pankow/Weißensee

/galleries/amtsgericht-pankow-2016/IMG_20160211_122422.thumbnail.jpg

Der Grund war ein ganz einfacher. Meine Freundin hat einen Bescheid über die Erbschaft von einer verstorbenen Großtante bekommen. Diese Großtante hatte keine eigenen Kinder, daher hatten ihre Nichten und Neffen Anspruch auf das Erbe. Diese (7) haben das Erbe ausgeschlagen und daher ist die nächste Riege angeschrieben worden. Warum war ich eigentlich dabei? Wenn, der Empfehlung nach, alle Cousins und Cousinen meiner Freundin das Erbe ebenfalls ausschlagen, wird unser Sohn angeschrieben. Da wir beide sorgeberechtigt sind, haben wir das für unseren Sohn gleich mit erledigt. Eine Vollmacht genügt nicht, ich musste persönlich vorstellig werden.

Da ich die Geschichte um das Erbe so interessant finde, werde ich darüber noch schreiben. Dies hier möchte ich aber dem Gebäude widmen, in dem das Amtsgericht Pankow/Weißensee in der Parkstraße residiert.

"Widmen" ist vielleicht etwas übertrieben. Ich finde das Gebäude einfach interessant und der Schein von außen hat nicht getrügt. Mal abgesehen von den Büros, die waren genau so eingerichtet wie ich es bei einem Amtsgericht erwartet hätte, fanden wir hohe schön verzierte Decken mit tollen Lampen und interessante Treppenhäuser vor.

… alle Bilder

/galleries/amtsgericht-pankow-2016/IMG_20160211_123726.thumbnail.jpg

Was ich nicht wusste: Anscheinend gilt man als Kunde des Amtsgerichtes als "Publikum".

Außerdem habe ich etwas über das Erbrecht gelernt. Das Ausschlagen eines Erbes kostet eine Mindestgebühr von 30 Euro, die wirklich zu bezahlende Gebühr richtet sich nach dem Vermögen des Erbes. Bis zu einem Erbe von 5000 Euro bleibt es aber bei den genannten 30 Euro.

Nun ist das mit dem Erben also eine kleine Lotterie. Nimmt man es an, und es besteht aus Schulden, hat man den Salat. Nimmt man es nicht an, es stellt sich aber heraus, dass die verstorbene Person mehr als 5000 Euro zu vererben hätte, wird es unter Umständen auch teuer. Ich habe leider vergessen zu Fragen was "Die Gebühr richtet sich nach dem Vermögen" in echt bedeutet. Prozent des Vermögens? Positives oder negatives Vermögen? Wer weiß. Ich bin zu faul das zu recherchieren.

GNOME 3 color profile and screen brightness

Lesson learned: Next time I when face a problem with some desktop software, first I will do is dbus-monitor.

I decided to switch my current desktop environment from Cinnamon to GNOME 3. As a by product, I created a new color profile with my spyder3. The old one was from 2014, and it was definitly the time to create a new one. Therefor, GNOME shipps gnome-color-manager. So I set my display to my "photo-editing"-brightness and started measuring.

So far, everything worked fine. I even only had to touch gnome-tweak-tool once to set my favorite fonts.

I did everything in one session, without logging out or rebooting. During this time, I also created my shortcuts to switch the screen brightness between "working"- and "photo-editing"-mode. In the first shot, they worked. The only issue: The brightness slider did not move to the correct position.

At this point I was done! Just wanted to compare the console fonts with those, set in Cinnamon. So I logged out, switched the session to Cinnamon, took a screen shot, and switched back to GNOME.

On login it happened. GNOME set the screen brightness to 100%. I I set the brightness manually using the brigness slider. Still the same problem. Searching the web, did not really help. I was told to place a startup desktop-file in ~/.config/autostart/. But this did not work very reliable. One thing I did. Killing the gnome-settings-daemon, since in my excperience, this could be the cause. Turns out. gnome-settings-daemon was responsible. Next, I tried to find some gsettings in dconf, without any success.

So I got the source code of gnome-settings-daemon, found the responsible lines of code within a few minutes. During walking up the possible call tree, I did not find an explicit call. But I discovered the dbus api of the power-plugin. However. I started dbus-monitor, and pressed the brightness buttons, to get an idea who the api is used. Mainly to be able to put a filter on dbus-monitor.

$ dbus-monitor --session "path=/org/gnome/SettingsDaemon/Power"

This only shows calls to the power-thing. I used path in favor of interface, since this also shows property changes.

Killing gnome-settings-daemon showed a property change of brightness to 100%.

signal time=1453584221.013391 sender=:1.579 -> destination=(null destination) serial=214 path=/org/gnome/SettingsDaemon/Power; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
   string "org.gnome.SettingsDaemon.Power.Screen"
   array [
      dict entry(
         string "Brightness"
         variant             int32 100
      )
   ]

Form this on, I knew the call, and grepped again the source code of gnome-settings-daemon. And that was the missing piece. I found a dbus call in plugins/color/gsd-color-state.c … the color manager part of gnome-settings-daemon. However. I discovered the following rows:

// if output is a laptop screen and the profile has a
// calibration brightness then set this new brightness
brightness_profile = cd_profile_get_metadata_item (profile,
                                                   CD_PROFILE_METADATA_SCREEN_BRIGHTNESS);
if (gnome_rr_output_is_builtin_display (output) &&
    brightness_profile != NULL) {
        // the percentage is stored in the profile metadata as
        // a string, not ideal, but it's all we have...
        brightness_percentage = atoi (brightness_profile);
        gcm_session_set_output_percentage (brightness_percentage);
}

The comment pointed me directly to the solution of my problem. I took a look into the icc profile, and voila, metadata said "Screen Brightness: 100".

I only had to remove this metadata of my profile (located in ~/.local/share/icc).

$ cd-fix-profile PROFILE.icc md-remove SCREEN_brightness

\o/

However. I am pretty sure, the screen was on the correct brightness level (39%) when I started to create the profile.

And again. I faced a little, but annoying, problem, and was able find the cause. And this, just because GNOME is open source!

Above, I wrote about the issue of the "not moving brightness slider". As a by-product of the walk through gnome-settings-daemon, I was able to change my brightness-switch-command, to use the dbus calls.

Setting the brigness to 9% (which is my working-mode, most of the time):

$ dbus-send --session --type=method_call \
  --dest="org.gnome.SettingsDaemon.Power" \
  /org/gnome/SettingsDaemon/Power \
  org.freedesktop.DBus.Properties.Set \
  string:"org.gnome.SettingsDaemon.Power.Screen" \
  string:"Brightness" \
  variant:int32:9

Setting the brightness to 39% (for photo-editing):

$ dbus-send --session --type=method_call \
  --dest="org.gnome.SettingsDaemon.Power" \
  /org/gnome/SettingsDaemon/Power \
  org.freedesktop.DBus.Properties.Set \
  string:"org.gnome.SettingsDaemon.Power.Screen" \
  string:"Brightness" \
  variant:int32:39

Wohooo, and yet it moves :D