Zum Hauptinhalt springen
Chris’ wirre Gedankenwelt

Dropping weekly format

Turns out. Like expected I am not able to write one post per week, and I didn't manage to pick up after two weeks of vacation. When reading the weekly posts again, they seem kind of weird. Which is not bad, but nut worse the time. So I call the my first attempt to write a series to be failed.

Anyway. Natually, not each week is equally interresting. I will try to pick up interresting topic here, and try to go a bit into detail.

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Weekly 2016/37

No such meeting Monday. This week, our all-hands-meeting was moved to Tuesday. Because most people in my team were on vacation, we did a short daily instead of a 1 hour weekly. More time to do real work :) I returned to the cluster wide unattended update. I worked at least half of the day on this topic. Another look into the networking setup of our lab and some more documenation did the rest of it.

Back in office on Wednesday after a day of sickness. From Tuesday on, the production cloud was kind of slow. In terms of: Creating a bare network took 6 seconds in average, instead of 2 seconds in the lab. Digging into the problem showed a problem in filling up one rabbitmq queue. Usually this should be consumed by a telemetry solution, which is not deployed correctly so far. I removed all the messages in the queue and everything went back to normal again.

On Thursday, I had not too much time, to digg into unattended updates again. Just an hour in the morning. This week is strange, with many little disturbances, and I literally forgot all the small stuff. After lunch we did a complete wipe and reinstall of our lab. I left after 2 hours of installation for 10 hardware nodes. Does not seem to be pretty fast, but in fact this is ok. At least we had to do the process on our first 3 nodes (those form our consul cluster, now) twice, because of an expired pgp key. My colleagues Continues with running Ansible on the nodes, without huge problems.

Although, the installation on the day before was pretty ok overall, I cleaned up some stuff on Friday. With some minor changes, the installation is smoother than Thursday. I did not run an entire installation. But shut off the consul cluster to simulate the "nothing is there"-situation. This worked pretty well. I did a second reinstallation, but with consul turned on. This is the usual scenario if we grow our existing cluster by new machines. I finalized the installation by applying Ansible. Because the node, I selected, was a gateway of our SDN, there was a little problem with a new UUID for this node, but easily to resolve by hand. This sceanrio is not usual, so I just documented this issue and did not automate it away. Furthermore, I extended the existing documentation by things, I saw yesterday, but not yet in the documentation. Somewhen in between, we had a call with our storage vendor about the next steps to happen. The afternoon I prepared a update of consul 0.6.4 to 0.7. Its going to be bad:

"Consul's default Raft timing is now set to work more reliably on lower-performance servers, which allows small clusters to use lower cost compute at the expense of reduced performance for failed leader detection and leader elections. You will need to configure Consul to get the same performance as before. See the new Server Performance guide for more details. [GH-2303]"

In other words: It did not work well on raspberry pi, we fixed this, but destroyed it for everyone else. And for sure, 0.6er consul refuse to start because of an unkown key, if I already would place the new paremeter to all config files :(

Oh. Almost forgotten. Last week I wrote about my plans of cleaning up my vim modules. And with this, a switch to neovim. Turned out. Cleaning up stuff went well. The neovim experiment not. It felt ok in most points, but not better then good old vim. And one thing let me scream several times, and go back vim. In some/most cases a hit on ESC took a while to complete. That long, that I typed some more characters before leaving insert mode. An I'm escaping a lot!

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Weekly 2016/36

Starting with meeting Monday. The first half of the day was gone for this. After lunch, I had to add another authentication mechanism to our MachineDB. In fact this is a Flask-app which uses ldap as an authentication backend. We have to put this part of the software into a place without any access to our ldap. Since we use postgres as the DB backend, I created another authentication backend, using postgres. In the afternoon I helped a fellow with some scripts to setup floating ip networking to our lab cluster.

On Tuesday, I continued my work on the authentication backend and got it merge ready. More debugging was done at the floating ip creation stuff. Because the script is more complex than I hoped, I created the networks by hand and was able to configure our integration test to run against the lab.

The next day was a short day for me. I figured out some problems detected by the tempest run. Some strange errors happen due to the ordering of openstack endpoints. The config parameter v3_endpoint_type has been ignored, and adminURL was used by accident. By accident, since it really depended on the sorting of endpoints. I fixed this by updating the id of the endpoint in the keystone database. Beside this, I did some daywork.

Thursday, was kind of boring. First of all, it was a short day again. So primarily I did some daywork, and prepared a backup-presentation for the Berlin Ops Summit on Friday. Turned out, we didn't need it.

Friday was Conference day. I liked the Talk of Jeff Sussna. Most other talks were ok, but not new to me. The most entertaining talk was UI engineering by Dennis Reimann and Jan Persiel. The closing party was great :)

Btw: I am on my way to re-create my vimrc in terms of throwing away modules and settings, I don't use any more. I take the chance to switch to neovim in this step. Lets see how far I come with this.

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Linux Kernel Features…

… mit dessen Hilfe Container-Runtimes gebaut werden. Keine Garantie für die Vollständigkeit dieser Liste.

chroot #

z.B. Alpine Linux in chroot

Um den gestarteten Prozessen ein anderes / vorzugeben. Noch nicht mal ein Linux-Feature und wurde bereits 1979 in Unix implementiert.

# Ich habe da mal was vorbereitet.
export mirror=http://dl-3.alpinelinux.org/alpine/
export chroot_dir=/root/alpine
export version=2.6.7-r0
wget ${mirror}/latest-stable/main/x86_64/apk-tools-static-${version}.apk
tar -xzf apk-tools-static-*.apk
./sbin/apk.static -X ${mirror}/latest-stable/main -U --allow-untrusted --root ${chroot_dir} --initdb add alpine-base
mknod -m 666 ${chroot_dir}/dev/full c 1 7
mknod -m 666 ${chroot_dir}/dev/ptmx c 5 2
mknod -m 644 ${chroot_dir}/dev/random c 1 8
mknod -m 644 ${chroot_dir}/dev/urandom c 1 9
mknod -m 666 ${chroot_dir}/dev/zero c 1 5
mknod -m 666 ${chroot_dir}/dev/tty c 5 0
cp /etc/resolv.conf ${chroot_dir}/etc/
mkdir -p ${chroot_dir}/root
mkdir -p ${chroot_dir}/etc/apk
echo "${mirror}/${branch}/main" > ${chroot_dir}/etc/apk/repositories

mount -t proc none ${chroot_dir}/proc
mount -o bind /sys ${chroot_dir}/sys

chroot ${chroot_dir} /bin/sh -l

Namespaces #

Definiert was Prozesse sehen und machen können.

Mount #

unshare -m /bin/bash
mount -t tmpfs tmpfs /tmp

touch /tmp/blah
ls -l /tmp ### nur blah


Anderes Terminal
ls -l /tmp ### orig-stuff aber kein blah

PID #

unshare --pid /bin/bash
ps -ef ## segmentation fault, da fehlt wohl noch was

unshare --pid --mount-proc -f /bin/bash
ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:26 pts/1    00:00:00 /bin/bash
root         4     1  0 08:26 pts/1    00:00:00 ps -ef

User #

Wichtig, als != root starten

id -u ## 500
unshare --user --map-root-user /bin/bash
id -u ## 0

Trotzdem hat man keine root-Berechtigung!

mount -o tmpfs tmpfs /tmp  ## permission denied

Network #

man ip-netns

ip netns list
ip netns add testing
ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1

ip link add veth0 type veth peer name veth1
ip link set veth1 netns testing
ip a ## kein veth1

ip addr add 192.168.0.1/32 dev veth0
ip route add 192.168.0.0/24 dev veth0
ip netns exec testing ip link set veth1 up
ip netns exec testing ip addr add 192.168.0.2/32 dev veth1
ip netns exec testing ip route add 192.168.0.0/24 dev veth1

ping 192.168.0.2 ## works

ip netns exec testing ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
7: veth1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
   inet 192.168.0.2/32 scope global veth1

Capabilities #

Gezielt Fähigkeiten von Prozessen beschränken. Auch wenn der Prozess als effective user 0 (root) läuft.

man capabilities

mount -t tmpfs tmpfs /mnt
umount /mnt
capsh --drop=CAP_SYS_ADMIN --
id -u  ## 0
mount -t tmpfs tmpfs /mnt  ## permission denied

cgroups #

Resourcen von Prozessen einschränken. Beispiel Memory.

cd /sys/fs/cgroup/memory
mkdir testing
cd testing
ls -l  # zeigt u.a. tasks

# 2. Console auf und dort die PID ermitteln
ls -l  # funktioniert
echo $$

# Zurück in Console 1 und bash der Console 2 in die cgroup testing aufnehmen
echo PID > tasks
cat tasks
PID

echo 1 > memory.limit_in_bytes

# Console 2
ls -l
Killed

Wie man sieht hängen alle Child Prozesse auch in dieser cgroup!

Weitere Resourenlimits:

  • CPU
  • blkio
  • devices
  • pids (Process numbers)

Siehe die Kernel Dokumentation zu cgroups.

seccomp #

secure computing mode. Gezieltes filtern von system calls. Leider habe ich hier kein einfaches Beispiel.

AppArmor / SELinux / SMACK #

Mandatory Access Control im Linux Kernel. Es gibt verschiedene Implementierungen. Ubuntu verwendet z.B. AppArmor.

Auf Ubuntu

vi /etc/apparmor.d/usr.sbin.rsyslogd
...
/usr/sbin/rsyslogd {
  # rsyslog configuration
  /etc/rsyslog.conf r,
  /etc/rsyslog.d/ r,
  /etc/rsyslog.d/** r,
  /{,var/}run/rsyslogd.pid rwk,
  /var/spool/rsyslog/ r,
  /var/spool/rsyslog/** rwk
 ...
}

AppArmor wird unter Ubuntu auch dazu verwendet den Zugriff von VMs-Prozessen einzuschränken. Damit wird der Schaden potentielle Schaden bei einem Ausbruch aus der VM verringert.

Zum Schluss #

Alle der genannten Features existieren schon einige Jahre im Linux-Kernel. Vorläufer davon existieren in Form von OpenVZ / Parallels Virtuozzo bereits seit ca. 2000. Allerdings außerhalb des Mainline Kernels gepflegt. Die Features können nicht nur dafür benutzt werden um Container Runtimes wie Docker oder CoreOS rkt zu erstellen, man kann auch "regulär im System" laufende Anwendungen damit einpferchen. Auch systemd bietet Flags, die einige der angesprochenen Techniken verweden um Prozesse besser voneinander zu trennen. Siehe dazu Security Features in systemd.

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

OpenStack-Tempest No nw_info cache associated with instance

When using Tempest to test an OpenStack-Cloud, it is easy to run into this error

tempest.lib.exceptions.BadRequest: Bad request
Details: {u'message': u'No nw_info cache associated with instance', u'code': 400}

In fact, networks are not configured proper. Tempest needs a full configured network, connected to the external network, to get some things working. This must be available to all tempest projects/tenante, hence it is enough to create it for each, but use the same name. I created a heat stack for this, and called this network needed-for-tempest. .. code:

heat_template_version: 2014-10-16

description: >
  Setting up a private Network with an snat-router

parameters:
  public_net_id:
    type: string
    description: ID of public network for which floating IP addresses will be allocated

resources:

  ### Creates a Neutron network and subnet
  network:
    type: OS::Neutron::Net
    properties:
      name: needed-for-tempest

  subnet:
    type: OS::Neutron::Subnet
    properties:
      name: needed-for-tempest1
      dns_nameservers:
        - 37.123.105.116
        - 37.123.105.117
      network_id: { get_resource: network }
      ip_version: 4
      gateway_ip: 10.0.0.1
      cidr: 10.0.0.0/24
      allocation_pools:
        - { start: 10.0.0.230, end: 10.0.0.240 }

  router:
    type: OS::Neutron::Router
    properties:
      external_gateway_info: { network: { get_param: public_net_id } }

  router_subnet_connect:
    type: OS::Neutron::RouterInterface
    depends_on: [ subnet, router ]
    properties: 
      router_id: { get_resource: router }
      subnet: { get_resource: subnet }

Of cource, tempest.conf has to be edited as well:

[network]
fixed_network_name = needed-for-tempest
Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Weekly 2016/35

This is the third approach to write a weekly, and the first one, I totally failed. I didn't make any notes of the week :/

But let's start anyway. This week was covered by a re-deployment of our lab. As expected several things went wrong, and we had to fix it iteratively. Nothing special so far. This couldn't be sticked to one special day. Like describe in the last weeks, I digged around in our network setup. This week, I took my time and documented it down into our IPAM nipap. That includes all the transfer nets and all public IPs assigned to us. In case of service IPs, I documented the "function" as well. We are now able to collect new IPs directly from this pools, without having to grep config files, and hope you catched everything.

On Wednesday, one day with fixing lab-stuff, I got the request to benchmark our S3. Which I started on my short Thursday. I decided to write a script in Go to parallelize the work. This would be perfectly doable in python as well, but I wanted to use something new. First results were collected on Friday morning. Over and over again in this week, I helped a fellow to configure the forward proxy corretly.

Beside of the S3-work, I was in our change commission, to discuss what things we can do on our own, and what should be discussed again in this commission. Lucky us, we can do most of the things, I wanted to.

Friday, started with collecting S3 benchmarks. Then an outage of our cinder-volume occured, which I helped to fix. For sure, since one of my changes broke the configuration :( Fortunately, it resulted in a loss of control and running instances were not affected. I wrote the post-mortem about this outage, and filed some tickets to fix missing bits in our monitoring. After this, I fixed an issue with our StorageManager. It creates new logical volumes on cluster bootstrap, but in our configurations they are very likely to be at the same position like before. So xfs restored the filesystem instead of creating a new one. I now wipefs the new created lv.

Currently I am graphing the S3-results with jupyter notebook, which is really cool. Give it a try. I installed mathplotlib and numpy as well. The performance in average is pretty ok, both up and download. But the long tail does really hurt. Some of the requests took around 8 seconds, while the average is around 0.08 seconds.

What did I test? I uploaded around 12500 files from a pool of 25 unique images. Using 5 uploads in parallel, and parallel to this, 10 downloads of random images. Both over and over again. I could even do more in parallel. Anyway. I we got some tickets about slow S3-requests. But like all tickets without any steps to recreate. The result we see here, is exactly what I wanted to get out of this tests. In short-term we are going to update our S3-Layer to the latest version. When the results are still the same, we can file a bug against our vender. Of course with steps to recreate ;) Maybe the latest version will be in lab somewhen next week.

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Weekly 2016/34

Starting this week with minor devel tasks. Two weeks before, I implemented a feature in our machine database, to be able to talk to different consul datacenters. There were some review comments, I had to fix. This enables us to talk to reach the consul cluster of our lab, even though the machine database is located in a different cluster.

My colleagues started to roll out our ansible roles to the lab cluster. Since we changed lots of stuff to support multiple clusters, we expect the rollout to be painful. And it did not last long to run into the first error. The storage manager crashes, due to a wrong usage of [pathlib.PosixPath]{.title-ref}. I fixed those issues, and built debian packages as well. Whereas building debs was the most painful part of this task.

At some point in the rollout, we were unable to reach all of the lab hosts, but only from two hosts, including one ouf our cluster workstations. The rest was working fine. I discovered "strange" routes on the ToR of lab, and was able to track down the issue with the help of our NetOps. I learned to ask bird for learned and exported routes. This quickly showed, two hosts in the lab exported the same ip address like the production cluster. With this information, a fellow reconfigured the canonical ip addresses in his host-vars and re-rolled ansible. After bringing the wrong (duplicate) address down, everything worked fine.

Tuesday, we continued our work on the lab plattform. After some minor changes in our ansible roles, we discovered some inconsistency in ASN numbering. I reverse-engineered the remaining parts of the bgp configurations of both clusters and introduced a new ASN policy. This included a change to all the ASN in our new lab. Which means a temporarily loss of control of all the nodes in the lab. Fortunately the configuration was successful and all nodes re-announced their IPs. The configuration change was done with our NetOps again, because we need some changes on devices, we don't have access to. After all the changes the connection between production and lab were established again, and the loadbalancers in the lab were able to annonce public ips. We have an accessable lab now :)

At some time on Tuesday, I tried to get metrics from Consul. Turned out, Consul is only able to stream changes to something like statsd.

Bring all the stuff together. One of my colleagues mainly changed many of our ansible roles, to make the lab possible. In fact, the rollout of the last days was done by these changes. He started a while ago, and so master diverged a lot, resulting in a hugh, unmaintainable merge request. Since the cluster rollout worked fine, we split the merge request into smaller ones, that we merge into master, without affecting production. Just one large refactoring of the loadbalancer role remaining. After getting all the v4 bgp-sessions alive during the last days, today I fixed issues with the v6 configuration an brought up all the sessions as well. This was the last bit for a fully working underlay-network :) The afternoon I documented things for baremetal deployment and the network configuration.

On Thursday we did some minor fixes to the loadbalancer role and finally merged it into master. Another feature of yet another colleagues became merge ready, and we brought this to master as well. Since it is not yet in any playbook, this will not affect any running services. After those merges, we applied the entire new master on our lab. The next step would be to re-deploy the lab from master. Since another team would like to participate, we scheduled this into next week. I tried to check some network settings with our SDN Midonet, as a prerequisite for running the integration test-suite, and discovered missing configuration of external IPs and BGP sessions. It was barely automated with a shell script, and it was available in our repo, but never called automatically on bootstrap. However, this script does not perform any checks whether the objects it tries to create, are already available. It also works only for our production cloud, even if it should never ever be called against it, due to the lack of checks. Anyway, it is a nice task for our new team member, to rewrite most of the script as an ansible role, that could run regulary. Since we need this script updated for the lab deployment anyway, we just wait for this rewrite.

Friday, and this week one of my short days, and the lab works good enough to test some things. First I helped out Mathias to setup the Loadbalancers with the config of production again. Then we migrated to the new loadbalancer role, without any problems. Yeahie. The rest of the day, I tried a fix to the configuration of all OpenStack-APIs, I did a few weeks ago. Due to a lack of a "not production" environment, I was not able to test it so far. Today I did, and the first shot worked well for all services except the both relevant ones nova and neutron. The latest version of python-openstacksdk does not work against our neutron. We use the traditional OpenStack service ports, but using https. The services http-links in api-json-results. If the client follows those links, it talk http to the https endpoint resulting in a bad request (400). The solution is to configure oslo_middleware.HTTPProxyToWSGI for each service. With this, all the returned json-links are https.

Before:

> curl https://api.cbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "http://api.cbk.cloud.syseleven.net:9696/v2.0",  <<===
          "rel": "self"
        }
      ]
    }
  ]
}

After:

> curl https://api.zbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "https://api.zbk.cloud.syseleven.net:9696/v2.0", <<==
          "rel": "self"
        }
      ]
    }
  ]
}

The first shot did not work for nova and neutron, because the versions-endpoints had a different pipeline configuration, than all the other stuff. I just needed to add HTTPProxyToWSGI to those endpoints and everything worked fine. Because some oslo_middleware developers decided to disable HTTPProxyToWSGI by default for a strange reason. This flag is not yet available in OpenStack Mitaka, but the default will break our deployment on an update to Newton. So I set enable_proxy_headers_parsing = True in all our OpenStack Services, and let it be ignored in the current deployment.

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Weekly 2016/33

This is the first attempt to write about my week in tech. More about the things I did, and less about things I read about. First of all, this is for me. So times, I tend to be a bit frustrated about being caught in a treadmill. So I really don't know where this is going.

Let's step into this week. It was going to be a short week. I'm still on part time, and working 30h a week. This week from Monday to Thursday. The main task, and target as well, was bootstrapping the hardware nodes for our second OpenStack Cloud, which is going to be a lab. Means, this is the second time to bootstrap with our automation, and the first time doing it from zero to working in without steps in between.

For sure, the hardware is not exactly the same like the production cloud, but only differs some nuances. The most important difference for the entire cluster is the fact, that we dual use the boot infrastructure (like pxe boot) of the production cloud. However, the lab cloud must get its own network, but also reach our boot infrastructure and must be reachable from the cluster workstations of the production cloud. For sure, the connections were set up by our netops.

Slight insight into the networking. The ToR-switches of the production cloud are connected to a qfabric. The ToR of lab as well. Between those two, a mpls l3vpn is configured. Even the racks stand side by side, we decided to use this configuration. Our plan is to spread several clusters over the data center. So this is a quite nice test case.

The nodes were started end of the week before, but did not show up in the machine database. I even didn't expect this working the first time. Unfortunately, due to the design, I was not able to reach any bmc of the nodes. The nodes should be configured in a way to boot from network into a live system. This does a quick inventory of the machine, and submits it into the database. We as the admins, have to configure the rack position. The live system picks this configuration periodically and does some system configuration like setting a fixed bmc ip address and password. We are also able, to set the system disk in our machine database. After this two configuration bits, rack position and system disk, the node is ready to boot into the ubuntu installer and configures itself with a preseed. Again, this was some decision made somewhen. I would prefer Images at this point. But however.

However. None of the 10 machines showed up, and bmc did not work. We send out one of our colleages to the datacenter. While he was on his way, I connected to the ToR of the lab, because this was reachable, and did some basic tests. Because we use anycast for dns, ntp and some more stuff, the first guess was no working anycast. Turns out anycast worked. But the throughput was at least one problem. I only got 1.4MByte/s from ToR to ToR. Which is really bad, they are connected with 40GBit/s. My DC-fellow check network connections, and correct patching. Everything looked good so far. Since we started late, due to the meeting-monday, we called it a day.

On Tuesday, i tried to get our netops, to get the problem solved. Most of them were busy as well, but with little chatting, and further debugging, I detected a drop of the mtu from 9000 to 1500 somewhere in the qfabric. With the help of the netops, we checked the links. All were configured properly. Great!? Since both ToRs are configured with a mtu of 9000, and the bad throughput, I expect pmtu discovery not to work. Which is even worse if 10 nodes should do netboot/install stuff almost at the same time. Since we still lacked bmc, my colleage again visited the DC. He shut off all the nodes and brought back the first one. This barely worked, and it was able to download the live image.

I was stucked at this point. I had to wait for the netops who initially configured the l3vpn to help out. So in the meantime, I helped to submit a new version of the internal storage manager (that configures disks to be system or lvm or cluster fs), and get it out to the production cluster. This removed a blocker of a aborted rollout of the week before. At my end of the day, we had 4 of 10 nodes running. Before I left, I could grab the netops and he promised to have a look at this issue. During my spare time, my fellow in the DC discovered, that six nodes didn't even boot from the correct network device. For sure he was not able to fix this because we need a firmware update for the cards to get it working.

Wednesday. Like my netops promised, the mtu issue was fixed. Wohooo. Turned out, there was an overall-setting that overrides the concrete port configuration m( Because of the mtu struggles, I didn't debug the ToR of the lab enough. To recap, we still lack bmc. In the morning I discovered a missing link on ToR to bmc-switch. Lukily for me, my fellow had to visit the DC again, to update the firmwares. So he armed himself with some optics.

In this field, there was nothing to do for me. I waited for the updated nodes to submit their inventory to our database, and a working bmc. So I continued the work on a solution to run a cluster wide unattended update, but stop if something strange happen, or a human hit the red button. In the evening, my fellow replaced the fiber and got another 3 nodes running. The remaining nodes ended in an endless loop in the installer. But I thought, this is pretty good. We could define those nodes to be compute nodes, and deploy OpenStack with only one compute.

My Thursday starts with a totally different topic. We have a Thursday time slot of 15 minutes to present some interresting stuff. I showed my kubernetes stack running onto of our OpenStack. 15 minutes are way to less. I could talk hours about this topic. But anyway. It is a great way to show other teams topics to think about. Even I would like to see more faces in my presentation. I did a live demo of booting the stack, so there are no slides, but the stack is available on github.

After this short trip, I went back to the cluster bootstrap an noticed, that exactly those broken nodes have to be the controllers, because of the ssd provisioning. So we had to get the nodes out of the door. However, we have bmc, so at least no one had to visit the DC. For sure, no errors were shown on the serial console. Nothing but the over and over starting progress bar. Thankfully my DC-fellow was able to connect to ilo and get some error messages. Somewhere in the bootup-time, the routes to the lab disappeared. There was a experimental rollout on the lab-ToR, that broke the config of bird. Trivial fix and we were up again. Some of my tests of the installed nodes showed missing bits in the consul configuration, as well as wrong configuration of the fqdn. Trivial fixes for another fellow.

Back to the insallation process. Turned out. The very last step failed. Because ubuntu changes the boot device to install-disk, there is some late command to reset the boot order netboot. We always use netboot, to be able to change grub config even if we are not able to connect to the nodes at some time. The netboot grub fetches the next desired boot target for the node. A tiny fix in a regexp to detect the network device did the trick. The nodes installed correctly and came up like expected.

This was at 1430. I called it a week and left for weekend. With major help of "the DC-fellow", I was able to get done the ground work for a running lab (installed with the same automation like production). The next steps are up to my colleages anyway.

All in all a very productive week, in face of "only" 30h :)

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

Heute war ich im Amtsgericht Pankow/Weißensee

Der Grund war ein ganz einfacher. Meine Freundin hat einen Bescheid über die Erbschaft von einer verstorbenen Großtante bekommen. Diese Großtante hatte keine eigenen Kinder, daher hatten ihre Nichten und Neffen Anspruch auf das Erbe. Diese (7) haben das Erbe ausgeschlagen und daher ist die nächste Riege angeschrieben worden. Warum war ich eigentlich dabei? Wenn, der Empfehlung nach, alle Cousins und Cousinen meiner Freundin das Erbe ebenfalls ausschlagen, wird unser Sohn angeschrieben. Da wir beide sorgeberechtigt sind, haben wir das für unseren Sohn gleich mit erledigt. Eine Vollmacht genügt nicht, ich musste persönlich vorstellig werden.

Da ich die Geschichte um das Erbe so interessant finde, werde ich darüber noch schreiben. Dies hier möchte ich aber dem Gebäude widmen, in dem das Amtsgericht Pankow/Weißensee in der Parkstraße residiert.

"Widmen" ist vielleicht etwas übertrieben. Ich finde das Gebäude einfach interessant und der Schein von außen hat nicht getrügt. Mal abgesehen von den Büros, die waren genau so eingerichtet wie ich es bei einem Amtsgericht erwartet hätte, fanden wir hohe schön verzierte Decken mit tollen Lampen und interessante Treppenhäuser vor.

… alle Bilder

Was ich nicht wusste: Anscheinend gilt man als Kunde des Amtsgerichtes als "Publikum".

Außerdem habe ich etwas über das Erbrecht gelernt. Das Ausschlagen eines Erbes kostet eine Mindestgebühr von 30 Euro, die wirklich zu bezahlende Gebühr richtet sich nach dem Vermögen des Erbes. Bis zu einem Erbe von 5000 Euro bleibt es aber bei den genannten 30 Euro.

Nun ist das mit dem Erben also eine kleine Lotterie. Nimmt man es an, und es besteht aus Schulden, hat man den Salat. Nimmt man es nicht an, es stellt sich aber heraus, dass die verstorbene Person mehr als 5000 Euro zu vererben hätte, wird es unter Umständen auch teuer. Ich habe leider vergessen zu Fragen was "Die Gebühr richtet sich nach dem Vermögen" in echt bedeutet. Prozent des Vermögens? Positives oder negatives Vermögen? Wer weiß. Ich bin zu faul das zu recherchieren.

<div style="clear: both" />

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.

GNOME 3 color profile and screen brightness

Lesson learned: Next time I when face a problem with some desktop software, first I will do is dbus-monitor.

I decided to switch my current desktop environment from Cinnamon to GNOME 3. As a by product, I created a new color profile with my spyder3. The old one was from 2014, and it was definitly the time to create a new one. Therefor, GNOME shipps gnome-color-manager. So I set my display to my "photo-editing"-brightness and started measuring.

So far, everything worked fine. I even only had to touch gnome-tweak-tool once to set my favorite fonts.

I did everything in one session, without logging out or rebooting. During this time, I also created my shortcuts to switch the screen brightness between "working"- and "photo-editing"-mode. In the first shot, they worked. The only issue: The brightness slider did not move to the correct position.

At this point I was done! Just wanted to compare the console fonts with those, set in Cinnamon. So I logged out, switched the session to Cinnamon, took a screen shot, and switched back to GNOME.

On login it happened. GNOME set the screen brightness to 100%. I I set the brightness manually using the brigness slider. Still the same problem. Searching the web, did not really help. I was told to place a startup desktop-file in ~/.config/autostart/. But this did not work very reliable. One thing I did. Killing the gnome-settings-daemon, since in my excperience, this could be the cause. Turns out. gnome-settings-daemon was responsible. Next, I tried to find some gsettings in dconf, without any success.

So I got the source code of gnome-settings-daemon, found the responsible lines of code within a few minutes. During walking up the possible call tree, I did not find an explicit call. But I discovered the dbus api of the power-plugin. However. I started dbus-monitor, and pressed the brightness buttons, to get an idea who the api is used. Mainly to be able to put a filter on dbus-monitor.

$ dbus-monitor --session "path=/org/gnome/SettingsDaemon/Power"

This only shows calls to the power-thing. I used path in favor of interface, since this also shows property changes.

Killing gnome-settings-daemon showed a property change of brightness to 100%.

signal time=1453584221.013391 sender=:1.579 -> destination=(null destination) serial=214 path=/org/gnome/SettingsDaemon/Power; interface=org.freedesktop.DBus.Properties; member=PropertiesChanged
   string "org.gnome.SettingsDaemon.Power.Screen"
   array [
      dict entry(
         string "Brightness"
         variant             int32 100
      )
   ]

Form this on, I knew the call, and grepped again the source code of gnome-settings-daemon. And that was the missing piece. I found a dbus call in plugins/color/gsd-color-state.c … the color manager part of gnome-settings-daemon. However. I discovered the following rows:

// if output is a laptop screen and the profile has a
// calibration brightness then set this new brightness
brightness_profile = cd_profile_get_metadata_item (profile,
                                                   CD_PROFILE_METADATA_SCREEN_BRIGHTNESS);
if (gnome_rr_output_is_builtin_display (output) &&
    brightness_profile != NULL) {
        // the percentage is stored in the profile metadata as
        // a string, not ideal, but it's all we have...
        brightness_percentage = atoi (brightness_profile);
        gcm_session_set_output_percentage (brightness_percentage);
}

The comment pointed me directly to the solution of my problem. I took a look into the icc profile, and voila, metadata said "Screen Brightness: 100".

I only had to remove this metadata of my profile (located in ~/.local/share/icc).

$ cd-fix-profile PROFILE.icc md-remove SCREEN_brightness

\o/

However. I am pretty sure, the screen was on the correct brightness level (39%) when I started to create the profile.

And again. I faced a little, but annoying, problem, and was able find the cause. And this, just because GNOME is open source!

Above, I wrote about the issue of the "not moving brightness slider". As a by-product of the walk through gnome-settings-daemon, I was able to change my brightness-switch-command, to use the dbus calls.

Setting the brigness to 9% (which is my working-mode, most of the time):

$ dbus-send --session --type=method_call \
  --dest="org.gnome.SettingsDaemon.Power" \
  /org/gnome/SettingsDaemon/Power \
  org.freedesktop.DBus.Properties.Set \
  string:"org.gnome.SettingsDaemon.Power.Screen" \
  string:"Brightness" \
  variant:int32:9

Setting the brightness to 39% (for photo-editing):

$ dbus-send --session --type=method_call \
  --dest="org.gnome.SettingsDaemon.Power" \
  /org/gnome/SettingsDaemon/Power \
  org.freedesktop.DBus.Properties.Set \
  string:"org.gnome.SettingsDaemon.Power.Screen" \
  string:"Brightness" \
  variant:int32:39

Wohooo, and yet it moves :D

Autor
Chris Glaubitz
Configuring, coding, debugging computers for a living. Riding bikes for fun.