Weekly 2016/34

Starting this week with minor devel tasks. Two weeks before, I implemented a feature in our machine database, to be able to talk to different consul datacenters. There were some review comments, I had to fix. This enables us to talk to reach the consul cluster of our lab, even though the machine database is located in a different cluster.

My colleagues started to roll out our ansible roles to the lab cluster. Since we changed lots of stuff to support multiple clusters, we expect the rollout to be painful. And it did not last long to run into the first error. The storage manager crashes, due to a wrong usage of pathlib.PosixPath. I fixed those issues, and built debian packages as well. Whereas building debs was the most painful part of this task.

At some point in the rollout, we were unable to reach all of the lab hosts, but only from two hosts, including one ouf our cluster workstations. The rest was working fine. I discovered "strange" routes on the ToR of lab, and was able to track down the issue with the help of our NetOps. I learned to ask bird for learned and exported routes. This quickly showed, two hosts in the lab exported the same ip address like the production cluster. With this information, a fellow reconfigured the canonical ip addresses in his host-vars and re-rolled ansible. After bringing the wrong (duplicate) address down, everything worked fine.

Tuesday, we continued our work on the lab plattform. After some minor changes in our ansible roles, we discovered some inconsistency in ASN numbering. I reverse-engineered the remaining parts of the bgp configurations of both clusters and introduced a new ASN policy. This included a change to all the ASN in our new lab. Which means a temporarily loss of control of all the nodes in the lab. Fortunately the configuration was successful and all nodes re-announced their IPs. The configuration change was done with our NetOps again, because we need some changes on devices, we don't have access to. After all the changes the connection between production and lab were established again, and the loadbalancers in the lab were able to annonce public ips. We have an accessable lab now :)

At some time on Tuesday, I tried to get metrics from Consul. Turned out, Consul is only able to stream changes to something like statsd.

Bring all the stuff together. One of my colleagues mainly changed many of our ansible roles, to make the lab possible. In fact, the rollout of the last days was done by these changes. He started a while ago, and so master diverged a lot, resulting in a hugh, unmaintainable merge request. Since the cluster rollout worked fine, we split the merge request into smaller ones, that we merge into master, without affecting production. Just one large refactoring of the loadbalancer role remaining. After getting all the v4 bgp-sessions alive during the last days, today I fixed issues with the v6 configuration an brought up all the sessions as well. This was the last bit for a fully working underlay-network :) The afternoon I documented things for baremetal deployment and the network configuration.

On Thursday we did some minor fixes to the loadbalancer role and finally merged it into master. Another feature of yet another colleagues became merge ready, and we brought this to master as well. Since it is not yet in any playbook, this will not affect any running services. After those merges, we applied the entire new master on our lab. The next step would be to re-deploy the lab from master. Since another team would like to participate, we scheduled this into next week. I tried to check some network settings with our SDN Midonet, as a prerequisite for running the integration test-suite, and discovered missing configuration of external IPs and BGP sessions. It was barely automated with a shell script, and it was available in our repo, but never called automatically on bootstrap. However, this script does not perform any checks whether the objects it tries to create, are already available. It also works only for our production cloud, even if it should never ever be called against it, due to the lack of checks. Anyway, it is a nice task for our new team member, to rewrite most of the script as an ansible role, that could run regulary. Since we need this script updated for the lab deployment anyway, we just wait for this rewrite.

Friday, and this week one of my short days, and the lab works good enough to test some things. First I helped out Mathias to setup the Loadbalancers with the config of production again. Then we migrated to the new loadbalancer role, without any problems. Yeahie. The rest of the day, I tried a fix to the configuration of all OpenStack-APIs, I did a few weeks ago. Due to a lack of a "not production" environment, I was not able to test it so far. Today I did, and the first shot worked well for all services except the both relevant ones nova and neutron. The latest version of python-openstacksdk does not work against our neutron. We use the traditional OpenStack service ports, but using https. The services http-links in api-json-results. If the client follows those links, it talk http to the https endpoint resulting in a bad request (400). The solution is to configure oslo_middleware.HTTPProxyToWSGI for each service. With this, all the returned json-links are https.

Before:

> curl https://api.cbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "http://api.cbk.cloud.syseleven.net:9696/v2.0",  <<===
          "rel": "self"
        }
      ]
    }
  ]
}

After:

> curl https://api.zbk.cloud.syseleven.net:9696/ | jq '.'
{
  "versions": [
    {
      "status": "CURRENT",
      "id": "v2.0",
      "links": [
        {
          "href": "https://api.zbk.cloud.syseleven.net:9696/v2.0", <<==
          "rel": "self"
        }
      ]
    }
  ]
}

The first shot did not work for nova and neutron, because the versions-endpoints had a different pipeline configuration, than all the other stuff. I just needed to add HTTPProxyToWSGI to those endpoints and everything worked fine. Because some oslo_middleware developers decided to disable HTTPProxyToWSGI by default for a strange reason. This flag is not yet available in OpenStack Mitaka, but the default will break our deployment on an update to Newton. So I set enable_proxy_headers_parsing = True in all our OpenStack Services, and let it be ignored in the current deployment.