Starting this week with minor devel tasks. Two weeks before, I
implemented a feature in our machine database, to be able to talk to
different consul datacenters. There were some review comments, I had to
fix. This enables us to talk to reach the consul cluster of our lab,
even though the machine database is located in a different cluster.
My colleagues started to roll out our ansible roles to the lab cluster.
Since we changed lots of stuff to support multiple clusters, we expect
the rollout to be painful. And it did not last long to run into the
first error. The storage manager crashes, due to a wrong usage of
[pathlib.PosixPath]{.title-ref}. I fixed those issues, and built debian
packages as well. Whereas building debs was the most painful part of
this task.
At some point in the rollout, we were unable to reach all of the lab
hosts, but only from two hosts, including one ouf our cluster
workstations. The rest was working fine. I discovered "strange" routes
on the ToR of lab, and was able to track down the issue with the help of
our NetOps. I learned to ask bird for learned and exported routes. This
quickly showed, two hosts in the lab exported the same ip address like
the production cluster. With this information, a fellow reconfigured the
canonical ip addresses in his host-vars and re-rolled ansible. After
bringing the wrong (duplicate) address down, everything worked fine.
Tuesday, we continued our work on the lab plattform. After some minor
changes in our ansible roles, we discovered some inconsistency in ASN
numbering. I reverse-engineered the remaining parts of the bgp
configurations of both clusters and introduced a new ASN policy. This
included a change to all the ASN in our new lab. Which means a
temporarily loss of control of all the nodes in the lab. Fortunately the
configuration was successful and all nodes re-announced their IPs. The
configuration change was done with our NetOps again, because we need
some changes on devices, we don't have access to. After all the changes
the connection between production and lab were established again, and
the loadbalancers in the lab were able to annonce public ips. We have an
accessable lab now :)
At some time on Tuesday, I tried to get metrics from Consul. Turned out,
Consul is only able to stream changes to something like statsd.
Bring all the stuff together. One of my colleagues mainly changed many
of our ansible roles, to make the lab possible. In fact, the rollout of
the last days was done by these changes. He started a while ago, and so
master diverged a lot, resulting in a hugh, unmaintainable merge
request. Since the cluster rollout worked fine, we split the merge
request into smaller ones, that we merge into master, without affecting
production. Just one large refactoring of the loadbalancer role
remaining. After getting all the v4 bgp-sessions alive during the last
days, today I fixed issues with the v6 configuration an brought up all
the sessions as well. This was the last bit for a fully working
underlay-network :) The afternoon I documented things for baremetal
deployment and the network configuration.
On Thursday we did some minor fixes to the loadbalancer role and finally
merged it into master. Another feature of yet another colleagues became
merge ready, and we brought this to master as well. Since it is not yet
in any playbook, this will not affect any running services. After those
merges, we applied the entire new master on our lab. The next step would
be to re-deploy the lab from master. Since another team would like to
participate, we scheduled this into next week. I tried to check some
network settings with our SDN Midonet, as a
prerequisite for running the integration test-suite, and discovered
missing configuration of external IPs and BGP sessions. It was barely
automated with a shell script, and it was available in our repo, but
never called automatically on bootstrap. However, this script does not
perform any checks whether the objects it tries to create, are already
available. It also works only for our production cloud, even if it
should never ever be called against it, due to the lack of checks.
Anyway, it is a nice task for our new team member, to rewrite most of
the script as an ansible role, that could run regulary. Since we need
this script updated for the lab deployment anyway, we just wait for this
rewrite.
Friday, and this week one of my short days, and the lab works good
enough to test some things. First I helped out Mathias to setup the
Loadbalancers with the config of production again. Then we migrated to
the new loadbalancer role, without any problems. Yeahie. The rest of the
day, I tried a fix to the configuration of all OpenStack-APIs, I did a
few weeks ago. Due to a lack of a "not production" environment, I was
not able to test it so far. Today I did, and the first shot worked well
for all services except the both relevant ones nova and neutron. The
latest version of
python-openstacksdk
does not work against our neutron. We use the traditional OpenStack
service ports, but using https. The services http-links in
api-json-results. If the client follows those links, it talk http to the
https endpoint resulting in a bad request (400). The solution is to
configure
oslo_middleware.HTTPProxyToWSGI
for each service. With this, all the returned json-links are https.
Before:
> curl https://api.cbk.cloud.syseleven.net:9696/ | jq '.'
{
"versions": [
{
"status": "CURRENT",
"id": "v2.0",
"links": [
{
"href": "http://api.cbk.cloud.syseleven.net:9696/v2.0", <<===
"rel": "self"
}
]
}
]
}
After:
> curl https://api.zbk.cloud.syseleven.net:9696/ | jq '.'
{
"versions": [
{
"status": "CURRENT",
"id": "v2.0",
"links": [
{
"href": "https://api.zbk.cloud.syseleven.net:9696/v2.0", <<==
"rel": "self"
}
]
}
]
}
The first shot did not work for nova and neutron, because the
versions-endpoints had a different pipeline configuration, than all the
other stuff. I just needed to add HTTPProxyToWSGI to those endpoints and
everything worked fine. Because some oslo_middleware
developers
decided to disable HTTPProxyToWSGI by
default
for a strange reason. This flag is not yet available in OpenStack
Mitaka, but the default will
break our deployment on an update to
Newton. So I set
enable_proxy_headers_parsing = True
in all our OpenStack Services, and
let it be ignored in the current deployment.