This is the first attempt to write about my week in tech. More about the
things I did, and less about things I read about. First of all, this is
for me. So times, I tend to be a bit frustrated about being caught in a
treadmill. So I really don't know where this is going.
Let's step into this week. It was going to be a short week. I'm still
on part time, and working 30h a week. This week from Monday to Thursday.
The main task, and target as well, was bootstrapping the hardware nodes
for our second OpenStack Cloud, which is going to be a lab. Means, this
is the second time to bootstrap with our automation, and the first time
doing it from zero to working in without steps in between.
For sure, the hardware is not exactly the same like the production
cloud, but only differs some nuances. The most important difference for
the entire cluster is the fact, that we dual use the boot infrastructure
(like pxe boot) of the production cloud. However, the lab cloud must get
its own network, but also reach our boot infrastructure and must be
reachable from the cluster workstations of the production cloud. For
sure, the connections were set up by our netops.
Slight insight into the networking. The ToR-switches of the production
cloud are connected to a qfabric. The ToR of lab as well. Between those
two, a mpls l3vpn is configured. Even the racks stand side by side, we
decided to use this configuration. Our plan is to spread several
clusters over the data center. So this is a quite nice test case.
The nodes were started end of the week before, but did not show up in
the machine database. I even didn't expect this working the first time.
Unfortunately, due to the design, I was not able to reach any bmc of the
nodes. The nodes should be configured in a way to boot from network into
a live system. This does a quick inventory of the machine, and submits
it into the database. We as the admins, have to configure the rack
position. The live system picks this configuration periodically and does
some system configuration like setting a fixed bmc ip address and
password. We are also able, to set the system disk in our machine
database. After this two configuration bits, rack position and system
disk, the node is ready to boot into the ubuntu installer and configures
itself with a preseed. Again, this was some decision made somewhen. I
would prefer Images at this point. But however.
However. None of the 10 machines showed up, and bmc did not work. We
send out one of our colleages to the datacenter. While he was on his
way, I connected to the ToR of the lab, because this was reachable, and
did some basic tests. Because we use anycast for dns, ntp and some more
stuff, the first guess was no working anycast. Turns out anycast worked.
But the throughput was at least one problem. I only got 1.4MByte/s from
ToR to ToR. Which is really bad, they are connected with 40GBit/s. My
DC-fellow check network connections, and correct patching. Everything
looked good so far. Since we started late, due to the meeting-monday, we
called it a day.
On Tuesday, i tried to get our netops, to get the problem solved. Most
of them were busy as well, but with little chatting, and further
debugging, I detected a drop of the mtu from 9000 to 1500 somewhere in
the qfabric. With the help of the netops, we checked the links. All were
configured properly. Great!? Since both ToRs are configured with a mtu
of 9000, and the bad throughput, I expect pmtu discovery not to work.
Which is even worse if 10 nodes should do netboot/install stuff almost
at the same time. Since we still lacked bmc, my colleage again visited
the DC. He shut off all the nodes and brought back the first one. This
barely worked, and it was able to download the live image.
I was stucked at this point. I had to wait for the netops who initially
configured the l3vpn to help out. So in the meantime, I helped to submit
a new version of the internal storage manager (that configures disks to
be system or lvm or cluster fs), and get it out to the production
cluster. This removed a blocker of a aborted rollout of the week before.
At my end of the day, we had 4 of 10 nodes running. Before I left, I
could grab the netops and he promised to have a look at this issue.
During my spare time, my fellow in the DC discovered, that six nodes
didn't even boot from the correct network device. For sure he was not
able to fix this because we need a firmware update for the cards to get
it working.
Wednesday. Like my netops promised, the mtu issue was fixed. Wohooo.
Turned out, there was an overall-setting that overrides the concrete
port configuration m( Because of the mtu struggles, I didn't debug the
ToR of the lab enough. To recap, we still lack bmc. In the morning I
discovered a missing link on ToR to bmc-switch. Lukily for me, my fellow
had to visit the DC again, to update the firmwares. So he armed himself
with some optics.
In this field, there was nothing to do for me. I waited for the updated
nodes to submit their inventory to our database, and a working bmc. So I
continued the work on a solution to run a cluster wide unattended
update, but stop if something strange happen, or a human hit the red
button. In the evening, my fellow replaced the fiber and got another 3
nodes running. The remaining nodes ended in an endless loop in the
installer. But I thought, this is pretty good. We could define those
nodes to be compute nodes, and deploy OpenStack with only one compute.
My Thursday starts with a totally different topic. We have a Thursday
time slot of 15 minutes to present some interresting stuff. I showed my
kubernetes stack running onto of our OpenStack. 15 minutes are way to
less. I could talk hours about this topic. But anyway. It is a great way
to show other teams topics to think about. Even I would like to see more
faces in my presentation. I did a live demo of booting the stack, so
there are no slides, but the stack is available on
github.
After this short trip, I went back to the cluster bootstrap an noticed,
that exactly those broken nodes have to be the controllers, because of
the ssd provisioning. So we had to get the nodes out of the door.
However, we have bmc, so at least no one had to visit the DC. For sure,
no errors were shown on the serial console. Nothing but the over and
over starting progress bar. Thankfully my DC-fellow was able to connect
to ilo and get some error messages. Somewhere in the bootup-time, the
routes to the lab disappeared. There was a experimental rollout on the
lab-ToR, that broke the config of bird. Trivial fix and we were up
again. Some of my tests of the installed nodes showed missing bits in
the consul configuration, as well as wrong configuration of the fqdn.
Trivial fixes for another fellow.
Back to the insallation process. Turned out. The very last step failed.
Because ubuntu changes the boot device to install-disk, there is some
late command to reset the boot order netboot. We always use netboot, to
be able to change grub config even if we are not able to connect to the
nodes at some time. The netboot grub fetches the next desired boot
target for the node. A tiny fix in a regexp to detect the network device
did the trick. The nodes installed correctly and came up like expected.
This was at 1430. I called it a week and left for weekend. With major
help of "the DC-fellow", I was able to get done the ground work for a
running lab (installed with the same automation like production). The
next steps are up to my colleages anyway.
All in all a very productive week, in face of "only" 30h :)