Weekly 2016/33
This is the first attempt to write about my week in tech. More about the things I did, and less about things I read about. First of all, this is for me. So times, I tend to be a bit frustrated about being caught in a treadmill. So I really don't know where this is going.
Let's step into this week. It was going to be a short week. I'm still on part time, and working 30h a week. This week from Monday to Thursday. The main task, and target as well, was bootstrapping the hardware nodes for our second OpenStack Cloud, which is going to be a lab. Means, this is the second time to bootstrap with our automation, and the first time doing it from zero to working in without steps in between.
For sure, the hardware is not exactly the same like the production cloud, but only differs some nuances. The most important difference for the entire cluster is the fact, that we dual use the boot infrastructure (like pxe boot) of the production cloud. However, the lab cloud must get its own network, but also reach our boot infrastructure and must be reachable from the cluster workstations of the production cloud. For sure, the connections were set up by our netops.
Slight insight into the networking. The ToR-switches of the production cloud are connected to a qfabric. The ToR of lab as well. Between those two, a mpls l3vpn is configured. Even the racks stand side by side, we decided to use this configuration. Our plan is to spread several clusters over the data center. So this is a quite nice test case.
The nodes were started end of the week before, but did not show up in the machine database. I even didn't expect this working the first time. Unfortunately, due to the design, I was not able to reach any bmc of the nodes. The nodes should be configured in a way to boot from network into a live system. This does a quick inventory of the machine, and submits it into the database. We as the admins, have to configure the rack position. The live system picks this configuration periodically and does some system configuration like setting a fixed bmc ip address and password. We are also able, to set the system disk in our machine database. After this two configuration bits, rack position and system disk, the node is ready to boot into the ubuntu installer and configures itself with a preseed. Again, this was some decision made somewhen. I would prefer Images at this point. But however.
However. None of the 10 machines showed up, and bmc did not work. We send out one of our colleages to the datacenter. While he was on his way, I connected to the ToR of the lab, because this was reachable, and did some basic tests. Because we use anycast for dns, ntp and some more stuff, the first guess was no working anycast. Turns out anycast worked. But the throughput was at least one problem. I only got 1.4MByte/s from ToR to ToR. Which is really bad, they are connected with 40GBit/s. My DC-fellow check network connections, and correct patching. Everything looked good so far. Since we started late, due to the meeting-monday, we called it a day.
On Tuesday, i tried to get our netops, to get the problem solved. Most of them were busy as well, but with little chatting, and further debugging, I detected a drop of the mtu from 9000 to 1500 somewhere in the qfabric. With the help of the netops, we checked the links. All were configured properly. Great!? Since both ToRs are configured with a mtu of 9000, and the bad throughput, I expect pmtu discovery not to work. Which is even worse if 10 nodes should do netboot/install stuff almost at the same time. Since we still lacked bmc, my colleage again visited the DC. He shut off all the nodes and brought back the first one. This barely worked, and it was able to download the live image.
I was stucked at this point. I had to wait for the netops who initially configured the l3vpn to help out. So in the meantime, I helped to submit a new version of the internal storage manager (that configures disks to be system or lvm or cluster fs), and get it out to the production cluster. This removed a blocker of a aborted rollout of the week before. At my end of the day, we had 4 of 10 nodes running. Before I left, I could grab the netops and he promised to have a look at this issue. During my spare time, my fellow in the DC discovered, that six nodes didn't even boot from the correct network device. For sure he was not able to fix this because we need a firmware update for the cards to get it working.
Wednesday. Like my netops promised, the mtu issue was fixed. Wohooo. Turned out, there was an overall-setting that overrides the concrete port configuration m( Because of the mtu struggles, I didn't debug the ToR of the lab enough. To recap, we still lack bmc. In the morning I discovered a missing link on ToR to bmc-switch. Lukily for me, my fellow had to visit the DC again, to update the firmwares. So he armed himself with some optics.
In this field, there was nothing to do for me. I waited for the updated nodes to submit their inventory to our database, and a working bmc. So I continued the work on a solution to run a cluster wide unattended update, but stop if something strange happen, or a human hit the red button. In the evening, my fellow replaced the fiber and got another 3 nodes running. The remaining nodes ended in an endless loop in the installer. But I thought, this is pretty good. We could define those nodes to be compute nodes, and deploy OpenStack with only one compute.
My Thursday starts with a totally different topic. We have a Thursday time slot of 15 minutes to present some interresting stuff. I showed my kubernetes stack running onto of our OpenStack. 15 minutes are way to less. I could talk hours about this topic. But anyway. It is a great way to show other teams topics to think about. Even I would like to see more faces in my presentation. I did a live demo of booting the stack, so there are no slides, but the stack is available on github.
After this short trip, I went back to the cluster bootstrap an noticed, that exactly those broken nodes have to be the controllers, because of the ssd provisioning. So we had to get the nodes out of the door. However, we have bmc, so at least no one had to visit the DC. For sure, no errors were shown on the serial console. Nothing but the over and over starting progress bar. Thankfully my DC-fellow was able to connect to ilo and get some error messages. Somewhere in the bootup-time, the routes to the lab disappeared. There was a experimental rollout on the lab-ToR, that broke the config of bird. Trivial fix and we were up again. Some of my tests of the installed nodes showed missing bits in the consul configuration, as well as wrong configuration of the fqdn. Trivial fixes for another fellow.
Back to the insallation process. Turned out. The very last step failed. Because ubuntu changes the boot device to install-disk, there is some late command to reset the boot order netboot. We always use netboot, to be able to change grub config even if we are not able to connect to the nodes at some time. The netboot grub fetches the next desired boot target for the node. A tiny fix in a regexp to detect the network device did the trick. The nodes installed correctly and came up like expected.
This was at 1430. I called it a week and left for weekend. With major help of "the DC-fellow", I was able to get done the ground work for a running lab (installed with the same automation like production). The next steps are up to my colleages anyway.
All in all a very productive week, in face of "only" 30h :)