Weekly 2016/37
No such meeting Monday. This week, our all-hands-meeting was moved to Tuesday. Because most people in my team were on vacation, we did a short daily instead of a 1 hour weekly. More time to do real work :) I returned to the cluster wide unattended update. I worked at least half of the day on this topic. Another look into the networking setup of our lab and some more documenation did the rest of it.
Back in office on Wednesday after a day of sickness. From Tuesday on, the production cloud was kind of slow. In terms of: Creating a bare network took 6 seconds in average, instead of 2 seconds in the lab. Digging into the problem showed a problem in filling up one rabbitmq queue. Usually this should be consumed by a telemetry solution, which is not deployed correctly so far. I removed all the messages in the queue and everything went back to normal again.
On Thursday, I had not too much time, to digg into unattended updates again. Just an hour in the morning. This week is strange, with many little disturbances, and I literally forgot all the small stuff. After lunch we did a complete wipe and reinstall of our lab. I left after 2 hours of installation for 10 hardware nodes. Does not seem to be pretty fast, but in fact this is ok. At least we had to do the process on our first 3 nodes (those form our consul cluster, now) twice, because of an expired pgp key. My colleagues Continues with running Ansible on the nodes, without huge problems.
Although, the installation on the day before was pretty ok overall, I cleaned up some stuff on Friday. With some minor changes, the installation is smoother than Thursday. I did not run an entire installation. But shut off the consul cluster to simulate the "nothing is there"-situation. This worked pretty well. I did a second reinstallation, but with consul turned on. This is the usual scenario if we grow our existing cluster by new machines. I finalized the installation by applying Ansible. Because the node, I selected, was a gateway of our SDN, there was a little problem with a new UUID for this node, but easily to resolve by hand. This sceanrio is not usual, so I just documented this issue and did not automate it away. Furthermore, I extended the existing documentation by things, I saw yesterday, but not yet in the documentation. Somewhen in between, we had a call with our storage vendor about the next steps to happen. The afternoon I prepared a update of consul 0.6.4 to 0.7. Its going to be bad:
"Consul's default Raft timing is now set to work more reliably on lower-performance servers, which allows small clusters to use lower cost compute at the expense of reduced performance for failed leader detection and leader elections. You will need to configure Consul to get the same performance as before. See the new Server Performance guide for more details. [GH-2303]"
In other words: It did not work well on raspberry pi, we fixed this, but destroyed it for everyone else. And for sure, 0.6er consul refuse to start because of an unkown key, if I already would place the new paremeter to all config files :(
Oh. Almost forgotten. Last week I wrote about my plans of cleaning up my vim modules. And with this, a switch to neovim. Turned out. Cleaning up stuff went well. The neovim experiment not. It felt ok in most points, but not better then good old vim. And one thing let me scream several times, and go back vim. In some/most cases a hit on ESC took a while to complete. That long, that I typed some more characters before leaving insert mode. An I'm escaping a lot!