Weekly 2016/35

This is the third approach to write a weekly, and the first one, I totally failed. I didn't make any notes of the week :/

But let's start anyway. This week was covered by a re-deployment of our lab. As expected several things went wrong, and we had to fix it iteratively. Nothing special so far. This couldn't be sticked to one special day. Like describe in the last weeks, I digged around in our network setup. This week, I took my time and documented it down into our IPAM nipap. That includes all the transfer nets and all public IPs assigned to us. In case of service IPs, I documented the "function" as well. We are now able to collect new IPs directly from this pools, without having to grep config files, and hope you catched everything.

On Wednesday, one day with fixing lab-stuff, I got the request to benchmark our S3. Which I started on my short Thursday. I decided to write a script in Go to parallelize the work. This would be perfectly doable in python as well, but I wanted to use something new. First results were collected on Friday morning. Over and over again in this week, I helped a fellow to configure the forward proxy corretly.

Beside of the S3-work, I was in our change commission, to discuss what things we can do on our own, and what should be discussed again in this commission. Lucky us, we can do most of the things, I wanted to.

Friday, started with collecting S3 benchmarks. Then an outage of our cinder-volume occured, which I helped to fix. For sure, since one of my changes broke the configuration :( Fortunately, it resulted in a loss of control and running instances were not affected. I wrote the post-mortem about this outage, and filed some tickets to fix missing bits in our monitoring. After this, I fixed an issue with our StorageManager. It creates new logical volumes on cluster bootstrap, but in our configurations they are very likely to be at the same position like before. So xfs restored the filesystem instead of creating a new one. I now wipefs the new created lv.

Currently I am graphing the S3-results with jupyter notebook, which is really cool. Give it a try. I installed mathplotlib and numpy as well. The performance in average is pretty ok, both up and download. But the long tail does really hurt. Some of the requests took around 8 seconds, while the average is around 0.08 seconds.

What did I test? I uploaded around 12500 files from a pool of 25 unique images. Using 5 uploads in parallel, and parallel to this, 10 downloads of random images. Both over and over again. I could even do more in parallel. Anyway. I we got some tickets about slow S3-requests. But like all tickets without any steps to recreate. The result we see here, is exactly what I wanted to get out of this tests. In short-term we are going to update our S3-Layer to the latest version. When the results are still the same, we can file a bug against our vender. Of course with steps to recreate ;) Maybe the latest version will be in lab somewhen next week.