A Tale On Kuberenetes Controller

Note: This is a pretty incoherent write up about what happens to be in my head on two different days. Maybe it is useful nevertheless 🤷

From time to time, I write operator (or controller) for Kubernetes. Primarily using go and thus using controller-runtime, either through kubebuilder or operator-sdk. The idea behind operator is reconciling something (let’s say an external resource) to the desired state. See Operator Pattern and Concepts Controllers.

In short terms: An operator observes the desired state in kube-apiserver and tries to “move” the resource to match the desired state. When it comes to controller-runtime, the logic is implemented in the method Reconcile.

It often times is desired, to only run a single instance of the operator. When dealing with idempotent APIs, one run multiple instances, but in general it is recommended to only have a single “active” instance in the cluster. Creating the boilerplate with kubebuilder or operator-sdk, a Leader Election will be created automatically. Thus, a failed instance will be taken over after a while.

Why is that important?

Imagine the operator can talk to an API to create a loadbalancer (we only can use that API, but can’t change any of it). The only field the API provides is a name for the loadbalancer. The create request returns an UUID, over which the loadbalancer would be identifiable. In an ideal world, the operator would create a loadbalancer, get back the UUID and store it somewhere for further processing.

uuid = db.get(lbUUID)
lb = api.get_lb(uuid)
if not lb:
  uuid = api.create_lb(name)
  db.store("lbUUID", uuid)
  lb = api.get_lb(uuid)

Looks reasonable, but has some shortcomings. Because there are no transactions between the various applications and APIs, the code could fail at any point in time. Fail not in terms of, the application as it fails, but fails because of external circumstances (OOM, node failure, etc.) For example, after/during calling create_lb. There was just no time to retrieve and store the UUID. The next reconcilation loop would create another loadbalancer, and now we maybe end up with two or more instances we need to pay for. Sometimes the code could end up in an error. For example because a port is already used by another loadbalancer instance. What would be the correct action? Adopting the loadbalancer? Erroring and retry? Probably the latter one. Because, who knows what or how created the loadbalancer in the first place.

The idea behind operator is that the program observes the world, try to move the world into the direction of the desired state, and store the observed state into the status field. It is not necessarily a good idea to store observed state somewhere for identification purposes. Even though there are examples, that do kind of that when storing things like providerID. However, in the world of Kubernetes cloud-provider, the provider deliberately is very limited in what it can do. For instance, the reconciler code for the loadbalancer has no access to Kubernetes.

So the best some reconciler sometimes get is a name. And must deal with the challenges of the external API.

lbs = api.list_lbs(filter = {"name": name})
if lbs:
  lb = lbs[0]
else:
  lb = api.create_lb(name)

Here are dragons again. The API does allow multiple loadbalancer with the same name. Every other method of identifying the object might be flawed in some way, or will suffer from the same problems. For example. If the API lets us set tags on resources, but doesn’t allow it in the create call, we are back at square one.

Even still pretty unsatisfying from a purists perspective, running a single instance of the operator is probably the best solution. Especially if otherwise one would introduce distributed consensus on what to create when and where. There are always trade-offs. Depending on the use case, it might be sufficient to only pick result[0], sometimes you might want to be more conservative and return an error if you get more results than one. You have to make those decisions, there is no way to get around them.

So far, we took look at operators and some of the challenges, especially if it comes to concurrently running processes. But do we suffer from the same problem when we are in the same process? In some cases, a single reconciliation is enough. If there are infrequent updates/changes to the source, and or one run of the loop is fast. But when we think back to the loadbalancer example. Imagine each reconciliation takes 30 seconds. If you create 10 loadbalancer at the time, it would take 300 seconds to create all of them, even though, in theory the API would be able to create 10 in parallel without problems.

Let’s take a look at the interface of a Kubernetes reconciler

type Reconciler = TypedReconciler[Request]

type TypedReconciler[request comparable] interface {
	// Reconcile performs a full reconciliation for the object referred to by the Request.
	//
	// If the returned error is non-nil, the Result is ignored and the request will be
	// requeued using exponential backoff. The only exception is if the error is a
	// TerminalError in which case no requeuing happens.
	//
	// If the error is nil and the returned Result has a non-zero result.RequeueAfter, the request
	// will be requeued after the specified duration.
	//
	// If the error is nil and result.RequeueAfter is zero and result.Requeue is true, the request
	// will be requeued using exponential backoff.
	Reconcile(context.Context, request) (Result, error)
}

For each event, no matter if it is create, update, delete, some generic event, the Reconcile method will be called. You will only see the request object, not what happened, not what state was there before, nor the change. Remember? The idea is to grab the desired state, drive the resource towards it, and record what you observed along the way.

But does that mean, we are really called for each and every event? Like… create of the Kubernetes Resource, which leads to creating the external resource. During that time, someone is editing the labels on the Kubernetes Resource. Would that result in the same difficulties described above?

When running a single Reconcile at any times, certainly not. The second event would be “batched”, and feed to the reconciler as soon as the previous run of Reconcile finished. Is it a good idea to crank up MaxConcurrentReconciles up to lets say 10? See Controller Options. Or do we need to implement something to coordinate execution of reconciles somehow. Certainly easier to do in the same process as opposed to do it distributed. But is it necessary?

The short answer is No. You don’t have to implement it. But not because the problem does not exist, but rather because of the workqueue used internally, provided by client-go. See Learning Concurrent Reconciling (from 2019) for more details.

If you need to have more than one Reconcile at the time, because of durations or number of objects in Kubernetes, you can tune MaxConcurrentReconciles. controller-runtime with help of workqueue will take care, that only one Reconcile per object is running.

Let’s do a quick detour back the events #

I mentioned that the reconciliation function is called for events happening. There are ways to filter events, and you can decide on which events you would like to be called, depending on you use case. Sometimes, it makes sense to store a condition with a field observedGeneration, recording the metadata.generation of the last successful and complete reconciliation. You would then filter out all events where metadata.generation == observedGeneration, to not reach out to external APIs and hit some rate limit in the process.

Sometimes, if you code for instance sets routes in the system, you might always return with RequeueAfter. Checking and setting routes is pretty fast, and there are no real rate limits in place. So it might be better to make sure a route is in place, lets say every 30 second, over accidentally not having the route at all. Either by a bug in the code, or because someone/something removed it. This kind of reconciliation would also run with MaxConcurrentReconciles = 1, so there are no two route altering processes at a time. Sure, you could watch netlink events in a separate goroutine and create a GenericEvent. However, this would need additional book keeping between the route and your object. The effort might not be worth it. But again, everything depends on your use case.

Reaching out for checking and maybe creating/updating an s3 bucket might not be worth doing it over and over again. Here you would likely use the observedGeneration pattern.

You could even watch on other objects and reconcile your real object. When there are ownerReferences, kubebuilder can do it pretty automatically by using .Owns(...). But even if you watch some object without real (or directly visible) relation to you object, .Watches(…) in combination with the correct enqueue function, can do exactly this. Your code “just” needs to be capable of drawing the line between the observed object and your actual ones.

Resources #

I strongly encourage you to have a look at the kubebuilder book for inspiration. Also check kubernetes API conventions. If you would like to see a real world example with several objects involved, take a look at cluster-api.

Updates #

In the recent edition of golangweekly I came across So you wanna write Kubernetes controllers by ahmentb. Much more thought out than what I am capable of.