Kubernetes Controller Runtime - Andy Dunstall

*(January 2024)* Notes on how the [Kubernetes Controller Runtime](https://github.com/kubernetes-sigs/controller-runtime) works, focusing on how the controller runtime works internally rather than how to use it. The [Kubernetes Controller Runtime](https://github.com/kubernetes-sigs/controller-runtime) contains libraries for building controllers. It is used by [Kubebuilder](https://book.kubebuilder.io/) and the [Operator SDK](https://github.com/operator-framework/operator-sdk) for building [Kubernetes operators](https://github.com/cncf/tag-app-delivery/blob/163962c4b1cd70d085107fc579e3e04c2e14d59c/operator-wg/whitepaper/Operator-WhitePaper_v1-0.md). ## Manager Each process has a manager. The main roles of the manager are to handle leader election and run the registered controllers. It also manages shared dependencies like clients and caches, provides metrics, health checks and handles signals. Controllers are registered with the manager, which configure whether they require leader election. If a controller requires leader election is will only be run if the manager is leader, otherwise it will always be run. If the manager loses its leadership then the manager is stopped (including its controllers) and it returns an error, which typically causes the process to exit and be restarted. ### Leader Election When you run multiple managers for high availability, usually only one controller of each type should be running at once. Therefore the controller runtime uses leader election to elect one manager to run the controllers that require leader election. This means if the leader fails or closes, another manager is quickly elected leader. By default the manager uses Kubernetes [leases](https://kubernetes.io/docs/concepts/architecture/leases/) for election. A Kubernetes [lease](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/lease-v1/) is an object managed by the API server that includes fields: - Name: Identifies the lease - Acquire time: When the current lease was acquired - Holder identity: Who currently holds the lease - Renew time: When the current holder last renewed their claim to the lease - Lease duration seconds: How long candidates need to wait after the last renew before they can acquire the lease When the manager starts, it will keep trying to acquire the lease at the configured retry rate (with jitter). It may keep retrying forever if it never gets the lease. Once it acquires the lease it will start the controllers that depend on being leader, then keep renewing the lease at the configured retry rate (again with jitter). If it loses the lease it will return an error and stop the controllers. Note this simple implementation of leader election doesn't guarantee only one client is acting as the leader. Such are there is no fencing so a client could continue performing operations without realising it is no longer the leader (such as due to a slow network request). ## Controller A controller watches for events in the system that require reconciliation. Such as a `ReplicaSet` controller will watch for changes to a `ReplicaSet`, or changes to any pods the replica set owns. Each controller has a reconciler and a work queue. It will watch for events from sources and enqueue reconcile requests to the queue. Each reconcile request contains the name and namespace of the object to be reconciled. It doesn't include the event that triggered the reconcile. The controller then runs 'workers' which watches the work queue for reconcile requests and calls the reconciler. Each worker is just a goroutine that blocks waiting for reconcile requests. The number of workers is configured with `MaxConcurrentReconciles`. After the reconciler runs it returns a result: - If reconcile returned an error, the request is queued again with rate limiting - If reconcile returned 'requeue-after', the request is added to the queue after the given time - If reconcile returned 'requeue', the request is queued again with rate limiting - Otherwise if reconcile returned no error the processed request is removed from the queue ### Watch Controllers watch for events from a source. Each source provides a stream of events for Kubernetes objects using the watch API. The events can be filtered using predicates, which are simply boolean checks on each event. Events are then added to the worker queue. ### Worker Queue The worker queue contains reconcile requests for workers to process. Requests are processed in order. If the same reconcile request is added multiple times before being processed, it will only be processed once. Therefore queued reconcile requests are deduplicated. The reconcile request only contains the object name and namespace being reconciled, so if multiple events trigger a reconcile on the same object before being processed, only one request will be processed. The queue uses condition variables so workers can block waiting for the next request (or a shutdown signal). The queue also supports adding requests after some timeout which is used when reconciles return 'requeue-after'. There is also a rate limiting queue which only adds requests when the rate limiter allows it. ## Reconciler A reconciler is a function containing the 'business logic' of a controller. It will compare the desired state with the actual state, then take any required actions to move the actual state closer to the desired state. Such as if the reconciler is invoked for a `ReplicaSet` object, if the `ReplicaSet` object specifies 5 pods (the desired state), though there are only 3 pods in the system (the actual state), it will add two more pods. The reconciler is invoked with the name and namespace of the object to reconcile. Each reconciler is stateless and doesn't care what event triggered the reconcile. It will look up the required state and take any actions required based on that state.