Two things to know about Kubernetes Mutating Webhooks
This is a quick story about a bug in SimKube where I knew what the bug was from the beginning and it still took me several hours to actually troubleshoot. In the end, the problem was… exactly what I suspected it was at the beginning, but it manifests in such a weird way that I couldn’t piece it together.
This is a story in two parts1 about mutating webhooks in Kubernetes.
What’s a webhook, precious?
OK, so, Kubernetes lets you extend the system in two ways2: the first is through Custom Resources, namely, you can define your own “Kubernetes objects” that look, feel, and behave just like all the built-in Kubernetes objects (Deployments, DaemonSets, etc). The second way is by injecting random bits of code into the pipeline to inspect or modify these resources at runtime. These random bits of code are called webhooks.
There are two types of webhooks3: validating admission webhooks and mutating admission webhooks. More-or-less, these do exactly what they sound like: they either check that Kubernetes resources or actions meet some policy, or they modify Kubernetes resources before they’re created. The reason they’re called webhooks is that you interact with them over the web the cluster’s network, which I dearly hope is not something that you’ve exposed to the open Internet, but, you know, you do you.
There are two methods to create a webhook4: YAML and JSON. For example, here is the webhook specification that SimKube uses (with some fields omitted for brevity):
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sk-testing-mutatepods
webhooks:
- name: create-update.mutatepods.simkube.io
clientConfig:
service:
name: sk-testing-driver-svc
namespace: simkube
port: 8888
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
- UPDATE
resources:
- pods
- pods/status
failurePolicy: Fail
matchConditions:
- expression: object.metadata.namespace.startsWith('virtual-')
name: virtual-namespaces-only
matchPolicy: EquivalentThere are two key things to note5: the first is the clientConfig block; this contains a service reference, which tells Kubernetes what the endpoint of the webhook network request should be. In this case, it’s the SimKube driver, which is actually running our simulations. The second key thing to note is the rules: this tells Kubernetes under which circumstances the webhook should be invoked. For SimKube, we want our webhook to be invoked whenever a pod object is created or updated. The third key thing to note is the failure policy, which instructs Kubernetes that the pod creation or update operation should be rejected if the webhook doesn’t work for whatever reason (previously, the failure policy was set to “Open”, which was one of the inciting factors in ACRL’s first public postmortem). The last key thing to note are the match conditions; these are CEL expressions which allow you to narrow the targets of the webhook even further; here we’re instructing Kubernetes to only call the webhook for pods that are created in a namespace with the virtual- prefix.
What happens when a webhook breaks?
As I mentioned a few weeks ago, I’ve been extending SimKube to work better with bare pods (aka a pod that doesn’t have some other owning Kubernetes resource). There are two weird things about bare pods6: if the node they are running on disappears, the pod itself is deleted by the Kubernetes Garbage Collector, and because they don’t have an owner, there is nothing to reschedule them somewhere else. This can lead to inaccurate simulation results if you’re operating with an autoscaler like Karpenter, so I wanted to add a feature to SimKube to reschedule bare pods that were interrupted before they could complete. Fortunately, Kubernetes makes this easy; the two steps7 I took were to a) register a DELETE operation for the webhook, and b) update the driver code to check if the deleted pod had finished running, and recreate it if not.
Once I had written the code to do this, I loaded it up into my cluster to test it out; before the change, if I deleted a node with running pods on it, the node (and the pods) disappeared for good. After the change… deleting the node hung forever, and the pod never went away.
“Aha!” I said to myself. “I have seen this before; the mutating webhook is failing, and the logs for the failure are going to be in a weird place, because that’s how Kubernetes Mutating Webhooks work!” I then spent several hours debugging before confirming that this was, indeed, what was happening. I’ll spare you the details, but at the end of the journey, I found myself looking at this log line from KWOK (as an aside, I recently set up VictoriaLogs on my Kubernetes cluster, and it made this entire process so much easier: 10 out of 10, would read logs in the VM UI again).
failed to apply stage:
failed to delete pod bare-pod-1781282509: pods "bare-pod-1781282509" is forbidden: expression
'object.metadata.namespace.startsWith('virtual-')' resulted in error: no such key: metadataThere are two confusing bits here8: first, the CEL expression above is clearly coming from my webhook, and yet my webhook was never actually even getting called; and second, the error was actually coming from KWOK, which was the last place that I was expecting to see an error. So even though I predicted exactly this scenario at the beginning, I ended up being so confused that it took me a while to understand what was happening.
The short version is: if the configuration for a webhook is broken in some way, the error gets reported to whatever component caused the webhook to get invoked. In this case, Karpenter was calling the Eviction API, which in turn was causing KWOK to make a deletion request to the Kubernetes API server, which meant that KWOK was the component that saw the error. (As I mentioned above, I was expecting the delete request to come from the Kubernetes Garbage Collector, which I naively thought was in controller-manager, so that’s where I spent most of my time looking).
What was the error? Well, for DELETE actions, the body of the webhook request populates an oldObject field instead of an object field to indicate that the object used to be there and now it’s not. Which meant that there was no metadata field on object. The solution was to duplicate the webhook for DELETE requests, but change the match condition to the following:
- expression: oldObject.metadata.namespace.startsWith('virtual-')As soon as I made that change, everything started working. So anyways, all this is just to say, if you’re writing a webhook and things are behaving in weird ways, there are two things you need to do9: search your logs for error messages that seem like they are coming from really confusing places, and then think really hard about how you broke your webhook configuration so as to cause that particular log line to show up in that particular place.
As always, thanks for reading!
~drmorr
Actually there are more than two parts, but these are the key ones.
Actually there are more than two ways, but these are the key ones.
Actually there are more than two types, but these are the key ones.
Actually there are more than two methods, but these are the key ones.
Actually there are more than two key things, but these are the key ones.
Actually there are more than two weird things, but these are the key ones.
Actually there are more than two steps, but these are the key ones.
Actually there are more than two confusing bits, but these are the key ones.
Actually there are more than two things you need to do, but these are the key ones.



