Disruption, Eviction, Evacuation, Preemption, oh my!
Let's talk about pod disruption in Kubernetes1. Suppose you're running a bunch of happy little pods in your cluster, and then you discover that one of your nodes needs to be replaced because of a hardware issue. What do you do? Well, if you're a Kubernetes cluster, you just terminate the heck out of that sucker, running workloads be damned. State? Nobody uses state around here, what are you even talking about.
Introducing Pod Disruption Budgets
Ok, let's back up for a minute because I probably just confused half of you and made the other half of you angry. So what problem are we trying to solve here? It goes like this: you have a large application composed of a bunch of different microservices running on a big cluster. In this environment, things are constantly changing: traffic patterns change, so you need to spin more pods up over here, and tear some pods down over there. Nodes fail, so you gotta move all the stuff running on that node somewhere else. Applications get upgraded, so you have to deploy the new version and tear down the old version. And, ideally, you'd like all this to happen without your users noticing.
So what can you do? Well, if your application pods are truly interruptible, meaning that it doesn't matter where they run, they don't have to do any cleanup work, and new pods start up approximately instantaneously, all you really have to do is make sure that there are enough of them to handle demand at any given point in time. Enter the Pod Disruption Budget (PDB). This is a Kubernetes resource that allows you to specify2 how many pods must be running at any given point in time. They look something like this:
maxUnavailable: 1
selector:
matchLabels:
app: my-disruptable-app
Now, to terminate a pod, you make an eviction request; basically, you create another (sub)-resource, this time on the pod that you want to terminate, and the disruption controller will gradually terminate pods as long as the PDB allows it. This sounds great! This will solve our problem! Right? … right, guys?
Apropos of nothing at all, let's take a look at the strategy field in the Deployment spec:
strategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 5
What this says is that when a deployment needs to be updated, it will bring up at most 5 new pods at a time, and terminate at most 1 old pod at a time. The maxUnavailable
field seems familiar, right? You might think that this is somehow connected to or coordinated with the field by the same name in the PDB, but, you'd be wrong. They're completely unrelated! What this means is two-fold: a) users can set the values of PDB.maxUnavailable
and Deployment.maxUnavailable
to different things3, and b) because the deployment controller is blissfully unaware of the existence (or not) of a PDB, we've now opened up the possibility for race conditions that violate your application's availability requirements! I was pretty annoyed when I discovered this.
But, OK, this problem is solvable. We just need to migrate all of the existing Kubernetes controllers to use the eviction API (and, ideally, make it impossible for stuff to just straight-up delete pods except in extraordinary circumstances). It would be a huge amount of effort at this point, but it is doable4. But even if we fix all of the consistency issues with PDBs, there are still more problems.
The Evacuation API KEP
I was (re)-introduced to this whole topic by a fun argument happening on the sig-architecture mailing list5 about whether or not Kubernetes can and/or should become more than a container orchestration platform, and start caring about the underlying infrastructure/hardware/VMs its running on. This question is intimately connected to the "node lifecycle" problem6, which is itself intimately connected to the pod disruption problem. Someone on the mailing list mentioned a new proposal called the "Evacuation API", which I hadn't heard of before and so I went to dig it up.
But let's back up for a second again: what's the problem we're trying to solve here? See, not every pod fits the "easily-disruptable" model that we described above. Applications (generally) have state, even if that state is just an in-memory cache. If those pods get disrupted, it takes time for the replacements to rebuild their cache and be "ready to run". Some pods (like datastores) have external state that needs to be managed somehow—at the very least, you don't want to terminate a pod in the middle of a supposedly-atomic transaction which might leave your database in a corrupted state. And none of these use cases can be handled by the existing PDB mechanism. In fact, because you can theoretically run any application you can imagine on top of Kubernetes, you can theoretically have an infinite number of ways in which disruption could fail. And really the only way to solve this problem is to open a communication channel between the platform and the application.
It looks something like this: the platform says, "Hey, buddy, we need to move you somewhere else, is that cool?" and then the application says, "Yeah, go ahead," or "Give me 30 seconds to finish up what I'm doing, or "Nah, I'm good, thanks though." And then the platform can decide whether to honor the application's response or be a total jerk and just do it anyways. This is, in a nutshell, what the proposed Evacuation API is trying to implement.
This is great, actually! People have been implementing versions of this procedure in their internal platforms for years, and there's a whole host of problems doing it that way—most importantly being that, even more so than PDBs, it's impossible for Kubernetes to know about your custom artisanal pod disruption controller, which means that there are dozens or maybe even hundreds of ways for controllers to ignore the whole communication protocol I described above and just terminate the pod. You'd have to patch or maintain a fork of all the things to really make that solution robust. Having something that solves this problem in a Kubernetes-native way would be a really ideal solution.
And yet, there's a big part of me that wonders whether this ship has already sailed7, and we're just beyond the point where we can sanely implement any sort of general-purpose disruption mechanism in the platform. If you read the KEP, the first thing you'll notice is that this introduces another controller and a new API resource8. We (or at least I) am deeply familiar with the problems that can crop up when you have dozens of non-communicating controllers all trying to control pod lifecycles, and adding yet another one seems like it's gonna be rough.
Also, if you read the comments on the KEP, there's a bunch of concerns about naming. This KEP is proposing a new concept it calls "Evacuation", which can fall back to the "Eviction API", which uses "Disruption Budgets", none of which plays nicely with "Pre-emption"… And yea, there's definitely a naming concern here, but I think the naming debate is masking a deeper problem, namely, the XKCD standards problem—you know, where the guy says "I'm going to build a new standard to unify the 14 competing standards" and the end result is 15 competing standards. Right now, there are somewhere between 2 and 4 competing standards for "deleting pods in Kubernetes", and the Evacuation API will add another one. You have to support all of them for a long time, because even if the Evacuation API is perfect and solves all the other problems, customers aren't going to use it, because they have 27 other migration projects in flight and PDBs work fine enough until they don't and cause a major incident.
And then, even if you get all the other stuff perfect, you still have the problem that Kubernetes itself wasn't designed with anything like the Evacuation API in mind. Are we going to update the deployment controller so that it understands evacuation, when we haven't even updated it to understand PDBs? Sure, we could, but again—it's a huge amount of effort at this point and somebody's gotta step up and do the work.
Disrupting the Disruption Industry
So where can we realistically go from here? Maybe this is a disappointing end to the blog post, but I don't actually have a good answer. The problems we need to solve to do disruption "right" in Kubernetes aren't insurmountable, but they are large and legion. I really give the author of the Evacuation API mad props for trying to tackle it—the current state of affairs with pod disruption has been a pet peeve of mine for a while, and I've dreamed of trying to fix it, but the right opportunity hasn't come along. It's such a hard problem because it really requires re-thinking some incredibly fundamental bits of Kubernetes, things that are so entrenched in peoples' mental model of how the system works that changing it requires not only a tremendous technical effort, but also a tremendous organizational effort.
So, with apologies for just leaving you all hanging, I'm going to close here—but maybe I've at least given you all something to think about.
Thanks for reading,
~drmorr
Yeah, this is gonna be a rant, buckle up.
In a variety of different, conflicting and incompatible ways—the semantics between maxUnavailable
and minAvailable
are subtly different.
Why you would want to do this I'm not sure, but it's possible.
And actually, folks are making some efforts in this direction. See, for example, this KEP, which tries to improve kube-scheduler so that it always respects PDBs during pod preemption.
Well, actually, it started on the sig-architecture mailing list, and then all the other sig lists got cc'ed, and now I get two or three copies of every message sent. But seriously, it's a good read, go check it out.
See here for an example of how these are connected: https://github.com/kubernetes/kubernetes/issues/125618
This is a Kubernetes joke actually, geddit? Geddit?
Insert an obligatory "this is getting out of hand, now there are two of them" joke here.