Thoughts on Scheduling and Autoscaling
The last couple of posts I’ve made have been a lot more technical, so in this post I wanted to take a step back and put down some high-level thoughts on what I’m trying to do at Applied Computing, and why. This post might be a bit more stream-of-consciousness and open-ended than some of my other content, because I have a lot of questions and not a lot of answers right now.
What are we even doing here, anyways?
In our intro post, I laid out 3 rough areas of research that I want to start looking into: 1) simulation in distributed systems, 2) scheduling and autoscaling for containerized applications, and 3) the future of cloud computing platforms and technologies. That third topic is vague and ill-defined, so I’m going to leave it alone for now. The “simulation” topic is pretty well-scoped (in my head, at least), and I’ll have some more posts coming up on that soon. So in this post, I really want to dig into that second point: scheduling and autoscaling in containerized applications1.
This area is where I’ve spent the last 6-7 years, so I have a pretty good handle on how scheduling works in Kubernetes. I’ve done even more with autoscaling, particularly cluster-level autoscaling. I actually wrote an autoscaler at Yelp, and while I was at Airbnb made several contributions to and became an approver for the Kubernetes Cluster Autoscaler. I know where the current pain points are for scheduling and autoscaling, so the main question I’m grappling with right now is “How can we improve things?”
Let’s start with autoscaling
Ok, so what do we actually mean by autoscaling? There’s really three levels of “autoscaling” that people talk about, and while they’re fairly closely related, it can get confusing if you don’t clarify what level of autoscaling you’re talking about. So let’s break it down: the first level is cluster autoscaling, which deals with the number of physical2 nodes in your compute cluster; the second level is horizontal scaling, which deals with the number of running replicas, or instances, or copies of your application; and the third level is vertical autoscaling, which deals with the number of resources (CPUs, memory, etc) that your application requests.
Hopefully it’s fairly clear how these three levels interact: if you need to run more copies of your application, you’re going to need more physical nodes to run those copies on. If your application needs to request more resources, then fewer of them can fit on a box, and so you’ll need more nodes to run the same number of copies. However, you can already see where it gets complicated, because if your application uses more CPU cores (for example), then maybe you don’t need to run as many copies because each replica can do more.
Where it starts to get weird is at the cluster level. Kubernetes (and Mesos, and Spark, and basically every other orchestrator in existence) has an API for horizontal and vertical autoscaling. Kubernetes also (through the Cluster Autoscaler) provides an API for cluster-level scaling. But what does that actually mean? If you’re running your own datacenter, the concept is kindof meaningless. It’s not like there’s an API that you can call to go out and rack new hardware, and if you scale down, those nodes are still sitting there, they’re just not necessarily part of your cluster anymore.
And if you’re running in the cloud, the same principle applies. After all, there is no cloud; it’s just someone else’s computer. It’s not like AWS has a magic API that will make them rack new hardware, either3. So really, cluster-level autoscaling is “just” a cost control measure: how many of these (already-existing, already-racked) servers are you willing to pay for at this point in time?
I think this notion has met with some success in the industry, and lots of companies that have someone fluctuating traffic patterns have been able to save a lot of money using autoscaling, but my prediction is that the era of easy savings via autoscaling is coming to an end. Why?
Think about this from the cloud provider’s perspective. Maintaining hardware is expensive, and hardware that isn’t being used is costing you a lot of money and is not making you any money. So cloud providers are incentivized to maintain the minimal amount of hardware necessary to support their customers’ demand. And given that a lot of tech companies are based in the US and have a lot of US traffic, usage patterns across the industry are (at a first level of approximation) going to be the same: lots of traffic in the day, no traffic at night.
In essence, tech companies have successfully managed to avoid provisioning at peak capacity all the time, but cloud providers still have to. And, I have no inside information here, but my guess is that they don’t like that state of affairs very much. I think they’re going to be working very hard to shift the balance back towards their customers again, which means that autoscaling at a cluster level is no longer going to be an effective strategy: because if you don’t tell the cloud providers what you want at peak (and commit to paying for it), you’re not going to physically be able to get the hardware you need at peak4.
What does this mean for us? Well, for one thing it probably means that investing a ton of time into making even more advanced autoscaling engines is likely a waste of effort. The future of running Kubernetes at scale in the cloud is going to be about eking as much performance as you can out of a relatively static number of machines. Which brings us to the next topic: scheduling.
Yea, OK, scheduling. What about it?
Let’s start by clarifying our definition of scheduling. A lot of times when we discuss scheduling in Kubernetes, we’re specifically talking about kube-scheduler: the thing that watches for unscheduled pods and tries to find a place to put them. For the purposes of this blog post, however, I want to expand our definition. Scheduling in a distributed system like Kubernetes is the process of deciding 1) how many replicas we should run, 2) how many resources each replica requires, and 3) what servers should run them on.
In Kubernetes today, this definition of scheduling is accomplished by three different components. The Horizontal Pod Autoscaler (HPA) determines how many replicas of a pod we should run, the Vertical Pod Autoscaler (VPA) determines how many resources each replica should use, and kube-scheduler is responsible for actually placing those pods. The problem that I think a lot of companies are running into here is that there are too many layers to actually do scheduling efficiently. And if the future of running Kubernetes cost-efficiently is “doing scheduling better”, then we’re gonna have to get rid of some of the layers.
What do I mean by that? Well, here’s an example: in their default configurations, HPA and VPA actually cannot be used together, because the HPA is trying to maintain a constant CPU utilization across all pods, and VPA is trying to reduce the amount of “wasted” or “unused” CPU in a pod, so the two programs end up in a vicious cycle until suddenly your application crashes because each individual pod doesn’t have enough CPU to run5. Wouldn’t it be great if there were a single component that understood enough about the pod’s environment to make a decision about whether vertical or horizontal scaling is going to be more cost-effective? Unfortunately, no such component exists.
The question is, why not? It’s certainly not for lack of trying. There are dozens (if not hundreds or thousands) of published, peer-reviewed research papers out there trying to “solve Kubernetes scheduling”. There are a few prominent open-source projects that have actually made significant headway into writing a better scheduler for Kubernetes6.
I think there are probably two main reasons why none of these (alternative) schedulers have taken off: 1) a scheduling problem of this complexity is really freaking hard, and 2) for most companies right now, doing something basic is good enough7. But, my other prediction in this post is that (2) is going to change. Companies are going to want to become a lot leaner in the next few years, and given that cloud computing is one of the largest line-items on a company’s budget, this is a natural target for reducing cost.
Given that, the question is, “What do we need to be building in order to facilitate better scheduling?” Let’s first review our requirements. We need a scheduler that:
does better than the existing VPA/HPA/kube-scheduler combo
is easy to run and operate for companies that aren’t Google
is small, responsive, and light-weight enough that it doesn’t negate all the cost savings it supposedly is producing
Note that the second bullet point I think eliminates two common scheduling approaches, which are a) using machine learning, and b) using a mixed-integer programming (MIP) solver. ML-based scheduling models dominate the literature right now, because it’s easy to imagine that tossing a dash of ML on everything will be the magic sauce that solves all your problems, but I think there is plenty of empirical evidence that that is not the case. And MIP solvers are slow and extremely expensive to run. And, to top it all off, neither ML or MIP solvers are easy for non-specialists to understand.
…
Yeesh. What have we gotten ourselves into???
Well, they don’t call it “research” for nothing. It’s certainly not an easy problem, but I think there are a few potential “active” directions to explore, which is what I plan to be doing over the next several months (or years, if I’m lucky).
Firstly, I think there’s some low-hanging fruit around reducing some of the abstraction layers between the different Kubernetes components. I have a hypothesis that “just” making kube-scheduler, HPA, and VPA aware of each other and able to communicate with each other will be a fairly big improvement over the status quo.
Secondly, I think there are some other solution methodologies that we could explore aside from just ML and MIP that may prove more fruitful. There are some folks I know who’ve done some work on a more “economics/game-theoretic” approach to this problem which I’d like to explore further. I’d also like to look more into network-flow-style approaches and see if there’s anything interesting there. And lastly, maybe there’s some work in constraint programming (a related, but not identical field to MIP solving) that might help.
And lastly, maybe the answer is “stop doing Kubernetes.” I don’t have the faintest clue what we would replace it with, but I do wonder if we’ve (collectively, as an industry) painted ourselves into a corner with our current operating models. Maybe there’s a better way to do, well, everything.
Anyways, that’s all I have for now. Lots of questions, not a lot of answers, but hopefully you’re enjoying the ride!
Thanks for reading,
~drmorr
You know what, we all know I’m talking about Kubernetes, so I’ll stop pretending now.
Yes, yes, in a cloud environment you’re not actually getting a physical node in most cases, you’re getting a virtualized environment that looks and acts like a physical node. I’m going to ignore that distinction in this post.
Well, ok, that’s not 100% accurate. The API is money. If you promise to give them enough money, they will rack new hardware for you.
There are some indications that this shift is already happening; see, for example, a recent article about how AWS Spot Market prices have increased dramatically this year, or the work that AWS has been doing around Capacity Reservations all hint at the fact that AWS is probably not racking as much hardware as they used to.
The main effort to fix this is the Multidimensional Pod Autoscaler, which lets you do different types of autoscaling on different pod resources, but this is going about the problem all wrong. But that’s a subject for another post.
Poseidon/Firmament and Volcano are two relatively well-known such projects, but Poseidon has since been retired, and Volcano is specifically targeting batch/ML workloads, not general-purpose scheduling.
Actually there’s a third problem, which is that “tying changes to your scheduler to your bottom line” is a harder problem than people think it is, too.