Live-blogging "Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach"

Sep 11, 2023

I didn’t get the next post in my SimKube series written up in time to publish this week, so instead I’m gonna phone it in by copy-pasting some more content from Mastodon. Earlier today I live-blogged another paper, this time by some folks from Microsoft Research1, called Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach. Here are my thoughts:

So we're scheduling #containers to machines. It's a Stochastic Bin Packing Problem (SBPP). But this problem isn't well-defined when some of the machines are already in use (SBPP assumes all the machines are empty at the start).

They define a new objective function called "Used Capacity at Confidence" (UCaC) which is stable across empty and non-empty machines. UCaC can be approximated by a Gaussian distribution.

We're looking at the case when machines can be oversubscribed, and we want to keep the probability of resource violations low.

"Number of used machines" is not a great metric when some machines are already running things.
Or... so the paper claims? I don't know that I totally buy this argument yet, but I'm still on the first page, let's keep reading.

They consider CPU utilization as the objective function, and other resources (memory, storage) as constraints, since cloud providers charge by CPU.
In the next sentence, they say they're focusing down on just CPU and removing the other resource constraints, but state that they can easily be added back in.
Hmmm. Not sure about this. I get that you have to make simplifying assumptions for these problems or they become too hard to solve but are these actually good simplifying assumptions?

The randomness comes in for a pod's resource utilization, which is modeled with a random variable. They assume that all pods for an application ("Service") have identical distributions for their resource utilization, which I think is a fair assumption to make.

We want the probability that the sum of resource usage by pods on a machine is less than the machine's capacity greater than some parameter alpha. Here alpha is the business SLA.

Attention everyone, we have a MIP2! Repeat, we have a MIP! 😂

We simplify the MIP by transforming the chance constraints into deterministic ones, and reduce the number of decision variables by treating all pods of an application alike.

They assume that resource utilization for these services is approximately Gaussian (technically Sub-Gaussian, though I don't fully know what that means), and analyze real-world production data to verify this assumption. They also argue from a theoretical perspective that these resource utilization numbers ought to be approximately Gaussian.

Ok, now we're getting into the meat of it. The Used Capacity at Confidence metric is actually exactly what it sounds like: how much of a given machine's capacity is used up, with some probability greater than alpha. As long as everything is approximately Gaussian, this is easy to calculate.
I assume where we're going from here is, subtract that UCaC value from the total capacity of a machine, and that's how much you have left over to pack new things into.

They plug in UCaC into their objective function and show that if you relax the problem to be real-valued, the UCaC objective and the "number of machines" objective is the same.

Ok, this makes sense, if we have non-integer decision variables every machine except for 1 is not full, and the result follows from there.

This doesn't hold for integer-valued variables, of course, but they show that the UCaC metrics is within 8/3 of the optimal "number of machines" metric.
Or, to put it another way, almost 3x. I don't know about you, but I don't think very many finance folks would be happy paying 3x more than they need to.
(To be clear, not knocking on the paper authors or their results here, these types of approximation bounds are tend to be not "great" from a real-world perspective)

Time for some alg'rithms, y'all! We've got three in this paper:
1) online best-fit heuristic
2) offline heuristic
3) exact cutting stock solution
The online one is bad, as expected. The offline one is better, also as expected. The exact one is, well, exact.

Oh! They test out their algorithms on decently-sized problems!
That's a nice change from some papers I've read.
They're running a simulated cluster with 4000 nodes in it. Each node has ~32 cores.
But... they only consider number of services between 5 and 20. I'm reasonably certain most production clusters with 4000 nodes are going to have thousands of services, not 20. "Most" services aren't going to use 200 32-core nodes.

Comparing the three algorithms in this paper to a baseline, their algorithms perform better from a "number of nodes" and "UCaC" perspective, but they do worse from a reliability perspective -- they have more violations, but they stay within SLO.

Again, this makes complete sense. If you reduce cost, you (in most cases) reduce reliability. Once you've cleared out your low-hanging fruit, you have to start making tradeoffs. What level of reliability do you _actually_ need, and how much are you willing to pay for it? Any conversation in this space that tries to decouple cost from reliability is missing the mark.

And that's pretty much it for this paper! The theoretical results seem to be sound and interesting, but I'm skeptical about being able to apply this type of approach in a real Kubernetes environment.
Their cutting stock solver takes ~1m in "normal" cases, and can balloon to be significantly longer (and remember, they're just considering CPU here! When you throw in memory, network bandwidth, storage, GPUs, etc etc etc.... good luck!)

To be honest, I think the most useful bit of this paper is some validation that resource utilization of services is approximately Gaussian. That's going to be important, say, if you were trying to write a simulator that operated on generated data.
But I don't know anybody trying to do that.

Interested in more of this type of totally-phoned-in long-form content? Subscribe below! I’ll be back next week with more SimKube.

Jie Yan, Yunlei Lu, Liting Chen, Si Qin, Yixin Fang, Qingwei Lin, Thomas Moscibroda, Saravan Rajmohan, and Dongmei Zhang

Mixed Integer Program, the standard way to define and talk about optimization problems in research papers.

Live-blogging "Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach"

Discussion about this post