Conway's Law and Kubernetes
I’ve been spending a lot of time in the last couple of weeks doing non-technical work1, so in this post I’m again going to go for a less-technical topic and explore some thoughts I’ve been having around the Kubernetes project as a whole, and how it’s organized. It should be noted that I do occasionally contribute to the Kubernetes project, as well as review some PRs from time to time, but I don’t have any inside knowledge into how CNCF (the organization managing Kubernetes) works or how/why we got here. For the purposes of this blog post I’m an interested outsider :)
Who the heck is Conway?
We should probably start with some background. The Kubernetes project is run by the Cloud Native Computing Foundation (CNCF), and (because Kubernetes has a enormous amount of code and related artifacts to maintain) different aspects of the project are owned by different groups of people, organized into Special Interest Groups (SIGs). At time of writing, there are 23 different SIGs, ranging from sig-docs (focused on writing and improving the documentation) all the way down to sig-node (focused on some of the lowest-level details about how Kubernetes interacts with the underlying hardware)2. You can see a list of the current SIGs and learn more about the Kubernetes governance structure in the community repo on GitHub.
Each SIG has between one and two chairs that are roughly responsible for keeping the stuff that the SIG owns running. They maintain a roadmap around the SIG plans, communicate that out to other interested parties, present an annual report to the Kubernetes Steering Committee, and so on and so forth. Then there are many people who do the day-to-day work for the components the SIG owns: writing code, reviewing pull requests, triaging bug reports, submitting feature requests, etc. There are very few (relatively speaking) people whose full-time job it is to participate in Kubernetes. Most of the people in the SIGs (including the chairs) are employed by other companies who use or otherwise have a vested interest in the Kubernetes ecosystem.
So as you might expect, there are a lot of competing interests and priorities at play in the Kubernetes project. In some ways this mimics the sort of politics and organizational challenges that you might find in a large company, but even more so. There are so many conflicting goals and incentives here that it’s amazing to me that anything gets done.
Anyways, I was (a passive observer) in a fascinating SIG meeting a couple weeks ago that got me thinking about organizational and business structures, and how they interact with one of my favorite topics on this blog thus far, getting rid of layers3. More on the meeting in a minute, let’s talk a bit more about Conway’s Law.
Who the heck is Conway, anyways, and why was he cool enough to get a law named after him? Melvin Conway was a computer scientist who was credited with the invention of coroutines4, but he also coined the following adage:
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.5
Or, to put it another way, if you have a bunch of teams working to build a common piece of software, the API boundaries of the software will tend to mirror the organizational boundaries of the teams building the software. To some degree this seems fairly obvious now, but it has led to some interesting organizational strategies. For example, the “Inverse Conway Maneuver” attempts, broadly speaking, to restructure your organization to eliminate unnecessary boundaries or, shall we say, layers6, in your software.
So with that background, let’s go back to the meeting I was in.
“Should we get someone from cluster API involved in this conversation?”
At a high level, the meeting I was participating in was a meeting between sig-autoscaling and some folks at AWS who have built an alternative Kubernetes autoscaler called karpenter. The folks at AWS would like karpenter to succeed as an open-source project, and to that end are trying to hand off ownership of karpenter to sig-autoscaling (and just to be super clear here: I think all of this is awesome. AWS built a cool thing and want other people to use it, so they released it to the community and are trying to relinquish control of it. This is how open-source stuff is supposed to work! Speaking from experience, it’s incredibly hard to get a company to release software as open-source, much less to hand it off to someone else to run, so kudos to AWS for doing this with karpenter).
In order to facilitate this handover, a karpenter working group within the Kubernetes organization has been formed, and there are semi-regular working group meetings with interested parties (mostly sig-autoscaling members) to discuss the transition. And it was in one of these meetings that one of the participants made the comment, “Should we get someone from cluster API involved in this conversation?”
See, what happened is that karpenter is primarily an autoscaler, but it actually does other things as well that straddle the (somewhat artificial) API boundaries that exist in Kubernetes. In the Kubernetes organization, “provisioning a new cluster” is a separate task from “scaling an existing cluster up and down”. The former set of functionality is owned by sig-cluster-lifecycle, and the latter is owned by sig-autoscaling. But, karpenter ignores this distinction, and is able to (in theory) do some stuff more efficiently as a result7.
The trouble is, the existing Kubernetes organization isn’t configured to handle this very well. If someone comes along and builds a new component that smashes some layers together8 in a way that spans pre-existing organizational boundaries, what do you do? Who owns it? Which GitHub repo should it belong to? Who decides which features to prioritize? What we’re witnessing right now in real-time with karpenter is some of the very real impacts of Conway’s Law on Kubernetes, and it’s fascinating to watch the organization navigate this.
So what happened in the meeting? Well, everybody agreed that probably somebody from the cluster API side of things should be involved and we all moved on. But (at least from my perspective) there was no clear agreement on who would be in charge of getting the right people from cluster API in the room, and without that agreement it seems unlikely that the right people will get into the room in a timely fashion. Conway’s Law at work.
What can we learn from all this?
The reason I’m interested in this at all (aside from just finding organizational dynamics really interesting) is that at Applied Computing, I’m really looking for opportunities to bridge or reduce some of these barriers9 in the software that the industry is running. And, in theory, I can do this right now without getting buy-in from anybody else. I’m running a small business with one employee, I can build whatever the hell I want and I don’t have to listen to or convince anybody else that it’s a good idea.
BUT, the moment I want to convince someone else to use or adopt anything that I build, I’m going to have to start navigating these organizational boundaries. Let’s say that I build some cool new thing for Kubernetes that smashes together scheduling, horizontal autoscaling, and vertical autoscaling. Let’s say that it’s actually good—a big assumption at this point. Let’s say that I want to hand it off to Kubernetes. Who’s going to take ownership? sig-scheduling? sig-autoscaling? How do I, a relative outsider, even get the right people into the room to start having that conversation?
As with many things I’ll probably end up posting on this blog, I don’t have answers to these questions. I’m willing to bet at least 60% of the answers boil down to “know the right people”, which is a little bit frustrating. Shouldn’t the work I do be able to stand on its own, and if it’s good, people can start using it and organize themselves around it? Sadly things don’t work that way. This is at least one of the reasons why I think attending and participating in conferences (or, more generally, “networking”) is so important. It may feel overwhelming to go to a big technical conference, but at least there you can be relatively confident that all the right people are in the room. So your problem shifts from “how do you get the right people into a room together?” to “what room are the right people in?” And that feels ever so slightly more tractable.
Thanks for reading,
Translation: trying to convince people to give me money
For reference, I’m most active/involved in sig-autoscaling, which maintains the Cluster Autoscaler, Horizontal Pod Autoscaler, and Vertical Pod Autoscaler.
I’m beginning to think we need to start a drinking game here.
Sidenote: coroutines are super cool, I talked about them very briefly in my post about Kubernetes Scheduling in Rust
I’m eliding a bunch of details here because a) I don’t know them, and b) they’re not really relevant.
See? I didn’t say the L-word, aren’t you proud of me??