Stop Reading Changelogs: Safer Kubernetes upgrades with simulation
SRECon 2026 Talk Transcript
Quick note for my subscribers: In an effort to try to get a few more paid subscriptions to this blog, I’m running an experiment where I keep the “paid subscribers only” period for a week, instead of just “over the weekend”. I will still publish all of my posts publicly, but if you want to read it earlier, you’ll need to be a paid subscriber. As a reminder: paid subscriptions to the blog don’t pay my bills, but they help get me and Ian to conferences, which is where we meet the people who do pay our bills. So I very much appreciate everyone who is already supporting us, and look forward to getting even more!
Last week I had the privilege of giving a talk at SRECon Americas in Seattle; the recording of the talk will be published in a few weeks, but I thought for this post I’d do something a bit different and write an (approximate) transcript of the talk for those of you who don’t want to wait and/or prefer reading things to watching videos. The talk is, of course, about SimKube: I hope you enjoy! I’ll update this with a link to the talk once it’s posted.
Thanks all for coming; the title of my talk today is “Stop Reading Changelogs: Safer Kubernetes Upgrades with Simulation”.
On Pi Day three years ago, March 14, 2023, Reddit suffered a 100-pi-minute long outage, a 314-minute outage. In the extremely detailed and well-written postmortem, it was revealed that one of the primary causes of this outage was a Kubernetes version upgrade from Kubernetes 1.23 to 1.24. Specifically, in this upgrade, Kubernetes changed one of the labels that was applied to its control plane nodes, and Calico, the service mesh component used by Reddit at the time, was using a label selector that pointed to the old label for the node instead of the new label. This caused Reddit to go hard down for around 5 hours.




