Stop Reading Changelogs: Safer Kubernetes upgrades with simulation
SRECon 2026 Talk Transcript
Last week I had the privilege of giving a talk at SRECon Americas in Seattle; the recording of the talk is available on the Usenix YouTube channel, but I thought for this post I’d do something a bit different and write an (approximate) transcript of the talk for those of you who don’t want to wait and/or prefer reading things to watching videos. The talk is, of course, about SimKube: I hope you enjoy! I’ll update this with a link to the talk once it’s posted.
Thanks all for coming; the title of my talk today is “Stop Reading Changelogs: Safer Kubernetes Upgrades with Simulation”.
On Pi Day three years ago, March 14, 2023, Reddit suffered a 100-pi-minute long outage, a 314-minute outage. In the extremely detailed and well-written postmortem, it was revealed that one of the primary causes of this outage was a Kubernetes version upgrade from Kubernetes 1.23 to 1.24. Specifically, in this upgrade, Kubernetes changed one of the labels that was applied to its control plane nodes, and Calico, the service mesh component used by Reddit at the time, was using a label selector that pointed to the old label for the node instead of the new label. This caused Reddit to go hard down for around 5 hours.




