Applied Computing Research Labs

Applied Computing Research Labs

Stop Reading Changelogs: Safer Kubernetes upgrades with simulation

SRECon 2026 Talk Transcript

drmorr's avatar
drmorr
Mar 27, 2026
∙ Paid

Last week I had the privilege of giving a talk at SRECon Americas in Seattle; the recording of the talk is available on the Usenix YouTube channel, but I thought for this post I’d do something a bit different and write an (approximate) transcript of the talk for those of you who don’t want to wait and/or prefer reading things to watching videos. The talk is, of course, about SimKube: I hope you enjoy! I’ll update this with a link to the talk once it’s posted.

Applied Computing Research Labs is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


Thanks all for coming; the title of my talk today is “Stop Reading Changelogs: Safer Kubernetes Upgrades with Simulation”.

On Pi Day three years ago, March 14, 2023, Reddit suffered a 100-pi-minute long outage, a 314-minute outage. In the extremely detailed and well-written postmortem, it was revealed that one of the primary causes of this outage was a Kubernetes version upgrade from Kubernetes 1.23 to 1.24. Specifically, in this upgrade, Kubernetes changed one of the labels that was applied to its control plane nodes, and Calico, the service mesh component used by Reddit at the time, was using a label selector that pointed to the old label for the node instead of the new label. This caused Reddit to go hard down for around 5 hours.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 David Morrison · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture