<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Applied Computing Research Labs]]></title><description><![CDATA[Open-source research and development in distributed systems]]></description><link>https://blog.appliedcomputing.io</link><image><url>https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png</url><title>Applied Computing Research Labs</title><link>https://blog.appliedcomputing.io</link></image><generator>Substack</generator><lastBuildDate>Thu, 16 Apr 2026 07:41:45 GMT</lastBuildDate><atom:link href="https://blog.appliedcomputing.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[David Morrison]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[drmorr@appliedcomputing.io]]></webMaster><itunes:owner><itunes:email><![CDATA[drmorr@appliedcomputing.io]]></itunes:email><itunes:name><![CDATA[drmorr]]></itunes:name></itunes:owner><itunes:author><![CDATA[drmorr]]></itunes:author><googleplay:owner><![CDATA[drmorr@appliedcomputing.io]]></googleplay:owner><googleplay:email><![CDATA[drmorr@appliedcomputing.io]]></googleplay:email><googleplay:author><![CDATA[drmorr]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Boy AMI glad to see you]]></title><description><![CDATA[We&#8217;re pleased to announce that two AMIs (Amazon Machine Images) have joined the SimKube family!]]></description><link>https://blog.appliedcomputing.io/p/boy-ami-glad-to-see-you</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/boy-ami-glad-to-see-you</guid><dc:creator><![CDATA[Ian O’Gorman]]></dc:creator><pubDate>Fri, 03 Apr 2026 23:05:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QTEp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;re pleased to announce that two AMIs (<a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html">Amazon Machine Images</a>) have joined the <a href="https://simkube.dev/">SimKube</a> family! Yes, twins: <a href="https://aws.amazon.com/marketplace/pp/prodview-m7imofdta3tla?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">simkube-x86-64</a> and <a href="https://aws.amazon.com/marketplace/pp/prodview-jea6uc3po665a?sr=0-2&amp;ref_=beagle&amp;applicationId=AWSMPContessa">simkube-github-runner-x86-64</a> are now available in the AWS Marketplace. Each came in at a healthy 17 GiB snapshot weight<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. They arrived about two months apart due to the famously transparent AWS Marketplace approval process.</p><p>I&#8217;ll explain what each of these AMIs are and how we build them in due course, but first off, let&#8217;s address an important question:</p><h2>Two AMIs, in THIS economy?</h2><p>We know it&#8217;s crazy, who even has the action minutes to raise AMIs these days; we sure don&#8217;t. But we had a problem, or maybe an opportunity. SimKube just keeps getting better and better but configuring it can be, frankly, difficult. Building high-fidelity simulation environments requires installing and configuring a long list of tools: <a href="https://kind.sigs.k8s.io/">kind</a>, <a href="https://kwok.sigs.k8s.io/">KWOK</a>, <a href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a>, <a href="https://docs.docker.com/">Docker</a>, <a href="https://prometheus.io/docs/introduction/overview/">prometheus</a>, and SimKube, to name a few. So spinning up a ready-to-go SimKube environment takes some doing.</p><p>Internally, we have a configuration management repository called <a href="https://open.substack.com/pub/appliedcomputing/p/what-to-expect-when-youre-expecting?r=6zc8zh&amp;utm_campaign=post&amp;utm_medium=web">isengard</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. It is ~48k lines of pure <a href="https://docs.ansible.com/">Ansible</a> bliss. We use it to automate the deployment of repeatable simulation environments. It occurred to us that users of SimKube probably don&#8217;t want step one of using it to be &#8220;here&#8217;s ~48k lines of Ansible, good luck!&#8221;. It turns out there is a better way: a custom SimKube AMI.</p><h2>Why not a Docker image like a normal person?</h2><p>That&#8217;s a fair question. We did evaluate using a Docker image because one of our primary goals is a fast, one-click startup.</p><p>The challenge is that SimKube relies on <code>kind</code>, which spins up Kubernetes nodes as Docker containers. Initializing and configuring the kind cluster requires access to a live Docker daemon. During <code>docker build</code>, there is no Docker daemon available inside the build environment, which means we can&#8217;t just &#8220;run all the setup steps in our Dockerfile&#8221; and ship the result.</p><p>We also looked at snapshotting a running Docker environment, but that&#8217;s complicated for a different set of reasons. So after a long side quest that included Vagrant and QEMU, we realized what we actually need isn&#8217;t a container image but a prebuilt machine image that preserves the state of our configured simulation cluster. Since we primarily work in AWS, an AMI fits naturally.</p><h2>Baking the AMIs</h2><p>Fortunately, baking AMIs is a fairly straightforward task that our ancestors have been doing for thousands of years. We can reuse a lot of what we have already built in our configuration management system (which I will remind you is lots and lots of Ansible). We use <a href="https://developer.hashicorp.com/packer/docs">Packer</a> to bake our AMIs, so the first step is selecting a base image which our custom AMI will be built on top of. We chose Ubuntu 24.04 LTS for its stability, compatibility with our tooling, and long term security patching.</p><p>Using Packer we can initiate an automated build via a GitHub Action. For configuration, Packer includes a range of provisioners&#8212;<a href="https://developer.hashicorp.com/packer/integrations/hashicorp/ansible/latest/components/provisioner/ansible">including one for Ansible</a>&#8212;so we are able to leverage our existing configuration library in <code>isengard</code>. The GitHub Action itself is fairly simple: it clones the repo and runs packer. This helps keep our Packer configuration sparse and maintainable. We only need to configure a handful of things: the Ansible playbook to run, the region of our builder, our base AMI, regions to copy the finished AMI to, and any cleanup activities or supplemental provisioners.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BUDz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BUDz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 424w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 848w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 1272w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BUDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png" width="1096" height="1274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1274,&quot;width&quot;:1096,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125032,&quot;alt&quot;:&quot;A screenshot of a an AMI pipeline showing the relationships between GitHub Actions, Packer and Ansible&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/188559350?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A screenshot of a an AMI pipeline showing the relationships between GitHub Actions, Packer and Ansible" title="A screenshot of a an AMI pipeline showing the relationships between GitHub Actions, Packer and Ansible" srcset="https://substackcdn.com/image/fetch/$s_!BUDz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 424w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 848w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 1272w, https://substackcdn.com/image/fetch/$s_!BUDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e77b3a5-92ea-4c91-8cc6-c8822b3c8368_1096x1274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An AMI pipeline we can live with.</figcaption></figure></div><p></p><p>Our AMI pipeline is triggered by a weekly cron trigger, or by a manual build via a dispatch trigger<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. After some bake time, we end up with a custom AMI image backed by the snapshot created during the build. That AMI is then listed on the Marketplace.</p><h2>AMI patching</h2><p>Shipping a public AMI means we own patching it. At a minimum, Ubuntu is going to ship security patches (we really want those) and there will be patches for other software in our stack. Instead of patching in place, we treat each AMI build as an immutable artifact tied to the configuration management repository (isengard) git hash used to produce it. Every build is traceable and reproducible.</p><p>Every time our pipeline runs, it starts fresh with a clean Ubuntu image and pulls down the latest patches so each AMI is fully up to date at build time. The result is a simple, deterministic build process that is easy to maintain. AMIs are short-lived, they won&#8217;t drift over time, and there is no ambiguity about which code produced which image.</p><p>The tradeoff is that older AMIs are never patched. If you launch an older version, you get exactly what existed at build time. For our use case, this ends up being a feature since we value reproducibility. This comes in handy for debugging thorny issues in our AMIs, especially those that manage to bypass our validation tests. We can fire up an AWS EC2 instance and watch one of our services get clobbered in real time by some bad code that I definitely didn&#8217;t write.</p><p>For the most part our AMI pipeline quietly churns out new images. To keep our account from piling up with tons of old AMIs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, we use a <a href="https://developer.hashicorp.com/packer/docs/templates/hcl_templates/blocks/build/post-processor">Packer post-processing block</a> to deprecate old AMIs and clean up their snapshots automatically.</p><h2>You said TWO AMIs</h2><p>I did say that! We have two versions of our AMI. The first is the <a href="https://aws.amazon.com/marketplace/pp/prodview-m7imofdta3tla?sr=0-1&amp;ref_=beagle&amp;applicationId=AWSMPContessa">free SimKube AMI</a> with everything needed to run SimKube including a running kind cluster and management tools. This is our free-to-use simulation environment. All the user needs to do is launch it in AWS EC2 and get right to running simulations&#8212;though you will need a trace from the cluster you are simulating.</p><p>The second AMI is our <a href="https://aws.amazon.com/marketplace/pp/prodview-jea6uc3po665a?sr=0-2&amp;ref_=beagle&amp;applicationId=AWSMPContessa">SimKube GitHub Action Runner</a> it includes everything in the SimKube AMI but also has some extra configuration applied. We use an iterative build process, so this version is literally built on top of the base SimKube AMI<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QTEp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QTEp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 424w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 848w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 1272w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QTEp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png" width="1127" height="646" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1127,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89704,&quot;alt&quot;:&quot;A screenshot of a an AMI lineage diagram showing the inheritance of AMIs from Ubuntu 24.04 LTS,   down to simkube-x86-64, and finally to simkube-github-runner-x86-64&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/188559350?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A screenshot of a an AMI lineage diagram showing the inheritance of AMIs from Ubuntu 24.04 LTS,   down to simkube-x86-64, and finally to simkube-github-runner-x86-64" title="A screenshot of a an AMI lineage diagram showing the inheritance of AMIs from Ubuntu 24.04 LTS,   down to simkube-x86-64, and finally to simkube-github-runner-x86-64" srcset="https://substackcdn.com/image/fetch/$s_!QTEp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 424w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 848w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 1272w, https://substackcdn.com/image/fetch/$s_!QTEp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757ed00-3b2b-4de0-9e0d-07892ebd7d69_1127x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Built on the shoulders of giants.</figcaption></figure></div><p></p><p>This cuts down on build time and allows us to patch the GitHub Action Runner software independently of the base AMI if we wish. The additional configuration in this version is the GitHub runner software and a systemd wrapper that manages it. We use this version to run SimKube in CI pipelines (via <a href="https://docs.github.com/en/actions/get-started/understand-github-actions">GitHub Actions</a>). Effectively, this AMI is primed to register itself with a GitHub repo as a custom action runner when it receives the information contained in our User Data script.</p><h2>A world of opportunities</h2><p>Our SimKube AMI is a step forward in making SimKube approachable and easy to use. Instead of spending a few hours setting up a simulation environment, you can grab the SimKube AMI off the AWS Marketplace and have SimKube up and running in a couple of minutes. You will need to <a href="https://simkube.dev/simkube/docs/intro/running/#step-1-collect-a-trace">grab a trace</a> from your production cluster, but the environment for running those simulations is available at the click of a button or at the end of a AWS CLI command.</p><p>We want to continue to extend Kubernetes simulation into CI pipelines using our GitHub runner AMI. The vision is an engineer, maybe you, checks in some change to your cluster. Then, SimKube CI simulates it based on your production cluster and sends you back metrics you can use to evaluate your change before it hits production.</p><p>Today, ACRL is already running small simulations in CI in the <a href="https://github.com/acrlabs/simkube/blob/main/.github/workflows/simkube_e2e.yml">SimKube repo</a>. We have developed custom GitHub Actions, available in our <a href="https://github.com/acrlabs/simkube-ci-action">simkube-ci-action</a> repo, to make launching runners backed by SimKube AMIs as easy as adding a few lines in your GitHub Actions workflow.</p><p>So maybe you find SimKube interesting but setting it up has been too much of a hassle. Or perhaps you are already running SimKube locally but want to run a dozen simultaneous simulations in AWS. The AMIs are there for you, and the SimKube AMI is free-to-use&#8211;though you still have to pay AWS for the compute (sorry).</p><p>If you want to learn more, we&#8217;ve added a new <a href="https://simkube.dev/simkube/docs/infra/overview/">SimKube in the Cloud</a> section to the documentation that walks through how they work and how to get started.</p><p>So get out there and simulate some trouble&#8230; before it makes it to prod!</p><p>Cheers,</p><p>Ian</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Applied Computing Research Labs is a reader-supported publication. All our footnotes are ethically sourced.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Is snapshot weight part of the APGAR score?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Pronounced: nerds</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Builds are expensive from an action minutes perspective</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Ask me how I know that EBS volume storage costs extra</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Andddd now the twins metaphor has completely broken down</p></div></div>]]></content:encoded></item><item><title><![CDATA[Stop Reading Changelogs: Safer Kubernetes upgrades with simulation]]></title><description><![CDATA[SRECon 2026 Talk Transcript]]></description><link>https://blog.appliedcomputing.io/p/stop-reading-changelogs-safer-kubernetes</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/stop-reading-changelogs-safer-kubernetes</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 27 Mar 2026 20:01:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YQ3N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YQ3N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YQ3N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YQ3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YQ3N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!YQ3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf09fd51-ca0b-497c-9370-9de2bc70b309_1960x1104.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Quick note for my subscribers: In an effort to try to get a few more paid subscriptions to this blog, I&#8217;m running an experiment where I keep the &#8220;paid subscribers only&#8221; period for a week, instead of just &#8220;over the weekend&#8221;.  I will still publish all of my posts publicly, but if you want to read it earlier, you&#8217;ll need to be a paid subscriber.  As a reminder: paid subscriptions to the blog don&#8217;t pay my bills, but they help get me and Ian to conferences, which is where we meet the people who <em>do</em> pay our bills.  So I very much appreciate everyone who is already supporting us, and look forward to getting even more!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Applied Computing Research Labs is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p>Last week I had the privilege of giving a talk at <a href="https://www.usenix.org/conference/srecon26americas">SRECon Americas</a> in Seattle; the recording of the talk will be published in a few weeks, but I thought for this post I&#8217;d do something a bit different and write an (approximate) transcript of the talk for those of you who don&#8217;t want to wait and/or prefer reading things to watching videos. The talk is, of course, about SimKube: I hope you enjoy! I&#8217;ll update this with a link to the talk once it&#8217;s posted.</p><div><hr></div><p>Thanks all for coming; the title of my talk today is &#8220;Stop Reading Changelogs: Safer Kubernetes Upgrades with Simulation&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0N2S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0N2S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0N2S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:432618,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0N2S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!0N2S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20cb6f70-f43e-43dc-8d50-9587c9745e23_1960x1104.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On Pi Day three years ago, March 14, 2023, Reddit suffered a 100-pi-minute long outage, a 314-minute outage. In the extremely detailed and <a href="https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/">well-written postmortem</a>, it was revealed that one of the primary causes of this outage was a Kubernetes version upgrade from Kubernetes 1.23 to 1.24. Specifically, in this upgrade, Kubernetes changed one of the labels that was applied to its control plane nodes, and Calico, the service mesh component used by Reddit at the time, was using a label selector that pointed to the <em>old</em> label for the node instead of the new label. This caused Reddit to go hard down for around 5 hours.</p><div class="paywall-jump" data-component-name="PaywallToDOM"></div><p>Now: I was not a Reddit engineer at the time of this outage and I am not a Reddit engineer now. In fact, I have <em>never</em> been a Reddit engineer, and I also know for certain that there are Reddit engineers in the audience today who were employed at Reddit at the time this incident happened. So let&#8217;s take a detour to answer the question, &#8220;Who the heck am I and what am I doing here?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nuse!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nuse!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nuse!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nuse!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Nuse!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F276d0abc-5e88-4520-a2db-5b569222d699_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hi! My name&#8217;s David; I go by drmorr on the internet, and I use this cute little grumpy red robot icon basically everywhere. I received my PhD in Computer Science from the University of Illinois in 2014, where I was focused on scheduling and optimization problems. From there, I got my first introduction to distributed systems at Yelp, which at the time was unhappy with the amount of money they were paying to AWS, and asked me to use my optimization skills to help them pay less money to AWS. Of particular note for this talk, while I was at Yelp I wrote my <em>first</em> distributed systems simulator, a simulation engine for Apache Mesos that we used to try to understand some of the cost-reduction changes we wanted to make. From Yelp I went to Airbnb, doing more of the same: scheduling, autoscaling, and cost optimization, before finally starting ACRL&#8212;a small business focused on open-source research and development in distributed systems. One of ACRL&#8217;s primary projects is a simulation environment for Kubernetes called <a href="https://simkube.dev/">SimKube</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rv9o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rv9o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rv9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:352378,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rv9o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Rv9o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2fa65c-1be2-4bef-81a7-9b6f8076cd76_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With that out of the way, let&#8217;s go back to the topic at hand: why are Kubernetes upgrades so challenging? In the screenshot on the right, you can see the Kubernetes release schedule, and the main takeaway from this is that a new Kubernetes version is released every fourteen weeks&#8212;that&#8217;s three new versions a year. This is an extremely aggressive release cycle, and one that is difficult to keep up with. It&#8217;s easy to find many links on the Internet like this one, entitled <a href="https://www.reddit.com/r/kubernetes/comments/1ggcufk/managed_k8s_cluster_upgrades_are_a_total_nightmare/">&#8220;Managed Kubernetes Cluster Upgrades are a Total Nightmare&#8221;</a>. What makes them so hard? Well, the primary reason is that upgrading Kubernetes requires performing upgrades to <code>n</code> independent control loops in a correct-but-unknown order. Threads like the above are filled with debates about whether you should do this process &#8220;in-place&#8221; or via a &#8220;lift-and-shift&#8221; approach where you spin up a cluster on the new version and then migrate your workloads over: and to be honest, both approaches have their tradeoffs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gvam!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gvam!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!gvam!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!gvam!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!gvam!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gvam!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:479540,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gvam!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!gvam!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!gvam!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!gvam!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54a693-d33e-474d-aa5f-e976c0e36c83_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I want to also look at what the typical upgrade process looks like at many organizations. Step 0 in this process is &#8220;someone tells you to upgrade&#8221;. Now, that person <em>might</em> be your boss, but more likely it&#8217;s one of the managed cloud providers. If you run on EKS, Azure, or GCP, you are going to be forcibly upgraded on a regular cadence; in some cases it <em>might</em> be possible to pay the cloud providers <em>more</em> money to stay on an older version, but that&#8217;s a stopgap solution that only delays the inevitable. At some point, you <em>must</em> upgrade.</p><p>So once you&#8217;ve decided to do the upgrade, you might decide that the reasonable next step is to read the changelog. On the right of this slide is an example for the Kubernetes 1.34 changelog. The text is probably too small for you to read in the back, but buried in the middle of this 2000-line-long markdown file are two bullets: the first reads &#8220;Urgent Upgrade Notes&#8221;, and the second sub-bullet says, in parentheses, &#8220;(No, really, you MUST read this before you upgrade)&#8221;. Again, this is in the middle of a multiple-thousand-line markdown file that users are expected to read.</p><p>But OK: you&#8217;ve read the changelog. The next step is likely to spin up (or utilize an existing) test environment, where you can deploy something that looks approximately like your production workload, and play whack-a-mole with the issues that crop up until you&#8217;re reasonably confident you got them all. Then you&#8217;re ready to upgrade in prod! You might start with your lowest-risk clusters, and gradually move to more critical clusters as you gain confidence in the rollout. Whoops! You missed one. I hope you have PagerDuty configured correctly.</p><p>And now you have a problem, because there is <em>NO</em> supported rollback plan for Kubernetes. Your choices are either, restore from a backup (you do have a backup, right? Does it work? Are you sure?) or to roll forward and try to identify and fix your broken cluster. This is exactly the choice that Reddit was faced with in their Pi Day outage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tDdl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tDdl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tDdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163139,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tDdl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!tDdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97c31c8-d65a-4a1e-af28-f171d310cf7b_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, there&#8217;s one point in this upgrade process that seems promising, and that&#8217;s this test cluster idea. Test clusters seem great! Why don&#8217;t we do more of those? In my experience, there&#8217;re three reasons why: first is that provisioning a new cluster is hard. I&#8217;ve been at some organizations where you can push a button and get a new cluster in about 30 minutes. I&#8217;ve been at other organizations where you have to make 10 changes in 20 different repos, and if you&#8217;re lucky you might have a running cluster 3 weeks later. I&#8217;m sure many organizations fall somewhere on that spectrum. But in any case, even waiting 30 minutes for a new cluster introduces a lot of friction when you have to repeatedly spin up and tear down an environment while you test things out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F1kG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F1kG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F1kG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F1kG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!F1kG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F998a04a1-53fa-415c-8cf7-e354f5647d00_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The second reason why test clusters aren&#8217;t as effective as they could be is that they&#8217;re expensive. Particularly if you&#8217;re doing any sort of scale or load testing (and you want to do scale testing, because this is where the most gnarly and hard-to-detect issues live), it&#8217;s going to cost a lot of money. If you go to your boss and say, &#8220;Sure, I&#8217;ll happily do this upgrade, I just need a million dollars to run a thousand-node test cluster for a month,&#8221; they&#8217;re going to come back with, &#8220;No, try again.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zv14!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zv14!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zv14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zv14!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!Zv14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10be993b-7290-4495-a22f-52f72dfe385f_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And the last reason why test clusters are hard is that replicating production is impossible. See, production has a bunch of annoying things like &#8220;users&#8221; and &#8220;network traffic&#8221; that you just don&#8217;t have even in a full-fledged staging or test environment. And you&#8217;re never going to be able to replicate and test all of that stuff.</p><p>Surely there&#8217;s a better way to do this? I think so! So in this talk I&#8217;d like to introduce SimKube.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u_1z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u_1z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u_1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2785195,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u_1z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!u_1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0f85880-6fae-4294-aa1c-6e0a1a6b7069_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, on this slide is a picture of a bridge. This is a picture I took of Tower Bridge, a local Sacramento landmark; I&#8217;m from Sacramento, and&#8212;as an aside&#8212;it was 90 degrees there last week, so I&#8217;m very happy to be in Seattle where it&#8217;s in the fifties. But anyways, back on topic: why is there a bridge? Well, in many other engineering disciplines, simulation is a common tool. Civil engineers would never build a bridge without simulating it first. Aerospace engineers would never build an airplane without simulating it first. And yet, in my experience, in our discipline, simulation isn&#8217;t a tool that&#8217;s reached for very often. Why not?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CKNC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CKNC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CKNC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/defcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:576965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CKNC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!CKNC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefcc232-b4c5-47e1-8b5f-692fa55f8887_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I think there are a number of reasons, but a significant one is: distributed systems are already hard. And simulating them is even harder. However, in the case of Kubernetes, I think there are two things going for us: the first is the declarative nature of the system. It&#8217;s a well-known design goal of Kubernetes that you write down the desired state of the system, and the Kubernetes control loops will try to move the observed, actual state of the world to that desired state. In other words: YAML. I know we all love to hate YAML, but for the purposes of simulation, I think it&#8217;s really beneficial. We can track, in a simple, easy-for-humans-and-computers-to-read format, the changes in the desired state of the system over time. And then we can record all of those changes into a trace file, which we can then replay in our simulated environment as many times as we want! Nifty.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d8Sd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d8Sd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d8Sd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:255407,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d8Sd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!d8Sd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7a82dd-71e4-41a6-83fc-0530f6bbcfc5_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The second feature of Kubernetes that helps with our simulation goal is the extensible API and custom controllers that Kubernetes supports&#8212;no, you know what, I&#8217;m kidding, this one&#8217;s YAML again. See, from the perspective of Kubernetes, a &#8220;node&#8221; is just a YAML blob that&#8217;s been written to etcd. It doesn&#8217;t care if there&#8217;s any real hardware backing that node. And a pod is just a YAML blob that&#8217;s been written to etcd. Kubernetes doesn&#8217;t care if there&#8217;s a real binary behind that YAML! So, if you believe <em>really really hard</em> that the node object that you wrote to etcd is real, then Kubernetes is going to believe it too; and likewise, if you believe, <em>really really hard</em> that your pod is real, Kubernetes will believe it too. And that&#8217;s exactly what <a href="https://kwok.sigs.k8s.io/">KWOK</a>, the Kubernetes WithOut Kubelet project, which underlies SimKube does: it is a custom controller that watches for Node and Pod resources that have been created in etcd, and it walks those resources through their lifecycles.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_seI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_seI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!_seI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!_seI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!_seI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_seI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:415817,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_seI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!_seI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!_seI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!_seI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e76e77e-3729-4e48-b004-b88bd5394b2b_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So with that, we can finally talk about how SimKube works. This next slide is a high-level architecture diagram of SimKube: I don&#8217;t care if you understand the details, all I want you to take away from this is that on the left, we have our real, production Kubernetes cluster. Inside that cluster, we have a small agent running that&#8217;s tracking all of the changes made to the YAML files describing that cluster; at any point in time, a user can then export those changes into a trace file, which get written to some persistent storage, and then replayed in the simulation environment on the right.</p><p>Let&#8217;s actually watch a demo so you can get a better idea of what I&#8217;m talking about.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;189357e1-380b-46b9-b6e6-fe864648fd7a&quot;,&quot;duration&quot;:null}"></div><p>In this demo, the first thing we&#8217;re going to look at is a trace file. I&#8217;m using <code>skctl</code>, the SimKube CLI tool, to inspect this trace file: as you can see, it&#8217;s just a timestampped series of events, and if we drill down into these events, you can see that it&#8217;s just Kubernetes manifests. This application implements a toy &#8220;social network&#8221; application, where you have microservices for browsing your timeline, writing posts, or analyzing the social graph. That&#8217;s not super important here, what&#8217;s important is that as the Kubernetes manifests for these services change over time, those changes get stored into the trace file.</p><p>So, let&#8217;s go ahead and run this trace file in our simulator. Once I do <code>skctl run</code>, it spits out a bunch of metadata for our simulation, and then it launches a driver pod in the <code>simkube</code> namespace. The driver is responsible for downloading the trace file and replaying all of the YAML inside it. We can look at the nodes on our cluster, and you see that I&#8217;ve got two real nodes (this is using <a href="https://kind.sigs.k8s.io/">kind</a> on my laptop), and a whole bunch of fake nodes that have been provisioned by Karpenter and are backed by KWOK. We can also go into the namespace where the simulated pods are being created to see that they&#8217;re all running; we have a little over 1000 pods right now, but this simulation will actually scale up to about 5000 pods across a hundred or so nodes.</p><p>One thing I do actually want to call out here, if we look in this namespace we see that there&#8217;s a CronJob running, which is responsible for doing some cleanup actions in the user database. Now the thing about CronJobs is that they typically have a start and an end time. You might be wondering how that works in the simulation, since there&#8217;s no application code running to complete, but if you watch you&#8217;ll notice that these pods actually do complete! This is KWOK at work again: SimKube records information about the pod lifecycles and injects that into KWOK, and it takes care of walking the pod through its entire lifecycle.</p><p>Great! Now, you might have noticed that this cluster is running Kubernetes 1.24, which is pretty old, so let&#8217;s go ahead and try to upgrade. I&#8217;m going to switch over to another cluster on my laptop that&#8217;s running Kubernetes 1.25, and re-run that same simulation. I&#8217;d like to acknowledge that what you&#8217;re about to see is <em>not</em> the best user experience&#8212;we&#8217;re working on it. But, we start the simulation, the driver pod comes up, and then a few seconds later&#8212;whoops! It crashed. Let&#8217;s go ahead and look at the logs.</p><p>Inside the logs, we see a traceback, and the message at the top says that the requested resource cannot be found, and returns a 404. If we scroll up a bit further, we find the culprit: the CronJob that I highlighted earlier. Now if you know your Kubernetes history, you probably already know the punchline here. Kubernetes uses versioned APIs, which is how they are safely able to evolve their user-facing APIs over time, and between Kubernetes 1.24 and 1.25 the <code>v1beta1</code> API was removed from Kubernetes. Our trace file was still referencing the <code>v1beta1</code> API, and when we tried to apply that to the new cluster, it crashed.</p><p>This is cool! We&#8217;ve used simulation to identify a problem with our cluster. But we can take this one step further. SimKube includes a bespoke DSL called SKEL (the SimKube Expression Language) which you can use to make targeted modifications to your trace file. The type of modifications you can make are quite complex, but for this example we&#8217;re going to do one simple transformation. Here we&#8217;re selecting all resources that have <code>kind</code> equal to <code>CronJob</code>, and we&#8217;re replacing their <code>apiVersion</code> with <code>batch/v1</code>. Then we just use <code>skctl</code> to apply this transformation to our trace file, and we can try running our simulation again. And this time, it works! We can watch the driver pod, which crashed after about 10 seconds last time, and this time it keeps running. We can also go into the simulation namespace and confirm that the simulated pods are all present, including the CronJob pods.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rvp3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rvp3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rvp3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:413097,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rvp3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!rvp3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a52ad4-190d-4821-8256-833630acb9a2_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So that&#8217;s SimKube, testing your Kubernetes upgrades! We can go even further, though: we&#8217;re all SREs, we like to automate things, so what if we just automate the entire process? Imagine if you regularly collect some traces from your production infra, store them in S3, and then periodically&#8212;maybe every time you need to upgrade, or maybe even just once a month to establish a baseline&#8212;you run those traces through a simulator, maybe as a CI job or something else? Well, it turns out you can do that too. We&#8217;ve released a free AMI that you can use to play around with SimKube, and soon we&#8217;ll have a GitHub runner that you can plug into your CI pipeline as well. In fact, we do exactly this on the SimKube repo: on the right you can see a screenshot of our GitHub actions, and on every merge to main we run an end-to-end test of SimKube on a few sample traces to ensure that the system is behaving properly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E78d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E78d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!E78d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!E78d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!E78d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E78d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:558900,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E78d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!E78d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!E78d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!E78d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e9a9713-05f7-4ac2-b618-32becf934f97_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To wrap up this talk, let&#8217;s revisit the things that made test clusters hard from the beginning: first, using simulation makes provisioning clusters easy. I can spin up a thousand-node cluster on my laptop in a couple of minutes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lVBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lVBm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lVBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:603955,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lVBm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!lVBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4351cb58-fbd1-465d-b852-8d0d8a43e476_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Secondly, using simulation makes scale testing free. I mean, not <em>free</em>-free, you still have to have <em>some</em> hardware to run it on, but compared to the cost of a full production environment, we&#8217;re talking pennies. And, lastly, replicating your production environment is impossible!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a9Sk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a9Sk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a9Sk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:621953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a9Sk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!a9Sk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09b3c6ef-5a0b-4903-b74c-e3b426b62fe7_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Oh. Hmmm.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wzj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wzj-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wzj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:680823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wzj-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!wzj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f5f4352-aa38-44c1-a5e8-f8e16bf23683_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ok, look. I don&#8217;t want you to walk out of this talk thinking that simulation is a silver bullet. The whole entire point of simulation is that you&#8217;re throwing out some part of your system in the hopes that it makes the analysis of the other part easier. The trick is knowing what part of the system you want to throw out while still maintaining the right level of fidelity. So here are three examples of places where SimKube struggles.</p><p>First, anything involving networking: since there&#8217;s no application code running, there&#8217;s nothing to respond to your network requests, which means testing load or network patterns won&#8217;t work. Secondly, I lied slightly earlier in the talk when I said that Kubernetes is 100% YAML &#8211; there are some components, like the Horizontal Pod Autoscaler (HPA) which make decisions based on the results of metrics like CPU utilization or other real-time data. SimKube can&#8217;t handle that <em>yet</em>&#8212;but unlike the networking thing, there&#8217;s nothing technically stopping us from providing fake Prometheus metrics data to the HPA, and in fact, this is something that I&#8217;m hoping to prioritize in the next year. And, lastly, any sort of integrations with, well, I wrote cloud providers here, but this could really be any third-party service, gets tricky. See, the cloud providers don&#8217;t even provide you <em>access</em> to the control planes of your cluster, so getting the right data out can be challenging. And other third party systems may be similar.</p><p>So those are some areas where SimKube specifically might struggle, but what I really hope you take away from this talk is that SimKube is just one tool in your toolbox. There are lots of other tools out there that can help to cover some of these areas. To close this talk out, I want to revisit the Reddit outage from the beginning. Here&#8217;s the question: &#8220;Would SimKube have prevented that outage?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Khx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Khx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Khx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:436850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/192347873?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Khx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 424w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 848w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!4Khx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0342fbc-b674-4bdc-a4aa-435892c3334a_1960x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the one hand, maybe yes? Reddit engineers could certainly have loaded up some production data into SimKube and tried to do their upgrade, and they could have seen that the control plane node labels changed and (potentially) Calico stopped registering routes. So, maybe if SimKube had existed back then, it would have prevented the outage. On the other hand, no, SimKube couldn&#8217;t have prevented that outage. The Pi-Day postmortem highlighted a host of factors contributing to the incident, only some of which were technical. And SimKube can&#8217;t do anything about those: again, it&#8217;s just a tool, and in order for a tool to be effective, you need people who know who to use the tool, when to use the tool, how to interpret the results from the tool, how to communicate those results, and so on and so forth. So SimKube isn&#8217;t the whole picture here, but I <em>do</em> think (and I hope I&#8217;ve convinced you) that it can be a really powerful <em>part</em> of the picture. It&#8217;s certainly one that I have a lot of fun working on.</p><p>So that&#8217;s I&#8217;ll I have for you today! I&#8217;ll close with these three links: the first takes you to <a href="https://simkube.dev/">simkube.dev</a>, which is the documentation site for the project; the second takes you to my blog where you can read about how SimKube has been used; and the third is a <a href="https://cal.com/drmorr">calendar link</a>. I love getting to chat with new folks, so if you have further questions or you just want to say hi, feel free to grab some time!</p><p>Thanks for your time, and I&#8217;ll be happy to take any questions that you have.</p><div><hr></div><p>So that was the talk! I hope you enjoyed getting to read it, even if you weren&#8217;t able to see the talk in person. I got a lot of great questions and engagement, and I&#8217;m pretty excited to see where we go next! As always, thanks for reading.</p><p>~drmorr</p>]]></content:encoded></item><item><title><![CDATA[Everything you need to run a single Kubernetes pod]]></title><description><![CDATA[OK, as promised last week, this post is a follow-on to my previous write-up about running a single-node Kubernetes cluster at ACRL.]]></description><link>https://blog.appliedcomputing.io/p/everything-you-need-to-run-a-single</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/everything-you-need-to-run-a-single</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 14 Mar 2026 20:00:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!heC4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!heC4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!heC4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 424w, https://substackcdn.com/image/fetch/$s_!heC4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 848w, https://substackcdn.com/image/fetch/$s_!heC4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 1272w, https://substackcdn.com/image/fetch/$s_!heC4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!heC4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png" width="1456" height="1035" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1035,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:760338,&quot;alt&quot;:&quot;Architecture diagram showing Moria, Isengard, and Mirkwood all interacting with AWS to run a Kubernetes cluster&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/190960222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Architecture diagram showing Moria, Isengard, and Mirkwood all interacting with AWS to run a Kubernetes cluster" title="Architecture diagram showing Moria, Isengard, and Mirkwood all interacting with AWS to run a Kubernetes cluster" srcset="https://substackcdn.com/image/fetch/$s_!heC4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 424w, https://substackcdn.com/image/fetch/$s_!heC4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 848w, https://substackcdn.com/image/fetch/$s_!heC4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 1272w, https://substackcdn.com/image/fetch/$s_!heC4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c4e84e-afad-4257-9ba3-bb312977fba0_2568x1825.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">I built all this stuff so that you don&#8217;t have to.</figcaption></figure></div><p>OK, as promised last week, this post is a follow-on to my <a href="https://blog.appliedcomputing.io/p/what-to-expect-when-youre-expecting">previous write-up</a> about running a single-node Kubernetes cluster at ACRL. After calling that post &#8220;hot garbage&#8221;, the redditor who commented on the post went on to say that he was hoping for more details and less high-level pish-tosh<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. So, random redditor, this 4000-word manuscript is for you! I hope you brought your coffee.</p><h2>Project Goals: Quick Recap</h2><p>Before we dive into all the arrows in the architecture diagram, I just wanted to quickly review what the goals of this project are:</p><ol><li><p>Run a Kubernetes cluster in the cloud,</p></li><li><p>as cheaply as possible,</p></li><li><p>and schedule a workload on it. Lastly,</p></li><li><p>it should be one command to completely destroy and recreate our entire stack if we need to.</p></li></ol><p>To recap what I discussed in the previous post, we&#8217;re running <a href="https://k3s.io/">k3s</a> on a single AWS EC2 spot instance. We&#8217;re not using <a href="https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html">EKS</a>, a) because it&#8217;s expensive, and b) because we&#8217;re supposed to Kubernetes experts over here and not just some random hack jobs, so we should probably have some expertise running Kubernetes. We&#8217;re running on a spot instance a) because it&#8217;s cheap, and b) because I wanted to know if it could be done.  If you&#8217;re following along at home and you&#8217;re not running on a spot instance, you can <em>probably</em> do away with a lot of this complexity; but on the flip side, running on spot serves as a forcing function to build the automation and tooling to easily recreate your entire state from scratch, which is in my experience an extremely underrated superpower.</p><p>In my previous post, we got to the point of &#8220;having a running k3s node&#8221;, but we didn&#8217;t actually get to the point of &#8220;running a workload on it&#8221;, partly because I ran out of time, and partly because I thought that once I had the node up and running, running the pods would be easy. It&#8217;s just YAML, right<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>???</p><p>Anyways, in the last post we also didn&#8217;t define what workload we&#8217;re going to be running on our Kubernetes cluster.  I have quite a few things I eventually want to run, but my first goal was just to run a simple XMPP server. I&#8217;ve wanted a good internal &#8220;chat&#8221; solution for a while, I <em>really</em> don&#8217;t want to use Slack or Discord, and it turns out that XMPP has been living a healthy thriving life for the past 15 years, even if most people don&#8217;t realize it. So after a bit of research, I decided that I was going to get <a href="https://prosody.im/">Prosody</a> running inside my Kubernetes cluster. It&#8217;s small and lightweight enough that &#8220;configuring Prosody&#8221; wouldn&#8217;t take too long, but it would also exercise a bunch of features that I&#8217;m going to want later: specifically, ingress, certificates, and persistent data.</p><p>I <em>also</em> also didn&#8217;t explicitly discuss the fourth goal in my last post, but this is really important to me. I&#8217;m a very big proponent of the &#8220;GitOps&#8221; pattern<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. If you can check in your desired infrastructure state into version control, and it&#8217;s easy to recreate and/or get back to a &#8220;known good configuration&#8221;, that makes everything so much easier down the line. So for me at least, it&#8217;s worth it to spend some more time up front doing things &#8220;right&#8221; to (hopefully) make my life a little easier in the future<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p><p>So with that said, let&#8217;s go into the nitty-gritty details.</p><h2>Revisiting Middle-earth</h2><p>In my last blog post I explained that I have two repos named after famous locations from Lord of the Rings: Moria handles my &#8220;infrastructure as code&#8221; and Isengard is my server configuration management repo. Since that post, I&#8217;ve added a third git repo, Mirkwood, which contains all of my Kubernetes manifests<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. You can see in the above diagram how each of these repos interact with components in AWS, shown by the colored solid arrows (green is moria, blue is isengard, orange is mirkwood, and the dashed lines indicate physical hardware connections or network connections); feel free to refer back to that diagram as we continue.</p><h3>Moria</h3><p>My last post did a pretty good job of covering the basic AWS infrastructure setup, so I&#8217;ll just quickly recap it here. In Moria, we use <a href="https://www.pulumi.com/">Pulumi</a> to create the following AWS resources:</p><ol><li><p>A single bare EC2 node as an SSH proxy/bastion host/NAT gateway into my VPC.</p></li><li><p>A single-node AutoScaling Group (ASG) which contains the k3s server node, registered as a spot instance.</p></li><li><p>A persistent EBS volume which is auto-attached to the k3s server node on first boot (via a systemd service), which serves as the data volume for k3s.</p></li><li><p>An internal zone in Route53 which I can use to store a DNS A record pointing to the k3s server.</p></li><li><p>An AWS Lambda function, run as an ASG Lifecycle Hook which updates the A record whenever the k3s server is terminated and recreated.</p></li></ol><p>Since the first post, I&#8217;ve had to add a number of other AWS resources as well into Moria to support various tooling further down the stack. XMPP requires TLS these days, which means I need a certificate; I considered whether to use <a href="https://blog.appliedcomputing.io/p/acrl-is-a-ca-now">the CA</a> that I set up earlier in the year for this purpose, but those certificates are for a <em>very</em> different purpose than what I want now. I also considered whether to create a second CA for &#8220;internal services&#8221; but that also didn&#8217;t feel like a good approach. Fortunately, <a href="https://letsencrypt.org/">Let&#8217;s Encrypt</a> is a thing; I&#8217;ve been using it for a long time for my personal website, and it has good integrations with Kubernetes, so this felt like the right approach.</p><p>Certificates need to be signed for a particular domain, however, and Let&#8217;s Encrypt needs to know that you own the domain that it&#8217;s signing certificates for. But my Kubernetes cluster is inside a private VPC that Let&#8217;s Encrypt knows nothing about! So this means I need to use Moria to add</p><ol start="6"><li><p>Two more hosted zones<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> into my infrastructure; a public zone that Let&#8217;s Encrypt can use to verify ownership, and a private zone that will host the A records for my XMPP server. Note that we don&#8217;t actually configure those A records with Moria, however, because the IP address of Prosody can change.</p></li></ol><p>I also didn&#8217;t talk about this at all in the last post<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>, but Moria also manages</p><ol start="7"><li><p>All of the AWS IAM (Identity and Access Management) permissions for this configuration.</p></li><li><p>An encryption key in AWS KMS (Key Management Service) to help with secure secret storage.</p></li></ol><p>IAM permissions are some of the most arcane and convoluted things I have ever had to work with, but essentially every component above requires a corresponding IAM policy that enables it to perform its task. The proper, most secure way of managing these permissions involves things like short-lived temporary credentials and OIDC and JWT and a whole bunch of other ugly acronyms, but frankly that stuff gives me heartburn every time I look at it, so for right now I just have a bunch of &#8220;service account&#8221; IAM users with static credentials configured in Moria. This is marginally less-secure, but I make it easy in Moria to rotate the credentials, and try to keep &#8220;what they can access&#8221; scoped as narrowly as possible. Someday in the future when I can afford to hire an infrastructure security engineer, I&#8217;ll make their first task to rip out all the service account users and replace it with the OIDC thing<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>.</p><p>I&#8217;ll lastly mention one note about the single-node spot instance: previously I had configured my ASG to launch a <code>t3a.medium</code>, which uses an AMD processor, has 2 CPUs, and 4GB of RAM. My expectation was that, since this is a (relatively) small instance type, it would have pretty high availability, since AWS can probably bin-pack that more-or-less anywhere. That expectation was extremely incorrect; the k3s server node was getting disrupted between 5 and 10 times a day. Since then, I&#8217;ve expanded to allowing any of <code>t3a.medium</code>, <code>t3.medium</code>, <code>t3a.large</code>, or <code>t3.large</code> spot instance types<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>, and using the &#8220;price capacity optimized&#8221; allocation strategy: this essentially tells AWS &#8220;pick an instance type that is relatively cheap, but also is relatively unlikely to get interrupted&#8221;. Since I made that change, I&#8217;ve been running a <code>t3.medium</code> spot instance for the last 3 days with no interruptions!</p><h3>Isengard</h3><p>In the last post, I gave an <em>extremely</em> high-level overview of my Isengard (<a href="https://ansible.com/">Ansible</a>) configuration management setup. Just as Pulumi or Terraform let you define AWS resources as code, Ansible lets you define <em>software resources</em> as code. In Ansible, you write all of your software configuration using a combination of YAML files and <a href="https://jinja.palletsprojects.com/en/stable/templates/">Jinja</a> templates. These configuration steps are bundles into &#8220;roles&#8221; which are then rolled up into &#8220;playbooks&#8221;. The idea is that a server can have multiple roles, each of which has its own software and configuration installed, and the playbook is what tells Ansible which roles should be applied to which hosts.</p><p>In Isengard, we have three playbooks:</p><ol><li><p>The SimKube playbook: this is used to quickly and repeatably create SimKube environments. It&#8217;s also what&#8217;s powering our SimKube AMI and Github Action Runners, which I&#8217;ve been teasing in several posts, but will be getting its own blog post very soon now. So I won&#8217;t say anything further about this.</p></li><li><p>The bastion playbook: this playbook manages the &#8220;entry point&#8221; into our private VPC.</p></li><li><p>The k3s server node playbook, (also used to generate the AMI that packer builds for the k3s root volume).</p></li></ol><p>In the previous post, I mentioned that I was using SSH tunneling through this bastion host, but that has since changed. I configured <a href="https://tailscale.com/">Tailscale</a>, and I have no other words for it beyond &#8220;freaking magical.&#8221; I understand (somewhat, at a high level) how Tailscale works, but the experience has been so unbelievably good that I wish I&#8217;d done this years ago. Everything Just Works; I now have my bastion host configured to forward both DNS lookups into my private VPC, as well as to expose the internal routing tables from my VPC to any other client in the tailnet. It really is magical, there&#8217;s just no other way to describe it.</p><p>I am also running the <a href="https://github.com/AndrewGuenther/fck-nat">fck_nat</a> NAT gateway on my bastion node, to allow everything <em>inside</em> the VPC to talk to The Internet :tm:. AWS <em>does</em> actually provide a native/built-in solution for this, but it&#8217;s frankly hard to describe it as anything other than price gouging. A single <a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html">NAT gateway</a> instance costs $30/month <em>just to exist</em>, and then they charge you on top of that for all of the traffic that goes through the NAT! For comparison, I&#8217;m running <code>fck_nat</code> on a <code>t4g.nano</code>, which works just fine and costs me $3/month.</p><p>On the k3s server node, I am obviously using Ansible to install k3s, kubectl, and other supporting tools. I also configure the k3s systemd service here. This is what my k3s config looks like:</p><pre><code><code>---
disable-helm-controller: true
disable-network-policy: true
tls-san: "k3s-server-0.uswest2.acrl.dev"
secrets-encryption: true</code></code></pre><p>I disable helm because, ew gross. I disable the network policy controller because I don&#8217;t care about network policies, and that controller expects a stable IP address which we obviously don&#8217;t have. The <code>tls-san</code> block gives a stable domain name to issue the Kubernetes certificates for<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>, and the secrets encryption line enables AES at-rest encryption for Kubernetes Secret objects.</p><p>For a brief period in time I also specified a stable Kubernetes node name in this config, but that ended up causing more problems than it solved:  whenever the node was interrupted/restarted, it would remember all of the state of the node from before it was interrupted, which led to some extremely weird &#8220;stale state&#8221; issues.  More on this later.</p><p>For what are probably &#8220;premature optimization&#8221; reasons, this config file is <em>not</em> baked in the k3s AMI, but instead is generated as an <code>ExecStartPre</code> script in systemd; the script looks up the <code>tls-san</code> name from a tag on the EC2 instance, which allows me to dynamically detect the hostname and potentially run more of these clusters in the future.</p><p>There is also an <code>ExecStartPost</code> k3s systemd hook which is configured by Isengard that does the following:</p><ol><li><p>Applies the EBS CSI Driver &#8220;not-ready&#8221; taint to the node (if this doesn&#8217;t mean anything to you yet, don&#8217;t worry, just keep reading).</p></li><li><p>Cleans up any Node resources that are left over from when the spot instance was restarted.</p></li></ol><p>The latter point is necessary because Kubernetes doesn&#8217;t actually <em>ever</em> delete nodes that are stored in its database, it just lists them as <code>NotReady</code> in perpetuity. This isn&#8217;t <em>really</em> a problem, except that it looks ugly and takes up space. If we were not using k3s, we would configure the <a href="https://kubernetes.io/docs/concepts/architecture/cloud-controller/">cloud-controller-manager</a> to monitor the state of our AWS nodes and delete the ones from the Kubernetes database that don&#8217;t exist anymore; but k3s ships with its own, slimmed-down cloud controller manager that has some other nice features, and I didn&#8217;t want to disable that. So instead I just stuffed the cleanup into this post-run hook.</p><h3>Mirkwood</h3><p>Whew! We&#8217;re almost done here! We&#8217;ve got one last git repo to cover before we can have a working messaging app! Mirkwood contains all of my Kubernetes manifests, and it uses <a href="https://kustomize.io/">kustomize</a> to provide the configuration management tooling<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>.</p><p>There are (currently) five applications running on my cluster:</p><ol><li><p>The <a href="https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master">AWS EBS CSI driver</a></p></li><li><p>The <a href="https://github.com/aws/aws-node-termination-handler">AWS node termination handler</a></p></li><li><p><a href="https://cert-manager.io/">cert-manager</a></p></li><li><p>The <a href="https://github.com/kubernetes-sigs/external-dns">external-dns</a> operator</p></li><li><p>And, lastly, Prosody itself, aka, the thing we&#8217;ve been trying to run this whole damn time.</p></li></ol><p>The EBS CSI driver watches for Kubernetes <a href="https://kubernetes.io/docs/concepts/storage/persistent-volumes/">Persistent Volume Claims</a> (PVCs) and handle the creation, deletion, mounting, and unmounting of those block devices into Kubernetes pods. This is necessary so that my XMPP/Prosody pod can have persistent data across pod restarts.  The EBS CSI driver <em>must</em> be running before any pods that request PVCs, so the CSI driver pod tolerates the taint that we created as part of the k3s configuration, and once it&#8217;s running and healthy it removes the taint so that other pods can schedule.</p><p>The node termination handler watches for spot interruption warnings (which AWS provides with a 2-minute time window) and triggers pod and node cleanup ahead of the interruption. This is necessary so that the EBS CSI driver can actually cleanly unmount the EBS volume before all the hardware is rudely yanked out from underneath it<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>.</p><p>Cert-manager, as we&#8217;ve discussed, is responsible for talking to Let&#8217;s Encrypt and getting certificates for my XMPP server. It&#8217;s operating using the DNS-01 challenge mode, wherein the Let&#8217;s Encrypt issuer creates a challenge, and then cert-manager creates a TXT record in my public <code>acrl.dev</code> hosted zone to prove that yes, in fact, I do own the zone. Cert-manager then stores the certificate as a secret, which gets injected into the Prosody pod. It also automagically handles getting new certificates ahead of the expiry date, which is pretty slick<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>! I&#8217;m a big fan of cert-manger.</p><p>Lastly, external-dns handles setting the DNS entry for my Prosody pod in my private <code>acrl.dev</code> hosted zone. The details here are a little subtle: we create a Kubernetes LoadBalancer Service<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>, which <em>normally</em> would be handled by the cloud controller manager to create a network load balancer resource in AWS. But that stuff&#8217;s expensive! We don&#8217;t wanna run no network load balancer here. So instead, we use the builtin <a href="https://docs.k3s.io/networking/networking-services#service-load-balancer">ServiceLB controller</a> in k3s, which more-or-less just exposes the right ports on the host where the pod is running via iptables rules. The external-dns controller just looks up the IP address of the Prosody Service and writes that as an A record into Route53.</p><p>And now we (finally!) have everything we need in order to run our single Kubernetes pod. The Prosody pod requests a 5GB persistent volume to store all of its data, grabs a certificate from cert-manager, uses a ConfigMap to inject all the Prosody configuration, and runs as a single-pod Deployment. And, it all works!</p><h2>Putting all the pieces together</h2><p>There&#8217;s two bits that I didn&#8217;t cover in the above descriptions, which are a) automation and tooling, and b) secrets management. While the details are slightly different for each of my three repos, the high-level bits are the same. For automation and tooling, we use <a href="https://just.systems/">just</a>, which I have increasingly fallen in love with over the last few years. I have justfiles in (almost) every repo, and any time there&#8217;s a complicated command that I need to remember, I drop it into the justfile. So if I needed to recreate this entire setup tomorrow from scratch it would be three commands<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-16" href="#footnote-16" target="_self">16</a>:</p><pre><code><code>~moria &gt; just apply
~isengard &gt; just apply
~mirkwood &gt; just apply</code></code></pre><p>Not quite the one command I started off with in my initial goals, but honestly it&#8217;s pretty good imo. I also have GitHub Actions configured in each of these repos<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-17" href="#footnote-17" target="_self">17</a> to automatically run the above <code>just</code> commands whenever code is changed (or, in the case of moria and isengard, on a periodic schedule to rebuild the AMIs and update the running hosts).</p><p>On the secrets management front, each of the four systems (moria, isengard, mirkwood, and k3s) have slightly different methods for dealing with them. Any secret data is stored encrypted at rest, and the user interface is more-or-less the same for each, it&#8217;s just the backend that changes. Moria uses Pulumi secrets, which are stored encrypted (using local credentials) in the Pulumi state file in S3. Isengard uses Ansible secrets, which are stored encrypted (using local credentials) in the Git repo itself. Mirkwood uses <a href="https://getsops.io/">SOPS</a> to store sensitive data in the Git repo itself, this time encrypted the AWS KMS key that we set up with Moria<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-18" href="#footnote-18" target="_self">18</a>. To make each of these three systems easier to work with, I have <code>just</code> targets configured to read or write encrypted data.</p><p>So that&#8217;s it! After several months of nights-and-weekends work, I now have a usable Kubernetes cluster that is running a single XMPP server in a somewhat reliable fashion, all for about $50/month of AWS bills. Is it cheaper than Slack? No. Is it a better user experience than Slack? Also no. Did I have fun? Lmao, nope, this was annoying as all hell to get configured. Was all the time I spent on this worthwhile, when I could have been using it to make SimKube better? Well&#8230; I think you know where this is going.</p><p>Anyways, hopefully this 4000-word guide (of sorts) might be useful to someone else down the line who is seeking to embark on a similar endeavour. Next time we&#8217;ll be returning to our regularly-scheduled SimKube content!</p><p>As always, thanks for reading.</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>My words, not his.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>How foolish we are in our youth.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Although I hate the name.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Critics might ask the question, &#8220;right for who?&#8221; and point out that some of the patterns I&#8217;m following are used at giant corporations with entire teams to support them and make sure things don&#8217;t go off the rails, and that I probably ought to be spending less time copying Google and more time hacking on SimKube. My response to that is, I&#8217;m not copying Google, we don&#8217;t do monorepos or Bazel here. Also, if you don&#8217;t like how I do things you&#8217;re free to go start your own company and do things your way instead.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>I <em>strongly</em> debated whether to call this repo &#8220;Mordor&#8221;, but that just didn&#8217;t feel right to me. YAML is kinda gross and there&#8217;s a lot of it, sure, but it&#8217;s not, like, the root of everything evil that is systematically trying to destroy every last bit of beauty and goodness in the world. It&#8217;s more like, you know, just a dark forest with lots of giant spiders.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>As an aside, I spent quite a while trying to figure out what domain name to use for these hosted zones; I wanted something that would be easy to distinguish from my &#8220;public&#8221; domains, namely <code>appliedcomputing.io</code> and <code>simkube.dev</code>. I could use a subdomain and do something like <code>internal.appliedcomputing.io</code> but that is already getting extremely long, and <code>xmpp.internal.appliedcomputing.io</code> is even worse. So I decided it was time to buy a new domain name; I&#8217;m now the proud owner of <code>acrl.dev</code>, which is solely used for internal ACRL services. If you ever see an <code>acrl.dev</code> address in the wild, it means something has gone horribly wrong.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>I thought long and hard about whether to buy <code>acrl.wtf</code> instead of <code>acrl.dev</code> for my internal domain name, but it was $50/year for what is essentially a throw-away domain and that felt like too much.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Astute readers will notice that I did <em>briefly</em> mention IAM in the footnotes of the previous post.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Please don&#8217;t hack me.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>The <code>t3.medium</code> and <code>t3.large</code> variants use Intel processors instead of AMD, but are otherwise identical to the <code>t3a</code> variants.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Always with the certificates, geez.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>I mentioned this on Mastodon last week, but the more I learn about Kustomize, the more I don&#8217;t understand why Helm is so popular. Kustomize does everything that Helm does, except better. And it&#8217;s built right in to <code>kubectl</code>! The docs are significantly worse, though, which is probably a big part of the reason why it&#8217;s not used more widely.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>I mentioned earlier in the post that setting a stable node name for the k3s node caused problems down the line, well, this is one of those places: when the node termination handler activates, it cordons the node, preventing any new pods from scheduling on it. Then the old node goes away and the new node comes up, and k3s looks in its database to see that &#8220;oh huh this node is cordoned, better not let any pods schedule on it!&#8221; So this is how we ended up in the current state. As I&#8217;m writing this, it occurs to me that instead of having my k3s post-start systemd hook clean up the old nodes, I could instead have it just un-cordon the current node; but there were a number of other similar &#8220;stale state&#8221; issues that I was working around with the stable node name that I <em>think</em> the current solution is slightly better. We actually could still end up in a similar situation because AWS uses the node IP address as the hostname, and sometime it will re-use the same IP address for two subsequently-launched nodes; I&#8217;ve seen this happen for large-scale clusters, but I think in my single-node cluster this circumstance seem unlikely.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>There is a slight issue that I haven&#8217;t figured out how to resolve yet, which is that the Prosody process will need to restart/reload its configuration once it gets the new certificates, but that&#8217;s a problem for 90 days from now, at which point I will have forgotten all of the details about all of this and will therefore spend a day and a half trying to understand why my chat server isn&#8217;t working anymore.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>&#8220;Service&#8221; is probably the least-well-thought-out name in the entire Kubernetes ecosystem. Literally everyone uses &#8220;service&#8221; to refer to an application that is running out there somewhere in the ether. Kubernetes, however, uses &#8220;service&#8221; to mean very specifically &#8220;the networking configuration that allows other things to talk to the application.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-16" href="#footnote-anchor-16" class="footnote-number" contenteditable="false" target="_self">16</a><div class="footnote-content"><p>I am <em>slightly</em> lying here; I haven&#8217;t fully hooked up all the tooling in each of these repos to apply everything in a single command, so for isengard I would need to apply the bastion configuration and the k3s configuration separately. In mirkwood, I need to <code>just apply</code> each application separately. But don&#8217;t worry! I&#8217;m getting there.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-17" href="#footnote-anchor-17" class="footnote-number" contenteditable="false" target="_self">17</a><div class="footnote-content"><p>Also slightly lying here, I haven&#8217;t set up GitHub Actions for mirkwood yet.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-18" href="#footnote-anchor-18" class="footnote-number" contenteditable="false" target="_self">18</a><div class="footnote-content"><p>Eventually I&#8217;d like to get all three of these repos using KMS instead of relying on local credentials, but, well, there&#8217;s always more work to do, amirite?</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Three subtle Kubernetes issues I've seen in the last week]]></title><description><![CDATA[Well I was going to use this post to provide a follow-up to the hot garbage from a few weeks ago, wherein I complain about my efforts to stand up an internal ACRL Kubernetes cluster and run something useful on it, but that post is going to have to wait because I&#8217;ve spent an inordinate amount of time over the last week debugging three separate, extremely subtle, Kubernetes issues.]]></description><link>https://blog.appliedcomputing.io/p/three-subtle-kubernetes-issues-ive</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/three-subtle-kubernetes-issues-ive</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 07 Mar 2026 00:30:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Well I was <em>going</em> to use this post to provide a follow-up to the <a href="https://blog.appliedcomputing.io/p/what-to-expect-when-youre-expecting">hot garbage</a> from a few weeks ago, wherein I complain about my efforts to stand up an internal ACRL Kubernetes cluster and run something useful on it, but that post is going to have to wait because I&#8217;ve spent an inordinate amount of time over the last week debugging three separate, extremely subtle, Kubernetes issues. So I&#8217;m going to use this post to complain about that instead! When it rains, it storms, I suppose.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Applied Computing Research Labs is trying to <a href="https://blog.appliedcomputing.io/p/quick-update-about-the-blog">increase our paid subscriber numbers</a>.  If you like what you&#8217;re reading here, would you consider subscribing below?</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Issue #1: OOMs on First?</h2><p>This first issue actually sparked a <a href="https://hachyderm.io/@drmorr/116134479810826760">long rant</a> on Mastodon, which I started off by saying,</p><blockquote><p>Sometimes problems are hard to solve because of something inherent in the problem domain that makes it hard. Other times, the problems are hard because we make it hard all on our own. This tale is one of the latter ones.</p></blockquote><p>And it&#8217;s true, I stand by every word of that post. And since not all of you read Mastodon, and I&#8217;m still kindof upset, I figured I&#8217;d recap the issue here.</p><p>The problem that we&#8217;re trying to solve here is, how frequently is your pod or application running out of memory (commonly called an OOM)? This seems like a useful thing that you might want to know, because it might, say, indicate a memory leak, an increase in data or requests, or even just misconfigured resource requests. Maybe you&#8217;d want to know if this is happening several times a week, or day, or hour. Anyways, Kubernetes exposes a whole host of metrics from basically every single part of the system, and if you install <a href="https://github.com/kubernetes/kube-state-metrics">kube-state-metrics</a> (KSM) you can get even more! Surely among those hundreds of thousands of metrics, there&#8217;s one that tells you how often your applications run out of memory?</p><p>Wrong! There&#8217;s not one, there&#8217;s actually <em>four<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></em>. They are:</p><ol><li><p><code>kube_pod_container_status_terminated_reason</code>: this one comes from KSM and is a binary value that is one if the container is terminated because it was out of memory, and zero otherwise.</p></li><li><p><code>kube_pod_container_status_last_terminated_reason</code>: this one also comes from KSM and is a binary value that indicates why the given container terminated last, including if it OOMed.</p></li><li><p><code>container_oom_events_total</code>: this metric comes from the Linux kernel, and reports the number of times that a container has run out of memory</p></li><li><p>The Kubernetes event stream: this is sortof cheating, because it&#8217;s not a metric exactly, but Kubernetes emits an event any time a container runs out of memory, and that event stream can be queried and stuffed into some other database somewhere else down the line if you want.</p></li></ol><p>This all seems great and stuff, maybe there are a few too many metrics here, but we can just pick one and move on with our lives, right? Wrong<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>!</p><p>It turns out that none of those four metrics give you the information that you want, and thus there is actually no way&#8212;in Kubernetes&#8212;to tell how frequently your pods are OOMing. &#128534;&#128534;&#128534;</p><p>I go into details in the Mastodon thread, but lets break it down real quick here, too. Metrics (1) and (2) are both gauges, not counters, which means they can go up and down. The first metric goes back to zero whenever the container state changes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. The second goes back to zero whenever the container terminates for a second time for some different reason. That means you can&#8217;t just slap a <code>delta</code> function on it and call it a day, because sometimes that delta is negative! The third metric actually doesn&#8217;t work at all. It just always emits zero. There&#8217;s a <a href="https://github.com/google/cadvisor/issues/3015">GitHub issue</a> about this where the answer is &#8220;welp, tough luck&#8221;. And to round it out, the last &#8220;metric&#8221; (event) doesn&#8217;t tell you which pod or container is affected, it just tells you &#8220;hey something on this node OOMed, oopsies!&#8221;</p><p>So there you have it. Or rather, there you don&#8217;t have it, because it&#8217;s literally impossible<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> for no good reason.</p><h2>Issue #2: When is closing time again?</h2><p>This one&#8217;s a quickie that I learned about yesterday. Pods in Kubernetes have a <code>deletionTimestamp</code> field which indicates (as you might expect) when the pod was requested to be deleted<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. Except! That&#8217;s not actually what it indicates.</p><p>The deletion timestamp shows the time when the pod was requested to be deleted <em>plus</em> the pod&#8217;s termination grace period delay<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. Fine, OK, whatever, just subtract the termination grace period delay from the deletion timestamp and you get the time when the pod was requested to be deleted. A little bit annoying, but not a big deal. But wait! It gets better!</p><p>If the pod finishes all of its work and shuts down before the termination grace period expires, the deletion timestamp is updated to the <em>actual time the pod went away</em>. Hope you like time travelling, baby, cuz we&#8217;re going B2k to the F4e!</p><h2>Issue 3: When is a map not a map? When it&#8217;s been merged!</h2><p>I spent an inordinate amount of time troubleshooting this issue yesterday and today. The short version is: you&#8217;ve got some <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">custom resource</a> in Kubernetes that creates some pods<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Inside your custom resource, you have a pod template spec, which defines how and where that pod should be created. Inside the pod template spec, you have a node selector, which specifies which node(s) the pod is eligible to run on. The node selector is a map, it looks something like this:</p><pre><code><code>nodeSelector:
  simkube.dev/type: virtual</code></code></pre><p>This node selector is saying that the pod can only run on nodes that have the <code>simkube.dev/type=virtual</code> node label.</p><p>Now, suppose you perform the following sequence of actions:</p><ol><li><p>Change the node selector to <code>appliedcomputing.io/simkube-type: real</code> (Why are you making this change? Who knows, who cares, it&#8217;s not important right now).</p></li><li><p>Re-apply the custom resource, using&#8230;</p></li><li><p><code>kubectl apply --server-side</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p></li></ol><p>What is the resulting node selector on your newly-applied custom resource? Is it:</p><pre><code><code>nodeSelector:
  simkube.dev/type: virtual</code></code></pre><p>or is it</p><pre><code><code>nodeSelector:
  appliedcomputing.io/simkube-type: real</code></code></pre><p>or, lastly, is it</p><pre><code><code>nodeSelector:
  simkube.dev/type: virtual
  appliedcomputing.io/simkube-type: real</code></code></pre><p>If you guessed the last one, you&#8217;ve been doing this software thing too long, you really ought to go do something else with your life so you don&#8217;t lose what little bit of sanity you have left. But also, if you guessed the last one, you were 100% correct, thanks to the slightly counter-intuitive <a href="https://kubernetes.io/docs/reference/using-api/server-side-apply/#merge-strategy">merge semantics of server-side-apply in Kubernetes</a>. This might be a problem for you if, for example, it&#8217;s impossible for something to be virtual and real at the same time.</p><p>Fortunately, this last one is fairly easy to fix, you can just add a field to the custom resource specification that tells Kubernetes &#8220;hey don&#8217;t do this&#8221;. But if you&#8217;re not expecting it, I hope you enjoy spending a couple days of your life pondering increasingly-unlikely scenarios, like gremlins and gamma rays.</p><p>Anyways, that&#8217;s all I&#8217;ve got for this week! Come back next week to read more about how hard it is to actually run applications on Kubernetes.</p><p>Thanks for reading!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I bet you didn&#8217;t see that coming.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I bet you <em>did</em> see that one coming.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p><a href="https://hachyderm.io/@drmorr/116134527390208807">Except for when it doesn&#8217;t</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Obviously this statement is hyperbole. It&#8217;s not actually impossible, it&#8217;s just that nobody has made this extremely basic thing that probably everybody wants to do possible out-of-the-box.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Yet another thing that you as a cluster operator and/or application owner might wish to know.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The termination grace period is used to give pods some time to cleanly shut down, handle any last requests, etc. etc. etc. before they die.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Look I&#8217;m playing fast-and-loose here, I know the custom resource is just the spec and doesn&#8217;t create any pods, it&#8217;s the controller that reads the spec and takes actions that creates pods, OK, captain pedantic?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Some of you already know where this is going.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Frustrations with Version Control]]></title><description><![CDATA[Alright, it&#8217;s been a long time since I&#8217;ve posted a good rant on here, so buckle up buttercup.]]></description><link>https://blog.appliedcomputing.io/p/frustrations-with-version-control</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/frustrations-with-version-control</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 20 Feb 2026 21:01:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8kTA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8kTA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8kTA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 424w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 848w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 1272w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8kTA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png" width="339" height="337" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eb3546d-9154-4550-a402-3746b6395991_339x337.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:337,&quot;width&quot;:339,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29806,&quot;alt&quot;:&quot;A pair of blue birds, with heads facing opposite directions.  The feather pattern on their bodies is reminiscent of ocean waves. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/188643298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A pair of blue birds, with heads facing opposite directions.  The feather pattern on their bodies is reminiscent of ocean waves. " title="A pair of blue birds, with heads facing opposite directions.  The feather pattern on their bodies is reminiscent of ocean waves. " srcset="https://substackcdn.com/image/fetch/$s_!8kTA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 424w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 848w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 1272w, https://substackcdn.com/image/fetch/$s_!8kTA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb3546d-9154-4550-a402-3746b6395991_339x337.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Jujutsu (jj) logo</figcaption></figure></div><p>Alright, it&#8217;s been a long time since I&#8217;ve posted a good rant on here, so buckle up buttercup. This one&#8217;s about version control.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Just as a reminder, if you&#8217;d like to leave angry comments on my rants, you can always become a paid subscriber!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>VCS? CVS? WTF?</h2><p>Ok, a quick primer if you don&#8217;t do software on the regular: version control is how we keep track of dozens or hundreds of changes being made to a codebase by dozens or hundreds of engineers. It&#8217;s also, thanks to GitHub, how most open-source projects publish their work. It&#8217;s <em>also</em> also, mostly accidentally, how most software is backed up. The whole idea is, &#8220;we&#8217;re tired of mailing around <code>SimKube.zip.v2.final.2.final.v0.4.no_really_final_this_time<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></code>; hey, computers are good at tracking things, what if we made the computer track our computer changes?&#8221;</p><p>The first &#8220;version control system&#8221; (aka VCS) I ever interacted with was called, confusingly, CVS<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. It was (apparently) revolutionary for its time, but I only ever had bad experiences with it. See, one of the main things that version control is supposed to solve is &#8220;what happens if two (or more) people change the same file at the same time?&#8221; One could argue that dealing with conflicts like this is the sole job of version control, because if there are never any conflicts it doesn&#8217;t matter. Needless to say, CVS was not very good at dealing with conflicts.</p><p>I moved fairly quickly from CVS to Subversion (aka SVN); I still didn&#8217;t really know what I was doing, and I honestly didn&#8217;t spend much time with SVN, but I do remember that it made merge conflicts ever so slightly less painful than CVS. However, this was around the time that Git was becoming extremely popular, and once I tried Git I never looked back. The major improvement that Git made was the idea of &#8220;branches&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>: you could have multiple&#8212;dozens, hundreds, even!&#8212;of different work streams going on the same codebase all at the same time, and none of them would ever conflict with each other, because they all lived on separate branches! It was truly revolutionary in my experience. Branches were cheap, lightweight, and easy to navigate back and forth, and you (the user) got to <em>choose</em> when you wanted to deal with the conflicts<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. It was still a pain to deal with the conflicts, but at least you had some control over it.</p><p>Then GitHub came along, and became synonymous with Git, and that&#8217;s basically been the state of the world for the last two decades. Nevertheless, it hasn&#8217;t been all roses and daisies over here in Git-land: Git is a <em>distributed</em> version control system, which introduces certain complexities, because (who know) distributed systems are hard<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. Also, especially in the early days, the Git user experience was&#8230; challenging. If you didn&#8217;t really understand the internals of the system, it was very easy to get your codebase into a bad state that was very difficult to recover from. There were limited affordances for new users to learn the system, and there&#8217;s honestly been endless ink spilled on that topic which I don&#8217;t need to revisit here. Suffice to say, there&#8217;s been a <a href="https://jvns.ca/blog/2026/01/08/a-data-model-for-git/">lot of effort</a> in recent years put into making Git more approachable for newcomers, which is honestly great to see. And thanks to GitHub, basically everyone in software now interacts with Git on a daily basis, and it&#8217;s difficult to see that changing anytime soon.</p><h2>So we&#8217;re done then, right? Version control is solved. Blog post over.</h2><p>Wellllllll..... not quite.</p><p>See, Git itself is very unopinionated about how you use it. There&#8217;s lots of ways to accomplish the same thing task, and teams adopting Git were left on their own to figure out what workflow worked best for them. Specifically: when you write code you usually want someone else to review that code before it &#8220;goes live&#8221; or &#8220;gets deployed&#8221; or &#8220;put in front of users&#8221;. Ostensibly, at least, having a second pair of eyes on your code is a good way to catch bugs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. When you have dozens or hundreds of branches floating around (because they&#8217;re free), figuring out how to review the changes on a specific branch is a little challenging. But then GitHub came along and said &#8220;The One True Way to interact with your code is through forks, merges, and pull requests!&#8221;. And ever since then, the &#8220;pull request&#8221; model is the only thing that most software engineers have ever encountered.</p><p>Now, pull requests in-and-of-themselves aren&#8217;t awful. Basically it shows a diff of changes between your branch and whatever is &#8220;in production&#8221;, and then you can leave comments on the changes and have a discussion about them. The problem is&#8230; How to put this nicely&#8230; the GitHub UI sucks. Badly. It always has, and it&#8217;s only ever getting worse<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. So while PRs themselves are fine, the experience of working with them on GitHub is unpleasant. The main issue is: someone requests that you change all your variable names arbitrarily, so then you sigh and go make the change because you&#8217;re effing tired of fighting about trivial nonsense all day every day. Then you make a new commit and give it a description like &#8220;addressing review comments&#8221;. Meanwhile, during the time you&#8217;ve been having this back-and-forth, 10 other people have made changes which means you need to merge their changes back into your branch before you can get it reviewed again and then once you&#8217;re ready for someone else to look at your code again, nobody can tell if you actually addressed any of the comments on the PR or not, and also your commit history is filled with things like &#8220;fixed things&#8221; and &#8220;merged so-and-so&#8217;s changes back in&#8221; and &#8220;fixed more things&#8221; and &#8220;wrote a test&#8221; and etc.</p><p>The thing that is frustrating to me is that it doesn&#8217;t have to be this way! All we need is a way to a) make targeted changes to a commit without having to create a new commit, and b) a way to see what changes were made since the last time we reviewed the code. But, because of the way that GitHub PRs work, (a) is incompatible with (b). And because GitHub is ubiquitous, it&#8217;s very difficult to do anything about it.</p><h2>Enter jj stage right</h2><p>It might surprise you to learn that a number of large, prominent companies have looked at this model and said &#8220;this sucks, we&#8217;re going to do it differently.&#8221; Google, Facebook, and a number of other companies have stopped using the PR model, and instead use a concept of <a href="https://newsletter.pragmaticengineer.com/p/stacked-diffs">&#8220;stacked diffs&#8221;</a>. The short version is instead of reviewing an entire branch at a time, you can instead review each commit in isolation. When you&#8217;re working on a new code feature, you create a &#8220;stack&#8221; of changes, and each change gets reviewed independently. But&#8212;and here&#8217;s the crucial part&#8212;while your heartless engineers are ripping apart your current changes, you can keep developing new code on top of your previous changes. And then, when you go back to address their totally unreasonable comments, you just &#8220;drop down&#8221; to a lower level of the stack, address their changes, and then the rest of your change stack automagically adjusts itself to incorporate those changes.</p><p>It&#8217;s hard to overstate how much of a game changer this model is. People who&#8217;ve worked with stacked diffs will do almost anything to keep working with them in the future. Just like Git made branches cheap and easy, tools that support a stacked diff workflow make &#8220;editing code anywhere in your history anytime&#8221; cheap and easy. The problem is, it is extremely difficult to make this model work with GitHub.</p><p>Not for lack of trying, though: a number of tools have emerged over the years that try to blend the &#8220;stacked diffs&#8221; mindset with Git/GitHub. <a href="https://sapling-scm.com/">Sapling</a> was (one of) the first Git-compatible VCS that tried this approach; a more modern attempt is <a href="https://graphite.com/">Graphite</a>, which basically writes their own entire UI on top of the GitHub UI. I&#8217;ve tried both (along with several others) and bounced off of them pretty quickly; they&#8217;re just clunky and don&#8217;t <em>actually</em> make the experience of managing code any nicer.</p><p>BUT. In recent years, a new project called <a href="https://www.jj-vcs.dev/latest/">Jujutsu</a> (aka JJ) has come out of Google which is changing all of that. It is a completely new model for doing version control that is, nonetheless, totally Git-compatible, and it&#8217;s really gaining a lot of traction<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. JJ does two things extremely well: first, it completely rethinks the Git UI/UX to streamline common operations. Things that take two, three, or more commands in Git are a single, well-documented command in JJ&#8212;specifically, it makes it extremely simple to navigate to any point in your change history, make changes, and then navigate somewhere else, and have everything else auto-update. Secondly, JJ makes a &#8220;conflict&#8221; a first-class concept in the system: when you make (or merge in) changes that conflict, JJ understands that what the conflict is and asks you to resolve it. However, you can defer &#8220;resolving the conflict&#8221; as long as you want; this is in contrast to Git, where if there is a conflict in your code, you must immediately drop everything and resolve the conflict right then and there before you can do anything else. Honestly, given that the sole goal of a VCS is to help manage conflicts, it&#8217;s a bit mind-blowing to me that it&#8217;s taken us this long to start treating &#8220;conflicts&#8221; as first-class citizens.</p><h2>So that&#8217;s it, right? You&#8217;re using JJ for everything now, and I can finally stop reading this dumb post?</h2><p>Well, again, no.</p><p>I&#8217;ve tried to completely switch over to JJ three separate times now, and I keep running into issues which make me switch back to Git. This isn&#8217;t a knock against the JJ team, because it&#8217;s a young, very ambitious project, and they&#8217;re very aware of all these issues, so I have a lot of optimism that these things will get fixed at some point in the future. But for right now, I&#8217;m unable to make the switch, which is doubly-frustrating, because now any time I&#8217;m doing something extra complicated with Git, I&#8217;m like &#8220;but this would be <em>so easy</em> with Jujutsu!&#8221;</p><p>For posterity, here are the three big blockers I keep encountering:</p><ol><li><p>Pre-commit hooks: I use <a href="https://pre-commit.com/">pre-commit</a> <em>extensively</em> to do code linting, formatting, and other static analysis checks. I also run the same pre-commit checks during CI to prevent &#8220;bad code&#8221; from accidentally getting merged. Unfortunately pre-commit hooks don&#8217;t make a lot of sense in JJ world; there are some efforts to support &#8220;pre-push&#8221; hooks instead of &#8220;pre-commit&#8221; hooks, but none of those have actually landed yet, and every time I push my code up to GitHub and then realize that I forgot to run my checks locally first and they&#8217;re all failing, I get real sad.</p></li><li><p>Branch management: while JJ itself doesn&#8217;t care about branches (called &#8220;bookmarks&#8221; in JJ-land), GitHub still cares about branches <em>A LOT</em>. But because JJ doesn&#8217;t care about branches, the tooling to keep branches in sync with GitHub is lacking. The <a href="https://shaddy.dev/notes/jj-tug/">tug alias</a> that some users have come up with does help <em>some</em>, but it doesn&#8217;t always work, and keeping your bookmarks correctly pointed at the right code is a lot of manual juggling that I find to be really disruptive. Again, I think the JJ team is aware of this and is working on it, but it&#8217;s not there yet.</p></li><li><p>Merging code: This is related to the previous point, and is the reason I bounced off JJ most recently. I normally use a &#8220;rebase&#8221; workflow on GitHub, to avoid a bunch of pointless and unhelpful &#8220;merge commits&#8221;. But when you rebase and merge an external system like GitHub, JJ is not currently able to track the ways in which your code has changed, and you end up with a whole bunch of dangling references/branches that (again) you have to clean up manually<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>, which again is a bunch of frustrating busywork.</p></li></ol><p>There are some other issues that I have with JJ, mostly around the learning curve for its revset language and other configuration aliases/options<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. But honestly, a steep learning curve doesn&#8217;t bother me that much, I can get over that. It&#8217;s the more fundamental &#8220;workflow issues&#8221; that are the bigger problem right now.</p><p>All that said, I&#8217;m <em>very</em> excited for the future of Jujutsu and I will probably continue watching its changelog and trying it out every few months until some of these issues get resolved. It really does feel like the first genuine, foundational improvement in how we do version control in twenty years.</p><p>As always, thanks for reading!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Yes, I&#8217;ve done this before. It&#8217;s as miserable and awful as it sounds. And yes, even in the year of our lord 2026 there are still people who do this.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Confusingly, this stands for &#8220;Concurrent Versions System&#8221;, not &#8220;Control Version System&#8221; or &#8220;Consumer Value Stores&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Yes, I know SVN has branches, but they are clunky and hard to use.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Fun random aside: when I was in grad school, I would keep all my research papers in Git repositories. The &#8220;master&#8221; branch would hold the pre-print version of the paper, and then when I submitted to a journal, I would create a separate branch specific to that journal. That way I could keep all the &#8220;journal-specific&#8221; formatting requirements isolated from the &#8220;content&#8221;. Any time we got comments back on the paper, I would apply changes to the &#8220;source of truth&#8221;, aka the &#8220;master&#8221; branch, and then merge the changes into the journal-specific branch. It wasn&#8217;t a <em>perfect</em> system, but it worked pretty well, and was a heck of a lot easier to manage than having forty different versions of the same document for each journal we submitted to.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>TIL</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>&#8220;Ostensibly&#8221; is doing a lot of heavy lifting in that sentence. In practice, it turns out that &#8220;finding bugs by reading someone else&#8217;s crappy code&#8221; is really hard to do, so most code review sessions devolve into &#8220;your lines are 82 characters long, please keep them under 76 characters to enhance legibility&#8221; or &#8220;I hate your variable names, please change them all post haste&#8221;. Needless to say, these critiques rarely find any bugs.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>I know this is going to offend many people, but it&#8217;s just because you&#8217;ve been Stockholm-syndromed into liking it. Seriously: why are there three separate tab bars? Why is information duplicated five times in different places? Why is it so damn difficult to compare two different commits? For that matter, why is &#8220;looking at the commit history&#8221; the least obvious link on the entire website?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>At least in the weird niche corners of the internet that I occupy.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>I found several blog posts/discussions (for example, <a href="https://github.com/jj-vcs/jj/discussions/7848">this one</a>) that seem to indicate this is sortof a solved problem, but it definitely wasn&#8217;t working for me last week.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>There&#8217;s a whole cottage industry of &#8220;look at my extremely complicated JJ config&#8221;-style blog posts that have cropped up in the last year or so.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Quick Update about the Blog]]></title><description><![CDATA[Hi friends!]]></description><link>https://blog.appliedcomputing.io/p/quick-update-about-the-blog</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/quick-update-about-the-blog</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Tue, 17 Feb 2026 17:32:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hi friends!  Hope you all had a great three-day weekend, for those of you in an area that got a three-day weekend!  This is just a quick update instead of a full-length post.</p><p>Let&#8217;s talk about subscribers!  As of today, there are 225 subscribers to this blog<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, and 14 of those are paid subscribers.  This is pretty great!  I&#8217;m excited that all of you are out there reading this&#8230;. whatever it is, and even more excited that 14 of you are willing to financially support the work we&#8217;re doing at ACRL.</p><p>As I&#8217;ve said before, this publication doesn&#8217;t pay our salaries, but it does help cover for our conference and travel budget, which gets us new clients which <em>do</em> pay our salary.  However, managing the accounting for these paid subscriptions does take up a non-trivial amount of time, especially around tax time.  Given that we don&#8217;t have a <em>huge </em>number of paid subscribers, it has been an ever-present question in the back of my mind if it&#8217;s worth the effort<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><p>So, for the remainder of this year, I&#8217;m trying an experiment: I&#8217;d like to increase the readership of this blog <em>in general</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, and specifically, I&#8217;d like to increase the number of paid subscribers.  I feel like if we could get between 50 and 100 paid subscribers to the blog (which would bring in around $2500-$5000/year), that would be enough to make the accounting effort during tax season worth it.  But, if we can&#8217;t get there, I don&#8217;t think it&#8217;s worth my time to manage the paid subscriptions.</p><p>So that&#8217;s my goal: for the rest of the year, I&#8217;m going to be trying a bunch of experiments with the blog to try to increase readership in general, and paid subscriptions specifically.  All the content here is still going to be released to the public on Monday afternoons, but I&#8217;m going to be experimenting with some other &#8220;paid benefits&#8221; for subscribers, and we&#8217;ll see what happens!</p><p>Also, if you&#8217;re currently one of my free subscribers, and you feel like you&#8217;ve gotten some value out of what I&#8217;ve written, would you consider upgrading to a paid subscription<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>?  You&#8217;ll get a warm fuzzy feeling in your heart, the ability to comment on my hot garbage, and maybe some other benefits in the near future.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.appliedcomputing.io/subscribe?"><span>Subscribe now</span></a></p><p>Thus concludes my meta-blog-intermission!  As always, thanks for reading.</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Substack really wants me to call it a newsletter, but eff that.  It&#8217;s a blog.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I also don&#8217;t really <em>love</em> the Substack platform, and would happily have an excuse to migrate off of it, but the paid subscribers feature has (thus far) been the thing that&#8217;s kept me on it.  I know <a href="https://ghost.org">Ghost</a> promises a seamless transition from Substack but&#8230; I haven&#8217;t had time yet.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I mean, who <em>doesn&#8217;t</em> want to read about SimKube all day every day???</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>As an aside, many tech companies offer &#8220;professional development&#8221; or &#8220;educational&#8221; benefits that will reimburse you for subscriptions to publications like this!  You could subscribe and not even have to pay for it!</p></div></div>]]></content:encoded></item><item><title><![CDATA[ACRL is a CA now]]></title><description><![CDATA[&#8220;CA&#8221; is one of the most overused acronyms in tech.]]></description><link>https://blog.appliedcomputing.io/p/acrl-is-a-ca-now</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/acrl-is-a-ca-now</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 07 Feb 2026 19:00:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>&#8220;CA&#8221; is one of the most overused acronyms in tech. It means, among other things, &#8220;Cluster Autoscaler&#8221; (e.g., the thing that scales your Kubernetes cluster up and down), &#8220;Corrective Action&#8221;, (that is, a post-incident task you do to ensure that the incident or outage doesn&#8217;t happen again), and &#8220;Certificate Authority&#8221; (the process or system that cryptographically signs all of your HTTP certificates so that all your REST traffic can be encrypted). That&#8217;s not to mention the more pedestrian meanings of &#8220;California&#8221; and &#8220;Canada&#8221;.</p><p>Put all this together, and things get really confusing really fast, especially when your CA engineer needs to work with his CA counterpart to write a CA because the CA couldn&#8217;t get new credentials from the CA, which is a thing I have definitely never had happen to me.</p><p>All this is to say that ACRL has been a CA for a while: we&#8217;re a California-based company that has definitely autoscaled some clusters, and given the circumstances of my departure from my previous gig, you could argue that ACRL is also a corrective action. But there&#8217;s one CA that we haven&#8217;t been before, which is a Certificate Authority; however, late last year that changed! In this post I&#8217;m going to talk about the why and how of it all.</p><h2>Why on earth do you need to issue certificates?</h2><p>OK, so let&#8217;s take a step back: what is a certificate<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>? I alluded to this briefly in the introduction, but a certificate is a cryptographic primitive for performing encryption and authentication. It has two parts, a &#8220;public key&#8221; and a &#8220;private key&#8221;; the public key, as you might expect, can be shared publicly, but the private key needs to be kept secret. There&#8217;s a lot of math involved which I won&#8217;t go into, but the basic idea is that you give your public key to someone else, and then you can prove to them that you are the owner of the private key by decrypting a bit of data that they encrypted with the public key. This is one of the underpinnings of the modern internet; most websites these days use public key cryptography to provide a secure connection to their site, and in fact many browsers will flash scary warnings at you if you visit a site that <em>doesn&#8217;t</em> use this security measure.</p><p>It turns out that this is just the beginning; for a long list of reasons, we&#8217;ve established a chain of trust with these certificates, so that very often you&#8217;re no longer verifying that you have the private key, you&#8217;re verifying that you have the private key <em>and</em> that private key is trusted by some third party. The third party has their own certificate, which is trusted by <em>another</em> third party, and so on and so forth. These third parties are called &#8220;Certificate Authorities&#8221;, and one of the reasons why they exist is to make certificate revocation easier.</p><p>Imagine, for example, that you accidentally shared your private key on GitHub. Whoops! Now anybody who happened to look at your GitHub repo while it was there has your private key, and they can pretend to be you! They can send encrypted messages as you or pretend to be you when visiting websites. You can take the private key down, but it would be <em>really great</em> if there was some way to signal to the <em>entire world</em> that if anybody ever uses that private key again, they are a bad person and should feel bad. That&#8217;s (one of) the functions of a Certificate Authority: they maintain revocation lists where you can look up and see whether a certificate is still &#8220;trusted&#8221;.</p><p>Again, all very cool technology, based on a lot of interesting math, but why is ACRL a Certificate Authority now? Well, the answer is simple: SimKube.</p><h2>Oh come on. You&#8217;re telling me your open-source Kubernetes simulator needs to be able to issue certificates?</h2><p>Well, not exactly. See, SimKube itself isn&#8217;t very useful on its own; if you wanted to, for example, <a href="https://blog.appliedcomputing.io/p/using-simkube-10-comparing-kubernetes">compare Cluster Autoscaler and Karpenter</a>, there are a lot of extra components that you might want to install to make that comparison easier, and some of those components might be hosted in a private container registry on AWS. So you need some way to give folks access to that container registry. Now, AWS has a way to manage permissions and authentication already: it&#8217;s called IAM (Identity and Access Management), and it&#8217;s some of the most arcane nonsense you will ever have to deal with<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p><p>Fortunately for us, AWS provides a way to make it even more arcane: <a href="https://aws.amazon.com/iam/roles-anywhere/">IAM Roles Anywhere</a>.  The sales pitch for IAM Roles Anywhere is, essentially, &#8220;What if we took an incredibly complex permissions management system and hot-glued an incredibly complex cryptographic identity scheme on top?&#8221;</p><p>The nice<em> (????)</em> thing about this, and the reason why ACRL is now a CA, is that it gives us a one-step process to grant access to parts of our private AWS account. If a client needs some internal tool or component to make simulation easier, all I have to do is send them a certificate. They don&#8217;t even need to have their own AWS account! They just install the cert and they&#8217;re off and running. And then, at some later point when they no longer need access to ACRL&#8217;s AWS account, I just revoke the certificate and AWS won&#8217;t let them in anymore. Cool beans!</p><h2>I&#8217;m still not convinced that any of this is necessary, but OK, at least tell me how you did it.</h2><p>The process of setting up a certificate authority that works with AWS IAM is non-trivial; there&#8217;s quite a lot you need to think about, especially if you want to do it securely. Fortunately for us, someone else already did all the hard work! A German company called Q-Solution has published an <a href="https://serverlessca.com/">open-source Terraform module</a> for creating a certificate authority and easily using that authority to generate public/private keys.</p><p>The steps for setting it up were relatively straightforward<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>:</p><ol><li><p>First, I created a second AWS account for the CA; these certificates aren&#8217;t guarding anything particularly sensitive <em>right now</em>, but they definitely could in the future, and if an attacker somehow gains access to my primary AWS account, I don&#8217;t want them to be able to muck around with certificates. This is probably overkill, honestly, but as one of my trusted colleagues has repeatedly told me, this whole CA scheme is insane and ridiculous overkill, so why not go all in on it?</p></li><li><p>Once I had the second AWS account, I really wanted to make sure that I got notified whenever anybody <em>used</em> the account for anything. For this purpose, I created a CloudTrail log (you get one free!), and then set up EventBridge to send notifications from CloudTrail to the Simple Notification Service (SNS), which emails me. Note that CloudTrail is configured in the main account (aka the management account), but it actually aggregates events from all the different accounts.</p></li></ol><p>This was incredibly annoying to get working: ACRL is using AWS Single Sign-On (SSO) to access both the main account and the CA account; when you sign in to an account with AWS SSO, it logs a <code>Federate</code> event to CloudTrail. But also, users can access the CA account from the CLI, which doesn&#8217;t go through SSO, but instead uses an <code>GetRoleCredentials</code> call; so I needed to monitor two separate types of events, confusingly, neither of which are an <code>AWS API Call</code>, but instead are an <code>AWS Service Event via CloudTrail</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. And lastly, it turns out that <code>GetRoleCredentials</code> is a read-only event, which isn&#8217;t tracked in EventBridge by default; instead, you have to turn on tracking of read-only events in EventBridge, and the process to do this is a semi-secret flag that you can only set from the command line, and not through the AWS console<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p><p>Anyways, once I got all that set up, I decided I hadn&#8217;t gone deep enough down the rabbit hole, and also made this janky &#8220;login monitoring system&#8221; alert me if anyone logs in using the AWS root user. Because why not. But finally, we can move on to</p><ol start="3"><li><p>Point the certificate authority Terraform module at my brand-spanking-new AWS account, run it, and immediately get a million emails saying &#8220;Someone just logged into your AWS account!!!111!1!11one&#8221;</p></li></ol><p>The certificate authority module is pretty nice actually; when you first set it up, it runs a bunch of AWS Lambda functions to generate your root certificate and signing certificates, and then anytime I need to generate a &#8220;client&#8221; certificate, I can just use the provided Python script to do so. Managing the revocation list is also not too bad, any time I need to revoke a certificate I just add its SHA to the revocation list, and trigger the &#8220;revocation Lambda&#8221;, and voila, nobody can use that certificate to access my account anymore. Pretty neat!</p><h2>Huh. I guess that is kinda cool.</h2><p>Thank you. I&#8217;m glad you finally agree. Anyways, the best part of all this is that now any time anybody wants to know if ACRL is a CA, I can confidently answer &#8220;Yes&#8221;, regardless of what definition of CA they are referring to<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>.</p><p>As always, thanks for reading.</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Feel free to skip this section if you&#8217;re a computer security guru, because it&#8217;s full of handwaving and statements that maybe aren&#8217;t outright lies, but they&#8217;re definitely not true either.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>On the other hand, if you aren&#8217;t a security expert and want to know more about all this stuff, the <a href="https://en.wikipedia.org/wiki/Public-key_cryptography">Wikipedia article on PKI</a> isn&#8217;t a bad place to start.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Pray that you never have to.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Even Claude and ChatGPT are bad at IAM policies, which is saying something. I&#8217;m not sure what it&#8217;s saying, but it&#8217;s something.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>I&#8217;m using &#8220;relatively straightforward&#8221; to mean the same thing as &#8220;the proof of this theorem is trivial&#8221; in your college math textbook.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>You must use this string exactly when you set it up, and if you use the wrong string or make a typo, nobody will tell you, but none of your events will get delivered.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Using the cleverly-named EventBridge parameter, <code>ENABLED_WITH_ALL_CLOUDTRAIL_MANAGEMENT_EVENTS</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Except for Canada, I guess. I don&#8217;t think ACRL will ever be Canada.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Making new versions is free, actually.]]></title><description><![CDATA[Ok, before I start off I need to acknowledge the reddit user who complained that my post last week was &#8220;hot garbage.&#8221; Thank you for your kind words!]]></description><link>https://blog.appliedcomputing.io/p/making-new-versions-is-free-actually</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/making-new-versions-is-free-actually</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 30 Jan 2026 21:00:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ge2O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ge2O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 424w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 848w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 1272w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ge2O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png" width="421" height="129" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:129,&quot;width&quot;:421,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10423,&quot;alt&quot;:&quot;A screenshot of a reddit post saying \&quot;This is hot garbage.  It's not too late to delete this.\&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/186329771?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A screenshot of a reddit post saying &quot;This is hot garbage.  It's not too late to delete this.&quot;" title="A screenshot of a reddit post saying &quot;This is hot garbage.  It's not too late to delete this.&quot;" srcset="https://substackcdn.com/image/fetch/$s_!ge2O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 424w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 848w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 1272w, https://substackcdn.com/image/fetch/$s_!ge2O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d379bf-f5e8-48ae-ab35-fa6adc2a0b25_421x129.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">This is the nicest thing anyone has ever said about my blog!</figcaption></figure></div><p>Ok, before I start off I need to acknowledge the reddit user who complained that my <a href="https://blog.appliedcomputing.io/p/what-to-expect-when-youre-expecting">post last week</a> was &#8220;hot garbage.&#8221; Thank you for your kind words! I have never aspired to produce anything else, but I do want to point out that everything I write here is 100% hand-crafted, human-written hot garbage. I put up AI artwork sometimes but the words are all real human words. Just in case there was any confusion about that.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Want to read more hot garbage in the future?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Anyways, in this week&#8217;s episode of hot garbage, I want to talk about software versions, but maybe not in the way you might be expecting. There&#8217;s a well-known class of &#8220;Software Blog Post&#8221; arguing about versioning schemes: &#8220;<a href="https://semver.org">SemVer</a> sucks! Use <a href="https://calver.org">CalVer</a>!&#8221; &#8220;No, CalVer is terrible, just use WTFVer!&#8221; And etc, ad nauseam, vim vs emacs style. That&#8217;s not what this post is about. Instead I want to talk about the psychological effects of software versioning<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><h2>Releasing new versions is scary&#8230;</h2><p>There&#8217;s a running joke/meme in the Rust community that none of the crates or libraries in the Rust ecosystem will ever reach &#8220;version 1&#8221;, and will be in a weird &#8220;alpha&#8221; state in perpetuity. It&#8217;s kind of funny that many foundational libraries are on version 0.1884.23, but I think (based on my own observations) that&#8217;s actually a symptom of a deeper psychological issue: namely, &#8220;releasing a new software version is scary.&#8221;</p><p>I tend to believe that as a general rule, scientists and engineers want to produce <em>good</em> things that are high quality and that will make people&#8217;s lives better. And we&#8217;ve created this weird association between &#8220;putting a version tag on something&#8221; with &#8220;this thing is <em>ready to go</em>&#8221;. And I can kinda understand how we got here: back when new software was released on a physical disk that you had to go into a store and purchase<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, there was a lot of pressure to make sure that the bits contained on that physical disk were perfect, because if they weren&#8217;t it was extremely difficult to fix them after the fact<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. And even now, when we put a new version on something, we&#8217;re kindof implicitly saying &#8220;This set of bits is ready for consumption, all those new features I was working on are complete and ready for someone to use.&#8221;</p><p>And that&#8217;s kinda scary! You&#8217;re putting yourself out there in a way that feels a bit uncomfortable. What if it&#8217;s broken and busted, or there&#8217;s some bug or corner case that you didn&#8217;t think about, or, or, or&#8230;?</p><h2>&#8230;but it doesn&#8217;t have to be.</h2><p>But here&#8217;s the thing: whatever versioning scheme you&#8217;re using, the numbers are free. It&#8217;s not like we&#8217;re going to run out of numbers. If you find a bug or an issue with version X.Y.Z, you can just fix it and then release version X.Y.Z+1. We no longer have a software distribution problem like we used to<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. It doesn&#8217;t <em>have</em> to be scary to release a new version.</p><p>I&#8217;ve been thinking about this a with one of the internal tools I&#8217;m developing for use with our clients; there&#8217;s been an aggressive set of feature development and bugfixes, and this has resulted in an aggressive set of new versions so that my clients can get access to those new features/fixes. I&#8217;ve often released several new versions in a single day, which feels&#8230; a little excessive. But at the same time, like, who cares? Number goes up. Install the new version. Life goes on.</p><p>I also ran into this with <a href="https://simkube.dev/">SimKube</a> yesterday: I released a new patch version of SimKube, version 2.4.3. It was a small <a href="https://github.com/acrlabs/simkube/commit/78127ed39109332971653023ab850b0ccf0b3444">bugfix</a> to ensure that simulated DaemonSets get scheduled on the right subset of nodes. I wrote some tests, they all passed, I released the new version, I installed it&#8230; and it was broken. UGH. I went through this whole narrative in my mind: &#8220;What kind of hack job am I? I can&#8217;t even release a new software version correctly. I know, nobody else is using this, I can just cover up how incompetent I am, I&#8217;ll just yank the version, force-push a new change that fixes the bug, and re-release 2.4.3. Nobody will know!&#8221;</p><p>But then I paused. Actually, really, who cares? Nobody is staring at my release notes going &#8220;Wwwwoowowowoww drmorr really screwed that one up!&#8221; Nobody cares. I fixed the bug, I fixed the test that should have caught the bug but didn&#8217;t because I wrote it wrong, and I released SimKube 2.4.4 an hour later. Problem solved, life moves on.</p><p>Anyways, the whole point of this post is that we should normalize this: version numbers are free, and we&#8217;re not going to run out. Don&#8217;t feel bad if your last version has a bug. All software has bugs. Fix the bug, make a new version, and move on with your life.</p><h2>Long list of caveats</h2><p>Because I <em>just know</em> that someone is lurking in the background trying to decide whether to sign up for a paid subscription so that they can call this post hot garbage in the comments, I do want to acknowledge that, while version numbers are free, sometimes &#8220;releasing things&#8221; <em>isn&#8217;t</em>. There are a whole bunch of settings where it&#8217;s still somewhat challenging to get your new version into the hands of your users: mobile development (or really anything that has an &#8220;app store&#8221; or &#8220;marketplace&#8221;) typically has a long lead time for releasing new things, because it has to go through some kind of third-party audit, whether that&#8217;s just &#8220;automated tests&#8221; or &#8220;human review&#8221;. And even after you&#8217;ve released something, many people can&#8217;t or won&#8217;t upgrade to the new version, so you&#8217;re kindof stuck maintaining the old thing for a long time anyways. See also: air-gapped systems, embedded firmware, stuff that is literally going to the moon, etc.</p><p>The point of this post is not to say we should just throw shitty code out there willy-nilly because you can fix it later<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. The point is to say, &#8220;do the best work you can, and when you think it&#8217;s ready, don&#8217;t have so much anxiety about releasing it.&#8221; I&#8217;d way rather have people using my slightly-buggy code, and have to release a new version to make it slightly less buggy, than to have nobody ever use it because I&#8217;m too scared to make the new version.</p><p>So anyways, to sum up: version numbers are free, we should do more of them. As always, thanks for reading, and tune in next week for more hot garbage from ACRL, Inc!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Psychology is another subject I am 100% equipped and trained to produce endless hot garbage about.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>P.S. The Wikipedia entry on <a href="https://en.wikipedia.org/wiki/Software_versioning">software versioning</a> is a fascinating read if you care about this stuff.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>We definitely never acquired software any other way back then, nosiree.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>It&#8217;s a little-known fact that back in the 90s, most tech support people were given a pair of tweezers and a tiny magnifying glass so that they could pick up the physical bits on a floppy disk and move them into the correct place. This was obviously a time-consuming and expensive process, so some engineers would instead try to speed up the process by smashing the bits into place with a tiny hammer. This is where the term &#8220;<a href="https://en.wikipedia.org/wiki/Bit_banging">bit-banging</a>&#8221; comes from.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Long list of caveats applies, see the section titled &#8220;Long list of Caveats&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This sentence is about AI.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[What to expect when you're expecting (a Kubernete)]]></title><description><![CDATA[Hello!]]></description><link>https://blog.appliedcomputing.io/p/what-to-expect-when-youre-expecting</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/what-to-expect-when-youre-expecting</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 23 Jan 2026 21:00:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3E5h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3E5h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3E5h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3E5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2956887,&quot;alt&quot;:&quot;Dwarves mining in a deep cave. the dwarves all have manic, crazy expressions on their faces. They are mining gems in the shape of the kubernetes logo. In the deep background of the cave is a shadowy, demonic figure surrounded by fire.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/185563985?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Dwarves mining in a deep cave. the dwarves all have manic, crazy expressions on their faces. They are mining gems in the shape of the kubernetes logo. In the deep background of the cave is a shadowy, demonic figure surrounded by fire." title="Dwarves mining in a deep cave. the dwarves all have manic, crazy expressions on their faces. They are mining gems in the shape of the kubernetes logo. In the deep background of the cave is a shadowy, demonic figure surrounded by fire." srcset="https://substackcdn.com/image/fetch/$s_!3E5h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!3E5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82ef87e-7e75-491c-aa16-830f0044e0d2_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Have we delved too greedily and too deep? No, it's the children who are wrong. Generated by ChatGPT.</figcaption></figure></div><p>Hello! Hope you&#8217;re all having a great start to your new year. As I hinted in my <a href="https://blog.appliedcomputing.io/p/acrl-warpped-3">recap post</a>, there are a lot of exciting things happening over here. In this post, I want to talk about one of them: a fun project I worked on over the holidays<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> and in the early part of the year to create ACRL&#8217;s very first permanently-running Kubernetes cluster<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>!</p><p>I&#8217;m expecting you all are having two distinct reactions to the above statement: a) &#8220;You&#8217;ve been in business for 2.5 years as Kubernetes experts and you don&#8217;t even have a Kubernetes cluster? What kind of hackjobs are you?&#8221;, or b) &#8220;Don&#8217;t you barely even do any real work? Why on earth would you need a Kubernetes cluster?&#8221;</p><p>My answer to those questions are &#8220;The best kind,&#8221; and &#8220;We were getting bored over here.&#8221; Anyways, in the remainder of this post, I&#8217;m going to pull back the curtain on some of our internal infrastructure and talk about how we got here.</p><h2>No, but really&#8211;why do you need a Kubernetes cluster?</h2><p>There&#8217;s an oft-repeated adage of sorts in the industry that &#8220;most companies&#8221; don&#8217;t actually need Kubernetes, and that trying to run a k8s cluster just adds a bunch of complexity that will distract from your core business offering when you&#8217;re small. I actually don&#8217;t totally disagree with this viewpoint, and have repeated it myself from time to time, but there are two specific reasons why I decided to ignore this advice for ACRL:</p><ol><li><p>We actually <em>are</em> supposed to be Kubernetes experts, and it is a good use of our time to make sure we stay up-to-date with how to run and operate the technology we are supposedly expert at.</p></li><li><p>As we&#8217;ve grown, I&#8217;ve built up a lot of &#8220;bespoke&#8221; services, and there are a bunch more that I <em>want</em> to run: for example, my website exists as a janky mess of Docker Compose files, and it&#8217;s always a little nerve-wracking to make changes to it. I also want to start hosting things like &#8220;an internal message/chat app&#8221; or &#8220;<a href="https://docs.victoriametrics.com/victorialogs/">VictoriaLogs</a>, because we already generate more logs than is humanly-feasible to comprehend&#8221;. And what I&#8217;ve realized is that I can have <em>N</em> sources of complexity to manage each of these <em>N</em> services/applications in slightly different bespoke ways, or I can have 1 source of complexity that manages <em>N</em> services for me in a uniform way. It&#8217;s the classic &#8220;<a href="https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/">pets vs cattle</a>&#8221; problem; and while ACRL is admittedly still quite small, it turns out that it doesn&#8217;t take a very large number of pets before you have too many pets. Kubernetes provides a well-supported platform for turning all of your dogs, cats, fish, and turtles into cows<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p></li></ol><p>So, given those two things, I&#8217;ve wanted to get this cluster up and running for a while, and finally decided it was time to take the plunge.</p><h2>So what&#8217;s your architecture, bro?</h2><p>Now, I knew going into this that running a Kubernetes is not for the faint of heart. It <em>does</em> have a huge amount of complexity, and it&#8217;s almost guaranteed to cost you an arm and a leg<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. So my goal when setting the cluster up was to do it as cheaply and with as little human intervention needed as possible. In this section we&#8217;ll talk about my architectural choices so that everybody on the Orange Site can mock and ridicule me for doing things the wrongest possible way.</p><h3>What Kubernetes distribution?</h3><p>There are a <em>LOT</em> of different options for &#8220;how to run Kubernetes&#8221; ranging from &#8220;just do the steps in <a href="https://github.com/kelseyhightower/kubernetes-the-hard-way">Kelsey&#8217;s book</a>, don&#8217;t worry, you&#8217;ll have a working cluster in 99 years&#8221; all the way to &#8220;just pay out the nose to run on EKS&#8221;. I landed somewhere in the middle: I don&#8217;t want to pay out the nose to AWS, and I want/need access to the control plane (which you don&#8217;t get with EKS), but I also want something that does &#8220;most&#8221; of the work for you. For this purpose I elected to use <a href="https://k3s.io/">k3s</a>, which is a lightweight Kubernetes distribution that bundles all of the components (API server, controller-manager, scheduler, kubelet) into a single statically-linked binary that you can &#8220;just run&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p><p>Using k3s makes setting up the cluster itself extremely easy: ship a binary somewhere and run it, done. The default configuration is even pretty sane out of the box<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>! The bigger question is how to set it up in a persistent, reliable, automated, and inexpensive way. And it turns out that <em>this</em> is where all the complexity lies.</p><h3>A descent into Moria</h3><p>Those of you who know me know that I&#8217;m a <em>huge</em> Lord of the Rings geek. So when I initially set up all my internal infrastructure tooling at ACRL, I didn&#8217;t have to reach far to come up with names for my repositories. We have two internal repos for doing &#8220;infrastructure as code&#8221; and &#8220;configuration management&#8221;. The first repo is named &#8220;moria&#8221;, and handles all of our IaC needs using <a href="https://www.pulumi.com/">Pulumi</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Moria is where all of our AWS config is managed: S3 buckets, EC2 instances, etc. The second repo, named &#8220;isengard&#8221;, uses <a href="https://ansible.com/">Ansible</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> for configuration management<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>: in other words, what software, tools, and configuration files should be installed on my hosts.</p><p>So, when I started getting k3s set up, the path seemed<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> pretty straightforward: launch an EC2 instance in moria, and install k3s on it via isengard. The first obstacle appeared when I started looking at costs: running a single EC2 instance with 2 CPUs and 8GiB of RAM would cost me ballpark $60/month (and that doesn&#8217;t include storage). And I want (eventually) a whole cluster of these things! So I made the design decision to run <em>the entire cluster</em> (including the control plane) on spot instances<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. Doing this lets me run a single node for ~$10/month (depending on spot price fluctuations). That&#8217;s much better!</p><p>The only problem is that spot instances can be taken away literally at any time, and I want my cluster to be (somewhat) resilient to disruption. Obviously the &#8220;best&#8221; way to do this would be to run etcd, a distributed datastore that is designed for resilience, but that would add a whole bunch of complexity and expense that I&#8217;m not ready for yet, so I took a middle ground: the k3s data volume would be a persistent EBS volume that gets automatically re-attached any time the instance restarts. So add another line of code to Pulumi and off we go!</p><p>Oops! Turns out that storage is expensive. A 100GB EBS volume (probably the minimum of what would be &#8220;acceptable&#8221; for this use case, once I get things actually running on the cluster) costs $8/month, and a 30GB root volume adds another $2.40. Welp, my costs just doubled. <em>ALSO</em>, I just made life harder for myself, because there&#8217;s no built-in way to re-attach EBS volumes to hosts in the event of disruption.</p><p>No big deal though, we can just write some more isengard code to handle this. The way we handle this is, each k3s host is tagged with the ID of the EBS volume that &#8220;belongs&#8221; to it; a couple of systemd scripts run on instance startup to look up the tag, attach the volume, mount it to the right place, format it if necessary, and then start k3s. Easy peasy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>!</p><p>The next problem we need to deal with is that the API servers need to have a stable network address if we&#8217;re planning to access them from anywhere &#8220;external&#8221;. My first thought was &#8220;DNS is for lusers, and also <a href="https://imgur.com/eAwdKEC">it&#8217;s always DNS</a>&#8221;, so I tried to just assign a static private IP address to the control plane instance. However, because we&#8217;re running the control plane node as a spot instance we have to stick it inside an ASG if we want the instance to automatically re-create itself on disruption, which means that&#8212;even though the ASG has a max size of 1&#8212;we can&#8217;t assign a private IP address to it. So, back to DNS it is.</p><p>But now we have a problem: how do we update the A record for the instance? Eh, no big deal, we&#8217;ll just use throw more money at AWS and use an internal hosted zone in Route53, and create a ASG lifecycle hook and a lambda function to update the A record on instance launch. Net additional cost: a few cents/month.</p><h3>AMI dreaming?</h3><p>The last step in the process is answering the question, &#8220;how do we get k3s re-installed on the nodes when they&#8217;re disrupted?&#8221; The na&#239;ve solution is &#8220;just re-run the Ansible playbook on boot&#8221;, but this is obnoxious because it a) uses up a bunch of compute credits for my burstable EC2 instance, and b) it takes even longer for the control plane to become available after interruption. So the solution here is to use <a href="https://developer.hashicorp.com/packer">packer</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a> to bake an AMI<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a> with all the required software packages installed. This part actually <em>was</em> pretty easy, thanks to all the hard work Ian has been doing to <a href="https://blog.appliedcomputing.io/p/postmortem-intermittent-failure-in">create a SimKube AMI for GitHub Actions</a>. It took a half-hour or so to modify the AMI baking pipeline to make it more generic, and now we&#8217;re baking an AMI for k3s as well! We do this weekly, so we can pick up security updates and other package updates to the underlying OS.</p><p>Of course, we don&#8217;t have anything to actually <em>clean up</em> stale/old AMIs, and turns out that these snapshots are <em>also</em> kindof expensive, so currently I log into our AWS console every couple of weeks and delete all the old ones, a process which is completely sustainable from now until the end of eternity. Net additional cost: another $5/month or so, probably.</p><p>We <em>do</em> want the k3s node(s) to reference the most recent AMI when new ones are launched or old ones are disrupted, and we also don&#8217;t want to have to manually update the ASG Launch Template for k3s to point to the new AMI. Fortunately, AWS provides a key-value store that you can reference inside a launch template so that it always points to the latest AMI. All we have to do is update the value stored there whenever a new AMI bake is complete, and problem solved. Fortunately because AWS is so benevolent, this service is free.</p><h2>Cool Kubernetes cluster, bro; what&#8217;s it do?</h2><p>So there you have it! ACRL is running a (one-node) Kubernetes cluster for around $35/month. Not too bad if I say so myself.</p><p>What? What&#8217;s that you say? Is it actually running any services or applications? Lmao, of course not. That stuff costs money! Also I got nerd-sniped this week into solving another totally different problem which will absolutely become its own blog post sometime in the future. So, yes, we are in fact spending $35/month for nothing.</p><p>Also, there&#8217;s the little, small, tiny&#8212;miniscule, really&#8212;problem that the cluster is living inside our private VPC, which means it&#8217;s not actually accessible to anybody from the outside. My current solution to this is to use SSH forwarding and a manual entry in <code>/etc/resolv.conf</code> to point to the internal VPC DNS resolver. This is&#8230; not actually sustainable, and isn&#8217;t even worthy of being called an &#8220;interesting choice&#8221;, it&#8217;s just dumb.</p><p>I <em>think</em> the solution here is to set up <a href="https://tailscale.com/">TailScale</a>, but that&#8217;s another $6/month and a whole bunch more configuration that I don&#8217;t have any time for right now. So instead, we&#8217;re just gonna keep running a pointless Kubernetes cluster that does nothing for the foreseeable future. But maybe at least you can use this article to inspire some &#8220;interesting&#8221; architectural decisions of your own.</p><p>As always, thanks for reading!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In the previous two years, I spent a tremendous amount of time over the holidays preparing grant proposals, because that&#8217;s when they were due. I decided this year, based on the general state of *<em>gestures despairingly at everything</em>* that applying for grants was a waste of my time, which meant I got to do fun things like &#8220;making my monthly AWS bill go up&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Well, &#8220;permanently&#8221; running is maybe a stretch. My SLA for the cluster is about one &#8220;5&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>OK this analogy got weird.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>There&#8217;s a <a href="https://www.devzero.io/blog/kubernetes-is-an-economic-system-not-a-technical-one">nice post</a> from the DevZero folks talking about why this is the case.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Another interesting fact about k3s is that it doesn&#8217;t&#8212;by default&#8212;use etcd, it instead runs an embedded SQLite database. You can configure it to use etcd or Postgres if you want, but I do find it really fascinating that the single foundational building block of Kubernetes that supposedly enables all its fancy features (watches and updates and blah blah blah)&#8230; isn&#8217;t actually necessary.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The only change I made was to disable Helm, because, well, fuck Helm.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>I can already hear some of my former coworkers saying &#8220;That&#8217;s an interesting choice.&#8221; My main reasons for using Pulumi were a) I&#8217;ve done a lot with Terraform and wanted to try something different, and b) I like Python better than HCL.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>&#8220;That&#8217;s an interesting choice.&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Having now used all three of the big &#8220;configuration management&#8221; tools&#8212;Puppet, Chef, and Ansible&#8212;I can now say definitively that &#8220;they all suck equally, just in different ways&#8221;. I picked Ansible because I like Python and Jinja templates.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Foreshadowing, anybody?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Say it with me now: &#8220;That&#8217;s an <em>interesting</em> choice.&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>It was not, in fact, &#8220;easy peasy&#8221; &#8211; the sticking point, naturally, was AWS IAM permissions; I asked both Claude and ChatGPT to help me write the IAM policies, because I&#8217;ve done enough of that by hand for one lifetime, and I expected that to be one of the tasks that the chatbots would actually be good at, and, well, it turns out that they are not, in fact, good at it.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>Looks like I wasn&#8217;t able to escape HCL after all.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>Pronounced Ayy Emm Eye, never &#8220;ahh-mee&#8221;.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[ACRL, Warpped 3]]></title><description><![CDATA[Whale hello there, happy new year!]]></description><link>https://blog.appliedcomputing.io/p/acrl-warpped-3</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/acrl-warpped-3</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 16 Jan 2026 21:00:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3tMD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3tMD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3tMD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3tMD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2380126,&quot;alt&quot;:&quot;The New York City LOVE sculpture but instead it says ACRL with a crooked C&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/184796656?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The New York City LOVE sculpture but instead it says ACRL with a crooked C" title="The New York City LOVE sculpture but instead it says ACRL with a crooked C" srcset="https://substackcdn.com/image/fetch/$s_!3tMD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!3tMD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e989c27-20e6-43e2-b3a3-9d80d63c9fca_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Generated with ChatGPT with the prompt &#8220;please generate an image of the letters ACRL in the style of the LOVE new york city sculpture&#8221;</figcaption></figure></div><p>Whale hello there, happy new year! I hope you all had a good holiday season! It&#8217;s been a minute since I&#8217;ve written anything here&#8212;I&#8217;ve been meaning to write this post ever since, well, before Christmas, but I had a bunch of travel and illness at the end of last year, and it&#8217;s taken me.... several weeks, lol, to get back into the swing of things in the new year. In any event, we&#8217;re back now! As is tradition, I want to take this post to do a recap of last year&#8217;s significant accomplishments and events, and then briefly discuss my goals and plans for the upcoming year.</p><p>Before jumping in, I do want to briefly acknowledge something: the world is a fucking shitshow right now. I don&#8217;t talk about it <em>much</em> on my blog, but things are hard out there, and I do really hope you are all taking care of each other. And, while I am personally optimistic about ACRL, I am&#8230; less optimistic about other things. It&#8217;s a difficult balancing act to hold both of those things at the same time, but I hope that the work that we do at ACRL is, in some small tiny way, an act of resistance against all of the bad things out there.</p><p>Anyways I don&#8217;t know how to segue from that, so I&#8217;m just going to do the world&#8217;s most awkward transition: anyways!</p><h2>2025 by the numbers</h2><p>As always, we&#8217;ll start off by looking a bunch of statistics and numbers that are, ultimately, meaningless.</p><ul><li><p>The big news of 2025 was that the company <a href="https://blog.appliedcomputing.io/p/now-there-are-two-of-them?r=2knfnl">doubled in size</a>! Ian has been doing a fantastic job, and I&#8217;m very happy to have him on the team. Looking forward to lots more good stuff this year!</p></li><li><p>I only had 26 meetings in 2025 (where I remembered to take notes). That&#8217;s a 17% decline in meetings from last year, which I feel like is a net improvement in things that could have been emails (or naps).</p></li><li><p>I completed 296 issues on my task list! That&#8217;s, like, 5 tasks a week, or just about one per day! If you want to work with somebody who is really good at creating tasks and then crossing them off, I&#8217;m ur guy! In a (slightly) more serious vein, in <a href="https://blog.appliedcomputing.io/p/2024-warpped?r=2knfnl">last year&#8217;s recap</a> I mentioned needing a better routine for task tracking. Early on the year, I switched to <a href="https://linear.app/">Linear</a> for task tracking, and I&#8217;ve been very happy with it. It&#8217;s pretty lightweight, easy to use, has some nice GitHub integrations, and is fairly inexpensive. It also has a good set of keyboard shortcuts, which I&#8217;m still learning but make your life easier once you get the hang of them.</p></li><li><p>In terms of code written, I only made 483 contributions on GitHub. This is slightly lower than in 2024 (which was, in turn, slightly lower than 2023). If we extrapolate this out, this means I will stop writing code sometime in 2030. Presumably this means I will have struck it rich and can retire. Thanks, linear algebra!</p></li><li><p>We can also look at &#8220;lines of code written&#8221;, which we all know is the best metric for software engineering, and it totally un-gameable<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Looking at my major repos, we have a total &#8220;source lines of code&#8221; count of 107,630. This is 90k more lines of code written in 2025! According to my estimates, this cost about <strong>$3 million</strong> to develop, took <strong>75 months</strong>, and required <strong>30 people</strong>. And this doesn&#8217;t even count work that I did for my clients! Not bad for a couple o&#8217; hacks<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. If you want to hire somebody who knows how to write lines of code, <a href="https://appliedcomputing.io/contact">get in touch!</a></p></li><li><p>The above two statistics don&#8217;t include any of the private work I did for clients in 2025; I had two clients that I was working with, and that kept me <em>very</em> busy throughout the year. Honestly it&#8217;s a miracle I managed to get any other code written at all, lol. I did have the opportunity to do some <a href="https://blog.appliedcomputing.io/p/astronomer-saving-megabux-with-sql?r=2knfnl">very successful work</a> for Astronomer, a small company you may have heard of.</p></li><li><p>By the way, just in case you were wondering, ACRL is now a Certificate Authority!</p></li><li><p>My blog was a little less active in 2025 than it was in 2024; I only wrote 20 posts last year, but I also produced a whole heck of a lot of video content<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, which is a new area for me, and is a lot harder and more time consuming than written content. Even so, I have been slowly but steadily adding new subscribers: I have 220 total subscribers, up 30 from last year! And 11 of you are paid subscribers: as always, thank you so much for your support. I don&#8217;t finance my business off of Substack, but I <em>do</em> finance my coffee addiction, and that&#8217;s not nothing.  We also had two (2) posts make it big on the Orange Site: <a href="https://blog.appliedcomputing.io/p/how-to-log-in-to-ecr-from-kubernetes?r=2knfnl">How to log in to ECR from Kubernetes the right way</a> in April, and and <a href="https://blog.appliedcomputing.io/p/make-the-easy-change-hard?r=2knfnl">Make the Easy Change Hard</a> in August. So if you want to view content that the Orange Site thinks is valuable, uh, check those two out I guess.</p></li><li><p>I&#8217;m still keeping up with the conference attendance: I was at the <a href="https://blog.appliedcomputing.io/p/what-ive-been-up-to-april-2025-edition?r=2knfnl">SoCal Linux Expo</a> in the spring, and (of course) <a href="https://blog.appliedcomputing.io/p/kubecon-recap-can-u-fit-in-the-kube?r=2knfnl">KubeCon</a> in the fall! I also made it to the <a href="https://blog.appliedcomputing.io/p/kubernetes-community-day-sf-bay-area?r=2knfnl">Kubernetes Community Days</a> event in San Francisco, which was a delightful smaller/more focused event than KubeCon is. My talks at KubeCon didn&#8217;t get accepted, but I was able to give <a href="https://youtu.be/661wqxu6DlE?si=6WQc7Hb49urI8xfd">a talk</a> as well as an <a href="https://youtu.be/yLxbfl3CAxE?si=P179PxrEcODelHpr">impromptu lightning talk</a> at <a href="https://cfp.cloud-native.rejekts.io/cloud-native-rejekts-atlanta-na-atlanta-2025/schedule/">Cloud Native Rejekts</a> before the main KubeCon event. Also at KubeCon, we sponsored the first-ever <a href="https://youtu.be/ZDlQzhAl8zI?list=PLOgtqKaB5McAOIyl18Gwh7Ks9CUWYxWwQ">&#8220;Can U Fit In The Kube?&#8221;</a> challenge, which was ridiculous and delightful and a ton of fun.</p></li><li><p>My first <a href="https://blog.appliedcomputing.io/p/contraction-hierarchies-hmc-clinic?r=2knfnl">clinic project</a> with <a href="https://hmc.edu/">Harvey Mudd College</a> wrapped up in spring of last year, and was a great experience! I&#8217;ve been sponsoring another project this year, which is also going very well; I have a great team that I&#8217;m working with, and I&#8217;m very excited about the project that they&#8217;re working on.</p></li><li><p>Towards the end of the year, ACRL published its first <a href="https://blog.appliedcomputing.io/p/postmortem-intermittent-failure-in?r=2knfnl">public postmortem</a> for a product that is (still) not publicly available<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. We just want to re-iterate that we&#8217;re deeply sorry for the impact that this issue caused, and to emphasize that we are making changes to prevent this from happening again.</p></li><li><p>Lastly, I just want to shout out to the ACRL BOD<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>&#8212;I&#8217;ve had an informal BOD ever since the company started, and I would not have gotten anywhere near as far along in this endeavour without them. But, last year, we took a step towards making the informal BOD (slightly) more formal: I&#8217;m now paying them for their time! I also added one additional person to the BOD, who I am extremely excited and grateful to have. Thank you so much for all your help so far.</p></li></ul><p>There were a <em>lot</em> of other things that happened in 2025, too many to write about here, but just know that no matter what pointless metrics you choose to measure things by, it was a very successful year for the business!</p><h2>Plans for 2026</h2><p>Let&#8217;s also take a brief look forward as to what&#8217;s coming in 2026! I&#8217;m pretty excited about the upcoming year, I think there&#8217;s some great things happening, and I can&#8217;t wait to share them with you:</p><ul><li><p>SimKube development continues: if you&#8217;ve been following along, you might have noticed that &#8220;core&#8221; SimKube development has slowed down quite a bit, which honestly is a good thing. We have a core piece of software that works pretty well and has had a lot of bugs and kinks ironed out over the last year. There are a bunch of planned new features for SimKube, but the main focus for the last 6 months has been &#8220;how do we make this thing easier to use?&#8221;, and I expect this to continue into 2026. ACRL is (still) probably the only team in the world that can <em>actually</em> use SimKube effectively, and I&#8217;d love to make it so that this is not the case. To this end, we have a number of projects in flight:</p><ul><li><p>A GitHub action runner that you can plug into your CI pipeline so that you can run simulations of your project on production data</p></li><li><p>A scraping tool similar to <a href="https://datastrophic.io/declarative-kubernetes-cluster-emulation-with-kemu/">kemu</a> that allows you to more easily replicate or clone a production environment into a simulation environment</p></li><li><p>A variety of other quality-of-life improvements to make the system easier to work with</p></li><li><p>[STRETCH GOAL] Maybe some kind of UI? I am increasingly coming to the conclusion that to <em>really</em> make SimKube take off, it needs a way to interact that isn&#8217;t just a CLI. We need ways to inspect, modify, and extend <a href="https://blog.appliedcomputing.io/p/anatomy-of-a-trace?r=2knfnl">trace files</a>, we need controls to start, stop, and re-run simulations, and we need dashboards to see the results of simulations. All of this is <em>possible</em> right now, but having a unified interface will be a huge improvement. The only slightly concerning challenge with this goal is that I am not, have no interest in, and will never be a front-end person, so to make this a reality I&#8217;m gonna need to find someone to do it for me.</p></li></ul></li><li><p>Client work will also be continuing in 2026: at the encouragement of my BOD, I am exploring ways to transition away from consulting work and move more towards &#8220;having a product I can sell&#8221;; this has always been the long-term goal, but I also do need ways of paying the bills. So I&#8217;m happy to say that I have one client that I&#8217;m continuing to work with in 2026, and several other potential clients in the sales pipeline! I&#8217;m hoping to have some more blog posts to share based on client work in 2026 as well.</p></li><li><p>We<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> are also spending a bunch of time ~writing a lot of YAML~ shoring up some of ACRL&#8217;s core infrastructure. It&#8217;s <em>probably</em> overkill, but I&#8217;m trying to start incorporating as many &#8220;best practices&#8221; for infrastructure management; we&#8217;re using <a href="https://pulumi.com/">Pulumi</a> for our infrastructure-as-code tooling and <a href="https://docs.ansible.com/projects/ansible/latest/index.html">ansible</a> for configuration management. On this front, ACRL is actively working towards running its very own permanent Kubernetes cluster, which I will have a very exciting blog post about in the coming weeks.</p></li><li><p>Conference attendance will continue in 2026 until morale improves. I&#8217;m going to be presenting at <a href="https://www.usenix.org/conference/srecon26americas">SRECon</a> in Seattle in a couple months; I&#8217;m very excited to go back to Salt Lake City for KubeCon in November, where we will be presenting &#8220;Can U Fit In The Kube 2&#8221;; and I am debating attending RustConf in Montreal for the first time! I&#8217;m also hoping that KCD SF will happen again, because I had a great time there. In any event, there will be lots of opportunities to meet up with us in 2026.</p></li><li><p>Blogging will continue in 2026 until morale improves. I am very grateful to all of you for continuing to read this publication, and I really enjoy getting to share the work that I&#8217;m doing with you all here. I&#8217;m also hoping to get Ian regularly contributing to this space in 2026, so you can get to know him a bit better as well.</p></li></ul><p>I&#8217;m sure 2026 is going to have its own twists and turns as well, but at least to start off with, that&#8217;s where we&#8217;ll be going! Follow along this year to learn how the plans worked out :)</p><h2>Warpping Up</h2><p>These wrap-up posts are always one of my favorite things to write on the blog: when you&#8217;re in the weeds as a business owner/small business person, there are a lot of hats to wear and a lot of things to do, and sometimes it can be frustrating and demoralizing that &#8220;we&#8217;re not moving faster&#8221;. So this is always a great way for me to look back at the year and go &#8220;Wow, we really did accomplish a lot!&#8221; It&#8217;s very validating to see: I&#8217;m two and a half years into this journey, and I&#8217;m so excited to see how successful it&#8217;s been so far. Of course anything can happen, who knows if this thing will still be around next year, blah blah blah, but I&#8217;m optimistic and excited!</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I have been using <a href="https://github.com/boyter/scc">scc</a> to count lines of code in my repos, and I&#8217;m leaving a link to it here because I use it once a year and never remember what it was called when it comes time to write the next warpped post.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>If you really want to dig into to the numbers, you might ask &#8220;how many lines of those are YAML?&#8221; And the answer to that is&#8230; <a href="https://pbs.twimg.com/media/Dfwl3oSW4AING2Z.jpg">35k lines</a>..</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>By which I mean, I spent a whole heck of a lot of time producing.... 6 videos. Many video. Much produce.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Thanks AWS.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Bored of Directors, naturally.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>And by &#8220;we&#8221; I mean &#8220;mostly Ian&#8221;.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Postmortem: Intermittent Failure in SimKube CI Runners]]></title><description><![CDATA[On Wednesday, November 26, 2025, while testing changes to ACRL&#8217;s SimKube CI Runner, an ACRL employee discovered an intermittent failure in the runner.]]></description><link>https://blog.appliedcomputing.io/p/postmortem-intermittent-failure-in</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/postmortem-intermittent-failure-in</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 05 Dec 2025 21:00:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UvVI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UvVI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UvVI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 424w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 848w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UvVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png" width="347" height="1624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1624,&quot;width&quot;:347,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221029,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/180812090?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UvVI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 424w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 848w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!UvVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff8caa4-5581-4e2b-904c-00b753c67d7a_347x1624.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We&#8217;re very sorry if your SimKube CI pipeline looked like this at some point in the last week or so.  Really, honest.</figcaption></figure></div><p>On Wednesday, November 26, 2025, while testing changes to ACRL&#8217;s SimKube CI Runner<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, an <a href="https://blog.appliedcomputing.io/p/now-there-are-two-of-them">ACRL employee</a> discovered an intermittent failure in the runner. This failure caused approximately 50% of the simulations scheduled on the runner to fail, resulting in failed actions in users&#8217; CI pipelines, which prevented new deploys of mission-critical code. We at ACRL take our responsibility as the world&#8217;s leading provider of Kubernetes simulation analysis very seriously, and we understand the severe impact this incident had on users of our CI runner. We deeply apologize for this incident, and are committed to taking whatever actions necessary to restore trust with our customers. In the remainder of this post we will outline the timeline of this incident, a detailed analysis of the underlying causes, and the remediation steps we have taken to prevent a recurrence of this incident.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Want to read more post-incident analyses from ACRL in the future?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Timeline of events</h2><p>The aforementioned ACRL employee discovered the issue late Wednesday afternoon on the 26th. However, because the following day was Thanksgiving, the investigation was postponed until the following week under the hypothesis that it was likely a transient error, it&#8217;d probably go away if we didn&#8217;t look at it too hard, and we had a lot of Thanksgiving food to eat.</p><p>On the following Monday (December 1st), during our regularly-scheduled company all-hands, we re-triggered the CI pipeline once and it succeeded, whereupon we decided the problem had fixed itself. It wasn&#8217;t until Thursday, December 4th, when the incident re-occurred that we decided to bother spending some time investigating. We then spent most of the afternoon troubleshooting until we found the inciting factors<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> and identified a series of remediations. Those fixes were published at some point later on, when we got around to it.</p><h2>Background and terminology</h2><p><a href="http://simkube.dev">SimKube</a> is ACRL&#8217;s <a href="https://blog.appliedcomputing.io/p/simkube-part-1-why-do-we-need-a-simulator">simulation environment for Kubernetes</a>. It is designed to allow organizations to study changes in their production Kubernetes clusters in a safe and isolated environment. One way of using SimKube is as a dedicated step in CI pipeline; this would enable users to check for regressions or bugs in their Kubernetes code before it is deployed.</p><p>The SimKube CI runner is published<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> as an Amazon Machine Image (AMI)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, which contains a complete SimKube environment. The runner can replay <a href="https://blog.appliedcomputing.io/p/anatomy-of-a-trace">trace files</a> contained in the codebase, and will check the outcome of the simulation to see if it&#8217;s <code>Succeeded</code> or <code>Failed</code>. The symptoms of this incident were that periodically, a simulation would report as &#8220;failed&#8221; after completing its entire run. The SimKube driver pod (the component responsible for running the events in the trace file) would report the following error, along with a stack trace and a panic:</p><pre><code><code>timed out deleting simulation root sk-test-sim-driver-sn295-root
</code></code></pre><p>The &#8220;simulation root&#8221; is a Kubernetes <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">custom resource</a> which acts as a &#8220;hook&#8221; to hang all the other simulation objects off of. The simulation root exists to make for a one-step clean-up procedure: because of Kubernetes <a href="https://kubernetes.io/docs/concepts/architecture/garbage-collection/">garbage collection</a>, when the root is deleted, all objects owned by the simulation root will also be deleted.</p><h2>Detailed analysis</h2><p>The first step we took in our investigation was to study the trace file running in the simulation. This trace file (also available as an <a href="https://github.com/acrlabs/simkube/blob/main/examples/traces/cronjob.sktrace">example trace</a> in the SimKube repo) creates a single <code>CronJob</code>, lets it run for three minutes, and then deletes the <code>CronJob</code>. The <code>CronJob</code> is configured to create a new pod every minute, and the pod sleeps for 30 seconds before terminating. This trace file is used to test the pod lifecycle management features of SimKube.</p><p>We investigated the log files from all the relevant controllers, including the SimKube driver pod, the Kubernetes controller manager, and the Kubernetes API server. The results were, to use the technical terminology, extremely f*$&amp;ing weird. The SimKube driver pod had dozens of log lines which looked like the following:</p><pre><code><code>INFO mutate_pod: mutating pod (hash=10855072724872030168, seq=66) pod.namespaced_name=&#8221;virtual-default/hello-simkube-29414550-tcr49&#8221;
INFO mutate_pod: first time seeing pod, adding tracking annotations pod.namespaced_name=&#8221;virtual-default/hello-simkube-29414550-tcr49&#8221;
</code></code></pre><p>What do these lines mean? Well, the SimKube driver registers itself as a <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook">mutating webhook</a> so that it can redirect simulated pods to the fake nodes and apply other labels and annotations to them. The <code>hello-simkube</code> pod is the one that&#8217;s owned by the simulated CronJob. What&#8217;s curious about these log lines is that they repeat over, and over, and over again, even after the CronJob object itself has been deleted! At first we thought this meant that the CronJob hadn&#8217;t actually been deleted, but after some further study we realized that the pod name was the same for every single one of these log entries: in other words, the SimKube mutating webhook is trying to mutate the same pod for 10 minutes, well after the simulation was over and everything (supposedly) had been deleted.</p><p>The next clue came from the Kubernetes controller manager logs:</p><pre><code><code> &#8220;syncing orphan pod failed&#8221; err=&lt;
        Pod &#8220;hello-simkube-29414550-tcr49&#8221; is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations), `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
        @@ -140,7 +140,9 @@
          &#8220;TerminationGracePeriodSeconds&#8221;: 30,
          &#8220;ActiveDeadlineSeconds&#8221;: null,
          &#8220;DNSPolicy&#8221;: &#8220;ClusterFirst&#8221;,
        - &#8220;NodeSelector&#8221;: null,
        + &#8220;NodeSelector&#8221;: {
        +  &#8220;type&#8221;: &#8220;virtual&#8221;
        + },
          &#8220;ServiceAccountName&#8221;: &#8220;default&#8221;,
          &#8220;AutomountServiceAccountToken&#8221;: null,
          &#8220;NodeName&#8221;: &#8220;cluster-worker&#8221;,
 &gt; logger=&#8221;job-controller&#8221; pod=&#8221;virtual-default/hello-simkube-29414550-tcr49&#8221;
</code></code></pre><p>This is a standard error that gets returned when something (a user, a controller, etc) tries to update a read-only field. In this case, it&#8217;s showing that something is trying to update the pod&#8217;s node selector after the pod has already been created, which is not allowed. There are two curious things to note in this log entry: first, the timestamp is after SimKube has deleted the CronJob, and it states that the pod has been orphaned, which means it&#8217;s not owned by anything. In other words, the CronJob really was deleted! Secondly, we got lucky in that some of the additional context shows that the pod has been scheduled to a node, that is, <code>cluster-worker</code>. This is not one of our simulated nodes! This is a real node! That shouldn&#8217;t happen.</p><p>The last clue came from the API server logs, where we discovered that the SimKube driver mutating webhook had been configured to fail open<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. This means that, if the webhook fails (for whatever reason), the pod object will be allowed through anyways. Specifically, we saw that the webhook was failing because of a certificate error.</p><p>The certificate error immediately cast suspicion on <a href="https://cert-manager.io/">cert-manager</a>, which is the component that manages all of the TLS certificates for SimKube. Cert-manager is quite a complex bit of machinery, but is nevertheless required because mutating webhooks <em>must</em> communicate over TLS, which means they need certificates. In SimKube, we create a self-signed certificate issuer for this purpose. Cert-manager is actually a very robust tool, and has the really nice feature that it can auto-inject certificates into your webhook configuration if you apply the <code>cert-manager.io/inject-ca-from</code> annotation, which we do in SimKube. Investigating the cert-manager logs, everything seemed like it was working as designed at first, until we inspected the timestamps more closely. Then these two lines stood out:</p><pre><code><code>I1204 18:29:07.814009 attempting to acquire leader lease kube-system/cert-manager-cainjector-leader-election...
I1204 18:30:11.466829 successfully acquired lease kube-system/cert-manager-cainjector-leader-election
</code></code></pre><p>By default, cert-manager, like many other components in Kubernetes, operates in a semi-<a href="https://en.wikipedia.org/wiki/High_availability">HA</a> fashion. There is one &#8220;leader&#8221; pod and a number of hot standby pods. That way, if the leader pod crashes or gets evicted, one of the standby pods can immediately take over. Kubernetes provides a <a href="https://kubernetes.io/docs/concepts/architecture/leases/#leader-election">distributed locking</a> mechanism to ensure that only one pod can be the leader at a time. Until the lease is acquired, the cert-manager pod can&#8217;t do any work. What&#8217;s interesting to note here is that it took almost a minute to acquire the lease; and moreover, the simulation start time on the runner was 18:29:41, which means that the first CronJob pod, created at 18:30:00, was created <em>before</em> the cert-manager injector could provide the SimKube mutating webhook with its certificate.</p><p>So that&#8217;s one mystery answered: if the webhook didn&#8217;t have a certificate, it can&#8217;t apply the proper node selector, and because it fails open, the pod gets scheduled onto a real Kubernetes node instead of the intended fake node. But why and how does this pod become orphaned and stick around in the cluster until the SimKube driver times out?</p><p>Now that we knew the mechanism for the failure, it was easy to develop a local reproduction: delete the cert-manager injector pod from the cluster, start a simulation, and then after the first CronJob pod was created, recreate the cert-manager injector pod. This simulates<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> the effect of the injector waiting for the lease. In fact, the first time we did this, we didn&#8217;t recreate the injector pod until <em>after</em> the simulated-cronjob-sleep-pod-that-got-scheduled-on-a-real-node-by-mistake<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> had finished, and in <em>this</em> case it was correctly cleaned up and the simulation finished as normal.</p><p>Repeating the test locally, we observed that the critical failure <em>only occurs</em> if the cert-manager injector pod comes up <em>while the CronJob pod is running</em>. Since we had a reliable way to reproduce the error, we decided to take a quick peek at the kubelet logs and saw this log line repeated over and over again:</p><pre><code><code>Failed to update status for pod&#8221; err=&#8221;failed to patch status
...
&lt;long status update message&gt;
...
for pod \&#8221;virtual-default\&#8221;/\&#8221;hello-simkube-29414879-r22m5\&#8221;:
pods \&#8221;hello-simkube-29414879-r22m5\&#8221; is forbidden: node \&#8221;karpenter-worker\&#8221; cannot update labels through pod status&#8221;
</code></code></pre><p>Aha! This is the last piece of the puzzle: kubelet is <em>trying</em> to update the status of the pod to say that it&#8217;s finished running, but it can&#8217;t. The error message is slightly weird, it&#8217;s saying that kubelet is sending a modification to the pod <em>labels</em> to the pod <em>status endpoint</em>, which is forbidden because pod labels aren&#8217;t part of the pod status. What&#8217;s strange about this is, if you look at the actual update kubelet is sending, there are no label updates.</p><p>I suspect those of you who&#8217;ve written admission webhooks are nodding along by now. The flow of data looks like this:</p><pre><code><code>kubelet status update -&gt; API server -&gt; SimKube mutating webhook -&gt; 
API server -&gt; kubelet
</code></code></pre><p>In other words: because the SimKube mutating webhook was subscribed to both <code>CREATE</code> and <code>UPDATE</code> events<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>, it intercepted the kubelet&#8217;s status update, said &#8220;hey, this pod doesn&#8217;t have any of the right simulation labels or the proper node-selector on it, lemme add those!&#8221; The Kubernetes API server received the modification and said (in the logs) &#8220;Hey, you can&#8217;t add a node selector on an UPDATE!&#8221;, and said (to kubelet) &#8220;Hey, you can&#8217;t add a label from the <code>/status</code> endpoint!&#8221;, and said (to the mutating webhook) nothing<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. Kubelet continued to retry the status update for the pod every 10 seconds until the simulation driver terminated.</p><p>Wait, but why did everything clean up after the simulation crashed? Well, once the simulation driver pod terminated, there was no longer a mutating webhook in place to add labels to the pods based on a status update, so the update went through, Kubernetes realized the pod had completed, and it deleted it to finish its cleanup.</p><h2>Remediation steps</h2><p>After conducting this detailed analysis, ACRL engineers identified the following remediation steps:</p><ol><li><p>Stop running cert-manager in HA mode, because our one-replica cert-manager injector pod definitely doesn&#8217;t need to be spending up to one (1) minute trying to claim a lock that nobody else is holding.</p></li><li><p>Configure the SimKube driver mutating webhook to fail closed: we basically never want a pod that is designated for a simulated node to get scheduled on a real node, because that could cause all kinds of issues.</p></li><li><p>Configure the SimKube driver mutating webhook to only listen to pod <code>CREATE</code> events, not <code>UPDATE</code> events. Once the simulated pod is running, the driver never makes any further changes, so there&#8217;s no reason to listen for updates.</p></li><li><p>Modify the SimKube simulation controller to wait for the driver pod to receive its certificate before continuing with simulation setup.</p></li><li><p>Improve our logging and metrics monitoring infrastructure so that it&#8217;s easier to identify and troubleshoot these issues in the future.</p></li></ol><p>As is common with incidents of this nature and scale, there was no single point of failure that caused the issue; had any one of these remediations been in place, the incident would not have occurred. To prevent future recurrence of this issue, and to enable defense in depth, we will prioritize getting these fixes in place at some point in the future when we feel like getting around to it.</p><h2>Conclusion</h2><p>ACRL cares strongly about the experience of the zero customers who are using this SimKube CI Runner action. We deeply apologize for the impact that our failure had on your CI pipelines and deploy process, and will be issuing refunds to all zero of customers who tried to use our runner image during the period of this outage. Please feel free to <a href="https://appliedcomputing.io/contact/">contact our support team</a> if you have any further questions or concerns about this outage, and rest assured we will strive to do better next time.</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Currently available to zero customers because AWS hasn&#8217;t approved it yet.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Less-enlightened organizations than ACRL might call this a &#8220;root cause&#8221; but as we all know around here, the root cause language is actively harmful to organizations&#8217; treatment and understanding of outages, so we don&#8217;t use that terminology on this blog.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Err, it will be published. Sometime.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Definitely pronounced Ayyy-Emm-Eyye, not &#8220;ah-me&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>We practice blameless postmortems around here, so you&#8217;ll notice that we don&#8217;t actually include the name of the engineer who made this mind-bogglingly dumb decision two years ago.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Hehehehehheeuuuhehe</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Are you confused yet?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Another, shall we say, <em>extremely baffling</em> decision made by the only ACRL engineer who existed at the company two years ago, but who shall remain nameless, because again, Blameless Postmortems &#8482;&#65039;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>By the way, this is one of the surprising behaviors of mutating webhooks: if the API server rejects the update for whatever reason, the webhook <em>is not notified</em>. It has no idea that the mutation that it tried to make just failed. The other surprising bit here is that the error message returned to the initiator of the request <em>includes</em> the mutations made by the webhook, which can result in some extremely weird and hard-to-troubleshoot spooky-action-at-a-distance issues, like in this case.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[KubeCon Recap: Can U Fit in the Kube?]]></title><description><![CDATA[KubeCon was last week!]]></description><link>https://blog.appliedcomputing.io/p/kubecon-recap-can-u-fit-in-the-kube</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/kubecon-recap-can-u-fit-in-the-kube</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Fri, 21 Nov 2025 21:00:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/179579459/ddadd4c07ae1560b8e6fa0866cdfb35d.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>KubeCon was last week! In past years, I&#8217;ve tried to do a big &#8220;recap&#8221; series, but that ends up being hard because it always runs into the holidays and the series tends to peter out, so this year I&#8217;m going to try to stuff everything into one post. Let&#8217;s see how it goes! Before we get started though, you should all go watch the video I posted at the top: I&#8217;ll talk more about it in a minute, but it genuinely was one of the highlights of the conference for me.</p><h2>The vibes</h2><p>It&#8217;s unfortunate that KubeCon was in Atlanta the same weekend that the US government announced disruptions to flights at major airports around the country. I was lucky in that my flight was &#8220;only&#8221; delayed 2 hours<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, but my HMC clinic students this year had their flight canceled, which was a major bummer. A few of them were able to make it out anyways, but it was definitely disappointing. I was having trouble telling how much the FAA disruptions impacted the conference as a whole, though&#8212;it still seemed like the event was well-attended, so aside from a lot of worrying on social media, it was still a big, loud, chaotic event just like normal.</p><p>While we&#8217;re talking about vibes, of course AI was <em>everywhere</em>. All the vendors were advertising their agentic thingmajiggywhatsit, and (at least based on the abstracts) more than half the talks were about AI. I also heard from a bunch of people that the conference felt way more &#8220;vendor-focused&#8221; than in the past. I&#8217;m not sure if I agree with this or not&#8212;KubeCon has always as far as I can remember had a veneer of vendors pretending to be open-source so they can sell you something<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>&#8212;but I did see more people <em>complaining</em> about it. The interesting technical content and hallway track conversations are still present, but I think folks are realizing that you have to work a lot harder to find it. And to be clear, <em>I&#8217;m</em> not complaining about this<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, I think this is just the natural progression for events that are as large as KubeCon and have as much money swirling around.</p><p>In terms of my personal experience, just like last year I went to very few talks, so I have a big backlog of things that I want to catch up on when the videos are posted to YouTube. Unlike last year, where I spent a lot of time on the vendor floor, this year I did the pre-game thing where I set up a ton of meetings ahead of time with folks that I wanted to talk to. I think this was a much more effective strategy, and I had a lot of really energizing and productive conversations with folks. It was also really fun to attend with my <a href="https://blog.appliedcomputing.io/p/now-there-are-two-of-them">new coworker</a>, we had a really good time.</p><h2>The talks</h2><p>But let&#8217;s go over the talks I <em>did</em> manage to attend: I only went to five, I think, but they were all &#128293;. I&#8217;ll give a quick overview here, in order of appearance:</p><ul><li><p>Slurm Bridge: Slurm Scheduling Superpowers in Kubernetes by Alan Mutschelknaus &amp; Tim Wickberg, SchedMD: this talk was a deeply technical talk about how to incorporate the OG distributed scheduler (aka Slurm) into Kubernetes. If you&#8217;re not familiar, Slurm is a fairly old tool used in many different high-performance computing (HPC) contexts. It is, I&#8217;d say, harder to use than Kubernetes and has an &#8220;Old Unix Sysadmin&#8221; appeal to it, but it&#8217;s also extremely good at what it does<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. To this end, there have been a lot of efforts to integrate Slurm and Kubernetes, and this talk described one such effort called SlurmBridge. The high-level framework for SlurmBridge is that workloads (Pods) are scheduled by Slurm but launched by kubelet. I don&#8217;t have a <em>ton</em> of personal experience with Slurm, but this seems like a really interesting approach and I&#8217;d love to play around with it some more!</p></li><li><p>Zero Downtime Migration of Monolith To K8s Using Sidecar and Container Lifecycle Hooks by Deepak Kosaraju &amp; James Dabbs, Procore: the next talk I went to was by a former coworker who&#8217;s now at ProCore Technologies, which is a construction management software company. This talk was a very hands-on, practical introduction to a problem that many companies face, namely, &#8220;We&#8217;ve got a giant monolithic code base and we want to figure out how to run it on Kuberentes&#8221;. The talk included a lot of practical advice about how to do autoscaling and configure timeouts and application behaviour to minimize downtime when Kubernetes, you know, does its Kubernetes thing. There&#8217;s also a <a href="https://github.com/deepak-kosaraju/kubecon25-zero-downtime">GitHub repo</a> where you can follow along and try out a bunch of these techniques on your own!</p></li><li><p>I&#8217;ve Got 99 Problems and They&#8217;re All Controllers by Tim Goodwin, UC Santa Cruz: you might recognize the name of this presenter, I gave a joint presentation with Tim on our <a href="https://www.youtube.com/watch?v=QcYsGytNBe8">Kompile project</a> at KubeCon last year. This year Tim was back talking about (wait for it) Kubernetes simulation! Tim is approaching this problem from a slightly different perspective than I am: instead of trying to simulate the <em>whole</em> cluster for the purposes of analyzing the control plane, Tim wants to simulate the <em>control plane</em> for the purposes of testing a single Kubernetes controller. To accomplish this, Tim built a tool called <a href="https://github.com/tgoodwin/kamera">Kamera</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> which provides a fake Kubernetes API server that you can wire into your controller(s). The fake API server can automatically produce events for your controller in any arbitrary order, and check the outputs from your controller to see if it does the right thing. In essence, it&#8217;s a big Kubernetes controller fuzzer, and it&#8217;s <em>incredibly</em> cool. This was one of my favorite talks from the conference and I definitely recommend that you watch it.</p></li><li><p>From Panic To Peace: Making K8s Controller Observability Suck Less by Cat Morris &amp; Derik Evangelista, Syntasso: this was another talk about building and testing Kubernetes controllers, which reviewed a lot of very practical &#8220;best practices&#8221; for how to write controllers and make them robust, observable, and debuggable. It also included a fun Shrek story throughline throughout the talk, which was a fun touch. Definitely recommend this one as well if you&#8217;re in the Kubernetes controller/operator space!</p></li><li><p>Evicted! All the Ways Kubernetes Kills Your Pods (and How To Avoid Them) by Ahmet Alp Balkan<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>, LinkedIn: this was far and away the best talk I saw at KubeCon this year. The whole premise of the talk was &#8220;Kubernetes has a lot of ways to kill your pods, and generally people don&#8217;t like it when their pods get killed, so let&#8217;s enumerate all the ways this can happen so you can be prepared.&#8221; He discussed 9 different pod eviction pathways, at least 8 of which I&#8217;ve personally experienced. He then demonstrated that none of these eviction pathways are aware of each other or play nicely together, and also that existing Kubernetes primitives for managing pod eviction are <em>WOEFULLY</em> inadequate, one of my personal pet peeves<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. If you watch one talk from this conference, make it this one. It&#8217;s so good.</p></li><li><p>Evolving Kubernetes Scheduling by Eric Tune &amp; Wojciech Tyczy&#324;ski at Google: this talk was in (almost) the last slot on Thursday afternoon, but it was still really well-attended! The presenters in this talk discussed the future of Kubernetes scheduling to handle the needs of batch and ML workloads running on Kubernetes. This is, as previously discussed, a challenging problem, because Kubernetes was never designed to handle these types of workloads, and is completely unaware of things like hardware/network topologies that are critically important to HPC/ML workloads. The scheduling framework is also missing many primitives that are necessary to operate in this environment. The quote from the talk that resonated the most with me is that, &#8220;The Kubernetes scheduler is designed to schedule one pod on one node at a time, and we now need it to schedule groups of pods on groups of nodes at a time.&#8221; They discussed a number of proposals and extension points that they are hoping to add to the Kubernetes scheduler to handle this use case. Personally, I am somewhat unconvinced that kube-scheduler is up to the challenge, but I&#8217;m still very interested to see where this ends up!</p></li></ul><h2>The vendor floor</h2><p>So that was it for the talks I attended! I also promised to discuss the video that I included at the beginning of this post. As you may have noticed, if you&#8217;ve been following along, I did a huge<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> <a href="https://blog.appliedcomputing.io/p/oh-we-doin-video-now-huh">video marketing campaign</a> leading up to KubeCon this year, and the <a href="https://youtube.com/shorts/ngsHBkfEoho">second-to-last video</a> was of my good friend and colleague Liz trying to climb into a tiny cardboard box to see if she could &#8220;fit in the kube&#8221;. This was, objectively, hilarious, and I had the objectively genius idea to see how many other people at KubeCon could &#8220;fit in the kube&#8221;.</p><p>So $20 and one extremely tiny FedEx box later, we launched the first-ever &#8220;ACRL Can U Fit In The Kube&#8221; challenge. I paid five people $50 each if they could &#8220;fit in the kube&#8221;, defined somewhat subjectively as &#8220;being able to more-or-less close the lid of the cardboard box without it ripping&#8221;. We had a <em>lot</em> of people stop by our guerilla marketing &#8220;booth&#8221;, and I handed up a bunch of SimKube business cards, and got to meet and laugh with a bunch of really cool people. This was probably the highlight of the conference for me: it was so fun to see peoples&#8217; reactions and double-takes when they walked by our &#8220;booth&#8221;, and I think maybe brightened some people&#8217;s days as well in what can be an long, gruelling, exhausting week. So that&#8217;s the story of our first-ever KubeCon promotional challenge, and it will absolutely be back next year with &#8220;Can U Fit In The Kube 2&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>!!!</p><h2>Aside: ACRL video campaigns</h2><p>Also speaking of videos, I thought it would be interesting to briefly cover some stats about the marketing videos that I posted in the lead-up to KubeCon. It was definitely eye-opening for me, and a window into a whole world of advertising that I&#8217;ve heretofore never experienced. For a quick summary, we posted five videos, one per week, starting at the beginning of October. Each video I posted here, on LinkedIn, on YouTube, and on my social media sites (Hachyderm and Bluesky)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. I spent $100 on LinkedIn to &#8220;boost&#8221; them for a week, and I also spent $100 on YouTube for the first one, just to compare what sort of reach and viewership we got on the two platforms. The results were somewhat surprising! Here&#8217;s a few stats:</p><ul><li><p><a href="https://youtube.com/shorts/jUE2YdUpAYQ">Video 1: &#8220;What&#8217;s SimKube, Precious???&#8221;</a> - Lord of the Rings is near and dear to my heart, so this was a very fun video to make. On LinkedIn, we got 16k &#8220;impressions&#8221;, 5000 views, 14 reactions, and 1 comment on the post. On YouTube, we got 25k views, and 43 &#8220;likes&#8221;. The big difference between YouTube and LinkedIn is the degree of targeting that you can do. LinkedIn has very fine-grained controls over your audience, in terms of who the video gets shown to, whereas YouTube was basically &#8220;are they in the US&#8221; and &#8220;what gender&#8221;. So even though YouTube got five times the views, I kindof suspect that most of the people who watched the video probably didn&#8217;t even know what Kubernetes is.</p></li><li><p><a href="https://youtu.be/CGkrE0DbH1Y">Video 2: How to pronounce skctl</a> - this was definitely my least favorite of the video series; it took a <em>long</em> time to make, and the result wasn&#8217;t quite what I&#8217;d envisioned. Nevertheless, I got a lot of good feedback from the video, and other people seemed to genuinely enjoy it. We got 17k &#8220;impressions&#8221;, 4800 views, 18 reactions, and 7 comments on the post on LinkedIn.</p></li><li><p><a href="https://youtube.com/shorts/KEjFjpHKi2M">Video 3: The Bluth family does Kubernetes</a> - This was my favorite video in in the series. It was short and easy to film, and resonates with a cultural touchpoint that I think is familiar to a <em>lot</em> of people in my &#8220;target market&#8221;. Also I just love Arrested Development. We only got 10k &#8220;impressions&#8221; and 4000 views (possibly because I was experimenting with a smaller target audience for this post), but 22 reactions and 3 comments. The folks who did watch this video seemed to love it!</p></li><li><p><a href="https://youtube.com/shorts/BW-iYyO_QG8">Video 4: Mission Impossible XII: Upgrade Kubernetes</a> - I was <em>expecting</em> this post to be a lot more popular than it was, particularly given how painful and near-to-home the Kubernetes upgrade problem is for folks, but I actually got very little engagement with it. We had 12k impressions, but only 2500 views, 5 reactions, and no comments. I don&#8217;t know if if people were just getting tired of SimKube content, or if there was some other factor outside of my control that was impacting viewership.</p></li><li><p><a href="https://youtube.com/shorts/ngsHBkfEoho">Video 5: Can you fit in the cube?</a> - Of course, the video that spawned our &#8220;challenge&#8221; at KubeCon. My friend Liz filmed it on a total whim, and (I found out later) wasn&#8217;t even intending for me to post it<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. Nevertheless, the video was (imo) brilliant, and outperformed every other post we made, with 29k impressions, almost 9000 views, and 16 reactions&#8212;still no comments, though.</p></li><li><p><a href="https://youtube.com/shorts/ZDlQzhAl8zI">Video 6: Can U Fit In The Kube challenge recap</a> - This is the same video at the top of this post, and the &#8220;boost&#8221; campaign is still ongoing. So far, we&#8217;ve gotten 9k impressions and 1700 views, along with 5 reactions and 2 comments. It &#8220;feels like&#8221; this video hasn&#8217;t gotten as much engagement as I was expecting, but I don&#8217;t know why: maybe folks are just not on LinkedIn as much now that KubeCon is over? No idea.</p></li></ul><p>So anyways, that&#8217;s the summary of my KubeCon experience! It was a good conference, despite a few rough patches here and there, and I&#8217;m definitely looking forward to returning to Salt Lake next year!</p><p>As always, thanks for reading. We&#8217;ve got some banger content coming up in the next few weeks to close out the year, so follow along if you want!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe below for banger content.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Humorously (???), it had nothing to do with the ATC shortages, but instead was a &#8220;genuinely minor fuel leak&#8221; that they resolved by &#8220;turning the plane on and off again&#8221;. You cannot make this stuff up.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Present company&#8212;of course, of <em>course</em>&#8212;excluded.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I know, shocking, right? I complain about <em>everything</em>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>I have detected some bitterness in the Slurm community that Kubernetes is so popular, and that so many people are trying to stuff HPC/ML workloads onto it, despite it being nowhere near as suitable for ML/HPC workloads as Slurm is.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Pronounced like &#8220;camera&#8221; but with a &#8216;K&#8217; for obvious reasons.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Ahmet, by the way, is the writer and maintainer of the <a href="https://github.com/ahmetb/kubectx">kubectx/kubens</a> tools, which, if you&#8217;re not using them already, you absolutely should. They will save you at least as much time and frustration as whatever agentic thingamajiggywhatsit you&#8217;re using today.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>I feel like I&#8217;m beating a dead horse at this point, but did you know that the core Kubernetes controllers&#8212;e.g., the Deployment controller&#8212;do not even know that PDBs exist, much less how to query or interact with them??? I understand how we got into this situation, but I also find it completely appalling that this is the situation we&#8217;re in. UGH.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Well, huge for <em>me</em>, not huge in the grand scheme of things.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>I wasn&#8217;t initially planning to go to KubeCon Amsterdam, but I kindof want to go just so we can repeat the challenge there! We&#8217;ll see if that happens or not&#8230;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>I also posted the first few videos on TikTok, partly because I was curious to see if I got any engagement there, and also because I thought it would be funny to say that ACRL has a TikTok. But after each of the first three videos got a grand total of zero (0) views, I gave up on the TikTok thing. Clearly the TikTok algorithm can detect elder millenials trying to pose as the cool kids.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Whoops.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Can you fit in the cube?]]></title><description><![CDATA[KubeCon is next week!]]></description><link>https://blog.appliedcomputing.io/p/can-you-fit-in-the-cube</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/can-you-fit-in-the-cube</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Thu, 06 Nov 2025 18:02:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/178157046/beb797ffa5f0e7016d6084f83db77a09.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>KubeCon is next week! Come find me, ask me questions about SimKube, and get a Mr. Squidler sticker!!!</p><p>I have a calendar set up here if you want to book some time: https://cal.com/drmorr/kubecon</p><p>Filmed by my Developer Relations Consultant Elizabeth Ponce and her partner!!</p>]]></content:encoded></item><item><title><![CDATA[Mission Impossible XII: Upgrade Kubernetes]]></title><description><![CDATA[Spoiler warning: using SimKube makes it easier!]]></description><link>https://blog.appliedcomputing.io/p/mission-impossible-xii-upgrade-kubernetes</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/mission-impossible-xii-upgrade-kubernetes</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Wed, 29 Oct 2025 17:46:05 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/177493754/3f3396eba17f5c15ebbff1e32d4b2cfd.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Spoiler warning: using SimKube makes it easier!  Want to know more?  Come chat with us at KubeCon:</p><p>https://cal.com/drmorr/kubecon</p>]]></content:encoded></item><item><title><![CDATA[The Bluth family does Kubernetes]]></title><description><![CDATA[The Bluth family does Kubernetes.]]></description><link>https://blog.appliedcomputing.io/p/the-bluth-family-does-kubernetes</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/the-bluth-family-does-kubernetes</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Wed, 22 Oct 2025 17:33:06 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/176850800/58e8236f45f3c432e8f0bd4adab5158c.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>The Bluth family does Kubernetes. Filmed along with my good friend and amazing engineer, Elizabeth Ponce!  Wish your company was less like the Bluth company? Come chat with us at KubeCon to find out how!</p><p><a href="https://cal.com/drmorr/kubecon">https://cal.com/drmorr/kubecon</a></p><p>(Btw, if you&#8217;re wondering why there are so many videos here recently, you can read my <a href="https://blog.appliedcomputing.io/p/oh-we-doin-video-now-huh">recent post</a> about it, which I don&#8217;t think got an email announcement because of the AWS outage)</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Wanna see more dumb videos?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Oh, we doin' video now, huh?]]></title><description><![CDATA[You may have noticed a couple things about this blog over the past couple years: a) I exclusively write long-form posts that are probably a touch too rambly, with wayyyyy too many footnotes, and b) my last two posts have both been video posts that are thinly-disguised advertisements for SimKube.]]></description><link>https://blog.appliedcomputing.io/p/oh-we-doin-video-now-huh</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/oh-we-doin-video-now-huh</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 18 Oct 2025 18:00:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wMyv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wMyv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wMyv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 424w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 848w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 1272w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wMyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png" width="1024" height="1189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1189,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2779377,&quot;alt&quot;:&quot;a vaguely-Back-to-the-Future-ish Delorean car driving into a flaming Kubernetes logo&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/176505013?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cdc8ade-fdbe-4fb6-b25e-91c12dde2c81_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a vaguely-Back-to-the-Future-ish Delorean car driving into a flaming Kubernetes logo" title="a vaguely-Back-to-the-Future-ish Delorean car driving into a flaming Kubernetes logo" srcset="https://substackcdn.com/image/fetch/$s_!wMyv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 424w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 848w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 1272w, https://substackcdn.com/image/fetch/$s_!wMyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccd8db-991e-4545-a918-a17013b4d8b8_1024x1189.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Where we&#8217;re going, we don&#8217;t need this cheap-ass AI generated image of a Delorean driving into a flaming Kubernetes logo either, but, well, here we are.  What are we gonna do?</figcaption></figure></div><p>You may have noticed a couple things about this blog over the past couple years: a) I <em>exclusively</em> write long-form posts that are probably a touch too rambly, with <em>wayyyyy</em> too many footnotes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, and b) my last two posts have both been video posts that are thinly-disguised advertisements for SimKube.</p><p>Have no fear, gentle readers, my rambly, footnote-laden, erratically-posted long-form text content isn&#8217;t going anywhere! These video &#8220;things&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> are just part of my brilliant-but-subversive strategy to get many people to give me money so that I can coast comfortably for the rest of my life<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Want to watch more dumb videos, and/or read more rambly, footnote-laden, erratically-posted long-form text content, and/or help me coast comfortably for the rest of my life?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>In seriousness, though, despite my love of long-form text blog posts that go into way too much detail, it&#8217;s relatively well-known that a lot of the world prefers videos, and I would like to be able to reach those people. Also, KubeCon is coming up in just under a month<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> and I am trying to get as many people as possible to be aware of SimKube before KubeCon so that when I&#8217;m there I can have a solid list of folks to meet with. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cal.com/drmorr/kubecon&quot;,&quot;text&quot;:&quot;Schedule a meeting with me at KubeCon!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cal.com/drmorr/kubecon"><span>Schedule a meeting with me at KubeCon!</span></a></p><p>In this post, I thought it might be interesting to briefly talk about my process with these videos.</p><h2>Videos are hard, maaaaaan</h2><p>I&#8217;ll tell you a secret: I don&#8217;t like making video content. Maybe that&#8217;s obvious. I&#8217;ve also done a decent amount of it over the years, so I know <em>JUST ENOUGH</em> to know how hard it is, and also how to make really bad videos. So for this SimKube series, I really decided I wanted to lean into the &#8220;cheap, goofy, barely-TikTok-worthy&#8221; vibe, partly because it&#8217;s easy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, and partly because it leans into the slightly-off-the-beaten-path model of ACRL in general. My goal has been to make videos that take no more than an hour to film, and under 2 hours to edit.</p><p>I have accumulated a fair bit of equipment over the years to help with this goal. I have a nice camera and tripod because I still occasionally do photography as a side hobby (though much less than I used to). I have a couple different microphones: one is the built-in mic for my Bluetooth headset, which does hardware noise cancellation on the mic itself<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>, and another directional mic that I picked up several years ago the last time I had to make some videos. And lastly, I have downloaded <a href="https://www.blackmagicdesign.com/products/davinciresolve">DaVinci Resolve</a>, which is a professional-grade video editing software that comes with an extremely capable free/community edition that is easy to learn and use. Given that I am explicitly not trying to make professional videos here, this is perfect for my use case.</p><p>And that&#8217;s pretty much it for my equipment and setup! Now let&#8217;s talk about the process.</p><h2>Script? What script? Where we&#8217;re going, we don&#8217;t need scripts</h2><p>All of the videos that I&#8217;ve filmed thus far have basically come down to an idea in my head, a few simple notes in Notion, and then I just sit down and improv the rest. This helps me keep to the low-budget, low-time-investment process. But I do spend <em>some</em> amount of time up front thinking about who my audience is and what I want them to take away from it. Aside from just &#8220;brand awareness&#8221;, my main goals with the videos are a) to make somebody laugh, and b) to highlight some aspect of SimKube that they might not be aware of or thought about. My <a href="https://blog.appliedcomputing.io/p/whats-simkube">first video</a> tried to answer a really basic question that someone brand new might ask: &#8220;What&#8217;s SimKube?&#8221; This led me straight to the <a href="https://www.youtube.com/watch?v=ihMMw0rnKz4">Mashed Taters meme</a> from Lord of the Rings, and since I already have a vast collection of LotR Legos, the rest of the video fell out extremely naturally.</p><p>The <a href="https://blog.appliedcomputing.io/p/how-to-pronounce-skctl">second video</a> tried, in a roundabout way, to identify one of the main problem domains that SimKube is designed to help with: scalability and capacity planning. It then quickly devolved into a wacky left-field bit about how to pronounce various CLI tools and why we&#8217;ve (collectively) decided that we should end all our CLI tools with <code>ctl</code>. The inspiration for this video is <a href="https://www.youtube.com/watch?v=9kaIXkImCAM">ffmpeg guy</a>, which I continue to maintain is one of the finest pieces of art and satire on the Internet. This video, unfortunately, took way too long to both film and edit, and I think it&#8217;s too long to be an effective advertisement as well<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p><p>The third video (coming this Wednesday) definitely hit the sweet spot for time and complexity. A friend and I filmed it in about 45 minutes, it took just about an hour and a half to cut and edit together, and it is, objectively, hilarious. Aren&#8217;t you excited to watch it?</p><h2>Being hypocritical is fun and easy</h2><p>Once the videos are ready, I post them on a variety of social media platforms; here (obviously), YouTube, LinkedIn, and <a href="https://www.tiktok.com/@appliedcomputing0">TikTok</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. They go out on a weekly schedule; I&#8217;ve been posting them on Wednesdays at 10am. Once they&#8217;re posted, I spend $100 to boost the video on LinkedIn.</p><blockquote><p>Everybody: But David, don&#8217;t you hate advertising with a fiery burning passion?</p><p>Me: Sure do.</p><p>Everybody: Soooooo you&#8217;re paying for LinkedIn advertising because&#8230;?</p><p>Me: Because I&#8217;m a hypocrite, obviously.</p></blockquote><p>Moral/ethical qualms about advertising aside, it has been <em>fascinating</em> to peek into the world of targeted advertising. I can definitely see why people and companies dump so much money into it. For my first video, I <em>also</em> put $100 into a YouTube promotion, because I was curious to see how it compared to LinkedIn: the main difference is that on LinkedIn, you can do some very specific targeting for who you want to see your videos: what industry, what level of experience, geographic location, etc. etc. etc. On YouTube, I got to select &#8220;geographic location&#8221;, &#8220;gender&#8221;, and &#8220;age range&#8221;. So while I got about twice as many &#8220;impressions&#8221; on YouTube as I did on LinkedIn, I got more &#8220;watches&#8221; on LinkedIn that YouTube, and I can just about guarantee that nobody who saw the video on YouTube is going to care (or even know what the heck I&#8217;m talking about).</p><h2>The future of video at ACRL</h2><p>So anyways, that&#8217;s why you&#8217;re seeing videos show up here more regularly. The rough strategy here is to produce a bunch of silly, fun short in the ramp-up to KubeCon. If I get one person at KubeCon to tell me &#8220;I saw one of your videos on LinkedIn, it was hilarious&#8221;, I will consider that a win. Post-KubeCon, I am hoping to transition to some more informational video content. As much as I love my long-form blog posts, there are a lot of people out there who are going to bounce off them hard&#8212;and that&#8217;s totally fine! Different people have different preferences and learning styles, but that means that if I stick to just long-form rambling posts, I&#8217;m neglecting a potentially large audience. So while the frequency of videos will probably lessen after KubeCon, I&#8217;m still hoping to have ~1 video/month with some short informational content about SimKube and/or the work that I&#8217;m doing around here. Along with the occasional quirky Lego-filled SimKube advert, because those have been heckin&#8217; fun to make.</p><p>As always, thanks for reading (and watching)! Until next time,</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>My editor insisted that I put that in; I maintain that there is no such thing as too many footnotes.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>My editor wanted me to call them abominations, but I refused to make that change.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>My editor informs me that I&#8217;m saying the quiet part out loud and I shouldn&#8217;t do that. On the other hand, my editor is a dumb poopy-pants and I&#8217;m not listening to him anymore.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Eeeeep!!!!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Well, easier</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This is honestly amazing: I can be sitting in an incredibly noisy coffee shop and the person on the other end won&#8217;t be able to tell at all.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>I may also be biased against it because I strongly dislike watching videos of myself.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Yes, ACRL has a TikTok. I mostly started it because I thought it would be funny, but literally nobody&#8212;zero people!&#8212;have watched the videos on there. Don&#8217;t you feel sad? You should feel sad. You should all go watch the videos to make me feel better.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[How To Pronounce Skctl]]></title><description><![CDATA[Ever wondered how you pronounce skctl, or why we have so many ctls in the first place?]]></description><link>https://blog.appliedcomputing.io/p/how-to-pronounce-skctl</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/how-to-pronounce-skctl</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Wed, 15 Oct 2025 16:59:54 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/176251416/4cc63fe9f90b55cba436a0af71977739.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Ever wondered how you pronounce skctl, or why we have so many ctls in the first place? This video will tell you why. </p><p>Come chat with us at KubeCon in Atlanta! </p><p>https://cal.com/drmorr/kubecon</p>]]></content:encoded></item><item><title><![CDATA[whats simkube]]></title><description><![CDATA[Want to see more dumb videos like this?]]></description><link>https://blog.appliedcomputing.io/p/whats-simkube</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/whats-simkube</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Wed, 08 Oct 2025 17:02:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/175593340/faa28615dce76f4da2d6d6ecbf155814.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Want to see more dumb videos like this?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[If you have a Google account your calendar isn't private]]></title><description><![CDATA[I&#8217;m taking a break from SimKube to talk about my other favorite subject, scheduling.]]></description><link>https://blog.appliedcomputing.io/p/if-you-have-a-google-account-your</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/if-you-have-a-google-account-your</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 04 Oct 2025 17:01:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M8wv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9abdbcc8-dc09-4e12-8fc3-348cb9c2691e_518x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;m taking a break from <a href="https://simkube.dev/">SimKube</a> to talk about my other favorite subject, scheduling. No, not Kubernetes scheduling. Less exciting than that<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. I mean, like, literal real just scheduling events on calendars with other humans. And actually this post isn&#8217;t even <em>really</em> about that, but instead it&#8217;s about the dumb rabbit hole I went down today<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. You&#8217;re welcome.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Want to read about more dumb rabbit holes that have nothing to do with my day job?  Subscribe below!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Part I: ProtonMail</h2><p>So, as part of my attempts to avoid Google products as much as possible, Applied Computing uses <a href="https://proton.me/">ProtonMail</a> for its email provider; I&#8217;ve also used ProtonMail for my personal mail for a long time, and I&#8217;ve been pretty happy with the product, so it made sense to set it up for ACRL as well. And it&#8217;s worked great as an email provider: but what I find somewhat lacking is the calendar software.</p><p>The core problem is that it is exceedingly difficult to share your Proton calendar(s) with other people so that they can see your availability. The only way you can share a calendar with a non-Proton user is via a read-only <a href="https://en.wikipedia.org/wiki/ICalendar">ICS</a> link. The problem is that usually these days you <em>really</em> want your calendar to be (partially) externally-writeable, either via scheduling software like <a href="https://calendly.com/">Calendly</a> (to let other people book time on your calendar without having to play the &#8220;Oh I&#8217;m free these times when are you free?&#8221; game) or with other calendars (because, say, maybe you are working with a few different clients who each have their own calendars and you need to sync events between all your clients as well as your non-work calendars so that everybody doesn&#8217;t all schedule everything all over each other all willy-nilly&#8212;but I don&#8217;t know anybody in that position).</p><p>(It&#8217;s me. I&#8217;m in that position. I actually wrote a <a href="https://github.com/drmorr0/gcal-sync">simple Google Apps script</a> that allows me to sync events across all my different calendars, because I couldn&#8217;t keep up with manually syncing everything. It gets around the ProtonMail read-only limitation by just creating duplicate events and inviting the ProtonMail account user to those events.  It&#8217;s not ideal, but it works.)</p><p>But anyways, that&#8217;s not the point of this story.</p><h2>Part II: Google</h2><p>As much as I would like to be completely de-Googled, it&#8217;s actually kindof tough to run a business and not have a Google account, because <em>most</em> of the world runs on Google Docs and Google Slides. So a while back, I created a Google account associated with my ACRL email address, which I mostly just use when I need to collaborate with someone else or share a document with them. It&#8217;s fine, it&#8217;s not my preference, but I got other things to worry about, whatever, life goes on.</p><p>However, today I was redoing my business cards because I need to print a bunch more for KubeCon and the last batch that I made was missing a few important things<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. And what I really would like to put on the business cards is a QR code with a link to Calendly (or something like it) so that people at the top of my sales funnel can just click two buttons and immediately get some time booked with me, instead of having to take the trouble to write an email, because nobody uses email anymore, or at least that&#8217;s what all the kids these days tell me.</p><p>So I was revisiting the &#8220;nothing is compatible with ProtonMail&#8221; rabbit hole when I happened to click through to the Google account that was associated with my ProtonMail email address and discovered that&#8230; all of my ProtonMail calendar events were <em>also</em> on my Google calendar (reminder: the one I never use).</p><h2>Part III: WTF?</h2><p>Again, maybe it&#8217;s obvious to you reading this, but it sure wasn&#8217;t obvious to me in the moment. How were all of my calendar events getting synced with Google? I double-checked at least three times, and I wasn&#8217;t actually sharing my magic ProtonMail ICS link with the Google account. Moreover, all of the semi-private information (event titles, attendees, etc) were appearing in the Google account, and I <em>definitely</em> wasn&#8217;t syncing those. What was going on???</p><p>After staring at it for a little while, I realized something important: not <em>all</em> of the events were getting synced with my Google account. In particular, if <em>I</em> created an event, it wasn&#8217;t on my Google calendar, but if somebody else had created the event and invited me to it, then it would appear on my Google calendar.</p><p>How was this happening? I don&#8217;t have a Gmail address associated with ACRL (a Google account is not the same thing as a Gmail address). Did I at one point set up sharing somehow and then forgot, and then maybe a setting changed or got removed and now I can&#8217;t undo it? What was going on? The most frustrating, confusing, and annoying thing about this is that I couldn&#8217;t just <em>delete</em> my Google calendar, because that would delete all the events on the calendar, and then send cancellation messages to all the other attendees. I definitely don&#8217;t want that to happen! But this is even more confusing now, because it means that the events aren&#8217;t just getting <em>synced</em> to my Google calendar, but Google also somehow thinks it&#8217;s the <em>owner</em> of the events.</p><p>I went through a lot of crazy (in hindsight) hypotheses: did I have some weird DNS record<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> set up that was also sending calendar events to Google? Did I have an email forwarding rule set up from ProtonMail to Google<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>? I went so far as to create a <em>brand new</em> Google account that was associated with an unused ACRL email address, and then I created an event from my personal email and invited that user. Annnnndddd&#8230; the event didn&#8217;t show up!</p><p>Ok, now I was really confused, before it finally hit me: Google was man-in-the-middling me! See, it&#8217;s not <em>every</em> event that someone else created that showed up on the Google account calendar, it&#8217;s just events that are created by <em>someone else using a Google account</em>. If both users are on Google, it just assumes that you&#8217;re using their product and will auto-create the events on your Google calendar, even though the calendar invite email gets sent somewhere else entirely. This is also why all of the calendar event details appear on the Google calendar, and why Google will notify users if you try to delete the events off of your Google calendar.</p><p>It&#8217;s obvious in hindsight, but it was a real head-scratcher for me for a bit this afternoon. And it was a good reminder that if one party in any sort of online interaction is using Google, then <em>all</em> parties are using Google, whether they want to be or not<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.  And not that it really matters, but I can also tell which of my contacts are using Google products or not by checking to see whether their calendar events show up on my Google account.</p><p>Anyways, now that that mystery is solved, I can get back to the problem of &#8220;figuring out how to make Calendly work with ProtonMail&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> so that I can print my dang business cards. Isn&#8217;t this fun???</p><p>Tune in next time for another rant about my problem of the week! Until then, thanks for reading!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>No, not Mesos scheduling either. RIP.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>It&#8217;s entirely probable that the thrilling conclusion to this blog post is obvious to you from the very beginning, particularly because I gave it away in the title of the post, but it took me many minutes of my life to understand what was going on so now you get to spend many minutes of your life reading about it. Or you could just click the &#8220;close tab&#8221; button and move on, but where&#8217;s the fun in that?</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Pro tip: if you are running a business and your business has a website and you would like people you hand your business card to to visit that website, it might be useful to include the URL of the website on the business card. Just saying.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p><a href="https://imgur.com/eAwdKEC">It&#8217;s not DNS, there&#8217;s no way it&#8217;s DNS, etc.</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>These hypotheses really make no sense if you think about it, because I don&#8217;t even have Gmail set up!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Really it&#8217;s the same with email: if you send an email to somebody with a Gmail account, Google is going to read the contents of that email and use it for advertising or training or whatever other privacy-invading things they do with it, even if the sender isn&#8217;t using a Google product to send the email. It was just extra weird to me because I wasn&#8217;t expecting to see my calendar events pop up someplace that I hadn&#8217;t specifically invited or granted permission to.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>In what is an extremely ironic twist, the only solution I&#8217;ve come up with that seems realistic at all is to sync my ProtonMail calendar to my Google account calendar, and then connect Calendly to the Google account. Life sure has a sense of humor.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>There is an <a href="https://protonmail.uservoice.com/forums/284483-proton-mail-calendar/suggestions/41340193-show-and-book-available-appointments-like-calendl?page=1&amp;per_page=20">open feature request</a> on the Proton user forums asking for the ability to connect to something like Calendly, or for Proton to develop their own solution, but it has hundreds of comments and votes over the years and seems to have been completely ignored by the Proton dev team, so I&#8217;m not holding my breath.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[HOWTO: Use SimKube for Cost Forecasting]]></title><description><![CDATA[Recently, I&#8217;ve had a number of folks ask for some more details about how SimKube can be used to predict or forecast your Kubernetes expenditures, and I realized that I&#8217;ve said you can do this several times, but I&#8217;ve never actually gone through the details!]]></description><link>https://blog.appliedcomputing.io/p/howto-use-simkube-for-cost-forecasting</link><guid isPermaLink="false">https://blog.appliedcomputing.io/p/howto-use-simkube-for-cost-forecasting</guid><dc:creator><![CDATA[drmorr]]></dc:creator><pubDate>Sat, 27 Sep 2025 21:00:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MCC6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, I&#8217;ve had a number of folks ask for some more details about how <a href="https://simkube.dev/">SimKube</a> can be used to predict or forecast your Kubernetes expenditures, and I realized that I&#8217;ve <em>said</em> you can do this several times, but I&#8217;ve never actually gone through the details! So this post will show you how.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.appliedcomputing.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Applied Computing Research Labs is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Problem statement: how much will we save by doing X?</h2><p>Anyone who&#8217;s worked on a platform team has probably at some point had their boss ask them &#8220;our platform is expensive, how can we make it less expensive?&#8221; When you&#8217;re dealing with Kubernetes and the cloud, there are a <em>lot</em> of variables at play, and it can be really tough to know in advance which knobs and levers are worth tuning. That&#8217;s where simulation comes in! If you can spend a couple hours running a simulation to see some projected savings, using real production data, that could potentially save you a lot of time and wasted effort. We&#8217;ll see why in just a minute.</p><p>First, some background on the specific scenario I&#8217;m demonstrating today: in recent years, the big cloud providers have started offering <a href="https://en.wikipedia.org/wiki/ARM_architecture_family">ARM</a>-based compute resources with similar-or-better performance characteristics to their equivalent <a href="https://en.wikipedia.org/wiki/X86-64">x86</a> machines, for a slightly-reduced rate<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. The details of the different architectures aren&#8217;t <em>super</em> important for this post: all you need to know is that they aren&#8217;t compatible, so it may involve recompiling a lot of code as well as figuring out how to serve <a href="https://docs.docker.com/build/building/multi-platform/">multi-platform Docker images</a>. All of this is doable, and in fact quite a few of the big tech companies have done it, but the key point is that it&#8217;s a non-trivial amount of engineering effort to enable. So the natural question you might ask is, &#8220;Is the savings worth the effort, and over what time horizon?&#8221; And that&#8217;s the question we&#8217;re going to tackle today in this post.</p><p>The problem setup is straightforward: I&#8217;m going to use the same <a href="https://github.com/delimitrou/DeathStarBench">DeathStarBench</a> trace data that I used in my <a href="https://blog.appliedcomputing.io/p/using-simkube-10-comparing-kubernetes">comparison of KCA to Karpenter</a>; I&#8217;m using the latest version of <a href="https://karpenter.sh/">Karpenter</a> for autoscaling, and I&#8217;ve configured SimKube to &#8220;look like&#8221; AWS by offering the full suite of AWS EC2 instances for scaling.</p><p>We&#8217;re going to run three different simulations: the first is using <em>only</em> x86 machines (as in my KCA vs Karpenter comparison, we&#8217;re using m, c, and r class nodes of 6th and 7th generation). In the second simulation, I have configured a few of the deployments in the trace to be &#8220;ARM-compatible&#8221;, and added in the equivalent Graviton nodes into the mix; and in the last simulation, I <em>only</em> run on Graviton nodes.</p><p>Why did I set things up this way? Well, in any large Kubernetes deployment, you probably have <em>some</em> workloads that don&#8217;t care about the underlying architecture: Python code, for example, is generally architecture agnostic<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. It&#8217;s a high-enough level language that the architecture details are abstracted away and it should &#8220;just work&#8221;. Golang is a step up on the difficulty ladder: you will need to recompile your code, but Go offers very good cross-compilation support, so it&#8217;s probably not &#8220;too hard&#8221;. Java code is probably similar to Golang in this regard. Highest on the complexity ladder are things like C/C++/Rust: especially if they have a lot of low-level system requirements, it&#8217;s probably going to be a lot of work to port from x86 to ARM.</p><p>So, this simulation is answering the question, &#8220;How much will we save if we just port the easy stuff to ARM?&#8221; compared to &#8220;How much could we hypothetically save if we ported <em>everything</em> to ARM?&#8221; Since this is all made-up for illustrative purposes, I just picked a few DeathStarBench deployments at random to be the &#8220;easy stuff&#8221;, but you can hopefully see how this mirrors the types of conversations that occur in many organizations.</p><p>Here&#8217;s the cool thing: because all my workloads are fake, they <em>don&#8217;t actually care about the underlying architecture</em>. In fact, the nodes are fake too, so they can report <em>whatever architecture I want</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. If you were to do this experiment using real nodes and real hardware, it would probably take you a few weeks to even just get things to the point where you could run a test; and I did the same work in an afternoon.</p><h2>The results: is the work worth it?</h2><p>Now that we understand the problem statement, we&#8217;ll look at the results. First, we need to establish our baseline: how much does it cost if nothing uses Graviton?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MCC6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MCC6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 424w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 848w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 1272w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MCC6!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png" width="1200" height="758.6206896551724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:660,&quot;width&quot;:1044,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:88881,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/174712422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MCC6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 424w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 848w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 1272w, https://substackcdn.com/image/fetch/$s_!MCC6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea0fce1d-935e-4a53-978f-6fadc8f289c8_1044x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Figure 1 shows both the total node count as well as the node composition in this experiment. You can see that predominantly, we are using <code>c6a.48xlarge</code> instances, which Karpenter has presumably chosen because they&#8217;re one of the cheapest instance types that supports the large number of pods that we&#8217;re running. We can multiply this set of instance type data with the publicly-available on-demand pricing for each of these instance types to find out how much everything costs. Doing so, we end up with a total price of $149.22 for this twenty-minute slice of time<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p><p>So what happens if we port the easy stuff over to ARM? I looked at the five largest (in terms of pod count) deployments in the simulated trace file and allowed them to run on Graviton nodes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>; everything else was restricted to x86. Re-running the experiment yields the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vO0Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vO0Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 424w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 848w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 1272w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vO0Q!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png" width="1200" height="771.1781888997078" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2f381df-1b22-4172-a3c5-5de292def225_1027x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:660,&quot;width&quot;:1027,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:106984,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/174712422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vO0Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 424w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 848w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 1272w, https://substackcdn.com/image/fetch/$s_!vO0Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2f381df-1b22-4172-a3c5-5de292def225_1027x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see in Figure 2 that we are definitely using Graviton nodes now<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>! We&#8217;re running a lot more (but smaller) nodes, and the most common instance type is the <code>c6g.16xlarge</code>. That seems promising, how much did it cost? We run the numbers and find out that it costs&#8230; <strong>$150.72</strong>. Basically the same as before<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>! Sure glad we didn&#8217;t put in all that engineering effort to save $0<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>.</p><p>So OK, doing just the easy stuff doesn&#8217;t help in this case, and we clearly don&#8217;t have the time to migrate everything to Graviton right now<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>, but just hypothetically speaking, what happens if we were able to wave a magic wand and move <em>everything</em> over to Graviton? Figure 3 shows the answer:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H576!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H576!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 424w, https://substackcdn.com/image/fetch/$s_!H576!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 848w, https://substackcdn.com/image/fetch/$s_!H576!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 1272w, https://substackcdn.com/image/fetch/$s_!H576!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H576!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png" width="1200" height="771.1781888997078" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:660,&quot;width&quot;:1027,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:96090,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/174712422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H576!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 424w, https://substackcdn.com/image/fetch/$s_!H576!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 848w, https://substackcdn.com/image/fetch/$s_!H576!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 1272w, https://substackcdn.com/image/fetch/$s_!H576!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd4d823-7df3-4f99-b1a4-55c485fc3709_1027x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now everything is running on Graviton, and our total EC2 spend is predicted to be <strong>$138.24</strong> for this 20-minute time slice. Now we&#8217;re talking! There&#8217;s that 10% discount showing up finally<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. Is that really worth the 6-12 months of concerted engineering time it would take to port all these workloads over to support ARM? I mean, maybe, maybe not, we&#8217;ve now moved from the realm of engineering into policy and priority decisions&#8212;but the point is, you can make <em>better</em> decisions now because you spent a few hours running some simulations and collecting data first.</p><h2>A cool epilogue: SimKube plus KubeCost</h2><p>Even though this was kindof a &#8220;toy&#8221; example, hopefully this is helpful to see how you might be able to use SimKube to make decisions around all the different cost levers that are available in Kubernetes and the cloud. I wanted to close with one cool observation: if you were going to repeat this for your own workloads, you <em>could</em> do what I did, and collect all the data, and then multiply it out by hand. It&#8217;s not hard to do. BUT, there are also existing tools out there to help you track your Kubernetes spend that make it a lot easier! Possibly one of the more well-known ones is <a href="https://github.com/kubecost">KubeCost</a>, which was recently acquired by IBM. KubeCost integrates with all the different cloud providers and can pull pricing data from the public APIs, as well as take into account EBS volume usage and any of your negotiated volume discounts or other cost-impacting factors: and you can just, you know, install KubeCost into your simulated environment and get some costs out! Here&#8217;s a screenshot of me doing exactly that:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6HG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6HG_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 424w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 848w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 1272w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6HG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png" width="1145" height="677" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea07c847-12a5-4055-833a-509c147c2034_1145x677.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:677,&quot;width&quot;:1145,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88863,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.appliedcomputing.io/i/174712422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6HG_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 424w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 848w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 1272w, https://substackcdn.com/image/fetch/$s_!6HG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea07c847-12a5-4055-833a-509c147c2034_1145x677.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I dunno, I think this is pretty nifty, and it really shows how you can start to use all of your existing tools and components in your simulation environment as well, to take your simulation analysis and data collection to the next level.</p><p>Anyways, I hope this was an interesting read for you all, and that it gives you some ideas for how you can use SimKube in the future! As always, thanks for reading!</p><p>~drmorr</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>AWS was the first (I believe) to start manufacturing their own ARM chips, called Graviton, for use in their public cloud offerings; although, famously, Apply abandoned Intel architectures several years ago and now exclusively sell ARM-based computers in their M* lineup.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The exception being, of course, if you rely on any compiled Python libraries: pandas, numpy, scipy, for example.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>They could even report some magical, mystical new architecture that hasn&#8217;t even been invented yet if I wanted. You could call it, I dunno, just spitballing, ACRL64.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Note that for this simulation, I&#8217;m ignoring <em>everything</em> but the EC2 instance cost. There&#8217;s a whole lot of other things that go into an AWS bill&#8212;EBS volumes, network costs, ECR costs, and more. If we <em>wanted to</em> we could factor those into our simulation as well, but it would be more effort. Still, it would probably be less effort than porting everything over to Graviton and trying it out for real.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The way I did this was to create a separate Karpenter node pool with only Graviton instance types, and added a <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/">taint</a> to them so that only pods that tolerated the different architecture could be scheduled there</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The big inverse spike at the 15 minute mark is just a metrics artifact, Prometheus got overloaded and dropped a bunch of metrics on the floor, it happens, life goes on.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>If you were doing this simulation for real, you&#8217;d probably want to repeat the experiment several times to smooth out any noise or non-determinism in the results, but for the purposes of this post you get the idea.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>You might also ask the question of why we got this result; shouldn&#8217;t Karpenter show <em>some</em> savings, even if it&#8217;s not very much? We&#8217;d have to really dig into the results here to understand what&#8217;s going on, but my initial hypothesis is that we&#8217;ve essentially hamstrung Karpenter&#8217;s ability to consolidate nodes. Some stuff <em>really</em> wants to run on Graviton because it&#8217;s cheap, but not everything can, so I&#8217;m guessing Karpenter&#8217;s consolidation routine is getting stuck; combine this with the fact that more (but smaller) instance types leads to worse bin-packing in general, and I&#8217;m guessing this is why we don&#8217;t actually see any savings here.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>After all, we&#8217;ve got <a href="https://blog.appliedcomputing.io/p/okrs-are-bullshit">OKRs to half-ass</a>!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>OK, OK, it&#8217;s only 8%. Sue me.</p><p></p></div></div>]]></content:encoded></item></channel></rss>