<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~files/atom-premium.xsl"?>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedpress="https://feed.press/xmlns" xmlns:media="http://search.yahoo.com/mrss/" xmlns:podcast="https://podcastindex.org/namespace/1.0" xml:base="https://superorbital.io/">
  <feedpress:locale>en</feedpress:locale>
  <link rel="hub" href="https://feedpress.superfeedr.com/"/>
  <id>https://superorbital.io/</id>
  <title>SuperOrbital Blog</title>
  <updated>2025-04-16T00:00:00Z</updated>
  <link rel="alternate" href="https://superorbital.io/" type="text/html"/>
  <link rel="self" href="https://feed.superorbital.io/" type="application/atom+xml"/>
  <author>
    <name>Tammer Saleh</name>
    <uri>https://superorbital.io</uri>
  </author>
  <icon>https://superorbital.io/img/logo.png</icon>
  <logo>https://superorbital.io/img/logo.png</logo>
  <entry>
    <id>tag:superorbital.io,2025-04-16:/blog/cluster-api-part-3-cluster-homogenized-workloads/</id>
    <title type="html">Managing Homogenized Workloads Across a Fleet of Cluster API Kubernetes Clusters</title>
    <published>2025-04-16T00:00:00Z</published>
    <updated>2025-04-16T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/cluster-api-part-3-cluster-homogenized-workloads/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[
<p><em>This is the third part of our series on Cluster API and how it can be a solution for managing large numbers of clusters at scale. For the previous parts on this series, see <a href="https://superorbital.io/blog/cluster-api-part-1-overview/">part 1</a> and <a href="https://superorbital.io/blog/cluster-api-part-2-capa-bootstrap/">part 2</a>.</em></p>

<details>
  <summary><em>Table of Contents</em></summary>

  <ul>
    <li><a href="https://superorbital.io#installing-workloads-with-clusterresourceset">Installing workloads with ClusterResourceSet</a></li>
    <li><a href="https://superorbital.io#declarative-configurations-with-argocd">Declarative Configurations with ArgoCD</a></li>
    <li><a href="https://superorbital.io#alternative-1-argocd-running-on-the-management-cluster">Alternative 1: ArgoCD running on the management cluster</a></li>
    <li><a href="https://superorbital.io#alternative-2-argocd-running-on-all-managed-clusters">Alternative 2: ArgoCD running on all managed clusters</a></li>
    <li><a href="https://superorbital.io#whats-next">What’s next?</a></li>
  </ul>

</details>

<p>In our previous posts, we looked at how Cluster API (CAPI) works, how we can use an infrastructure provider (CAPA) and bootstrap our own management cluster and its own managed clusters. Great, we have a working cluster… but a cluster with no workloads is not very useful. We desperately want to deploy to our clusters, but managing workloads across multiple Kubernetes clusters presents several challenges:</p>

<ol>
  <li>
<strong>Consistency</strong> - Ensuring all clusters run identical workloads</li>
  <li>
<strong>Drift prevention</strong> - Maintaining synchronized configurations over time</li>
  <li>
<strong>Scalability</strong> - Adding new clusters without increasing operational overhead</li>
  <li>
<strong>Maintenance</strong> - Performing updates across the fleet efficiently</li>
</ol>

<p>For this part, we’ll be looking at how we get our workloads in these clusters, and how we can solve these problems trying to manage multiple clusters all running the same workloads.</p>

<h3 id="installing-workloads-with-clusterresourceset">Installing workloads with ClusterResourceSet</h3>

<p><a href="https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-resource-set">ClusterResourceSet</a> (CRS) is a CAPI extension that allows you to define Kubernetes resources as YAML within ConfigMaps and Secrets and automatically apply them to matching clusters. It’s a simple and straightforward way to deploy the same resources to multiple clusters. Here’s an example of how one would use a ClusterResourceSet to deploy a basic monitoring stack:</p>

<pre><code class="language-yaml">apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
  name: monitoring-stack
  namespace: capi-system
spec:
  strategy: Reconcile
  clusterSelector:
    matchLabels:
      environment: production
      cloud: openstack
  resources:
  - kind: ConfigMap
    name: prometheus-yaml
  - kind: ConfigMap
    name: grafana-yaml
  - kind: Secret
    name: monitoring-certs
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-yaml
  namespace: capi-system
data:
  prometheus.yaml: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: prometheus
      template:
        metadata:
          labels:
            app: prometheus
        spec:
          containers:
          - name: prometheus
            image: prom/prometheus:v2.36.0
            ports:
            - containerPort: 9090
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-yaml
  namespace: capi-system
data:
  grafana.yaml: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: grafana
      namespace: monitoring
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: grafana
      template:
        metadata:
          labels:
            app: grafana
        spec:
          containers:
          - name: grafana
            image: grafana/grafana:9.0.0
            ports:
            - containerPort: 3000
</code></pre>

<p>One thing to note is that if a Secret is being used to hold configuration that’s to be installed, it <em>must</em> have the type <code>addons.cluster.x-k8s.io/resource-set</code> set for CAPI to reconcile the YAML, otherwise it will be ignored by the controller.</p>

<p>A ClusterResourceSet is mostly used to deploy resources on clusters for bootstrapping purposes. As an example, when the managed cluster is created, depending on the cluster type, it might be missing crucial cluster components such as networking via a CNI. For an EKS cluster, this is not an issue as they are pre-configured with the <a href="https://github.com/aws/amazon-vpc-cni-k8s">Amazon VPC CNI</a>, but for CAPI-managed Kubeadm control planes, a CNI must be present before any workload runs in the cluster. For this purpose, a CRS can be defined to allow for bootstrapping necessary resources on a cluster.</p>

<p>By default, CAPI only applies the resource once and does not delete the resource if the ClusterResourceSet has been removed. However, it does have limited support for reconciliation via the <code>spec.Strategy</code> field, so it can keep a resource updated whenever its manifest is modified. However, there’s no drift detection and managing complex applications is unfeasible with this approach, and therefore this is not a recommended approach for installing and managing general-purpose workloads.</p>

<h3 id="declarative-configurations-with-argocd">Declarative Configurations with ArgoCD</h3>

<p>If a CRS is not the solution for deploying to our fleet of clusters, then what is? This is where a good <a href="https://en.wikipedia.org/wiki/Continuous_delivery">continuous delivery</a> (CD) pipeline comes into play. Once the cluster is bootstrapped with its CNI and is ready to accept workloads, a tool for automating a CD pipeline, such as <a href="https://fluxcd.io/">Flux</a> or <a href="https://argo-cd.readthedocs.io/en/stable/">ArgoCD</a>, can perform the rest of the work of installing all the applications that must run on all the managed clusters. One benefit of using a GitOps-y tool like the aforementioned ones is that the desired cluster state can be expressed with a series of manifests in a Git repository, which provides an auditable, version-controlled deployment flow for all the clusters at the same time. For this particular example, we’ll be using ArgoCD and the <a href="https://argo-cd.readthedocs.io/en/latest/operator-manual/cluster-bootstrapping/#app-of-apps-pattern">“app of apps”</a> pattern to install applications declaratively: a single ArgoCD application is installed on each cluster “manually” (i.e. with some method outside of ArgoCD) which contains the information about the other applications with the real workloads that need to be installed.</p>

<h4 id="alternative-1-argocd-running-on-the-management-cluster">Alternative 1: ArgoCD running on the management cluster</h4>

<p>This is a C&amp;C (command and control) setup, where you have a single source of control for all the applications that get installed in the cluster. ArgoCD is installed on the management cluster, and every time a new managed cluster is created, the details for that cluster (API server endpoint, authentication) are added to ArgoCD so that it can then install all the applications on that cluster.</p>

<blockquote>
  <p>Note: the examples here are demonstrative, since the Application/Project layout will differ depending on the nature and organization of your workloads.</p>
</blockquote>

<ol>
  <li>
<a href="https://argo-cd.readthedocs.io/en/stable/getting_started/">Install ArgoCD</a> on the management cluster.</li>
  <li>
<a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#clusters">Register your clusters</a> in ArgoCD. Alternatively, use a tool like SuperOrbital’s <a href="https://github.com/superorbital/capargo">capargo</a> to manage adding new clusters in ArgoCD for you.</li>
  <li>Create an Application manifest for your workload:</li>
</ol>

<pre><code class="language-yaml">apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: standard-workload
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/standard-workloads
    targetRevision: main
    path: base
  destination:
    server: https://kubernetes.default.svc
    namespace: workloads
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
</code></pre>

<p>Once this is done, you can use <a href="https://argo-cd.readthedocs.io/en/stable/user-guide/application-set/">an ApplicationSet</a> to deploy that workload to multiple clusters, using <a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Generators-Cluster/">the cluster generator</a> field to automatically generate the cluster information based on the clusters that have already been registered with Argo CD. As an example:</p>

<pre><code class="language-yaml">apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fleet-workload
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production
  template:
    metadata:
      name: '{{name}}-standard-workload'
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/standard-workloads
        targetRevision: main
        path: overlays/{{metadata.labels.region}}
      destination:
        server: '{{server}}'
        namespace: workloads
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
</code></pre>

<p>Of course, with every approach there are a few pros and cons:</p>

<p>Pros:</p>

<ul>
  <li>Single-pane of glass overview of the state of all the applications on all clusters.</li>
  <li>ArgoCD is physically segregated from each cluster that it manages, which improves the security standing.</li>
  <li>The <a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Generators-Cluster/">Cluster Generator</a> feature becomes available for all the registered clusters, which allow for targeting specific clusters for a given Application via a label selector.</li>
</ul>

<p>Cons:</p>

<ul>
  <li>Only one ArgoCD in the management cluster. This all but ensures that ArgoCD must be installed <a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/">in HA mode</a> to ensure little downtime during cluster downtime or maintenance.</li>
  <li>Without the use of capargo, adding a new cluster to ArgoCD requires manual intervention to create a special secret with the API server endpoint and the kubeconfig credentials for ArgoCD to authenticate and access the workloads on the cluster.</li>
  <li>Depending on the number of clusters being managed, it could require a lot of resources to operate ArgoCD on the management cluster.</li>
</ul>

<p>Fortunately, this is not the only approach we can take. What if instead we ran ArgoCD <em>everywhere?</em></p>

<h4 id="alternative-2-argocd-running-on-all-managed-clusters">Alternative 2: ArgoCD running on all managed clusters</h4>

<p>This is a distributed setup where ArgoCD is installed on each managed cluster and preconfigured with the appropriate bootstrap application to install all the workloads needed on each cluster. The central ArgoCD on the management cluster is only tasked with ensuring that all the other ArgoCDs are kept up-to-date and have the “app-of-apps” configuration updated. In this situation, we can leverage CAPI’s ClusterResourceSet to bootstrap ArgoCD on every cluster with the “app of apps” configuration.</p>

<blockquote>
  <p>Note: the YAML for the base ArgoCD install <a href="https://github.com/argoproj/argo-cd/blob/master/manifests/install.yaml">is quite large</a>, so for the sake of brevity it is omitted from the example below.</p>
</blockquote>

<pre><code class="language-yaml">apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
  name: argocd-bootstrap
  namespace: capi-system
spec:
  clusterSelector:
    matchLabels:
      argocd: enabled
  resources:
  - kind: ConfigMap
    name: argocd-installer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-installer
  namespace: capi-system
data:
  argocd.yaml: |
    apiVersion: v1
    kind: Namespace
    metadata:
      name: argocd
    ---
    # ArgoCD YAML would be here...
    ---
    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: cluster-workloads
      namespace: argocd
      finalizers:
      - resources-finalizer.argocd.argoproj.io
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/standard-workloads
        targetRevision: main
        path: overlays/{{metadata.labels.region}}
      destination:
        server: https://kubernetes.default.svc
        namespace: workloads
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
</code></pre>

<p>With this setup, you can configure a “meta” ArgoCD instance to manage the ArgoCDs (but not their workloads) on the management cluster:</p>

<pre><code class="language-yaml">apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: argocd-fleet-manager
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          argocd: enabled
  template:
    metadata:
      name: '{{name}}-argocd-config'
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/argocd-fleet-config
        targetRevision: main
        path: clusters/{{name}}
      destination:
        server: '{{server}}'
        namespace: argocd
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
</code></pre>

<p>Alas, this is not a perfect solution either, so whichever approach you choose should fit best your specific use-case. Let’s review the pros and cons for running ArgoCD everywhere:</p>

<p>Pros:</p>

<ul>
  <li>A simpler configuration on the management cluster, as the single ArgoCD on the management cluster does not need to maintain the workloads for the entire cluster fleet.</li>
  <li>Easier installation and maintenance process since each cluster installs its own ArgoCD, and it can come pre-baked with a directive to install workloads on itself. This also means that each ArgoCD itself consumes fewer resources as it only cares about a single cluster.</li>
  <li>As new versions and configurations of ArgoCD are developed and deployed, it’s much easier to roll out these changes to a few clusters at a time and ensure that no bugs are introduced.</li>
</ul>

<p>Cons:</p>

<ul>
  <li>With ArgoCD, since a Redis server contains caches of the manifests deployed in the cluster, it can become a security issue if access to the ArgoCD namespace within the cluster is not properly locked down.</li>
  <li>More difficulty in controlling the synchronization of workloads as they are deployed in different clusters.</li>
  <li>If each instance running on each cluster needs to be exposed via an Ingress, it could potentially incur more costs for each load balancer.</li>
  <li>No “single pane of glass” capability for a quick overview of the state of the world.</li>
</ul>

<h3 id="whats-next">What’s next?</h3>

<p>CAPI is a great cloud-agnostic tool to build a platform where you can go from zero infrastructure to a fleet of homogeneous, production-ready clusters with a bit of initial configuration and some manifests.</p>

<p>However, there are times when a company has different clusters for different use cases. Maybe some clusters only host CI/CD pipeline tools and their runners; others run specialized ML workloads. When the types of deployed clusters start to exceed the dozens, you’ll soon realize that specifying the resources for each cluster becomes an exercise in creating a lot of boilerplate code.</p>

<p>In a future article, we’ll get into <a href="https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/">ClusterClass</a> objects and how they can help you simplify the management of hundreds of different kinds of clusters and help you lose the fear of the single unique snowflake clusters that cannot be recreated!</p>

<p><a href="https://feed.superorbital.io/">Subscribe (yes, we still ❤️ RSS)</a> or join our mailing list below to stay updated!</p>
]]></content>
    <summary type="html">Combining CAPI's cluster lifecycle management with ArgoCD's GitOps workflow for a sane cluster fleet workload rollout!</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2025-03-03:/blog/in-place-vertical-pod-scaling/</id>
    <title type="html">In-Place Vertical Pod Scaling: The Future of Resource Management</title>
    <published>2025-03-03T00:00:00Z</published>
    <updated>2025-03-03T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/in-place-vertical-pod-scaling/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<details>
  <summary><em>Table of Contents</em></summary>

  <ul>
    <li><a href="https://superorbital.io#what-is-in-place-vertical-pod-scaling">What is “In-Place Vertical Pod Scaling”?</a></li>
    <li><a href="https://superorbital.io#how-to-enable-in-place-vertical-pod-scaling">How to enable In-Place Vertical Pod Scaling</a></li>
    <li><a href="https://superorbital.io#how-does-it-look-like-in-the-pod">How does it look like in the Pod?</a></li>
    <li><a href="https://superorbital.io#potential-impact-on-the-vertical-pod-autoscaler">Potential impact on the Vertical Pod Autoscaler</a></li>
    <li><a href="https://superorbital.io#final-thoughts">Final thoughts</a></li>
  </ul>

</details>

<p>Kubernetes has had the ability to scale the amount of workloads easily using Horizontal Pod Autoscaling (HPA), but one challenge has always been adjusting CPU and memory resources for Deployments, StatefulSets and DaemonSets without needing to restart them. That all changed with the introduction of In-Place Vertical Pod Scaling, which was added starting in Kubernetes 1.27.</p>

<p>This new feature allows you to adjust CPU and memory resources (both requests and limits) in running pods without having to recreate the Pod, which provides a smoother, less disruptive way to dynamically adjust resources. In this post, we’ll go through how to enable this feature, demonstrate changes to the Pod spec, explain the new entries in Pod status, and discuss how this will affect open-source projects like the Vertical Pod Autoscaler (VPA).</p>

<h3 id="what-is-in-place-vertical-pod-scaling">What is “In-Place Vertical Pod Scaling”?</h3>

<p>In earlier versions of Kubernetes, if you needed to adjust the resource allocation of a Pod, it would need to be terminated and recreated. This is because, even though Docker and other container runtimes <a href="https://docs.docker.com/reference/cli/docker/container/update/">allow for dynamically updating the resource configuration</a> at runtime, the Pod spec marked the <code>resources</code> field as <a href="https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#resources">immutable</a>, and once set it cannot be modified. This means that the only way to “modify” the resources of a Pod is to delete the old one and create a new one with the updated resource values.</p>

<p>In-Place Vertical Pod Scaling promises to change this, by adding new fields that allow for the resources to be modified at runtime, and providing new status fields that indicate the progress in performing the change. As this is an alpha-level feature, with changes still being made, a feature gate needs to be enabled to try it out.</p>

<blockquote>
  <p>The concept of “in-place” scaling for Pods goes against the “treat workloads as cattle, not pets” ethos that Kubernetes was built upon. However, there are some special use-cases, such as when using any kind of stateful or long-running workload, where exceptions exist and justify the existence of features like this.</p>
</blockquote>

<h3 id="how-to-enable-in-place-vertical-pod-scaling">How to enable In-Place Vertical Pod Scaling</h3>

<p>Starting in Kubernetes 1.27, and as of current writing, to test out the In-Place Vertical Pod Scaling feature one must enable a feature gate, which needs to be set to <code>true</code> before it can be used in the cluster. To do this you’ll need to:</p>

<h4 id="update-api-server-and-controller-manager-flags">1. <strong>Update API Server and Controller Manager Flags</strong>
</h4>

<p>In your Kubernetes cluster configuration, you’ll need to enable the <code>InPlacePodVerticalScaling</code> feature gate. You can do this by modifying the <code>kube-apiserver</code>, <code>kube-scheduler</code>, and <code>kube-controller-manager</code> flags.</p>

<p>In each of these components, add the flag:</p>

<pre><code>--feature-gates=InPlacePodVerticalScaling=true
</code></pre>

<h4 id="update-the-kubernetes-version-on-all-nodes">2. <strong>Update the Kubernetes version on all nodes</strong>
</h4>

<p>Ensure that your nodes are running a Kubernetes version that supports this feature (1.27 or higher) and that the Kubelet’s <a href="https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#overview">feature gates</a> are updated to enable the In-Place Vertical Pod Scaling.</p>

<p>Certain cloud providers, such as GKE, can also provide a place for testing this feature by creating a cluster with <a href="https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--enable-kubernetes-alpha">all alpha features</a> turned on. Keep in mind that this may affect the stability of the cluster and therefore caution should be taken before making this change.</p>

<h3 id="how-does-it-look-like-in-the-pod">How does it look like in the Pod?</h3>

<p>Once the feature gate is enabled, modifying resource requests or limits on a running pod becomes possible. A container’s resource <code>requests</code> and <code>limits</code> will be <em>mutable</em> for CPU and memory resources. With the feature, these fields represent the <em>desired</em> CPU and memory resource requests and limits for the container. There is also now a <code>resizePolicy</code> array with two required fields:</p>

<ol>
  <li>
<code>resourceName</code>: Specifies which resource this policy applies to. Currently only <code>"cpu"</code> and <code>"memory"</code> are supported.</li>
  <li>
<code>restartPolicy</code>: Defines whether a container restart is required when this resource is modified. This field can have two possible values: <code>NotRequired</code>, which means that he container does not need to be restarted when this resource is changed, and <code>RestartContainer</code>, where the container must be restarted to apply changes to this resource.</li>
</ol>

<p>By default, if no <code>resizePolicy</code> is specified for a resource, Kubernetes treats it as if <code>restartPolicy: RestartContainer</code> is set. An example of all these fields in the Pod spec is shown below:</p>

<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        memory: "100Mi"
        cpu: "100m"
      limits:
        memory: "200Mi"
        cpu: "200m"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired
    - resourceName: memory
      restartPolicy: RestartContainer
</code></pre>

<p>The status field of a Pod now has more information as well. As of 1.27, the <code>resize</code> field tracks the progress of resize operations. It can have the following values:</p>

<ul>
  <li>
<code>Proposed</code>: The <code>resource</code> field was modified to update the desired resources, but the Kubelet has not yet started the process of resizing.</li>
  <li>
<code>InProgress</code>: The Kubelet has accepted the resize request and is in the process of applying it to the pod’s containers.</li>
  <li>
<code>Deferred</code>: The requested resize cannot be completed at this moment. The Kubelet will continue to retry the resize, and it may be granted when other pods are removed and node resources are freed up.</li>
  <li>
<code>Infeasible</code>: The requested resize cannot be performed on the container, such as when the resize exceeds the maximum resources in a node.</li>
  <li>
<code>""</code>: An empty or unset value indicates that the last resize operation was completed.</li>
</ul>

<p>However, this will change for 1.33, since <a href="https://github.com/kubernetes/enhancements/pull/5089">this</a> change that was merged only three weeks ago as of writing, replaces the <code>resize</code> status with two new pod conditions <code>PodResizePending</code> and <code>PodResizing</code>. For now, we’ll continue showing the pre-1.33 schema.</p>

<p>The <code>allocatedResources</code> field in <code>containerStatuses</code> of the Pod’s status reflects the currently-allocated resources to the pod’s containers as reported by the container runtime, if the container is running. For a non-running container, these are the resources that are allocated for the container for when it starts:</p>

<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  # Pod spec as above
status:
  resize: Proposed
  containerStatuses:
  - name: nginx
    allocatedResources:
      cpu: "200m"
      memory: "100Mi"
    # other status fields

</code></pre>

<p>Finally, a new Pod condition type <code>Resizing</code> was also introduced, which indicates that a Pod’s resources are being modified in-place. (Note that two more conditions will be present when 1.33 is released, as explained earlier in this post). This condition looks like so in the Pod’s status:</p>

<pre><code class="language-yaml">status:
  conditions:
  - type: Resizing
    status: "True"
    lastTransitionTime: "2023-04-01T12:00:00Z"
    reason: ResizeStarted
    message: "Pod resources are being modified"
</code></pre>

<h3 id="potential-impact-on-the-vertical-pod-autoscaler">Potential Impact on the Vertical Pod Autoscaler</h3>

<p>The <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler"><strong>Vertical Pod Autoscaler</strong></a> is an important scaling tool that helps with automatically adjusting resource requests and limits for pods based on usage patterns. It observes the current resource usage (leveraging the information from the <a href="https://github.com/kubernetes-sigs/metrics-server">Kubernetes Metrics Server</a>), suggests a resource value with some buffer room for unexpected spikes, and applies the value to the Pod. However, the way it does this is by modifying the resource value in the Pod’s controlling object (think of Deployment or StatefulSet), and then evicting the Pods which had the resource values, so that new Pods with the updated resource values could replace them. For stateful workloads such as a single database or a long-running workload with in-memory cached information, this was always a particular source of frustration because you could get locked in failure scenarios that you would not be able to scale from. Imagine these situations:</p>

<ol>
  <li>
    <p>A database that runs on a Pod has a high startup CPU spike that is dependent on the amount of rows it has to process. However, after startup the process uses very little CPU. The VPA sees this and readjusts the CPU request/limits to be lower, and evicts the Pod to apply these values. The new Pod comes up with lower resource values, and due to its startup sequence, it doesn’t have enough resources to start. Oops. This DB is also stateful so you can’t have more than one Pod running at a time, and now you have downtime. Double oops.</p>
  </li>
  <li>
    <p>A long-running query that’s being executed on a Pod has been pegging its CPU usage for a few minutes, and the VPA sees this and increases the CPU limit. Now your job can use more CPU to finish faster, but the Pod needs to be recreated to see this new value, which kills this query. Triple oops.</p>
  </li>
</ol>

<p>These are just a few situations where scaling becomes a problem for stability, and the temporary solutions are suboptimal given the limitations of the VPA. For that first example, the Pod now needs to have a minimum amount of CPU that never gets used outside of its startup procedure, which contributes to inefficient workload resource allocation, and for the second example autoscaling is just completely disabled.</p>

<p>However, with In-Place Vertical Pod Scaling, we can do these kinds of adjustments on-the-fly without restarting the Pod, which avoids the situations previously mentioned. And it seems that there are <a href="https://github.com/kubernetes/autoscaler/pull/7673">PRs</a> on the way to add this feature to the VPA. This promises to make the situations previously described a thing of the past!</p>

<h3 id="final-thoughts">Final thoughts</h3>

<p>Kubernetes 1.27’s <strong>In-Place Vertical Pod Scaling</strong> feature is a welcome improvement to resource management, offering the ability to scale pods without requiring to recreate your Pod. This is particularly useful for workloads with changing resource demands, allowing more flexibility and less downtime. With the potential integration into the Vertical Pod Autoscaler, Kubernetes is becoming even more powerful for managing dynamic workloads with minimal disruption.</p>

<p><a href="https://feed.superorbital.io/">Subscribe (yes, we still ❤️ RSS)</a> or join our mailing list below to see more blog posts like this!</p>
]]></content>
    <summary type="html">A new way to adjust the resource allocation of running pods with dynamic resource needs without having to recreate them!</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2025-02-26:/blog/the-skys-the-limit/</id>
    <title type="html">The Sky's the Limit: Why Sky Computing is the Cloud’s Future</title>
    <published>2025-02-26T00:00:00Z</published>
    <updated>2025-02-26T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/the-skys-the-limit/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[
<h2 id="the-forecast-is-cloudy">The Forecast is Cloudy</h2>

<p>Cloud computing revolutionized the IT industry. It delivered capabilities that traditional infrastructure could never match: on-demand scalability, pay-as-you-go pricing models, and near-instant global reach. Businesses no longer needed to invest heavily in physical servers or complex maintenance – cloud providers took care of everything. The cloud made it possible to experiment and innovate faster, and it allowed startups and enterprises alike to focus on their <em>products</em> instead of on their data centers. For a while, it seemed like cloud computing was the ultimate solution.</p>

<p>But now, cracks are starting to show. Vendor lock-in ties companies to proprietary tools and ecosystems, making migration increasingly complex. The overwhelming cost and difficulty of moving data between clouds, sometimes called Data Gravity, has created virtual moats around clouds. And of course, no list of drawbacks to cloud environments would be complete without mention of egress fees. AWS charges <a href="https://www.digitalocean.com/resources/articles/aws-egress-costs">$0.09 per GB</a> to simply move your data, which can cost enterprises hundreds of thousands annually. So, while the cloud liberated us from our hardware constraints, it has chained us in new, very expensive handcuffs.</p>

<h3 id="cloud-repatriation">Cloud Repatriation</h3>

<p>So then, what’s the path forward? Companies like <a href="https://arstechnica.com/information-technology/2024/10/basecamp-maker-37signals-says-its-cloud-exit-will-save-it-10m-over-5-years/">37Signals have shifted back to on-premises infrastructure</a>, and many <a href="https://www.cio.com/article/2104613/private-cloud-makes-its-comeback-thanks-to-ai.html">others</a> are in the process of doing the same. <a href="https://www.citrix.com/news/announcements/feb-2024/research-finds-it-leaders-are-choosing-hybrid-cloud-strategies-due-to-flexibility-costeffectiveness-and-security.html">A report by Citrix</a> last year says that:</p>

<p><em>“<strong>42% of organizations</strong> surveyed in the United States are considering or already have moved at least half of their cloud-based workloads back to on-premises infrastructures, a phenomenon known as cloud repatriation.”</em></p>

<p>While I applaud these companies’ pragmatic view on their infrastructure costs, it asks a question: Did these companies meticulously plan their cloud usage from the start with cost optimization in mind, or are they simply reacting to unexpectedly high AWS bills? Most evidence points to the latter. Companies that are now “repatriating” their infrastructure often failed to implement basic cloud cost controls from day one—such as automatically shutting down dev environments during off-hours, using spot instances for batch workloads, or implementing resource quotas. Rather than fixing these foundational issues, they’re choosing to abandon cloud completely.</p>

<p>Taking a step back to on-prem might not truly be a step forward in the long run. But if we remain in the cloud, how do we address the fundamental challenges of vendor lock-in, data gravity, and rising costs in a transformative way, rather than just applying incremental fixes?</p>

<h3 id="sky-computing-the-next-evolution-of-cloud-computing">Sky Computing: The Next Evolution of Cloud Computing</h3>

<p><strong>Sky Computing</strong> is a common-sense path forward for our fractured cloud ecosystem. Imagine a “cloud of clouds”, where workloads flow seamlessly between providers, free from lock-in and inefficiency, without you having to lift a finger. It’s not just another insufferable buzzword (or buzzphrase, for you pedants); it’s the next <em>logical</em> evolution in cloud infrastructure. Removing provider-specific complexity through its abstraction layer (explained below) enables businesses to prioritize performance, cost, and compliance over loyalty to a single vendor. It’s the freedom the cloud always promised but never delivered.</p>

<h3 id="wait-doesnt-multi-cloud-do-this-already">“Wait, doesn’t Multi-Cloud do this already?”</h3>

<p><strong>Not exactly.</strong> Multi-cloud strategies involve leveraging multiple cloud service providers to distribute workloads. While this approach offers benefits like redundancy and access to diverse services, it often results in fragmented operations. Each cloud platform operates in its silo, requiring distinct management tools and expertise. This fragmentation leads to increased complexity and inefficiencies, negating some of the advantages of a multi-cloud setup.</p>

<p><strong>Sky Computing</strong>, on the other hand, overcomes the limitations of multi-cloud by unifying disparate cloud environments into a cohesive, interoperable ecosystem. Instead of treating each cloud as an isolated entity, Sky Computing orchestrates them to function as a single, harmonious infrastructure. This integration eliminates silos, enabling seamless interaction and workload mobility across all participating clouds.</p>

<h2 id="why-sky-computing">Why Sky Computing?</h2>

<h3 id="a-seamless-cloud-experience">A Seamless Cloud Experience</h3>

<p>As an end user of a Sky Computing product, you’ll no longer care which cloud provider your applications run on. You interact solely with the Sky Computing broker you’ve chosen. Its abstraction layer decides what cloud platform is best for your workload (or even accepts your preferences should you have any). The broker’s decision may or may not change over time, based on the rising and sinking costs of various providers, improved abstraction algorithms, or based on whether your own requirements change (such as data locality, for example).</p>

<p>The more cloud platforms the abstraction layer supports, the greater the flexibility it has in choosing the best cloud platform for your business and applications, and the greater cost savings it can pass on to you.</p>

<p><img src="https://superorbital.io/blog/the-skys-the-limit/data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAmsAAAHLCAYAAACXuN+XAAAsyElEQVR4Xu3dTYwk530f4DnwRh20I6+sAEHSWiNALhS4vgULG9zASIBFrrIl0oQl6pAL4WMOOSzkXH0wVuQ5ByNeOQmQwB+KSC0SXnxIcgiwDhBKMQLFoq1lDAQOLJKGTAKd+S/nJd/5TVV3z0xPd3X38wAvpqvqreqerq9fv/V1dAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADsln/024+f/eof/mA+VL7yB99/nPVhX/3K7//gfq4DrWRdYH1+5Q/efjXXuU/3Q2+/mPXhoLzwzbeeyRVjqOR4sG9ymR8qX/n3b89yPOBqcj0bKr/8e9+/k+PBwcgVYlHJcWFf5LK+qFRLdI4PXE6uX4vKl//Nf7uZ48PeyxVhWalDRDkN2HWLTgMYKzkN4OJWPbJj3eOg5UqwSslpwK47Wa4f5XK+rPzyH/yPezkd4GK+8gfff5Dr1rKi0YCDsuhkzkUlpwO7LpfxVUtOB7iYXKdWLTkd2Ftf+cMfPMwVYJXiBGv2TS7jq5acDnAxuU6tWnI6sLfqUuhcAVYpOR3YdbmMr1pyOsDF5Dq1asnpwF7LFWCVktOAXXeZ82bcRgCubtF9DcdKncKT04G9livBsuLGhOwjV6TB9uS6tazk+HAQckVYVHJc2Bdf+cPvP8nlfaxUuMvxgcux7sGKcoUYKjkO7JtVdhpuiAvrt8q654a4HLz6tTK2stRVo1kf9lXdPy3XgVayLrA+dR5ornOf7IfchQAAAAAAgGF1FedoWbF5+dx4UbI+TE0dZsnl9jLLcI53mWnAIan9TK4rl1lvcrzLTAMmK4//Z8n6aZUHXec4MDXLntqx7P5Nq9ziI8cBlt+Eve51mOOkHCdL1oedkwv1QHmU4/QG6p8rOQ5MzbKwVmXRrQGy7lDJcYDlYa3Koqs9q2Uu62fJcWDn5EI9VOpq0ByvZL2xkuPB1KwS1saW5VV2NmPjwqG7yvqz6ArtZePCTsmFeqzkL5uTfo+yzljpx4MpWjWsDS3POXys5HjA6mFtaB3K4WMlx4Odkwv1otLGWaXZeWg8mKqLhLX+HJqxexAOlf79gI9dJKx9tTstx7rHQcmFelm57DgwZRcJa22Zzn7LSr4ncOGw9vTc0ey3rOR7ws7JhXqFsvLhz1byPWFqLhrWLlPyPYGLh7XLlHxP2Dm5UF9HyfeEqRHWYDuENVhBLtTXUfI9YWqENdgOYQ1WkAv1dZR8T5gaYQ22Q1iDFeRCfR0l3xOmRliD7RDWYAW5UF9HyfeEqRHWYDuENVhBLtTXUfI9YWqENdgOYQ1WkAv1dZR8T5gaYQ22Q1iDFeRCfR0l3xOmRliD7RDWYAW5UF9HyfeEqRHWYDuENVhBrSjXXfI9YWp++fe+fyeX23WXfE/g6OmzpnNdWXfJ9wQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAjPnpSH2XNHjH3um0fjw66iplnTht51LGtjZkebe7/rWo84HNex/Lx6Uu5kzyu6js+5TP0P7X3r7zo/w3V8R1yDJydlflLu54Cjj/v3Hnf96m8Ob6r/7PR1Pdh2rF5vqE71exT9Vp1eGhtndjQ+7Cr674DNqx8WbRmtDVt7ncvTpl3HsjbmupfB/qHVzxxd73txvdr60ZcHZ2pcTq1vqz7cfJY91qB+RNS2YJ1WWYfbd7gu/X5vdlquov9s1/EdcQ3aDmxowcp+1d1WvLGFsUJfBcDZafeq4arqZLofeo/qvhf9VpHTaWZH48OuoqY5y55szNCy036YbNMm3/+636u+T3bf0HpRO/Dsdxk1jVXD2q5Y9r3UD5e2/VlXCFp1P7qqdU6LDWkzbWjmZb++u15XaBoKWNWSMTvtXnUhG9o4tJa8/nBifoa+tFaTWfQv/XhtZSqz7nXJadbwsWG9tsEbG5fNGppHqVoO+vnV/wio7jo8UH9bK12vutvy1n7sDM336m4te627qR82/Xj9sLbs5zT7FsMcp5fD+2mU+kz9D69cfnOdG5peK7PT0n+WHKeGjw3rx2Oz2jqwTC7jOa9zWS59i3Y7bJfTyOn0r/vS/zBo+4qhYdU9O/1bcjnP0vSftUq/X8tx+vGG1PAW0rJudefnb+9Vr/NztPWw34/2n6EPhlXyu+hL2171/doh1fYd5fSqVL/S6vXDajvFBtTMa4c/60sfCl5j3e1136/t3C4T1kpOqz5P9eub44c+Q2l1y+z0df+LLsdrK8HstLu0VsEmf3H2r+v7ap/ruRjWPsus68dm9cvikBZ62oYoA1m9zu5eG3d2+rrJ7nrdb9ByWFsO+x8QbVjTD6tp9ct1LaNjLRc5vVnXPbQTa9qOt/TrVen/n7H/+yLrUQXkvpvNynk/Zmx5LP0y0bpr/Wuvx7bDQ8tW/7q2q313/7o/bad/j/x/+uW81y+Ts9PXbVuQDQf9sNa9SI7bt67l5+u7c1i/Ho6FtfqbAa1/3d47Q3n/etm2oM3bFiTbNivrck1yhSuLumumDa08OfNqBbxKWOtX8tIHoX4lqs+T59m1YbPuddN/3j6Uzk77lRynDPWbHZ3dydTfrFfds+jH5rXQ1kr+wu9VdwvgQ8OaWn76wDJUt9/49fruHFbj9J+vNsL9TqK0DWa/Lo7J95p13Ys20K1f+zt2KKcfZ9Z157RKfpZedrM5Q/N+SNbpt785rJ9m/R0KS7PTMrZcDE1zdlpyWGuFLvW33y8MhbW2TWiGvoP2uWtdz8P9Wbc3O1o87Xrdr7uz034l67Z+ZVFYW2Z2WvJzNLktmH066JN+per030U2UnBN2gzP0hubuX13hpZy2bDWVrqhXzbtbwuG1SqYK2ENn52WfM/+f+xX5tlpv5LjlNavhdv6f2enpR8vx22fhenof6C0eVbLal/Gwlotd21D1Q8bm07+6Ghy3ByvStN2FFWvHcIo9X9UvaHlrpfvNeu6cwOd02nd2b/XD5t13UPj5GfpZTeb0wedMbOj83Xyh0WvX57qb7+dru5aN2ZddzP2unXPjj7dn+Q603+Wqtf0y3nJVt7SPm9Orz5n/382OX6vTStLP3zWddfrNjzrtn7lMmGthtX8nR0NHzloclvQnwLR+pX8LmZHi9+fNakvua00rdSMzeblodfZXa9rx9LO+akZOjt9fZGw1pJ6NtkO7STrs7cda9OGz7rXTVsIc9qzrjvH6Xfu9b30O8yxlaep7ln0Y3PykH5T86Xm69BGuzc0bGg5GZr3vRyW464iDxf1qn/uTJp8r1l0rxLW6nuadf17/Tizrjunla34OTy72ZzZ0fj3X/1by27WaWGm5LB+eeqXswwMZWy5GKo3Oy05rNfqNRnWavhQGBmb5lDrUXY3re4sSvVr26N6vYmWtTy9IP+P/nWGtWy17z+DsLZhs6PxL3lohlZIyfp9d39svVw2rJW2IPYrWK3kQzvXsc8w6143fXf/mWbd69wpts9S6n/qh/WfJ1sC2/c16/qxWf28a3IZqtf9hru6x3ZOpfrVctCfn5Pnvl0kmAy9f3XnOpMtgjlO/mhp8r3yvKJVwtrQ6/4QcJvG7LS7LFqPWncvu9msdliwX66yxa1eLzqnq1fdbafez/uh8bJ76HV21+tsVOhbxPt9RwaRIW39yvPS+m3B7PR1Ltu9/H96rX+r0087hzW1H2kNFYvCWj9Oe53bkKF67TP031FuI6u7fbfC2hbUF7zoPJRsyaq/Y2l7qHsorGXpV6henk/QVL9+Z1PaSlal32HNTvv1hrprhZ+dvm7qc7UFtob1Wv+2wPYbtPZZagVrO/DZ6TC2I6+cHLrlSwvWQ8tHyl+nvbYs9Od0lKyf3e1HQJV+HeuX7RynvVeVPG+zl+O1cWoZb4d4+v69vnt29Ol75naj+tUObHb6ulm0Hi16L7Yjl7eh7XNbBmqZzWDTq+62rW4BrXW35b2tJ32A66czNM1Z192vN3kC/6KwlmV2Oqy0bUGuw7PTflWyxao31r/k/9j2c611sg2rz9saHvpgNBbWSvtMVfr50uZX++77zz47fV3/c/8dlTZP8vMJawDA3lsUbjKgAQCwYcIaAMCE5aHKXjssCQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMBi3/u5D+Z9yeGw79793OfmfcnhwPp99PrPz1vJYUAQ1jh0whpsnrAGF3AmrN364GEOh30nrMFm9UFNWIMVaFnj0AlrsFnCGlyQsMahE9Zgcz58/fb9PqiddD/OOkB449b794U1DpmwBpujVQ0u4c2ffffZPqx98+itZ7IO7DNhDTZHWINLciiUQyasweY4BAqXlGHtjVvvvZh1YF+9e3z8WFiD6/c33/r5V7WqwSV9d/aXswxsWQf21UlYeyiswfV665tHzzgEClf05q33H/dhrQJc1oF99OMbN17sw9qTz3zmZtYBriaD2oev3X6UdYAVaF3jEP35jRt3zoS1z352lnWAq8mwlsOBFb156/0nZ1rXbr1/L+vAvqlw1oe1amnLOsDlffj67QfCGqxRtq7VfdiyDuybPqzVOWw5HLi8DGrv/uaXns06wAVk65rDoRyCM2HNRQawNifh7MmZc9Vev/0g6wCXcBLYHmVgc7Nc9pmwBuuXLWoOf8KaZVgT2NhnwhqsV4Y0QQ2uSYY1h0TZV8IarJegBhtSLWkZ1qp859Z7z2Vd2GVPjo/vC2twdR89eP5eBrW6IW7WA9ZoLLBlPdhl79y48ZywBleXQU2rGmzIWGCrK0ezLuwqYQ0u76PXbj/MkCaowRZkWNPKxj7pw1o91SCHA8MyoJ0WP+ZhWzKoteI8NnZdH9Z+fHzsXlCwgoGQpkUNpuB7P/dXdzKsteL2HuyqPqw5FArLZUCr8jff+vlXsx6wRUNPO6jiXDZ2kbAGq8mApjUNJm4ssD0NbT/7rue/sTPqmaB9WHvryO0GoFe34MiAJqjBjqhQlkGtL1kfpujJZz5zsw9rde+1rAOHKsOZoAY7KkNaX9744gfOY2DyHAqF8zKctfLh67f9oIFdlUGtL2/eev9x1oepENbgUyeB7EkGtNOQ5mpp2AdjN9LtQtsTV44yNcIajLekVcm6wB743q0PHmZQ68sbt97XjM5kvHt8/LgPay4y4NBkOOuKq/xh33331vv3Mqhl0dLGtuVFBhXesg7sm5++9vyLA+Hsk/LXv/X8LMcB9lids5YhLUuOA5vkUCiH4iffun0zg1mWd3/zS27BBIdq2eHRKg6Rsg3vHh8/EtbYZx++fvtxhrK+1PAcBzhgdehzlda2Ooya48J1cd4a++anrz33XIayLEIasNB3Z385y4A2VIQ2NqEPax7qzi6rw5gfjdyCoy85HsBCb956/1GGtOHyV3dyXFgH562x65ZdNFCl6uR4ABe26NmjfXF+G+v05MaNe2cC2/Hxo6wDU7NKQKuS4wGsxaqhzRMSWBeta+ySDGRD5cPXnnc0AtiMDGiLSj1kPseHVbgqlCl765tHzyy7qrOKQ53AVq1yFemnxfltXExdBdqHtbphbtaBbfjotdsPM5QNlRwPYKvqnLXzAW28vHHrPb82WcqhUKagWtE+WuGKzip1o9scH2ByqhUtw9my4pYgDMlnhf75jRtaaNmID1+7/SiD2Fj58PXbLrACdlOdr5ahbJXynVvvPZfT4jCde1ao1jWu0emNa1dqQTsNae4BCOyXy7S4VXnjix+8mtPicAhrXKeLtKCdBjStaMDhuGx4q5v25rTYX0+Oj+9HYHuSdWBVdfuMDGALy2u3H3qYOsCJaj3LULZKcTPew6B1jauosHXRFrQ6JJrTAaBT561lMFulVOirh9Pn9NhtdWFBH9beuXHDjpSF6nBlBrBFpcJcTgOAFdVtPjKUrVbc222faF1jmdOHpp8LYouKpwoAXIPvzv5ydj6YLS8Ome62DGt1LlvW4fBoQQPYAVcJbx6HtVsysOVw9t9HD56/lwFsSXFBCsCU1M11M5StUt784nvul7QD3j0+fiisHaaLtqB5JifAjvgPX/jJzQxmq5QKbzVuTo/t07p2GC7TglaPhcrpALBDLntfN88xnRZhbb99+PrtxwNBbLT8zbd+3k2zAfZVhbc3b73/JMPZsuKmvNv17tHRs31Y+/GNG8L0DrtoC1o94kkLGsCBqosNLhPevnfrg4cOmW6W1rXdddHnb1b5ybduW78AOKtuynsS3B6fC2ZLSo3jprzX793j48dnAttJd9ZhWqpFLEPYouIh6QBcSAWwS4a3RxX8cnpcnda1aTttQTsXwsZK3f9MCxoAa1Phre7TluFslSK8rUeeu+Ymudt1medvenoAABtRN+WtFrQMZcuKQ6ZXp3Vt+y7cgvb67cda0ADYusuGN88zvZh3j48fOXdtszx/E4C98vSmvLc+eJjBbFmpq1I9Cms1Wtc2o+5nliFsUakWtJwGAOwE4W29MqzVuWxZh4vTggYAp+oJCRnOVisOmZa80EDr2uVd9Pmbnh4AwEF5elPeL7734HwoW1yq1S2ndWiEtcur1rAMYctKtbrldADgIF0mvNUtRXI6h+BMYDs+fpjD+dRHr91+mAFsUdGCBgAr+u6t9+9lOFtW3vjiBwexo9W6tthHF3jEk+dvAsAaXKbVbZ8vUvjx8fGDaF17lHUOzUXvgaYFDQCuyaXOd7v1wcN9C29a1y52i4164kCODwBcszrseS6YLSn78giseuTUoYa1izzq6aevPf9ijg8AbMlFw1tdXbrL4S1a1/b2StmLHuL0iCcA2BEfP9bqfEgbK3UfuJzGlOWh0LeO9udE+b/+rednGcIWlRwfANghdVPdakXLcDZW6nmnOY0pevK5z73ah7W68CDr7JqL3AutruTM8QGAPVDhLQPaopLjT0m2ruXwXXChVrQHz9/L8QGAPVXnq12kxW2K57ftelg7F8ZGSp23luMCAAfmIg+e/+7sL2c5/ja8e3z8uA9rf37jxuSfo1qHLzOMDRUPSwcARl3kytJt38NtF1rX6hmbGcaGiqcJAAAX8h++8JObGc7GStXN8TdhymGtwlcGsqGiFQ0AuLJvHr31TF0pmiFtqOS416lu2dGHtSef+cxWQmNvlas6Xc0JAFybp4+9WuHihDduvX8/x70O0bq2tRvkfvTa7YcZys6Uk+E5DgDAtanQlgFtuPzVtR7mm8Kh0HPBLEqdt5bjAABszCrnt1VrXI63DtsMaxnKolzL/wsAcGnfvfX+vQxpWer8txzvKvLB7ps4b23ZIU9XdgIAk7fsvLZ1PY/03aOjZzd13lo9RSCDWV/qiQQ5DgDApC19xNWtD6584v11Hwr96WvPv5jBrC9ZHwBg57x56/3H54JaV65yg918sHsOv4oMZn2pEJf1AQB22rLQdtlz2s60rh0fP8rhF7XoIetuZAsA7L06Zy2D2ietbJe4cnSdh0IznH0a0m5fOQQCAOyUDGqXDWzrCmsZ0FpxrzQA4GAtuk/bm198b6VHM/34xo0X+7D2zo0bz2WdRT58/fb9DGhVfvracxeaDgDA3sqgdpHAlrfw+PHx8dJxmp986/bNDGlPD3t6hicAwHkZ1lrJeukyh0LrPLQMaQ57AgAsUa1pGdaq1CHTrNtcJKzVUwYyoD0tHrgOALCauo1HhrVFLWwXCWvnQtppyXoAACyRYW0ssK0a1jKgVXGDWwCAS3rj1vv3M6xVyRvorhLWhs5RE9QAANYgw1qV/hy2J8fH988EtqOjMxcJZEirUueu9XUAALiCulluBrZ6GkINe/LZz876sFb3XmvjDV1QULfs+HTKAACsRYa1Ku1B8Gda1o6PP7myM4NalU+nCADAWmVYq1L987y1ul9ahjRBDQBgA9689f7jDGwZ1jKkCWoAABuUYe3Pbvzd/7QorOX4AABcswxsY2HNBQUAAFuQYe3J5z4/GNZyPAAANqCuBM3A9uPP/e2zYc0zPwEAtud7tz54mIGtD2tZHwCADRPWAAAmbCysffj67QdZFwCADavHTvVh7X99/h9rVQMAmIrv/dxf3cnWNRcWAABMyH/84pMP+rDmECgAwIT857/3e/+vD2s5HACALfovf//fvy+sAQBMmLAGADw1n89nyvRKH9ZymDKdkusTAFzayY7lwRzYlFmugwBwzskO407uQYCNup/rJQAHbv7x4Rlgep7N9RWAA3KyI3iYewZgkma5/gKw5042/k9ybwBM2s1cjwHYQycb/Pu5BwB2R67TAOyR3OgDO+tOrt8A7Ljc0gO7LddxAHZYbuSB/ZDrOgA7aO6+abDP7uU6D8AOOdmQP8otO7B3HuS6D8AOONmAP5NbdGA/5foPwA7IjTmw1x7mNgCACTvZcD/OLTmw99zOA2BX5BYcOAy5LQBgguau/oSDldsDACYoN97AQXmS2wQAJia33MBhyW0CABNysp2+mRtu4LDkdgGACcmNNnCQXs1tAwATkVts4DDltgGACZhv8IkFX/rSl2pnMP+d3/mdHESnvqOPZ800TO3zXEQta/XZa9ljuX7bAMBEnGyf7+UG+zJ+6Zd+6ZOdeiuvvfbamTqt/1/8xV+c6X8V77333rn3/fjf2h0//OEP5x9++OEn3Zv+H77whS8sfM9Fw1ZV83yd831MfY/1fTab+oGwju9oImZHAEzLfA0Pba/JtFI7yt/4jd8YDE7ZfVX9e1RprSit9AFoqn791399rd/JZQzNq6a+w7Fhq6plosb/4z/+4xy0dlf9rJfR/2DYA4+OAJiW3FJfRk2mSrac9C0c3/nOd87t0Fr3WCBo/frpNK3FJMcprZWvtez102+lb2lpgemrX/3qmVamet8/+qM/+qS7ptv0791Pt/+sfZ2hfv14/fh9nfZ5Kuj0rZf9Z8lp9f9vBYlF+veqv9kC1X83+R5NhvM2vZL9+/eq0uZjC3JZN5epmqdDw9vnHHufXtbLQ6Stf78s5Pfd67+bMf1ylHVbd//dD60TY+P3w9rr+j4u6wiAackN9WXUZFoZ04JG7dhLq5/dvaF+TRtWO8Fl+s/Xl7ajz/6LyrJpDtVpMojlePVdtOBTAWCozqL3GSqL5Plc/fs2Ob2habfuDHRj4/ffw1DdCklDh2b7fmPjjvVvsk4rLXxWaMxhOY3UhrdleUhfp60L7bvOeV7a/9pCVxu/6gz9UMnPOvQDZ1VHAExLbqgvqyaVZWh4vh4a3nePhbGsP6bV61stsiWkve53tjn9Zd1D/ep13yKTO+mhw2etu+1sV/1s/Y6+Dc/Pl7JOdvf9+hauvl7fYjQmh+f3P6avM/Rd9a1+Wb9k61QLRUP/S5vGUEhs3WMBKOuvIscZ6u5DdJWxZaBvtc6WyMs4AmBackN9Vdky0fT9qowdeuq7V2mpWGaoXh6SzTr9obZmWXf2a9/D0GHRtkNtoWUoaI11Z7/2WXstcOQhzV4LjlW36vXn+zXZEthkv9bdSv++Q9No3XkOW06nH6+9XnQ4Mt9nWZgrbR606bY6/eHjofF6y4aXVqcvY+Grzb8cNlT64YvWl4s4AmBackO9LjXpKtlCNNaq0kJHBZnWArLI0DSaZVdWtulny0XTdpZt55eBI7ubvt/QhQM5TuvuQ8tYnV7fr32fQ8MXXWDR6gyVPA9sWbAuec5Y077rsfP9sl+V+n7b9Gpe9MPHWlpLTjfHyeGlHVJs5zcO1Rnq11t1eH0H9b+1YDwU5FtrYPu/+2EtVPclx12HIwCmJTfUF1WTyMnUzj77990tmOXOpfoNnb8zpE0vW49a/2o967uH6pShw2vZnefbDQXODDY5vG9xbLK79WvnKQ0dYswWsPZZ2neZh/6GtO84Lz5owSXPk+qn1YJsCzcVOPrptOEZ0vv5lNMc+syte2w6+T0MBehl3UP9srv168NTr71vLoe9Gr6o9bT0rb05LPvV+tWW76HhV3UEwLTkhvqiahKt1M6+DzKtNWUodPTD+35Zb0xeFdifdJ3v00ofIlvwyp1+P052D12QUJ+jD2KtTvs89Tc/a063hldgacGxBZShw6QtOLZw0H+3Y61baWx4fhf9tOoz5OHjvk59B4uCb30P/feXYbx/n/4CgxYK+8/Wv08fGlu/DHhNmyc17fosYyfqD322sTDWgu9YKe11/z9UGfqx0o+X/Wv+9vNg0Q+SqzgCYFrma7jP2tBtCfpf/kPnULV6Q4cs87DbmKH3zR1gyTpDJ5gvaqlY1N1PN+WwrJctSnlIsw1r4aPv12vf71irXq8Ny/PFSn6e9rr69yfe999xP06VqpfzoA2r/6OFqv5/Kv28rADUT7fNm/6HQJUMUK1//z1k8Opvf1Kl/6ztM/QthS1gjemnlaUdfu1DYQb+3lC/JoP40HqzLkcATMt8TU8wuKoWBsYON01Jfc6Pv7rtqh12hYmhQ2yrBl6mYSrLVOk2DwBMRW6sN60+Qiu7oD5ntthsS/veqjVq6Fw6pq1f9qfyQ+V0swDAlOTGepNWOUF7StrnXcf9rNalD2n94Vymr823ocPSW+JxUwBTlFtr4GC9mNsHACbgZAN9M7fYwOHJbQMAE5IbbeDw5HYBgAnJjTZwcO7kdgGACTnZUN/PLTdwOHKbAMAE5cYbOBy5PQBggnLjDRyMWW4PAJigkw32s7kFB/ZfbgsAmLCT7faT3JADe+1hbgcAmLjckgN763Gu/wDsgPlEHu4OXK9c9wHYIXOHQ2HfPZfrPQA7JrfswH7IdR2AHZYbeWDnPZPrOQA7bO4h77BP7uc6DsCeyC0+sFtynQZgD+XGH9gJbs8BcEjmbusBOyPXXwAOyMl+4E7uGIBJeJLrKwAHLPcSwNbUvRFv5joKAJ842VHcz70HcK1ezfUQAJY62YE8M//43DZPQYD1enhS7uQ6BwDssW+8PJ+3ksMAANiif/dPf/83//kv/5/5P/vy/xXWAACm5jd/5X++p2UNAGCi/sU/+bM/78Pa/K0XPDcSAGAq+qD2NKz96O7DrAMAwJYIawAAE/W1l+Yv9kHt7Z/918IaAMBUZFj7wed/t8KaRxUBAEzB1391/uq5sPbOP5zP/+QXPLIIAGDb8ny1T8KaQ6EAANv1wgvzZxaENYdCAQC2KYPaf/07v/9pWDspWR8AgA3KsFZBrQ9rWtcAALZoaVjTugYAsB2v/Or8YR/U/vvf+reDYW3+zt37OS4AANdsrFXtfFjTugYAsFEZ1P7lP/jjs2HtR3cfCmwAAFvSB7VXXp4/6YNalaojrAEAbEG2qtW91gbD2g9fmMW5ay/mtAAAWKMMalWq/1BYK9m65hFUAADX5OWX589mUPv6l+dPw9eCsPY4A9unUwQAYG0yqH3tK/NZGzYW1kqGNTfKBQBYswxqX39pfubeaX1Q+/7N3z0TxuZvvfBMBrZ+OAAAV5A3v23nqfXOtqx9+2EOnz/+0rMCGwDAmq0S1MqysFYyrAlsAABXkCHt6eHP0wsK0iphrWRYE9gAAC7hlZfnjzKojbWqlVXDWsmwJrABAFzA116aP5chLS8oSGfC2s/87sK6QxccCGwAACuow5wXDWqlD2tv/8y3lz6poG7fkWHNLT0AABYYCmqLDn32LhrWmnOBTQsbAMB5v/bS/E6GtFVa1Mrbn/327LJhrWRYe1p++MIs6wEAHKSLXkyQrhzWRs5hm//o7sILFQAA9trQsz4vGtRKhrXvf/5f3ck6yywIbI+yLgDA3htrTeuf97mqDGvVnXVWNf/fd++dC2zvOI8NADggdS5ahrQqX//V+atZdxXrDGtl8EpRgQ0A2HeLDnu+8ML8may/qjpHbZ1hrYwFNqENANhLQ8/4bCXrXtR1hLUyf+fugwxqn5Q/+YXBx14BAOycDGfraE3rZVh7fPTbz2adq9DKBgDspQxn62xN62VYy+HrMHq1qNAGAOyaajHLcNbKZS8iWGQTYa3JkCawAQA7Y1FIOymPs/66/ODz3364qbBWFrayeb4oADBFFcYGAtrTUhcXZP112nRYa84FNYENAJiasXumbSKkNdsKa6WuCj0X1rqS9QEANmJRS1qVrH+dthnWmvk7v/hcBrUz5a0X1nLlKwDAQvU4qAxmffm1l+YXfi7nVU0hrJX5D1+YnQtpfXF4FAC4Lotualulnk6Q42xKH9S2Gdaa+eMvPXsuqJ0t13axBQBwYJaFtG20pKWphbXeQFA7U7I+AMBKlh3urJLjbMuUw1qZ/+ndOxnSzpQf3d3IhRgAwB74xpILB+rqzxxn26Ye1pqF92fT0gYALFKHMzOYZdnmeWmL7EpYaxY9a/Rp+eELsxwHADhQy1rSquQ4U7NrYa2Zv3P3/rmg1heHRwHgMNVjoV55ef4kQ1lfvvbS/MUcb6p2Naw1K1w9unP/EwBwSYueOFClQlyOM3W7HtYaLW0AcMAylGXZpZa0tC9hrVka2rS0AcB+WOWigRxnF+1bWGtWOTzqYgQA2EF1Tto3llw4sKmHrG/Cvoa1ZmlLm8OjALAblp2PVuXrX57fzPF23b6HtWaVlrYcBwCYiAxlWaZ4M9t1OZSw1mhpA4Ad8bWX5s9lKMtSh0RzvH1zaGGtEdoAYKK+cUDno63iUMNas8rhURciAMAG1O01MphlqQew53j77tDDWqOlDQC2ZJULB3KcQyKsnbVKS1uOAwBcQrWSZSjLcgjnpC0jrA3LgHaunIS6HAcAWME3lpyTVsNznEMmrC229PDoyfAcBwAIpzeyzVB2phzi+WirENZW4/AoAFxShrKhkuPwKWHtYrS0AcAKXn55/mwGsiy7/HD1TRLWLqeuDj0f1LS0AcDSlrSsz2LC2tXUPdgypJ0rLkYA4BC88vL8UQazMy1pzkm7FGFtPZa3tDk8CsCeylCWJetzMcLaeq3S0pbjAMDOymDWl0N7LNR1Edaux9KWtj/5hZs5DgDsjAxmfXHhwHoJa9drWWjL+gAwaRnMzoQ056RdC2FtM5YdHs36ADA5Gc5aeeXl+ZOsy/oIa5szf+uFZzKknSknw3McANi6DGd9ybqsn7C2eQufhvCju36cADANX3txfi/DmZC2ecLa9tRFBufC2qfFM2wB2J46tJkB7bTYQW2YsLZdC1vZBDYANq2u5BwIaE9LPYw963P9hLVpWHwBwt0HWR8A1u4kqD2XAU1Q2z5hbVrOB7VWXng16wLA2tThzQxoVdyOY/uEtemZ/++7986HtY9L1gWAK6kWswxop8W5OBMhrE1XHf7MsPZx+cXnsi4AXNjYYU/3TZsWYW3a5j+6++h8WDspf3r3TtYFgJXVczszpFXJemzf2bD2bc9bnahzYe1puXs/6wHAUmOHPl9+ef5s1mX7hLXdcT6s/cO6ie6jrAcAo8bun5b1mA5hbbecC2unJesBwDlDQe2kn1/9Eyes7Z7h89jciw2ABQS13SWs7aY6X+1cYHPRAQBDvv7S/H4GteqX9ZimPqy9/TPffjGHM12DLWxvveAG0wB8auj2HILabhHWdlvdc+1cYHOVKABl6KpPhz53y9uf/fZMWNt958OaCw4AOHr6CKkzQa1K1mHahLX9kWFNYAM4cEM3vc06TJ+wtj/mf/ILNzOszd+5a34CHKKhw59Zh91wLqyddGcddsu5wOaCA4DDMhTUql/WYzcIa/unbt+RgS3rALDHMqhpVdttddhTWNs/whrAARPU9ouwtr/OBLYf3X2SwwHYQ4La/smw9vjot5/NOuymbF0T2AD23K+9NL8jrO2fDGs5nN01/+ELswxsWQeAPZJBzUUF+0FY22/nnh96EuCyDgB7IsNaDmc31YPbhbX9NdC69jjrALAHXnl5/qQPanVINOuwm4S1/edQKMAB0Kq2v4S1/Vc3xdW6BrDn+qD2tRfn93I4u0tYOwxa1wD22Csvzx9pVdtfwtphqGeEal0D2FMOge63PqgJa/tN6xrAHsrngNaFBlmH3SasHQ5hDWAPvfKr84d9WPv6l+c3sw67TVg7HMIawB5yCHT/CWuHY/6juw+FNYA9I6ztP2HtcAhrAHvozPlqL80f5HB2n7B2OObv3H1w9lDo3RezDgA7pg9rX3tpbsO+h4S1wzHw6CnzG2DXnQlrX5nPcji7T1g7LMIawJ7pw1rdxiOHs/uEtcMirAHsGRcY7L8+qL1989uv53D2i7AGsGeENdgvwhoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALD37p+UefSr7ux3Z6BfqX7Pnb5+5rS7lZo20/Dw6Oy8afMsteWhlZvdsFkMq/JqN5ztenI0vI4CsOPaTvfFrt+j0369sR1B61c79Xpd486OPt6Jt2mzXW0+VGCr+dzm76yrU9o8ruFV7/Fp97Onw2en3TWdVszj6Wjzos0vAPZE29HmDneou1pj+paUB6f9y9hOuw+BbF4/j3rVClrDmppPQ/WqRbVCXJkdDdepfv202LzWql1yHvXhupV+WC+H1XTb/C/1o6y6q8y6/gBck2o5aWGqNvDV3VR37aj77v5ve93GaS1p9z4dzATkjntM1esPeQ6ZHQ1Pr/rlTp/N6uff0Dzq5Trcy2FVKrD13U29nnXdAFyDfsOb565V6Gq/qGsn0HbGubHuz31qh9daGTsvis3JnfGYVerNjs7Xay1yDr1tT9+qVmq97X9o9XK9zPmZ63cLaq27n25NJ8cHYM1qQzvrSm54W3duwIdeD6nhy+pwvfL7zwsNZqf9s96Q2dHZcatoUdu+fl62MjQ/q1/9oMp+Y91Dw4YKANckd9qt9Fdwtg1xv0Guw52zo49/YfeHTcfUuP2vczarvv+xVq8aNutej7XGNLMjO+cpynW4lX5+jp2TmP367qFhs4ECwDXJDXGTG+uhW3a0nUH2GzrJvPqPhQWuXx4i67Wdb9+d+iuDZ91rpmF2NDxP+qu3ly0DzSy6c5zqdjsegA0ZCltN9W8tYe12HNmCNjR+q5ulv5KM7ajzD3O+VMlzCts5SFmaWXSzfTU/xn4MtXmV87PKbGBY3rIn5/XQ8gHANalDImOHvKp/Hj6Zdd2t39gtOSoY1CHWamVz+HM6aode86zmzdi8K61ezb+8srcNYzoWzY+27rb1tS99wOu3B/30xqbdliPrNwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALvr/wOHbnqISlBr9gAAAABJRU5ErkJggg==" alt=""></p>

<h3 id="regulations-and-resilience">Regulations and Resilience</h3>

<p>Moreover, rising regulations like <a href="https://www.cloudflare.com/en-ca/learning/privacy/what-is-data-sovereignty/">GDPR and CCPA</a> are forcing companies to comply with strict data sovereignty rules, demanding infrastructure that adapts to regional requirements. And when major outages occur—like the <a href="https://www.crn.com/news/cloud/2024/aws-outage-hits-amazon-services-ring-whole-foods-alexa">ones that</a> <a href="https://www.bleepingcomputer.com/news/microsoft/microsoft-azure-outage-takes-down-services-across-north-america/">have left</a> <a href="https://www.forbes.com/sites/emilsayegh/2024/07/31/microsoft-and-aws-outages-a-wake-up-call-for-cloud-dependency/">entire businesses</a> <a href="https://www.datacenterknowledge.com/outages/a-history-of-google-cloud-and-data-center-outages">offline for hours</a>—it further highlights the dangers of single-cloud dependency. Sky Computing isn’t just an opportunity; it’s becoming a critical next step for businesses that need to stay agile, resilient, and competitive in this rapidly changing world.</p>

<h2 id="the-3-pillars-of-sky-computing">The 3 Pillars of Sky Computing</h2>

<h3 id="pillar-1-abstraction">Pillar 1: Abstraction</h3>

<p>At the heart of Sky Computing lies <strong>abstraction</strong>, which serves as the glue that unifies disparate cloud platforms. Through a compatibility layer leveraging tools like Kubernetes, Ray, and standardized APIs, Sky Computing hides the complexities of individual clouds. This means you won’t need to worry about whether your data sits in AWS S3 or Azure Blob Storage—the abstraction layer handles those details, deciding on the optimal storage based on cost and performance.</p>

<p>By providing a “write once, run anywhere” experience, abstraction eliminates vendor lock-in and simplifies application deployment. As more cloud platforms become supported, the abstraction layer offers greater flexibility, enabling smoother transitions and substantial cost savings without requiring you to change how you develop or manage your workloads.</p>

<p>Even beyond the Hypercloud providers of today like AWS, GCP, and Azure, you’ll see support for <a href="https://semianalysis.com/2024/10/03/ai-neocloud-playbook-and-anatomy/">Neocloud providers</a> like <a href="https://lambdalabs.com/">Lambda Labs</a>, <a href="https://www.coreweave.com/">Coreweave</a>, or <a href="https://nebius.com/">Nebius</a>, providing even greater operational flexibility.</p>

<h3 id="pillar-2-automation">Pillar 2: Automation</h3>

<p>Building on the foundation of abstraction, <strong>automation</strong> is the next pillar driving Sky Computing forward. Intercloud brokers embody this pillar by serving as intelligent decision-makers who manage workload placement, cost optimization, and compliance across multiple clouds.</p>

<p>These brokers continuously analyze factors like pricing, resource availability, and regulatory requirements using AI and real-time data. They automatically route or adjust workloads based on current conditions, ensuring your applications always run in the most cost-effective and efficient environment. This removes the manual overhead of juggling multiple cloud providers, reduces the chance of human error, and lets you focus on higher-level tasks while the system optimizes operations behind the scenes.</p>

<h3 id="pillar-3-agility">Pillar 3: Agility</h3>

<p>The final pillar, <strong>agility</strong>, is about creating a responsive and flexible cloud ecosystem where data and workloads move freely. Reciprocal peering agreements are key to this agility. These agreements are collaborations between cloud providers that allow for free or low-cost data transfers, breaking down barriers such as egress fees and data gravity.</p>

<p>As these agreements take shape—often organically driven by hyperscale providers keen to support popular brokers—workloads can move seamlessly between clouds. This dynamic environment empowers businesses to adapt quickly to shifting costs, regulatory changes, or performance requirements without being locked into a single provider.</p>

<p>Crucially, this level of agility opens up the cloud ecosystem in ways never seen before. By tearing down silos and encouraging collaboration between providers, Sky Computing creates an interconnected landscape where innovation thrives. Smaller and more specialized neocloud providers gain a seat at the table, fostering competition and driving breakthroughs in service offerings. Enterprises can mix and match services from various providers without fear, leveraging the best features from each platform to suit their needs.</p>

<p>The result is an agile infrastructure that can pivot on demand, offering both resilience and the flexibility to innovate and disrupt rather than just iterate. This unprecedented openness not only breaks the barriers of vendor lock-in but also sparks a whole new era of creativity and efficiency across the entire cloud industry, fundamentally changing how businesses harness cloud technology.</p>

<h2 id="real-world-examples">Real-World Examples</h2>

<p>Sky Computing is not just a theoretical framework—it has practical applications that transform how businesses run complex workloads. Here’s how we’re already seeing it play out in the real world.</p>

<h3 id="aiml-workloads">AI/ML Workloads</h3>

<p>In the world of AI and machine learning, different stages of a pipeline may benefit from different cloud providers’ specialties. For example, a company could split its ML pipeline: run model training on Google Cloud, which offers TPU-optimized instances for deep learning; perform inference on AWS, utilizing their Inferentia chips for lower latency; and handle data preprocessing on Azure, benefiting from its robust data services. By strategically placing each stage where it performs best, organizations gain speed, cost savings, and the ability to comply with regional data regulations. The concepts of a unified platform and intelligent routing introduced earlier come into play here, as Sky Computing brokers manage this orchestration—dynamically routing workloads to the optimal environment for each task without manual intervention.</p>

<h3 id="global-data-compliance">Global Data Compliance</h3>

<p>Regulatory requirements like GDPR in Europe or CCPA in California mandate strict handling of data based on location. Sky Computing can automatically route workloads and data to the correct geographical region to meet these legal requirements. This ensures compliance without sacrificing performance, as the system intelligently selects the cloud environments that best balance regulatory needs with operational efficiency.</p>

<h3 id="enterprise-batch-jobs">Enterprise Batch Jobs</h3>

<p>Batch processing tasks, such as large-scale data analysis or report generation, often require significant computational resources and time. Sky Computing’s cost-aware brokers analyze available resources across multiple clouds and choose the most cost-effective option to run these jobs. By doing so, enterprises can save millions on large-scale batch workloads, as the brokers not only find the cheapest compute options but also optimize job scheduling to take advantage of low-cost, high-performance opportunities.</p>

<h2 id="challenges-to-sky-computing-adoption">Challenges to Sky Computing Adoption</h2>

<p>Adopting Sky Computing won’t be all rainbows and unicorns. It comes with its own set of challenges that need to be navigated carefully.</p>

<h3 id="standardization">Standardization</h3>

<p>Achieving universal standards across all cloud platforms is unlikely due to competitive interests and proprietary technologies. However, progress can still be made by leveraging partial compatibility sets—common ground in widely adopted tools like Kubernetes, Ray, and S3 APIs. These standards don’t cover every scenario but provide a practical bridge, allowing Sky Computing to move forward without waiting for complete industry-wide uniformity.</p>

<h3 id="economic-resistance">Economic Resistance</h3>

<p>Large cloud providers may resist reciprocal peering agreements, as sharing data freely between platforms can conflict with their business models. While this resistance exists, smaller cloud providers and innovative startups have strong incentives to embrace Sky Computing principles. Their agility and desire to compete with larger players drive them to support the ecosystem, gradually encouraging wider adoption and putting pressure on the bigger providers to reconsider their stance.</p>

<h3 id="infrastructure-inertia">Infrastructure Inertia</h3>

<p>Organizations have significant investments in their existing cloud infrastructure - not just in terms of cost, but also in expertise, tooling, and operational processes. Many firms are understandably hesitant to make dramatic changes to their infrastructure stack, especially when it comes to adopting new paradigms like Sky Computing that don’t yet have widespread adoption. This resistance to change is compounded by the fact that existing cloud deployments often work “well enough,” even if they’re not optimal in terms of cost or performance.</p>

<p>The overhead of retraining staff, updating deployment pipelines, and potentially refactoring applications to work with Sky Computing’s abstraction layer can seem daunting to many organizations. Additionally, there are perceived risks around reliability and support when moving away from established cloud providers’ native services. These factors create significant inertia that must be overcome for widespread Sky Computing adoption.</p>

<h3 id="the-challenge-of-legitimacy">The Challenge of Legitimacy</h3>

<p>The concept of Sky Computing faces some uphill battles in establishing legitimacy, particularly in light of recent events. <a href="https://en.wikipedia.org/wiki/Sky_computing">A visit to Wikipedia’s Sky Computing entry</a> reveals a troubling warning banner questioning the reliability of sources and noting a lack of academic citations. This stems from an incident where a commercial entity attempted to shape the narrative around Sky Computing through Wikipedia editing, leading to their eventual ban from the platform.</p>

<p>This highlights a broader challenge: as emerging technologies gain traction, there’s often a rush by commercial entities to stake their claim as thought leaders or pioneers, sometimes through questionable means. This can inadvertently damage the credibility of legitimate technological advances. Sky Computing, as an architectural evolution of cloud computing backed by academic research and technical merit, deserves to be evaluated on its technical foundations rather than through marketing efforts.</p>

<p>The incident serves as a reminder that transformative technologies often face skepticism when commercial interests precede widespread technical validation. However, the fundamental value proposition of Sky Computing—providing a unified interface across cloud providers while optimizing for cost, performance, and compliance—stands independent of any single company’s implementation.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>I genuinely believe that we’re at a turning point in cloud infrastructure. I don’t think Sky Computing is just another buzzword—it’s a practical fix for real problems. It brings together different cloud services into one smooth system, making life easier for businesses and SREs who need reliable, flexible, and efficient operations. At the end of the day, this makes a sizeable dent in balance sheets, and that’s what matters.</p>

<p>As more Sky Computing solutions emerge, tech leaders worldwide will continue to notice. They’ll see the benefits and quickly move their workloads to these smarter, more open, and more cost-effective cloud setups.</p>

<p>The future of the cloud is here, knocking at our door. It’s an exciting moment to rethink how we build and manage systems that stand up to real-world demands—more resilient, more adaptable, and ready for what’s next.</p>

]]></content>
    <summary type="html">Discover why Sky Computing is the logical evolution of cloud infrastructure, delivering the freedom traditional cloud providers promised but never delivered.</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2025-02-15:/blog/gpud-in-node-problem-detector/</id>
    <title type="html">Integrating GPUd with Node Problem Detector</title>
    <published>2025-02-15T00:00:00Z</published>
    <updated>2025-02-15T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/gpud-in-node-problem-detector/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<p>This post continues on from our previous article on <a href="https://superorbital.io/blog/node-problem-detector-custom-plugins-primer/">building custom plugins for Node Problem Detector</a>.</p>

<p>Managing GPU-enabled Kubernetes clusters presents unique challenges that require closely monitoring GPU health and responding to hardware issues. While Kubernetes excels at container orchestration, it needs to be extended to monitor specialized hardware like GPUs. The combination of Node Problem Detector (NPD) and GPUd could create a solution for automated GPU health monitoring through Kubernetes’ native health reporting mechanisms.</p>

<h2 id="introduction">Introduction</h2>

<p><a href="https://github.com/kubernetes/node-problem-detector">Node Problem Detector (NPD)</a> is a Kubernetes monitoring agent that detects system-level issues and reports them as node conditions and events. The conditions and events are exposed through the Kubernetes API, and visible with <code>kubectl describe node</code>. NPD comes with built-in problem monitors, and <a href="https://superorbital.io/blog/node-problem-detector-custom-plugins-primer/">supports custom plugins</a> for extending its capabilities.</p>

<p><a href="https://github.com/leptonai/gpud">GPUd</a> is a system monitoring daemon specializing in GPU metrics. It has a component-based architecture that allows it to monitor GPU-specific metrics and related system components affecting GPU clusters. It’s output includes states and events that indicate the health of each component.</p>

<p>The monitoring models between NPD and GPUd appear to be compatible—states mapping to node conditions and events aligning with Kubernetes events—and should enable effective GPU health monitoring in Kubernetes.</p>

<h5 id="in-collaboration-with-sailplane">In Collaboration with Sailplane</h5>

<p>This blog post was produced in collaboration with <a href="https://sailplane.ai">Sailplane</a>. I paired with their AI agent to develop the proof-of-concept plugin, create and manage a test environment, and deploy GPUd and NPD within it, as well as authoring this post.</p>

<h2 id="the-gpud-architecture-and-api">The GPUd Architecture and API</h2>

<p>GPUd’s NVIDIA-specific GPU monitoring components include: GPU status and performance metrics, temperature monitoring, driver and CUDA toolkit health, GPU memory usage, and ECC errors.</p>

<p>GPUd also has components for monitoring non-GPU general system health, like systemd services, memory and CPU usage, kernel module status, and kernel dmesg logs.</p>

<p>GPUd caches monitoring data in a SQLite database and exposes it through a RESTful API. The API endpoints include:</p>

<ul>
  <li>
<code>/v1/components</code> to list the components;</li>
  <li>
<code>/v1/states</code> shows instantaneous current health of the component;</li>
  <li>
<code>/v1/events</code> shows timestamped series of notable events within the component;</li>
  <li>
<code>/v1/metrics</code> gathers measurements from the component similar to Prometheus metrics; and</li>
  <li>
<code>/v1/info</code> to gather all component information in one response.</li>
</ul>

<p>Each accepts a <code>components</code> query parameter to filter the results to one or a set of components, and the events and metrics endpoints accept <code>startTime</code> and <code>endTime</code> to query a time range.</p>

<p>Here’s an example of a state response from the systemd component (<code>/v1/states?components=systemd</code>):</p>

<pre><code class="language-json">[{
  "component": "systemd",
  "states": [{
    "name": "unit",
    "healthy": true,
    "reason": "name: kubelet active: true uptime: 1 day ago",
    "extra_info": {
      "active": "true",
      "name": "kubelet",
      "uptime_humanized": "1 day ago",
      "uptime_seconds": "90344"
    }
  }]
}]
</code></pre>

<p>The <code>/v1/events</code> API produces similar output, but requires the <code>startTime</code> parameter to actually produce any output. <code>endTime</code> is also an available parameter, and both default to the current time (ie, the default time range has zero duration and no events would be selected).</p>

<p>Here’s an example of an event response from the memory component (<code>/v1/events?components=memory&amp;startTime=[...]</code>):</p>

<pre><code class="language-json">[{
  "component": "memory",
  "startTime": "2025-02-11T20:59:30Z",
  "endTime": "2025-02-12T00:59:30.450503144Z",
  "events": [{
    "time": "2025-02-11T21:09:19Z",
    "name": "memory_oom_cgroup",
    "type": "Warning",
    "message": "oom cgroup detected",
    "extra_info": {
      "log_line": "Memory cgroup out of memory: Killed process 339038 (python) total-vm:92920kB, anon-rss:64672kB, file-rss:4608kB, shmem-rss:0kB, UID:0 pgtables:184kB oom_score_adj:992"
    }
  }]
}]
</code></pre>

<h2 id="npd-custom-plugin-implementation">NPD Custom Plugin Implementation</h2>

<p>We can use NPD’s <a href="https://superorbital.io/blog/node-problem-detector-custom-plugins-primer/">custom plugin system</a> to bridge GPUd’s monitoring capabilities to Kubernetes’ node health model. NPD interprets the exit status of a plugin script as the detection of a problem, and if a problem is detected, uses any message on stdout as the condition reason or event message. The plugin script can query GPUd’s API and process the response with <code>jq</code> to filter for the relevant state or events. It can then print a message for the state’s reason, or the event message, as well as setting the process exit status.</p>

<p>See the <a href="https://superorbital.io#appendix">appendix</a> below for a proof-of-concept implementation of such a script.</p>

<p>For state monitoring, the plugin can detect a problem when a state is reported with <code>"healthy": false</code>.
For events, the plugin can detect a problem whenever a matching event is emitted.
Here’s an example configuration of rules for monitoring the state of the kubelet service, and event monitoring for OOM kills:</p>

<pre><code class="language-json">{
  ...
  "rules": [
    {
      "type": "permanent",
      "condition": "GPUdKubeletHealthy",
      "reason": "KubeletRunning",
      "path": "/usr/local/bin/gpud-npd-plugin.sh",
      "args": [
        "--mode", "states",
        "--component", "systemd",
        "--state-name", "unit",
        "--match-extra-info", ".name == \"kubelet\""
      ]
    },
    {
      "type": "temporary",
      "reason": "OOMKilling",
      "path": "/usr/local/bin/gpud-npd-plugin.sh",
      "args": [
        "--mode", "events",
        "--component", "memory",
        "--event-name", "memory_oom_cgroup"
      ]
    },
    ...
  ]
}
</code></pre>

<h2 id="limitations">Limitations</h2>

<p>Some limitations arise from the way NPD queries plugins for problem detection.</p>

<h3 id="event-handling">Event Handling</h3>

<p>The plugin can only emit one event per polling interval. This will miss sequences of events that occur faster than the polling interval. While shrinking the polling interval may help, that approach cannot guarantee events will not be missed. Instead, the script should output an indicator that multiple events occurred. Additionally, the event rules should be split up with finely sliced (more specific) queries to match the smallest number of events for a <code>"reason"</code>.</p>

<p>However, splitting these into finely sliced queries explodes the configuration above and will compound the performance overhead, as explained next.</p>

<h3 id="performance-overhead">Performance Overhead</h3>

<p>The NPD custom plugin architecture requires polling GPUd’s API separately for each configured component. Each poll of each rule must fork/exec a process, and each script execution will launch several other processes. The most expensive process will be contacting the GPUd API with the connection overhead (TLS) that entails. Depending on the component, GPUd will read in-memory caches or run a SQLite query to collect the requested information. With many components, this can add up to significant overhead.</p>

<p>A potential mitigation (not implemented for this proof-of-concept) is to run another per-node process (eg, another daemonset or added to the NPD daemonset) that periodically polls GPUd for the information from all components in one request, and then process it out to individual files to be more cheaply read by individual rules.</p>

<h3 id="gpud-code-quality">GPUd Code Quality</h3>

<p>GPUd is a <a href="https://blog.lepton.ai/introducing-gpud-the-missing-gpu-management-for-ai-0f0d026337e3">young project</a> published by a fast-moving startup. As such it shows signs of immaturity that we can hope will improve over time.</p>

<p>There is not a lot of <a href="https://www.gpud.ai/docs">documentation</a> for the project. The <a href="https://www.gpud.ai/docs/components">list of components</a> has once-sentence descriptions that give little more than restating the name, and then links to the GoDocs for the component, which has no additional information.
The <a href="https://www.gpud.ai/api/v1/docs">API documentation</a> lists the API endpoints. But it shows <code>component</code> as a query parameter, instead of the correct parameter <code>components</code>.
Meanwhile <code>startTime</code> and <code>endTime</code> are not documented, yet these are critical to getting any information from the <code>/v1/events</code> endpoint, as noted earlier. The shortcomings of the documentation leaves you to either probe the API directly to figure out what information is available, or read the code.</p>

<p>Some components appear to have tunable thresholds, like the <code>fd</code> component <a href="https://github.com/leptonai/gpud/blob/804a546833f103a884f6dfc191ba14d8492cd5ba/components/fd/config.go#L15">has a Config struct</a> with a field <code>threshold_allocated_file_handles</code> which is also in the component’s output.
If you’re looking to change this threshold, you are out of luck.
You might, as I did, look at the code to see how to set this configuration and have some hope. There is a global configuration object (that includes the <code>fd</code> component’s <code>Config</code> struct). And there is a function that <a href="https://github.com/leptonai/gpud/blob/804a546833f103a884f6dfc191ba14d8492cd5ba/pkg/config/default.go#L353">reads from a YAML file in a set of fixed locations</a>.
But, at the time of writing, that function is dead code, never referenced by any other code. GPUd always launches with its built-in, automatic configuration, with limited modification from command-line arguments.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The integration of GPUd with Node Problem Detector demonstrates how Kubernetes’ node health monitoring can be extended to cover specialized hardware like GPUs. By mapping GPUd’s monitoring capabilities to Kubernetes’ native health reporting mechanisms through NPD’s plugin system, clusters gain visibility into GPU health and can potentially automate responses to GPU-related issues. While the plugin architecture has limitations around event handling and performance, it provides a starting point for exploring automated GPU health monitoring in Kubernetes environments and should work fine for a limited number of extracted event and condition types.</p>

<h2 id="appendix">Appendix</h2>

<p>This is the proof-of-concept plugin script written for this blog post. <code>curl</code> and <code>jq</code> must be installed in the node-problem-detector image.</p>

<pre><code class="language-shell">#!/bin/sh

# Exit code definitions
EXIT_SUCCESS=0           # No problem detected
EXIT_PROBLEM_DETECTED=1  # Problem detected in GPUd events/states
EXIT_SYSTEM_ERROR=2      # System errors like API failures or invalid arguments

die() {
  echo "$1"
  exit $EXIT_SYSTEM_ERROR
}

MODE=""
COMPONENT=""
EVENT_NAME=""
STATE_NAME=""
MATCH_EXTRA_INFO=""

while [ $# -gt 0 ]; do
  case "$1" in
    --mode)             MODE="$2";             shift 2 ;;
    --component)        COMPONENT="$2";        shift 2 ;;
    --event-name)       EVENT_NAME="$2";       shift 2 ;;
    --state-name)       STATE_NAME="$2";       shift 2 ;;
    --match-extra-info) MATCH_EXTRA_INFO="$2"; shift 2 ;;
    *) die "Unknown argument: $1" ;;
  esac
done

query_events() {
  # Query GPUd events for specific component and filter by event name and message pattern
  # startTime is needed to get any events. endTime defaults to now. 30sec matches the polling interval.
  response=$(curl -sk "https://$NODE_NAME:15132/v1/events?components=${COMPONENT}&amp;startTime=$(date -d "30sec ago" +%s)")
  if [ $? -ne 0 ]; then
    die "Failed to query GPUd events API"
  fi

  event_count=$(echo "$response" | jq --arg name "$EVENT_NAME" \
    '[.[].events[] | select(.name == $name)] | length')
  event_msg=$(echo "$response" | jq -r --arg name "$EVENT_NAME" \
    '[.[].events[] | select(.name == $name) | .message][0]')

  if [ "$event_count" -gt 0 ]; then
    echo -n "$event_msg"
    if [ "$event_count" -gt 1 ]; then
      echo -n " ($((event_count - 1)) events missed)"
    fi
    exit $EXIT_PROBLEM_DETECTED
  fi

  return $EXIT_SUCCESS
}

query_states() {
  # Query GPUd states for specific component and filter by state name and extra info
  response=$(curl -sk "https://$NODE_NAME:15132/v1/states?components=${COMPONENT}")
  if [ $? -ne 0 ]; then
    die "Failed to query GPUd states API"
  fi

  state_reason=$(echo "$response" | jq -r --arg name "$STATE_NAME" \
    "[.[].states[] | select(.name == \$name and (.extra_info | $MATCH_EXTRA_INFO) and .healthy == false) | .reason][0]")

  if [ -n "$state_reason" ] &amp;&amp; [ "$state_reason" != "null" ]; then
    echo -n "$state_reason"
    exit $EXIT_PROBLEM_DETECTED
  fi

  return $EXIT_SUCCESS
}

case "$MODE" in
  "events") query_events ;;
  "states") query_states ;;
  *) die "Invalid mode: $MODE" ;;
esac

exit $EXIT_SUCCESS
</code></pre>

]]></content>
    <summary type="html">Extend Kubernetes' native health monitoring to GPUs by integrating GPUd with Node Problem Detector. </summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2025-01-24:/blog/node-problem-detector-custom-plugins-primer/</id>
    <title type="html">Node Problem Detector Custom Plugins</title>
    <published>2025-01-24T00:00:00Z</published>
    <updated>2025-01-24T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/node-problem-detector-custom-plugins-primer/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[
<p>I was recently asked to prototype a custom plugin for node-problem-detector. I found that the documentation for the plugin interface is pretty inscrutable. The <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/#adding-custom-plugin-monitors">official documentation</a> links to the <a href="https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit?tab=t.0#heading=h.tplmhav3arf5">plugin interface proposal document</a> on Google Docs which doesn’t make a good guide to writing your own plugin. So here’s a primer I developed based on my brief experience.</p>

<h2 id="quick-introduction-to-node-problem-detector">Quick Introduction to Node Problem Detector</h2>

<p>Node Problem Detector (NPD) runs as a DaemonSet, and sets <a href="https://superorbital.io/blog/status-and-conditions/">conditions</a> in the status field of the node that it is running on. It can also output an Event when a problem occurs. The condition and events would be visible when running <code>kubectl describe node &lt;node-name&gt;</code>.</p>

<p>NPD has several built-in problems it will detect. But it also has a way to take advantage of its infrastructure to add your own conditions and events.</p>

<h2 id="writing-custom-plugins">Writing Custom Plugins</h2>

<p>The “script interface” is based on exit status and a message on stdout. Exit status of <code>0</code> means no problem detected; <code>1</code> means the problem was detected; any other non-zero status means the status could not be determined.</p>

<p>The path to this script is added to a JSON configuration file, and the path to that configuration file is passed to the node-problem-detector binary via a command-line argument. The <a href="https://github.com/deliveryhero/helm-charts/tree/master/stable/node-problem-detector">helm chart</a> abstracts some of this plumbing away so that you only need to author the script, the configuration, and probably modify the node-problem-detector image so that it contains the tools necessary for your script.</p>

<p>You might want to start by adding a configuration like this to your helm values:</p>

<pre><code class="language-json">      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "30s",
          "timeout": "5s",
          "max_output_length": 80,
          "concurrency": 3
        },
        "source": "my-custom-plugin-monitor",
        "metricsReporting": true,

        "conditions": [
          {
            "type": "MyProblemCondition",
            "reason": "NoProblem",
            "message": "Everything is normal"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "MyProblemCondition",
            "reason": "ProblemCause",
            "path": "./custom-config/plugin-my_problem.sh"
          }
        ]
      }
</code></pre>

<p>But what do these configuration keys even mean? What are the “conditions” and “rules”?</p>

<p>The best explanation I’ve seen for custom plugin configuration is in the source for node-problem-detector at <a href="https://github.com/kubernetes/node-problem-detector/blob/master/docs/custom_plugin_monitor.md">docs/custom_plugin_monitor.md</a>. The struct definitions for <a href="https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/types/types.go#L56)">Condition</a> and <a href="https://github.com/kubernetes/node-problem-detector/blob/v0.8.12/pkg/custompluginmonitor/types/types.go#L40">CustomRule</a> serve as additional references.</p>

<p>It’s not the most obvious configuration surface. Instead of starting from configuration, it’s simpler to think about this from the bottom-up, starting from the script, how it gets executed, and how the result is turned into a status condition or an event.</p>

<h2 id="how-a-plugin-script-invocation-is-turned-into-a-condition-or-event">How a plugin script invocation is turned into a Condition or Event</h2>

<p>I write a script that can return an exit status of <code>0</code> or <code>1</code>.  For example, below is an abridged version of <a href="https://github.com/kubernetes/node-problem-detector/blob/master/config/plugin/check_ntp.sh">a sample script from the NPD repository</a> that checks if a systemd service (NTP) is running.</p>

<pre><code class="language-shell"># Return success if service active (i.e. running)
if systemctl -q is-active ntp.service; then
  echo "NTP is running"
  exit 0
else
  echo "NTP is not running"
  exit 1
fi
</code></pre>

<p>I put this script as the <code>"path"</code> of a <code>"rules"</code> entry with <code>"type": "temporary"</code>.</p>

<ul>
  <li>Since this is a <code>"temporary"</code> rule, NPD may output an Event.</li>
  <li>If the script returns <code>0</code>, then do nothing.</li>
  <li>If the script returns <code>1</code>, then output an Event with <code>"reason"</code> from this <code>"rule"</code>, and <code>"message"</code> from stdout.</li>
</ul>

<p>Alternatively, I could add this script to the configuration as the <code>"path"</code> of a <code>"rules"</code> entry with <code>"type": "permanent"</code>.</p>

<ul>
  <li>Since this is a <code>"permanent"</code> rule, NPD will update an entry in <code>status.conditions</code> of the node. Which Condition entry? The one with <code>"type"</code> equal to this rule’s <code>"condition"</code> field. (Or if none exists, then it will create it, of course.)</li>
  <li>If the script returns <code>0</code>, then NPD will set the Condition to its default state. The default state comes from the entry in the <code>"conditions"</code> section whose <code>"type"</code> matches this rule’s <code>"condition"</code>. NPD will update the Condition using both the <code>"reason"</code> and <code>"message"</code> from the <code>"conditions"</code> entry.</li>
  <li>If it returns <code>1</code>, then NPD will set the Condition to the <code>"reason"</code> from this rule and the <code>"message"</code> to the stdout produced by the script.</li>
</ul>

<h2 id="recipes">Recipes</h2>

<p>We can distill this further into three recipes.</p>

<h3 id="i-want-an-event-to-be-emitted-when-my-script-returns-non-zero-status">I want an Event to be emitted when my script returns non-zero status.</h3>

<ul>
  <li>Don’t add anything to <code>"conditions"</code>.</li>
  <li>Add an entry in <code>"rules"</code> with <code>"type": "temporary"</code>, and <code>"path"</code> with the path to your script.</li>
  <li>Set the rule’s <code>"reason"</code> to what you want emitted in the event.</li>
</ul>

<h3 id="i-want-a-condition-to-be-set-according-to-how-my-script-returns">I want a Condition to be set according to how my script returns</h3>

<ul>
  <li>Add an entry in <code>"conditions"</code> with <code>"type"</code>, and an entry in <code>"rules"</code> with <code>"condition"</code>, set to the same value: The name of the Condition you want to output.</li>
  <li>The rule should have <code>"type": "permanent"</code>, and <code>"path"</code> with the path to your script.</li>
  <li>The condition should have a <code>"reason"</code> and <code>"message"</code> for the passing case.</li>
  <li>The rule should have a “reason” for the failing case. The detailed “message” will come from the script’s stdout.</li>
</ul>

<h3 id="i-want-a-condition-and-an-event-when-my-script-returns-non-zero">I want a Condition and an Event when my script returns non-zero.</h3>

<ul>
  <li>Combine the above…</li>
  <li>A <code>"conditions"</code> entry for the default state of the Condition.</li>
  <li>A <code>"permanent"</code> rule for the erroring state of the Condition.</li>
  <li>A <code>"temporary"</code> rule to emit an event.</li>
  <li>That is, one condition and two rules with the same <code>"path"</code>.</li>
</ul>

<h2 id="deployment">Deployment</h2>

<p>You now have a script and custom plugin configuration to tell node-problem-detector to run the script and update a Condition or output an Event. A little more work is needed to bundle it all together to deploy. Using the <a href="https://github.com/deliveryhero/helm-charts/tree/master/stable/node-problem-detector">helm chart</a>, the following values should be set to enable the custom plugin:</p>

<ul>
  <li>
<code>image</code>: If a custom image was needed to extend the node-problem-detector image with additional binaries, then specify it here.</li>
  <li>
<code>settings.custom_plugin_monitors</code>: This is a list of file paths within the container to the JSON configuration file for custom plugins. If using the chart’s <code>custom_monitor_definitions</code> to populate a ConfigMap, then these paths should start with <code>/custom-config/</code>.</li>
  <li>
<code>Settings.custom_monitor_definitions</code> define the contents of a ConfigMap mounted at <code>/custom-config/</code>. Add a key under here with a filename like <code>"my-custom-monitor.json"</code>. Your script can also go in a key here.</li>
</ul>

<p>That’s it! Deploy the helm chart with these values, and your custom conditions and events should begin appearing on the node objects when you run <code>kubectl get nodes -o yaml</code> or <code>kubectl describe</code> a node.</p>

<h2 id="demo">Demo</h2>

<p>As part of my exploration, I made a <a href="https://github.com/superorbital/node-problem-detector-custom-plugin-demo">demo repository</a>. The exploration was to see if NPD could be used to detect that the node had a network connection issue. The conclusion of that exploration was that NPD was not a good candidate for detecting network connectivity problems, since it would be unable to write the status back to the api-server over the very network it was detecting a problem within. Nevertheless, the repository shows a complete custom plugin deployment that can be deployed in a <code>kind</code> cluster.</p>
]]></content>
    <summary type="html">A primer on configuring custom plugins for Node Problem Detector</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2025-01-08:/blog/make-k8s-apps-istio-aware/</id>
    <title type="html">Make Your K8s Apps Istio-Retry Aware!</title>
    <published>2025-01-08T00:00:00Z</published>
    <updated>2025-01-08T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/make-k8s-apps-istio-aware/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<details>
  <summary><em>Table of Contents</em></summary>

  <p><a href="https://superorbital.io#overview">Overview</a></p>

  <p><a href="https://superorbital.io#istio-changes-communication">Istio Changes Communication</a></p>

  <p><a href="https://superorbital.io#mitigate-the-additional-retries">Mitigate The Additional Retries</a></p>

  <p><a href="https://superorbital.io#get-insight-into-istios-retries">Get Insight Into Istio’s Retries</a></p>

  <p><a href="https://superorbital.io#summary">Summary</a></p>

</details>
<p><br></p>

<h2 id="overview">Overview</h2>

<p>The addition of Istio to Kubernetes (K8s) can remove a lot of the burden of retry and timeout logic from our microservice code. BUT, if our code isn’t aware of what Istio is doing, it can undo these benefits or worse, it can introduce new problems. In this article, we are going to explore how Istio changes API requests, and how to code our requests to respect those changes and maximize their benefits.</p>

<h2 id="istio-changes-communication">Istio Changes Communication</h2>

<p>When microservices on K8s communicate with each other, they must be tolerant of transient issues like pod terminations, network glitches, and request overloads. This means that their retry logic needs to be robust.</p>

<p>On the flip side, when an upstream microservice is floundering (perhaps it’s overloaded with requests, or maybe one of its upstream services is in trouble), we don’t want our retries to be too aggressive, as that could further overload the upstream microservice and actually prevent its recovery.</p>

<blockquote>
  <p><strong>NOTE</strong>: In network applications, “upstream” refers to the direction data flows when a request is made from a client to a server. When your microservice makes a request to another service, that service is said to be “upstream.”</p>
</blockquote>

<p>So, let’s say you’re part of an application development (app dev) team that has achieved this delicate balance for all its microservices by implementing some sophisticated retry logic in its request code. To help other app dev teams quickly achieve similar functionality, the K8s platform team has decided to install Istio in all the K8s clusters, utilizing Istio’s VirtualServices to provide similar retry logic for all microservices.</p>

<p>What does that mean for your team? The addition of Istio-level retry logic presents two challenges. First, the additional retries could push your microservices out of their balanced retry stance and into an overly aggressive stance. Secondly, since Istio is adding retries outside of your code, your code will need some way to get insight into what’s happening with those retries so that it can react appropriately. Let’s start with the first challenge…</p>

<h2 id="mitigate-the-additional-retries">Mitigate The Additional Retries</h2>

<p>To add retries to microservices, a K8s platform team would typically attach an Istio <code>VirtualService</code> resource to each one, similar to this:</p>

<pre><code class="language-yaml">apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: example-api
spec:
  hosts:
    ...
  http:
  - route:
    - destination:
        ...
    retries:
      retryOn: 5xx
      attempts: 3
      perTryTimeout: 1s
</code></pre>

<p>And here’s the really interesting part–when your microservice requests an upstream microservice, that will trigger the retry logic in the <code>VirtualService</code> attached to the <strong><em>upstream microservice</em></strong> (not yours). So, if you don’t have the correct permissions to see their <code>VirtualService</code> in K8s, you’ll want to reach out to the app dev team or K8s platform team that owns it and ask to see a copy. Then, you’ll want to check out the retries section–you can read more about each field in this section of the <a href="https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry">Istio Docs</a>.</p>

<blockquote>
  <p><strong>NOTE:</strong> In the absence of VirtualServices, Istio still adds a <a href="https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig-default_http_retry_policy">default retry policy</a> of 2 retries on errors “connect-failure, refused-stream, unavailable, cancelled, or retriable-status-codes.” Retriable-status-codes includes 503 errors.</p>
</blockquote>

<p>Your first order of business should be to figure out the <strong>total time</strong> a request could take and then make sure the timeout in your request code is larger. For example, if you’re requesting a microservice with the above  <code>VirtualService</code> that 1 request could result in 3 retries at 1s intervals, thus taking 4 * 1s = 4s. So, the request timeout in your code should be <em>at least</em> 4s (plus a bit of buffer). Otherwise, your timeout could cut off the request while Istio is still working on it.</p>

<p>Next, you’ll want to calculate the <strong>maximum amount of retries</strong> that could happen–too many too quickly could overwhelm the upstream microservice. If you’re making another request to the microservice with the above <code>VirtualService</code>, and your request code has 3 retries, then the math looks like this: your 4 requests * the upstream <code>VirtualService</code>‘s 4 requests = 16 requests.</p>

<p>I’ll illustrate this scenario by requesting 3 retries to a microservice in my K8s cluster that has its VirtualService set to 3 retry attempts.</p>

<p>Here is the log from my request:</p>

<pre><code>DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): test-api.test-api.svc.cluster.local:80
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:24 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 138
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 109
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 120
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0

DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 18:54:25 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 90
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0
Request failed: 503 Server Error: Service Unavailable for url: http://test-api.test-api.svc.cluster.local/
</code></pre>

<p>The upstream microservice is responding (via Istio) with 503 errors, so we see 4 responses as expected. Now let’s look at the Istio logs and see <em>what else</em> is going on…</p>

<pre><code>test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:24 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
test-api-cd4bf8b77-rnw47 test 127.0.0.6 [08/Jan/2025:18:54:25 +0000] GET / HTTP/1.1 503 Host: test-api.test-api.svc.cluster.local}
</code></pre>

<p>Whoa, that’s <strong>16</strong> total requests to the upstream microservice, even though our request code only had <strong>3</strong> retries! We better reduce our code’s retries or find a way to make more intelligent retry decisions. This takes us to our second challenge…</p>

<h1 id="get-insight-into-istios-retries">Get Insight Into Istio’s Retries</h1>

<p>When you request an upstream microservice that is struggling, Istio will iterate through the retry logic specified by the <code>VirtualService</code>, trying to get you a good response. But what happens if it exhausts its retry logic? Or if it runs into other problems? Will you be stuck with some generic 503 or 504 error? The answer is…</p>

<p>It depends! You had to see that coming–I <em>am</em> a consultant, after all.</p>

<p>Fortunately, there is a very powerful, yet <em>little-known</em> feature of Istio that can help us with this: <code>VirtualServices</code> allow us to dynamically <a href="https://istio.io/latest/docs/reference/config/networking/virtual-service/#Headers">inject or remove headers</a> in requests and responses. In our case, we want to see Istio’s RESPONSE_FLAGS (more on that in a moment) by adding a <code>headers</code> section to the upstream <code>VirtualService</code> like this:</p>

<pre><code class="language-yaml">apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: example-api
spec:
  hosts:
    ...
  http:
  - route:
    - destination:
        ...
    retries:
      retryOn: 5xx
      attempts: 3
      perTryTimeout: 1s
    headers:
      response:
        set:
          x-envoy-response-flags: "%RESPONSE_FLAGS%"
</code></pre>

<p>Dynamically injecting headers with Istio is a very underdocumented feature that even seasoned Istio users might not be aware of, so it’s likely that you will have to ask your K8s platform team to add this <code>headers</code> section to the upstream <code>VirtualServices</code>.</p>

<p>In this example, I’ve called the header <code>x-envoy-response-flags</code>, but it could be called anything. I know the <code>x-</code> prefix is somewhat out of vogue now, but I chose <code>x-envoy-</code> to match the other headers that Istio adds to the response, like <code>x-envoy-upstream-service-time</code>.</p>

<p>Anyway, Istio can add a lot of RESPONSE_FLAGS to this header. For the complete list, go to <a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage">this doc page</a> and search for “RESPONSE_FLAGS.”</p>

<blockquote>
  <p>Don’t be weirded out by the fact that this is an Envoy doc page–Istio uses Envoy to manage its inter-service network traffic</p>
</blockquote>

<p>Let’s say your upstream microservice+<code>VirtualService</code> does include this <code>x-envoy-response-flags</code> header. When Istio exhausts the retry logic in the upstream <code>VirtualService</code>, it will send back a <strong>URX</strong> flag (UpstreamRetryLimitExceeded) in the <code>x-envoy-response-flags</code> header.</p>

<p>When added to our previous example about retries, we can see this additional header in the log:</p>

<pre><code>
DEBUG:urllib3.util.retry:Incremented Retry for (url='/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Retry: /
send: b'GET / HTTP/1.1\r\nHost: test-api.test-api.svc.cluster.local\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
header: server: istio-envoy
header: date: Wed, 08 Jan 2025 20:48:03 GMT
header: content-type: text/html; charset=utf-8
header: access-control-allow-origin: *
header: access-control-allow-credentials: true
header: content-length: 0
header: x-envoy-upstream-service-time: 89
<strong>header: x-envoy-response-flags: URX</strong>
DEBUG:urllib3.connectionpool:http://test-api.test-api.svc.cluster.local:80 "GET / HTTP/1.1" 503 0
Request failed: 503 Server Error: Service Unavailable for url: http://test-api.test-api.svc.cluster.local/
</code></pre>

<p>Now we need our code’s request logic to check for that URX flag. If it sees it, then it should stop retrying because it knows that Istio has already sufficiently retried.</p>

<p>Here is some example code in Python that will make a request and then look for this particular header in the response. Python folks typically use a <code>requests.Session</code> with a <code>urllib3.util.Retry</code> object to manage their retries. However, I haven’t found a good way to inject a check function into the <code>urllib3.util.Retry</code> object, so I’m using my own retry loop. This allows me to include my <code>should_retry()</code> function, which can look for the <code>x-envoy-response-flags</code> header along with all my other checks.</p>

<pre><code class="language-python">#!/usr/bin/env python3

import requests
import time

def should_retry(response, max_retries=0, retries_completed=0):
    if retries_completed &gt;= max_retries:
        print ("Finished all ({}) retries".format(max_retries))
        return False
    # Check for any response codes that you want to be retried
    if response.status_code not in [500, 502, 503, 504]:
        print("{} is not a retry-able status code".format(response.status_code))
        return False
    # Check for any Istio Response Flag headers that should stop retries
    if response.headers.get('x-envoy-response-flags') == 'URX':
        print('UpstreamRetryLimitExceeded (URX) response flag received from Istio, stopping retries')
        return False
    return True

def simple_get_with_retry(url, retries=0, data_dict=None, headers=None, timeout_secs=5):
    retries_completed = 0
    response = requests.get(url.strip(), headers=headers, json=data_dict, timeout=timeout_secs)
    while should_retry(response, retries, retries_completed):
        time.sleep(timeout_secs)
        response = requests.get(url.strip(), headers=headers, json=data_dict, timeout=timeout_secs)
        retries_completed += 1
    return response

def main():
    response = simple_get_with_retry("http://my-microservice.local", 3)
    print("Response code: " + str(response.status_code))
    print("Response Headers: " + str(response.headers))

if __name__ == "__main__":
    main()
</code></pre>

<p>Feel free to run this example code in a debug pod or from wherever you can reach an upstream microservice–just replace <code>http://my-microservice.local</code> with the upstream URL. The point here is that you’ll need some sort of check function in your retry loop, where you can react to the <code>x-envoy-response-flags</code> header.</p>

<p>Checking for URX is a great starting point. Over time, as your microservices evolve, be sure to check <a href="https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage">the doc page</a> for additional RESPONSE_FLAGS that could help make your check function more intelligent.</p>

<p>And while we’re talking about the evolution of our microservices, let’s also consider the evolution of our K8s platform. Once our K8s platform team has defined appropriate <code>retries</code> within <code>VirtualServices</code> for <strong><em>every</em></strong> microservice, we can shift our thinking and consider retries to be the platform’s responsibility. Ideally, we’ll stop managing retry logic within each microservice and allow the platform to own and manage all retries centrally within its <code>VirtualServices</code>. In that case, our microservice request code would still want to check the <code>x-envoy-response-flags</code> response header to use in its error handling.</p>

<h1 id="summary">Summary</h1>

<p>When requesting an upstream microservice, you should calculate the total possible <code>retries</code> time defined in its Istio <code>VirtualService</code>, and use that to size the timeout of your request code. You should also reduce the number of retries in your request code since the <code>VirtualService</code> will already be instructing Istio to do them.</p>

<p>Additionally, your request code should check for Istio’s RESPONSE_FLAGS within response headers so that it can react intelligently to them in its retry and error-handling logic.</p>

<p>And finally, we should move to get rid of all retries from our code and rely on the K8s+Istio platform to own them centrally and do what is best. Coming up with appropriate retry and timeout logic for every upstream microservice that our code requests is tedious. When we multiply that effort by all the other teams whose code is also making requests to the same upstream microservices, it wastes a lot of time. Instead, let’s spend that time developing the features that our users really care about, and leave the network details like retries and timeouts to the platform!</p>

<p>If you found this guide helpful, you might also enjoy our live <a href="https://superorbital.io/training/istio/">Istio training workshops</a>. We spend &gt; 50% of our workshop time doing hands-on lab work, which is a really fun way to learn. Sometimes I help out as a lab coach–maybe I’ll see you there!</p>
]]></content>
    <summary type="html">Learn how to effectively code API requests to microservices in Istio-enabled Kubernetes.</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2024-12-11:/blog/status-and-conditions/</id>
    <title type="html">Status and Conditions: Explained!</title>
    <published>2024-12-11T00:00:00Z</published>
    <updated>2024-12-11T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/status-and-conditions/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<details>
  <summary><em>Table of Contents</em></summary>

  <ul>
    <li><a href="https://superorbital.io#what-is-a-status-field">What is a status field?</a></li>
    <li>
<a href="https://superorbital.io#conditions">Conditions</a>
      <ul>
        <li><a href="https://superorbital.io#a-small-history-lesson">A small history lesson…</a></li>
        <li><a href="https://superorbital.io#conditions-in-kubernetes">Conditions in Kubernetes</a></li>
      </ul>
    </li>
    <li><a href="https://superorbital.io#best-practices-for-conditions-in-custom-resources">Best Practices for Conditions in Custom Resources</a></li>
    <li><a href="https://superorbital.io#takeaways">Takeaways</a></li>
  </ul>

</details>

<p>If you’ve ever used Kubernetes, you may have already introspected the many fields inside a Kubernetes object, usually by getting the object via <code>kubectl get</code> and having the <code>-o yaml</code> flag set. If so, you would have noticed the top-level <code>status</code> field, which is never set by a user. You may have also seen the command <code>kubectl wait --for=condition=Ready=true</code> <a href="https://kubernetes.io/docs/reference/kubectl/generated/kubectl_wait/#examples">being used in commands</a> to wait for resources to be in an expected state. What’s the purpose of the <code>status</code> field? Why aren’t any changes in this field persisted when we try to modify them with <code>kubectl edit</code>? And what are <code>conditions</code>? You’re definitely not the only one with these questions, so let’s find out a bit more about the status field and conditions!</p>

<h3 id="what-is-a-status-field">What is a status field?</h3>

<p>In Kubernetes, all resources are bound to an API, which has fields for the specification of the desired state of the object (the top-level field usually named <code>spec</code>), identifying information of the object (the top-level field named <code>metadata</code>), and the current state of the object (the top-level field named <code>status</code>). The <code>metadata</code> field contains data such as the name, namespace and UID that helps to uniquely identify the object (but is not relevant to the topic of this article). We can see an example of these fields below with our fake object <code>MyObject</code>:</p>

<pre><code class="language-yaml">apiVersion: resources.superorbital.com/v1
kind: MyObject
metadata:
  name: my-object
  namespace: testing
  uid: 314e7f00-2694-4f9a-bc08-73aa3104fa8b
spec:
  widgets:
  - "foo"
  - "bar"
status:
  observedWidgets:
  - "baz"
</code></pre>

<p>The <code>spec</code> field contains a resource-specific description of the configuration for that object. The information in this field is used by controllers in the cluster to perform operations such as creating and scaling containers, and any other actions required to ensure the object can be put into the desired state.</p>

<p>The <code>status</code> field provides a space for controllers to summarize the current state of the object in the system. This may include (but is not limited to) the current progress of an ongoing action, the success or failure of said action taken by the controller, whether the object is in an expected state or not, the progress towards the expected state, or any other observations made by the controller that may be relevant for the consumer of the object to know about.</p>

<p>One thing to note is that the <code>status</code> field is usually not directly editable by a user via <code>kubectl edit</code>. This is because the <code>status</code> field is only modifiable from a different subresource (<code>/status</code>) from the main object to prevent a situation where an object modification can overwrite the status field unintentionally. This also means that RBAC permissions for the status subresources are provided separately from the permissions for the main object. As an example with the <code>resources.superorbital.com/myobject</code> resource from before:</p>

<pre><code class="language-yaml">apiVersion: v1
kind: Role
metadata:
  name: role
rules:
# Permissions for the object
- apiGroups:
  - resources.superorbital.com
  resources:
  - myobject
  verbs:
  - get
  - list
  - patch
  - update
  - watch
# Separate permissions for status fields of the object
- apiGroups:
  - resources.superorbital.com
  resources:
  - myobject/status
  verbs:
  - get
  - patch
  - update
</code></pre>

<p>Given that the structure of the <code>status</code> field can differ between different resources, there are no required subfields within the <code>status</code> field. However, it is expected that the status field will <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#when-to-use-a-status-field">hold the values</a> of the observed current state. For many workload-type resources (such as Pods), useful information in the <code>status</code> includes the current container status, the IP address assigned to the container, and when the container was started. Additionally, some resources (such as Pods again) contain a <code>conditions</code> subfield that holds similar information to the ones in the other status subfields. But what exactly is this field?</p>

<h3 id="conditions">Conditions</h3>

<p>The <code>conditions</code> subfield is a list of Condition elements, each of which provides a standard format to store information about the state of a resource. This information is meant to complement the existing information in the <code>status</code> field and allows consumers of the object to read information about the observed state without having to know the resource-specific <code>status</code> subfields. As an example, a Pod’s <code>conditions</code> subfield contains a <code>type: Ready</code> Condition that is only set to <code>status: "True"</code> when all of the Pod’s containers are running and ready to accept traffic – but the Pod’s <code>status</code> fields also contain a <code>containerStatuses</code> field with a lot more detail about the state of each container.</p>

<p>Conditions will typically follow <a href="https://github.com/kubernetes/apimachinery/blob/release-1.23/pkg/apis/meta/v1/types.go#L1433-L1493">the standard schema</a> with the following fields:</p>

<ul>
  <li>
<code>type</code>: The type of condition (in CamelCase)</li>
  <li>
<code>status</code>: The status of the condition – either <code>"True"</code>, <code>"False"</code>, or <code>"Unknown"</code>.</li>
  <li>
<code>observedGeneration</code>: The <code>.metadata.generation</code> value of the object when this condition was observed. This field is optional.</li>
  <li>
<code>lastTransitionTime</code>: The time when a Condition transitioned from one status to another.</li>
  <li>
<code>reason</code>: An identifier (in CamelCase) that provides the reason for the last transition. This value is used for an API to consume.</li>
  <li>
<code>message</code>: A human-readable message with details about the transition.</li>
</ul>

<p>However, this is not always the case. Some Conditions, such as the <a href="https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/staging/src/k8s.io/api/core/v1/types.go#L3307-L3327">PodCondition</a>, will include a <code>lastProbeTime</code>, and others will have a <code>severity</code> field, such as <a href="https://github.com/kubernetes-sigs/cluster-api/blob/85783d75851bb8ec21bd3e65f9391ee66b51fa08/api/v1beta1/condition_types.go#L55-L85">the Cluster API Condition</a>. In general, the fields mentioned are always present, and extra optional ones may be added depending on the needs of the object.</p>

<p>One misconception is that the <code>conditions</code> field represents a chronological record of updates on the object by the controller; however, that is incorrect. Even though <code>conditions</code> is a list of Conditions, it’s actually treated as if it were a map with <code>type</code> being used as the key. More Conditions get appended to this array when their <code>type</code> value differs from all the other ones. This allows a single object to report multiple Conditions at once. As an example:</p>

<pre><code class="language-yaml">status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-23T18:41:59Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-10-23T18:41:58Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-10-23T18:41:59Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-10-23T18:41:59Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-10-23T18:41:58Z"
    status: "True"
    type: PodScheduled
</code></pre>

<p>The Conditions in this Pod aren’t in any particular order but do indicate information about the Pod being scheduled, the containers starting, and eventually the containers running. If we were to watch the <code>conditions</code> field being modified in real-time, each element would start with a <code>status</code> value of <code>"False"</code> or <code>"Unknown"</code>, and as time passes and the containers start to run, the appropriate Condition’s <code>status</code> being set to <code>"True"</code> and the <code>lastTransitionTime</code> being set to the timestamp when the Condition transitioned to <code>"True"</code>.</p>

<p>This is the exact logic that <code>kubectl wait</code> uses! When you run <code>kubectl wait</code> in the CLI, it retrieves the target resource by its name and reviews the <code>conditions</code> array. It looks for a condition matching the name you have queried and watches for updates to that resource until that status returns <code>"True"</code> or the timeout occurs.</p>

<h4 id="a-small-history-lesson">A small history lesson…</h4>

<p>Once upon a time, in the ancient times of the year 2015, Pods had a <code>status.phase</code> field where the state of the Pod was reflected as an enum. This made a lot of people very angry and was widely regarded as a bad move since any change to the values of the field would necessitate reinterpreting an existing enum or adding a new one, which is not backwards compatible. An issue (<a href="https://github.com/kubernetes/kubernetes/issues/7856">#7856</a>) was created where users discussed the alternatives to <code>phase</code>, and came to the conclusion that <code>conditions</code> would be the successor to <code>phase</code>. However, <code>phase</code> was never actually phased out, as breaking an existing API that was being used proved to be far too difficult. This is why today we still see a <code>phase</code> field in Pods, even though it’s officially been deprecated in favor of Conditions.</p>

<h4 id="conditions-in-kubernetes">Conditions in Kubernetes</h4>

<p>Nowadays, even though Conditions were <a href="https://github.com/kubernetes/kubernetes/issues/50798">almost deprecated</a> back in 2017, it looks like the field is here to stay. <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status">Official documentation</a> recommends that all new resources contain a <code>conditions</code> field in their status field and provide guidance on how to do so. In addition to Pods, Conditions are present in Namespaces, Nodes, PersistentVolumes, PersistentVolumeClaims, and Services for the core APIs. All resources in the batch (e.g. Jobs) API implement Conditions, as well as the autoscaler resources such as HorizontalPodAutoscaler objects. Unfortunately, not all core resources in Kubernetes use Conditions. All apps API resources (Deployment, StatefulSet, DaemonSet…) define Conditions, and <a href="https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/pkg/apis/apps/types.go#L892-L918">the ReplicaSet</a> and <a href="https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/pkg/apis/apps/types.go#L542-L574">the Deployment</a> resources actually implement it, but StatefulSet and DaemonSet do not attempt to populate the <code>conditions</code> field. This is evident when comparing <a href="https://github.com/kubernetes/kubectl/blob/5f5894cd61c609d7b55aa0f9bc99967155c69a9f/pkg/polymorphichelpers/rollout_status.go">the logic</a> for <code>kubectl rollout status</code>, which is a command that waits for Deployments, StatefulSets and DaemonSets to be in a Ready state after a rollout: the logic for checking if a Deployment is rolled out does include a substep where it checks the Condition on the object, but none of that is found for StatefulSets or DaemonSets.</p>

<h3 id="best-practices-for-conditions-in-custom-resources">Best Practices for Conditions in Custom Resources</h3>

<p>If you’re building a custom resource, you will invariably need to implement a <code>status</code> field. This means that a <code>conditions</code> field will almost surely follow. If so:</p>

<ol>
  <li>Implement a Condition of type <code>Ready</code> for long-running execution objects (think of Pods and Services), and a type <code>Succeeded</code> for bounded-execution objects (e.g. Jobs). Strive to always have an all-encompassing summary Condition for quickly assessing if the object is in a good state.</li>
  <li>When a Condition’s <code>"True"</code> status represents normal operations, it is referred to as a “positive-polarity” condition, whereas Conditions where <code>"False"</code> represents this state are “negative-polarity”. Standardize all your Conditions to use the same polarity to represent normal operations. This will help avoid confusion when scanning the <code>conditions</code> list and seeing a mix of <code>"True"</code> and <code>"False"</code> status values during normal conditions.</li>
  <li>Condition type names should always describe the current state of the observed object, never a transition phase. Think of <code>ScaledOut</code> as opposed to <code>Scaling</code> – the former can be set to <code>"True"</code> when successful, <code>"False"</code> when failed, and <code>Unknown</code> when the process is still ongoing.</li>
</ol>

<h3 id="takeaways">Takeaways</h3>

<p>The ubiquitous presence of Conditions in Kubernetes resources has an interesting history and represents the desire of the Kubernetes architects to provide a declarative, level-based, and observation-driven design. Even though the API still contains artifacts from past decisions (I’m looking at you, <code>phase</code>), Conditions signify an important step forward in standardizing the visualization of the observed state.</p>

<p><a href="https://feed.superorbital.io/">Subscribe (yes, we still ❤️ RSS)</a> or join our mailing list below to read more blog posts like this one!</p>
]]></content>
    <summary type="html">How to interpret the status and conditions fields in Kubernetes resources</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2024-12-09:/blog/chaos-mesh/</id>
    <title type="html">Bring on the Chaos!</title>
    <published>2024-12-09T00:00:00Z</published>
    <updated>2024-12-09T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/chaos-mesh/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<!-- markdownlint-disable MD029 MD033 -->
<!-- This style tag will cause the individual lines in the code blocks to word wrap. -->
<!-- Unfortunately they will not be indented, but this is still better in many cases. -->
<style type="text/css" media="screen">
code[class*="language-"], pre[class*="language-"] {
    white-space: pre-wrap;
}
</style>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/cover.jpg" alt="chaos - fractal flames"></p>

<details>
  <summary><em>Table of Contents</em></summary>

  <ul>
    <li><a href="https://superorbital.io#overview">Overview</a></li>
    <li>
<a href="https://superorbital.io#chaos-engineering">Chaos Engineering</a>
      <ul>
        <li><a href="https://superorbital.io#game-days">Game Days</a></li>
        <li><a href="https://superorbital.io#practice-makesbetter">Practice Makes…Better</a></li>
      </ul>
    </li>
    <li>
<a href="https://superorbital.io#kubernetes">Kubernetes</a>
      <ul>
        <li>
<a href="https://superorbital.io#chaos-mesh">Chaos Mesh</a>
          <ul>
            <li><a href="https://superorbital.io#installation">Installation</a></li>
            <li>
<a href="https://superorbital.io#chaos-experiments">Chaos Experiments</a>
              <ul>
                <li><a href="https://superorbital.io#resource-stress">Resource Stress</a></li>
                <li><a href="https://superorbital.io#pod-stability">Pod Stability</a></li>
                <li><a href="https://superorbital.io#network-latency">Network Latency</a></li>
              </ul>
            </li>
            <li><a href="https://superorbital.io#clean-up">Clean Up</a></li>
          </ul>
        </li>
        <li><a href="https://superorbital.io#conclusion">Conclusion</a></li>
      </ul>
    </li>
    <li><a href="https://superorbital.io#further-exploration">Further Exploration</a></li>
    <li><a href="https://superorbital.io#acknowledgments">Acknowledgments</a></li>
    <li><a href="https://superorbital.io#footnotes">Footnotes</a></li>
  </ul>

</details>

<hr>

<h2 id="overview">Overview</h2>

<p>In this article, we are going to explore the idea of Chaos Engineering and one tool, named Chaos Mesh, that can help you simulate some common types of Kubernetes cluster disruptions so that you can test how your applications respond to those events and then use that knowledge to improve the resilience of those applications against similar planned or unplanned events that may happen in the future.</p>

<blockquote>
  <p><strong>NOTE</strong>: All of the <em>custom</em> files used in this post can be downloaded from the accompanying <code>git</code> repository at <a href="https://github.com/superorbital/chaos-mesh-playground">github.com/superorbital/chaos-mesh-playground</a>.</p>
</blockquote>

<h2 id="chaos-engineering">Chaos Engineering</h2>

<p>Like the weather, the internet and distributed systems are unreliable. In general, they do what we expect them to do, but then they always seem to decide to do something unexpected when it is the least convenient. In the case of the weather, we prepare for this guaranteed eventuality by buying a coat and umbrella, learning how to use them properly, keeping them in good shape, and having them close at hand when we head out for the day. With distributed systems, we need to ensure that we have built, installed, tested, and practiced procedures that will ensure that our systems handle unplanned failures in the system with grace and aplomb.</p>

<p><a href="https://en.wikipedia.org/wiki/Chaos_engineering">Chaos Engineering</a> is the art of intentionally injecting various forms of chaos, or failure scenarios, into a system, observing what happens, and then documenting, evaluating, and improving the system to better handle those events in the future. Although this type of testing has existed for a very long time in one form or another, the term Chaos Engineering is primarily attributed to engineers at <a href="https://www.netflix.com/">Netflix</a>, who, in 2011, released a tool called <a href="https://github.com/Netflix/chaosmonkey">Chaos Monkey</a>, which randomly terminated virtual machine (VM<sup id="fnref:vm" role="doc-noteref"><a href="https://superorbital.io#fn:vm" class="footnote" rel="footnote">1</a></sup>) instances and containers that ran inside of their <strong>production</strong> environment, in an effort to directly expose developers to their application’s failure cases and incentivize them to build resilient services, as they were undertaking a massive <a href="https://netflixtechblog.com/5-lessons-weve-learned-using-aws-1f2a28588e4c">migration</a> into Amazon Web Services (AWS<sup id="fnref:aws" role="doc-noteref"><a href="https://superorbital.io#fn:aws" class="footnote" rel="footnote">2</a></sup>). Chaos Monkey was so successful that it eventually spawned a whole series of tools that became known as the <a href="https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116">Netflix Simian Army</a>.</p>

<h3 id="game-days">Game Days</h3>

<p>Another idea that has been around for a long time but was actively popularized in the technology field by AWS is a Game Day. To directly quote the <a href="https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.gameday.en.html">AWS well-architected manual</a>, “A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds ‘muscle memory’ on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.”</p>

<p>If you want to be prepared for a human health emergency, you might take a class to learn CPR<sup id="fnref:cpr" role="doc-noteref"><a href="https://superorbital.io#fn:cpr" class="footnote" rel="footnote">3</a></sup>, but unless you practice it on a regular basis, it is very likely that you will have forgotten how to do it properly when a real emergency arrives. You will either freeze or potentially even cause more damage by performing the procedure incorrectly.</p>

<p>Organizations and teams that really want to be prepared to handle emergencies as smoothly and effectively as possible, must practice frequently. And effective practice requires a tight feedback loop, that ideally, at a minimum, includes most of the following step plan, test, observe, document, fix, test, and repeat.</p>

<h3 id="practice-makesbetter">Practice Makes…Better</h3>

<p>No process is ever perfect, but practice and follow-through can help move you in the right direction.</p>

<p>To get started, most organizations will want to have at least two environments: development and production. A development, integration, or staging environment often gives an organization enough redundancy to feel safe starting to experiment with chaos engineering and game days.</p>

<p>In these environments, it is recommended that you pick a scenario, plan it out, and then schedule a time to trigger the incident, allowing teams to observe and respond to what occurs. Some things will be expected, while others will be a complete surprise. This exercise gives teams a chance to discover many things, like previously unknown risks, unexpected edge cases, poor documentation, poor training, software bugs, issues in the incident management process, and much more.</p>

<p>This is a good start, but follow-up is <strong>critical</strong>! The teams that were involved must be given the space to do a thorough retrospective regarding the event, where they can discuss and document what happened and how it might be avoided or improved. When the retrospective ends, each team should have a list of action items that will be immediately converted into tickets for follow-up, design, and implementation.</p>

<p>As teams get more experienced with this exercise, the game days can evolve to mirror real life more accurately. Eventually, organizers can plan the event but leave the teams involved in the dark about what situation is going to be triggered. This will remove the ability for the teams to come in with anything other than their existing preparation, precisely as they would during an actual incident; no extra, specialized preparation for the event can be leaned on in this case.</p>

<p>This not only allows you to test the product and the teams that maintain it, but it also allows you to test the incident management process thoroughly.</p>

<ul>
  <li>How are communications handled?</li>
  <li>Did the right teams get notified at the right time?</li>
  <li>Were we able to quickly engage the right on-call people?</li>
  <li>Was anyone confused or uninformed about the status of the incident at any point?</li>
  <li>Did we properly simulate communication with customers, leadership, etc?</li>
</ul>

<p>Organizations and teams will improve as they practice and follow up with their findings, which is critical.</p>

<h2 id="kubernetes">Kubernetes</h2>

<p>So, how can this sort of testing be done within a Kubernetes cluster? There are many potential approaches, but one tool that can help mimic some of the potential failure cases that can occur within Kubernetes is <a href="https://chaos-mesh.org/">Chaos Mesh</a>, which we will discuss throughout the rest of the article.</p>

<h3 id="chaos-mesh">Chaos Mesh</h3>

<p>Chaos Mesh is an <a href="https://www.cncf.io/projects/chaosmesh/">incubating open-source project</a> in the Cloud Native Computing Foundation (CNCF<sup id="fnref:cncf" role="doc-noteref"><a href="https://superorbital.io#fn:cncf" class="footnote" rel="footnote">4</a></sup>) ecosystem. The project’s source code can be found on GitHub at <a href="https://github.com/chaos-mesh/chaos-mesh">chaos-mesh/chaos-mesh</a>, and it utilizes the <a href="https://cloud-native.slack.com/archives/C0193VAV272">CNCF Slack workspace</a> for community discussions.</p>

<p>This tool primarily consists of four in-cluster components, described below, and one optional CLI<sup id="fnref:cli" role="doc-noteref"><a href="https://superorbital.io#fn:cli" class="footnote" rel="footnote">5</a></sup> tool called <a href="https://chaos-mesh.org/docs/chaosctl-tool/">chaosctl</a>.</p>

<ul>
  <li>
<strong><code>chaos-controller-manager</code> <a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">Deployment</a></strong> - The core component for orchestrating chaos experiments.</li>
  <li>
<strong><code>chaos-daemon</code> <a href="https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/">DaemonSet</a></strong> - The component on each node that injects and manages chaos targeting that system and its pods.</li>
  <li>
<strong><code>chaos-dashboard</code> Deployment</strong> - The GUI<sup id="fnref:gui" role="doc-noteref"><a href="https://superorbital.io#fn:gui" class="footnote" rel="footnote">6</a></sup> for managing, designing, and monitoring chaos experiments.</li>
  <li>
<strong><code>chaos-dns-server</code> Deployment</strong> - A special DNS<sup id="fnref:dns" role="doc-noteref"><a href="https://superorbital.io#fn:dns" class="footnote" rel="footnote">7</a></sup> service that is used to simulate DNS faults.</li>
  <li>
<strong><code>chaosctl</code> CLI</strong> - An optional tool to assist in debugging Chaos Mesh.</li>
</ul>

<h4 id="installation">Installation</h4>

<p>To install Chaos Mesh, you will need a Kubernetes cluster. In this article, we are going to utilize <a href="https://kind.sigs.k8s.io/docs/user/quick-start/">kind</a> along with <a href="https://www.docker.com/">Docker</a> to manage a local Kubernetes cluster, so if you want to follow along exactly, you will need these two tools installed. However, with a bit of adjustment to the commands, most of this should work in any Kubernetes cluster.</p>

<p>After taking a look at the <a href="https://mirrors.chaos-mesh.org/v2.6.3/install.sh">install script</a> to ensure that it is safe to run, you can instruct it to spin up a cluster with a single worker node via <code>kind</code> v0.24.0 and then install Chaos Mesh v2.6.3 into the cluster using the following command.</p>

<blockquote>
  <p><strong>NOTE</strong>: Some of these examples assume that there is only a single worker node in the cluster. If you are using a different setup, you may need to tweak the YAML manifest and commands to ensure you are targeting the correct pods/nodes and than observing the correct output.</p>
</blockquote>

<pre><code class="language-console">$ curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | \
    bash -s -- --local kind --kind-version v0.24.0 --node-num 1 \
    --k8s-version v1.31.0 --name chaos

Install kubectl client
kubectl Version 1.31.0 has been installed
Install Kind tool
Kind Version 0.24.0 has been installed
Install local Kubernetes chaos
No kind clusters found.
Clean data dir: ~/kind/chaos/data
start to create kubernetes cluster chaosCreating cluster "chaos" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.31.0 present locally
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✓ Preparing nodes 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-chaos"
You can now use your cluster with:

kubectl cluster-info --context kind-chaos

Thanks for using kind! 😊
Install Chaos Mesh chaos-mesh
crd.apiextensions.k8s.io/awschaos.chaos-mesh.org created
…
Waiting for pod running
chaos-controller-manager-7fb5d7b648-… 0/1 ContainerCreating 0 10s
chaos-controller-manager-7fb5d7b648-… 0/1 ContainerCreating 0 10s
chaos-controller-manager-7fb5d7b648-… 0/1 ContainerCreating 0 10s
Waiting for pod running
Chaos Mesh chaos-mesh is installed successfully
</code></pre>

<blockquote>
  <p><strong>Note</strong>: Chaos Mesh can easily be installed into any cluster that your <code>kubectl</code> current context points at by simply running <code>curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash</code>.</p>
</blockquote>

<p>If you utilized the installer that leverages <code>kind</code>, then you should be able to find the cluster config and related data volumes storage in <em>${HOME}/kind/chaos</em>.</p>

<p>If you are curious, you can investigate the main components that were installed by running <code>kubectl get all -n chaos-mesh</code>.</p>

<p>Once Chaos Mesh is installed, you can verify that you have access to the GUI by opening up another terminal window and running:</p>

<pre><code class="language-console">$ kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Forwarding from 127.0.0.1:2333 -&gt; 2333
Forwarding from [::1]:2333 -&gt; 2333
</code></pre>

<p>Then, open up a web browser and point it to <a href="http://127.0.0.1:2333/#/dashboard">http://127.0.0.1:2333/#/dashboard</a>.</p>

<p>If all is well, then you should see this:</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/./chaos-mesh-gui.png" alt="Chaos Mesh GUI"></p>

<p>Because we will want to be able to easily examine resource utilization in the course of this article, we are also going to install <a href="https://github.com/google/cadvisor">Google’s cadvisor</a> to provide some simple resource monitoring, utilizing this Kubernetes YAML<sup id="fnref:yaml" role="doc-noteref"><a href="https://superorbital.io#fn:yaml" class="footnote" rel="footnote">8</a></sup> manifest, which will create the <strong>cadvisor</strong> <a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/">Namespace</a>, <a href="https://kubernetes.io/docs/concepts/security/service-accounts/">ServiceAccount</a>, and DaemonSet. We can achieve this by copying the manifest below into a file called <em>cadvisor.yaml</em> and then running <code>kubectl apply -f ./cadvisor.yaml</code>.</p>

<details>
  <summary>cadvisor Kubernetes YAML Manifest</summary>

  <pre><code class="language-yaml">apiVersion: v1
kind: Namespace
metadata:
  labels:
    app: cadvisor
  name: cadvisor
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: cadvisor
  name: cadvisor
  namespace: cadvisor
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  labels:
    app: cadvisor
  name: cadvisor
  namespace: cadvisor
spec:
  selector:
    matchLabels:
      app: cadvisor
      name: cadvisor
  template:
    metadata:
      labels:
        app: cadvisor
        name: cadvisor
    spec:
      automountServiceAccountToken: false
      containers:
      - image: gcr.io/cadvisor/cadvisor:v0.49.1
        name: cadvisor
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 4000m
            memory: 4000Mi
          requests:
            cpu: 1000m
            memory: 100Mi
        volumeMounts:
        - mountPath: /rootfs
          name: rootfs
          readOnly: true
        - mountPath: /var/run
          name: var-run
          readOnly: true
        - mountPath: /sys
          name: sys
          readOnly: true
        - mountPath: /var/lib/docker
          name: docker
          readOnly: true
        - mountPath: /dev/disk
          name: disk
          readOnly: true
      serviceAccountName: cadvisor
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /
        name: rootfs
      - hostPath:
          path: /var/run
        name: var-run
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /var/lib/docker
        name: docker
      - hostPath:
          path: /dev/disk
        name: disk
</code></pre>

</details>

<p>You can verify that the <strong>cadvisor</strong> DaemonSet is in a good state by running <code>kubectl get daemonset -n cadvisor</code>, and ensuring that there is one pod per worker, which is both <strong>READY</strong> and <strong>AVAILABLE</strong>. Once everything is running, you can access the <strong>cadvisor</strong> dashboard on one of the nodes by opening up a new terminal and running:</p>

<pre><code class="language-console">$ kubectl port-forward -n cadvisor pods/$(kubectl get pods -o jsonpath="{.items[0].metadata.name}" -n cadvisor) 8080

Forwarding from 127.0.0.1:8080 -&gt; 8080
Forwarding from [::1]:8080 -&gt; 8080
</code></pre>

<p>Then, open up a web browser and point it to <a href="http://127.0.0.1:8080/containers/">http://127.0.0.1:8080/containers/</a>.</p>

<p>If everything has gone to plan up to this point,  you should see something like this:</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/./cadvisor-gui.png" alt="cadvisor GUI"></p>

<h4 id="chaos-experiments">Chaos Experiments</h4>

<p>Chaos Mesh has three primary concepts that form the core of the tool and its capabilities. These include:</p>

<ul>
  <li>
<strong><a href="https://chaos-mesh.org/docs/run-a-chaos-experiment/">Experiments</a> (<a href="http://127.0.0.1:2333/#/experiments/new">local UI</a>)</strong> - which are used to define the parameters of a single chaos test that the user wants to run. This will include the type of chaos to inject into the system and specifically how that chaos will be shaped and what it will target.</li>
  <li>
<strong><a href="https://chaos-mesh.org/docs/create-chaos-mesh-workflow/">Workflows</a> (<a href="http://127.0.0.1:2333/#/workflows/new/next">local UI</a>)</strong> - this allows you to define a complex series of tests that should run in an environment to more closely simulate complex real-world outages.</li>
  <li>
<strong><a href="https://chaos-mesh.org/docs/define-scheduling-rules/">Schedules</a> (<a href="http://127.0.0.1:2333/#/schedules/new">local UI</a>)</strong> - expands upon Experiments by making them run on a defined schedule.</li>
</ul>

<p>In this article, we will primarily use Kubernetes manifests to demonstrate the functionality of Chaos Mesh, but many things can be done in the UI<sup id="fnref:ui" role="doc-noteref"><a href="https://superorbital.io#fn:ui" class="footnote" rel="footnote">9</a></sup>, and the workflows UI can be particularly helpful in building complex visual workflows.</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/./chaos-mesh-workflows.png" alt="Chaos Mesh Workflows visual editor"></p>

<h5 id="resource-stress">Resource Stress</h5>

<p>So, at this point, let’s go ahead and run a simple experiment by applying some CPU<sup id="fnref:cpu" role="doc-noteref"><a href="https://superorbital.io#fn:cpu" class="footnote" rel="footnote">10</a></sup> stress to our <strong>cadvisor</strong> pod.</p>

<p>We’ll start by getting a snapshot of the node’s resource utilization. Since the node is actually a container in Docker, we can check it like this:</p>

<pre><code class="language-console">$ docker stats chaos-worker --no-stream

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O     PIDS
4a3385d7c565   chaos-worker   7.30%     581.4MiB / 15.6GiB   3.64%     369MB / 19.5GB   0B / 1.49GB   294
</code></pre>

<p>Next, let’s create and apply the following <a href="https://chaos-mesh.org/docs/simulate-heavy-stress-on-kubernetes/">StressChaos</a> Schedule, which will create a significant CPU load within the <strong>cadvisor</strong> pod for 90 seconds every 15 seconds, with the command <code>kubectl apply -f ./resource-stress.yaml</code>.</p>

<pre><code class="language-yaml">apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: resource-stress-example
spec:
  schedule: '@every 15s'
  type: StressChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  stressChaos:
    mode: all
    duration: 10s
    selector:
      namespaces:
        - cadvisor
      labelSelectors:
        'app': 'cadvisor'
    stressors:
      cpu:
        load: 100
        workers: 20
</code></pre>

<p>If you then wait just over 15 seconds and take another snapshot of the resource utilization, you should see something like this:</p>

<pre><code class="language-console">$ sleep 15 &amp;&amp; docker stats chaos-worker --no-stream

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT    MEM %     NET I/O          BLOCK I/O     PIDS
4a3385d7c565   chaos-worker   400.03%   617.6MiB / 15.6GiB   3.87%     371MB / 19.6GB   0B / 1.49GB   317
</code></pre>

<p>The <a href="http://127.0.0.1:8080/containers/"><strong>cadvisor</strong> UI</a> should also be giving you a very clear indication of this fluctuating CPU load.</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/./cadvisor-cpu-load.png" alt="Chaos Mesh CPU Stress cadvisor chart"></p>

<blockquote>
  <p>It is worth noting that you can <strong>pause</strong> a scheduled experiment by annotating the <strong>Schedule</strong> like so
<code>kubectl annotate schedules.chaos-mesh.org resource-stress-example experiment.chaos-mesh.org/pause=true</code>
and then you can <strong>unpause</strong> it by running
<code>kubectl annotate schedules.chaos-mesh.org resource-stress-example experiment.chaos-mesh.org/pause-</code>. 
If you check cadvisor while the experiment is paused, you will see that everything has dropped back down to a mostly steady baseline value.</p>
</blockquote>

<p>Now, let’s remove this schedule by running <code>kubectl delete -f ./resource-stress.yaml</code> so it doesn’t continue to utilize our precious CPU resources.</p>

<h5 id="pod-stability">Pod Stability</h5>

<p>For the next set of tests, let’s deploy three replicas of a small web application to our cluster by creating and applying the following manifest with <code>kubectl apply -f ./web-show.yaml</code>.</p>

<blockquote>
  <p><strong>NOTE</strong>: As written, this web app will attempt to continuously ping the Google DNS server(s) at 8.8.8.8; if you are unable to ping this IP address, you can replace the IP address in this manifest with something else in your network that will respond to a ping.</p>
</blockquote>

<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: web-show
  labels:
    app: web-show
spec:
  selector:
    app: web-show
  ports:
    - protocol: TCP
      port: 8081
      targetPort: 8081
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-show
  labels:
    app: web-show
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-show
  template:
    metadata:
      labels:
        app: web-show
    spec:
      containers:
        - name: web-show
          image: ghcr.io/chaos-mesh/web-show
          imagePullPolicy: Always
          command:
            - /usr/local/bin/web-show
            - --target-ip=8.8.8.8
          env:
            - name: TARGET_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          ports:
            - name: web-port
              containerPort: 8081
          resources:
            requests:
              memory: "10Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "1000m"
</code></pre>

<p>Once applied, let’s open another terminal and monitor the pods that we just deployed.</p>

<pre><code class="language-console">$ kubectl get pods --watch

NAME                        READY   STATUS    RESTARTS   AGE
web-show-76b9dd8f44-5ks6j   1/1     Running   0          36s
web-show-76b9dd8f44-g9hrj   1/1     Running   0          35s
web-show-76b9dd8f44-mxx6z   1/1     Running   0          38s
</code></pre>

<p>In the original terminal, we can now apply the following Chaos Schedule, which will cause a <code>web-show</code> pod to fail every 10 seconds.</p>

<pre><code class="language-yaml">apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: web-show-pod-failure
spec:
  schedule: '@every 10s'
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  podChaos:
    action: pod-failure
    mode: one
    selector:
      namespaces:
      - default
      labelSelectors:
        app: web-show
</code></pre>

<p>After you create and apply this with <code>kubectl apply -f ./pod-failure.yaml</code>, you can observe what happens to the pods you are watching on the other terminal. The output should look something like the one shown below.</p>

<pre><code class="language-console">NAME                      READY  STATUS            RESTARTS    AGE
web-show-76b9dd8f44-5ks6j 1/1   Running           0           36s
web-show-76b9dd8f44-g9hrj 1/1   Running           0           35s
web-show-76b9dd8f44-mxx6z 1/1   Running           0           38s
web-show-76b9dd8f44-mxx6z 1/1   Running           0           5m36s
web-show-76b9dd8f44-mxx6z 0/1   RunContainerError 1 (0s ago)  5m37s
web-show-76b9dd8f44-mxx6z 0/1   RunContainerError 2 (0s ago)  5m38s
web-show-76b9dd8f44-mxx6z 0/1   CrashLoopBackOff  2 (1s ago)  5m39s
web-show-76b9dd8f44-mxx6z 0/1   RunContainerError 3 (1s ago)  5m52s
web-show-76b9dd8f44-mxx6z 0/1   RunContainerError 3 (15s ago) 6m6s
web-show-76b9dd8f44-mxx6z 0/1   CrashLoopBackOff  3 (15s ago) 6m6s
web-show-76b9dd8f44-g9hrj 1/1   Running           0           6m3s
web-show-76b9dd8f44-mxx6z 1/1   Running           4 (15s ago) 6m6s
web-show-76b9dd8f44-g9hrj 0/1   RunContainerError 1 (1s ago)  6m4s
web-show-76b9dd8f44-g9hrj 0/1   RunContainerError 2 (0s ago)  6m5s
web-show-76b9dd8f44-g9hrj 0/1   CrashLoopBackOff  2 (1s ago)  6m6s
web-show-76b9dd8f44-g9hrj 0/1   RunContainerError 3 (1s ago)  6m22s
web-show-76b9dd8f44-g9hrj 0/1   RunContainerError 3 (12s ago) 6m33s
web-show-76b9dd8f44-g9hrj 0/1   CrashLoopBackOff  3 (12s ago) 6m33s
web-show-76b9dd8f44-mxx6z 1/1   Running           4 (45s ago) 6m36s
web-show-76b9dd8f44-g9hrj 1/1   Running           4 (12s ago) 6m33s
</code></pre>

<p>Most types of Chaos have a few modes or actions that can be taken. Let’s remove this experiment using <code>kubectl delete -f ./pod-failure.yaml</code>.</p>

<p>Then, we can add a very similar experiment that will kill a pod instead of causing it to fail by applying the following YAML with <code>kubectl apply -f ./pod-kill.yaml</code>.</p>

<pre><code class="language-yaml">apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: web-show-pod-kill
spec:
  schedule: '@every 10s'
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  podChaos:
    action: pod-kill
    mode: one
    duration: 30s
    selector:
      namespaces:
      - default
      labelSelectors:
        app: web-show
</code></pre>

<p>Once the experiment has been applied, the output from <code>kubectl get pods --watch</code> should now display something like this:</p>

<pre><code class="language-text">NAME                      READY STATUS            RESTARTS AGE
web-show-76b9dd8f44-5clbk 1/1   Running           0        4s
web-show-76b9dd8f44-hzwn8 1/1   Running           0        5m18s
web-show-76b9dd8f44-rfxrw 1/1   Running           0        34s
web-show-76b9dd8f44-hzwn8 1/1   Terminating       0        5m24s
web-show-76b9dd8f44-hzwn8 1/1   Terminating       0        5m24s
web-show-76b9dd8f44-zcwbq 0/1   Pending           0        0s
web-show-76b9dd8f44-zcwbq 0/1   Pending           0        0s
web-show-76b9dd8f44-zcwbq 0/1   ContainerCreating 0        0s
web-show-76b9dd8f44-zcwbq 1/1   Running           0        1s
web-show-76b9dd8f44-zcwbq 1/1   Terminating       0        10s
web-show-76b9dd8f44-zcwbq 1/1   Terminating       0        10s
web-show-76b9dd8f44-xvnh2 0/1   Pending           0        0s
web-show-76b9dd8f44-xvnh2 0/1   Pending           0        0s
web-show-76b9dd8f44-xvnh2 0/1   ContainerCreating 0        0s
web-show-76b9dd8f44-xvnh2 1/1   Running           0        1s
</code></pre>

<p>If you compare the earlier pod behavior with this, you will notice that in the original Pod failure experiment, we see messages like <strong>RunContainerError</strong> and <strong>CrashLoopBackOff</strong>, while in this Pod kill experiment, we see messages like <strong>Terminating</strong>, <strong>Pending</strong>, and <strong>ContainerCreating</strong>. This is because the first experiment replicates an application crash, while the second experiment simply uses a normal Unix signal to kill the container.</p>

<p>We can regain our pod stability by removing the scheduled experiment from the cluster with <code>kubectl delete -f ./pod-kill.yaml</code>.</p>

<h5 id="network-latency">Network Latency</h5>

<p>Next, we will generate network latency for a set of our pods by defining a scheduled <a href="https://chaos-mesh.org/docs/simulate-network-chaos-on-kubernetes/">NetworkChaos</a> experiment. But first, let’s examine the web UI that the <code>web-show</code> application generates.</p>

<p>In another terminal window, run the following command to forward a host port to the <code>web-show</code> service.</p>

<pre><code class="language-console">$ kubectl port-forward service/web-show 8081

Forwarding from 127.0.0.1:8081 -&gt; 8081
Forwarding from [::1]:8081 -&gt; 8081
</code></pre>

<p>Now, you should be able to point your web browser at <a href="http://127.0.0.1:8081/">http://127.0.0.1:8081/</a> and see <code>web-show</code>’s simple latency chart. This chart is currently configured to show the latency between our pods and the Google DNS servers at 8.8.8.8 (<em>or whatever IP address you used in the web-show manifest</em>).</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/web-show-start.png" alt="web-show UI - Showing standard latency"></p>

<p>Let’s leave the web-show UI running and then apply the following YAML file to the cluster, using <code>kubectl apply -f ./network-delay.yaml</code>.</p>

<pre><code class="language-yaml">apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: web-show-network-delay
spec:
  concurrencyPolicy: Forbid
  historyLimit: 5
  networkChaos:
    action: netem
    mode: all
    selector:
      namespaces:
        - default
      labelSelectors:
        'app': 'web-show'
    delay:
      latency: '500ms'
      correlation: '100'
      jitter: '100ms'
    duration: 10s
  schedule: '@every 20s'
  type: NetworkChaos
</code></pre>

<p>This YAML is using the network emulation action to introduce 500 milliseconds of delay with a 100-millisecond jitter (fluctuation) to the <code>web-show</code> pods’ network packets.</p>

<p>After it has run for a minute or two, the chart should look something like this:</p>

<p class="image"><img src="https://superorbital.io/blog/chaos-mesh/web-show-delayed.png" alt="webshow UI - Showing spiky latency"></p>

<p>As usual, to remove the experiment, we can simply run <code>kubectl delete -f ./network-delay.yaml</code> and then run <code>kubectl delete -f ./web-show.yaml</code> to remove the web-show Deployment and Service.</p>

<h4 id="clean-up">Clean Up</h4>

<p>At this point, you can go ahead and stop any <code>kubectl port-forward …</code> or <code>kubectl … --watch</code> commands that you still have running by switching to that terminal and pressing [Control-C]. Then you can use <code>kubectl delete …</code> to remove anything else that might still be lingering around.</p>

<p>If you are using a temporary cluster, you can de-provision it to ensure that everything is cleaned up. If you are using the <code>kind</code> cluster created by the installation script, then this should be as easy as running <code>kind delete cluster --name chaos</code>.</p>

<blockquote>
  <p><a href="https://chaos-mesh.org/docs/uninstallation/">Detailed instructions on uninstalling Chaos Mesh</a> from a cluster can be found in the documentation.</p>
</blockquote>

<h3 id="conclusion">Conclusion</h3>

<p>Chaos-Mesh is an interesting tool for exploring some common failure modes that can impact applications running inside Kubernetes environments. It can complement other testing tools in the ecosystem, like <a href="https://testkube.io/">testkube</a>.</p>

<p>There are several open-source, cloud-native, and commercial tools that specialize in robust tooling for Kubernetes-focused chaos engineering, but for those who are just getting started with Chaos Engineering, Chaos Mesh provides a simple and approachable open-source tool that can help adopters understand some of the more significant resiliency risks in their stack and point them in the right direction to documenting those issues and prioritizing fixes.</p>

<p>So, what are you waiting for? There is no better time than right now to start practicing and improving your platform’s resiliency. You can take it slow, but creating a healthy habit takes practice and repetition.</p>

<h2 id="further-exploration">Further Exploration</h2>

<ul>
  <li>
<a href="https://chaos-mesh.org/">Chaos Mesh</a>
    <ul>
      <li><a href="https://github.com/chaos-mesh">github.com/chaos-mesh</a></li>
    </ul>
  </li>
  <li><a href="https://litmuschaos.io/">LitmusChaos</a></li>
  <li><a href="https://aws.amazon.com/fis/">AWS Fault Injection Simulator</a></li>
  <li><a href="https://azure.microsoft.com/en-us/products/chaos-studio">Azure Chaos Studio</a></li>
  <li><a href="https://www.gremlin.com/kubernetes-chaos-engineering">Gremlin</a></li>
  <li>
<a href="https://www.conf42.com/">Conf42</a> <a href="https://www.conf42.com/ce2025">Chaos Engineering 2025</a>
</li>
  <li><a href="https://www.amazon.com/_/dp/1492043869">Chaos Engineering: System Resiliency in Practice</a></li>
</ul>

<h2 id="acknowledgments">Acknowledgments</h2>

<blockquote>
  <ul>
    <li>Cover image by <a href="https://pixabay.com/users/darksouls1-2189876/">darksouls1</a> from <a href="https://pixabay.com/illustrations/fractal-light-light-fractal-fire-1764914/">Pixabay</a>
</li>
  </ul>
</blockquote>

<hr>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:vm" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Virtual_machine">Virtual Machine</a> <a href="https://superorbital.io#fnref:vm" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:aws" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Amazon_Web_Services">Amazon Web Services</a> <a href="https://superorbital.io#fnref:aws" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:cpr" role="doc-endnote">
      <p><a href="https://cpr.heart.org/en/resources/what-is-cpr">Cardiopulmonary Resuscitation</a> <a href="https://superorbital.io#fnref:cpr" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:cncf" role="doc-endnote">
      <p><a href="https://www.cncf.io/">Cloud Native Computing Foundation</a> <a href="https://superorbital.io#fnref:cncf" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:cli" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Command-line_interface">command line interface</a> <a href="https://superorbital.io#fnref:cli" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:gui" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Graphical_user_interface">graphical user interface</a> <a href="https://superorbital.io#fnref:gui" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:dns" role="doc-endnote">
      <p><a href="https://www.cloudflare.com/learning/dns/what-is-dns/">Domain Name Services</a> <a href="https://superorbital.io#fnref:dns" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:yaml" role="doc-endnote">
      <p><a href="https://yaml.org/">YAML Ain’t Markup Language</a> <a href="https://superorbital.io#fnref:yaml" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:ui" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/User_interface">user interface</a> <a href="https://superorbital.io#fnref:ui" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
    <li id="fn:cpu" role="doc-endnote">
      <p><a href="https://aws.amazon.com/what-is/cpu/">Central Processing Unit</a> <a href="https://superorbital.io#fnref:cpu" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
  </ol>
</div>
]]></content>
    <summary type="html">Exploring Chaos Mesh and how it can be used to improve Kubernetes cluster resilience.</summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2024-11-06:/blog/from-helm-operator-to-go-controller/</id>
    <title type="html">From Helm Operator to Go Controller</title>
    <published>2024-11-06T00:00:00Z</published>
    <updated>2024-11-06T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/from-helm-operator-to-go-controller/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<p>A recent client engagement asked us to upgrade an operator using
<a href="https://sdk.operatorframework.io/docs/building-operators/helm/tutorial/">helm-operator</a>
with a Go controller in the <a href="http://kubebuilder.io">kubebuilder</a> framework in
order to enable more sophisticated use cases than can be provided by simple helm
charts and templating. The operator enables application authors to self-manage
deploying a webapp and integrates with other operators to enable ingress,
secrets management, and autoscaling.</p>

<h3 id="what-is-helm-operator">What is Helm Operator?</h3>

<p><a href="https://sdk.operatorframework.io/docs/building-operators/helm/tutorial/">helm-operator</a>
is a project from Operator SDK that simplifies operator development by putting
most of the business logic of producing downstream resources into the templating
of an embedded helm chart. It’s convenient for very simple operators that only
need to apply resources in response to the upstream resource. The upstream
resource is made available as a chart value so that the CRD values can affect
the output.</p>

<h3 id="sounds-great-whats-the-problem">Sounds Great! What’s the Problem?</h3>

<p>Well, this simplicity comes at the cost of not allowing anything more
sophisticated. For example, it could not react to the status of the downstream
resources. And it would be unable to reach “Level 5” of the
<a href="https://sdk.operatorframework.io/docs/overview/operator-capabilities/">operator capability levels</a>
defined by Operator SDK.</p>

<p>The webapp operator used the
<a href="https://github.com/operator-framework/helm-operator-plugins/blob/main/docs/tutorial.md">hybrid helm operator</a>
approach, which runs the helm reconciler within a
<a href="https://github.com/kubernetes-sigs/controller-runtime">controller-runtime</a>
controller manager. With a hybrid operator, the helm reconciler can also be
configured with a translator to transform the resource into a more suitable
form, or to apply complex transformations that would be difficult to express in
helm templates. The hybrid approach is intended for either mixing with
controller-runtime reconcilers, or as a transition path to replacement of the
helm-operator.</p>

<p>This post is about how we achieved that replacement, and some of the challenges
along the way.</p>

<h2 id="the-game-plan">The Game Plan</h2>

<p>Since helm-operator was already being invoked as a controller added to a
controller-runtime controller Manager, it made sense to
<a href="https://book.kubebuilder.io/reference/markers/scaffold#to-set-up-a-controller">mount a second controller</a>
for the same resource, and gradually move resources from being managed by helm
to being managed by the Go controller. Transitioning one downstream resource at
a time would allow us to deploy to development and non-production clusters first
to shake out issues with the process and reveal unanticipated problems with the
migration. We carefully planned the order of resources to minimize risk and
impact to running services, starting with ConfigMaps, and ending with the
Deployment.</p>

<h2 id="challenges">Challenges</h2>

<p>The simplest part of making the transition was translating each templated
resource in the chart into filling out the corresponding Go struct to submit to
the api-server via the controller-runtime client. Beyond that mostly mechanical
transformation, there were a few other details to work out.</p>

<h3 id="migrating-object-ownership">Migrating object ownership</h3>

<p>The helm reconciler uses helm as a library, and as such behaves very similarly
to invoking helm from the command line. When updating a helm release, if the new
version of the chart removes resources compared to the installed manifest, then
helm will delete those resources. This creates a challenge for migrating the
resources from helm over to the new controller. While some resources may be
more-or-less harmless to delete and recreate, others like Deployments would
cause workloads to restart across the whole cluster. Recreating an Ingress would
disrupt service as the cloud load balancer is recreated, which takes minutes.</p>

<div class="image-right" style="width: 250px">
  <p><img src="https://superorbital.io/blog/from-helm-operator-to-go-controller/helm.svg" alt="Helm logo"></p>
</div>

<p><strong>Controlling helm</strong>: Fortunately, helm honors
<a href="https://helm.sh/docs/howto/charts_tips_and_tricks/#tell-helm-not-to-uninstall-a-resource">an annotation</a>,
<code>helm.sh/resource-policy: keep</code> , which when applied to a chart resource,
prevents helm from deleting that resource if it is removed from the release, or
if the release is deleted. Helm-operator also normally applies <code>ownerReferences</code>
to chart resources to control
<a href="https://kubernetes.io/docs/concepts/architecture/garbage-collection/">garbage collection</a>–when
the upstream resource is deleted, the owner references on the downstream
resources would ensure they are cleaned up. Applying the annotation also
disables adding the owner references.</p>

<p>This requires two phases to roll out. An initial release must update the helm
chart to apply the annotation. Then a second rollout will migrate the resource
by removing it from the chart, and allowing the Go controller to adopt the
resource. The new Go controller will also apply owner references using
<code>controllerutil.SetControllerReference()</code>. But note that between the rollouts,
the downstream resource may not have any <code>ownerReferences</code> applied. In practice,
we found that when <code>resource-policy: keep</code> is added, helm-operator does not
remove the owner references that it added when the annotation was not present.
However, any new webapps created between the rollout phases would not have owner
references.</p>

<p><strong>Finalizer for deletion</strong>: <code>resource-policy: keep</code> stopping owner references
gives rise to another problem: After the first phase of roll out (where the
annotation is applied, but helm is still managing resources), the downstream
resources will not be deleted when the webapp is deleted. To address this, we
had the new Go controller apply a finalizer to the webapp resource. When a
webapp is deleted, then the controller code for the finalizer will delete the
downstream resources in place of garbage collection.</p>

<h3 id="contingencies-for-rollback">Contingencies for rollback</h3>

<p>Any large migration like this comes with risk of unforeseen problems and
mistakes, so rollbacks must be considered within the plan.</p>

<p><strong>Feature flags</strong>: We decided to add a feature flag for each downstream resource
that directs it to be managed by the Go controller. When off, helm-operator
continues to include the resource in the chart, and the Go controller will not
manage creating or updating the resource. However, because we have prevented
helm from deleting anything, the Go controller always deletes resources that are
no longer needed due to a configuration change in the webapp, or—via the
finalizer—due to deleting the webapp. That is, the create-and-update code is
behind the feature flags, but the delete and finalizer code is not.</p>

<p><strong>Equivalent controllers</strong>: The Go controller should, initially, precisely match
the output of the helm controller. New behaviors can come in subsequent
releases. Ensuring the controllers have equivalent output was taken on by the
unit tests. Existing unit tests in the project, using EnvTest, had pretty good
coverage of the expected output for a given input manifest. I adapted the tests
to be run twice, with each controller enabled.</p>

<p><strong>Testing roll-forward and -back</strong>: Because accidentally deleting resources
would cause a major disruption, I added tests of roll-forward and roll-back of
each feature flag that ensures each object is not deleted by mistake in the
process. A couple of techniques facilitate these tests:</p>

<ul>
  <li>Recording and checking the UID of an object ensures that it is indeed the same
object, and not an object that has been deleted and recreated with the same
object key (name, namespace, apiVersion, and kind).</li>
  <li>Adding a label, “reconciled-by”, to the downstream resources that is set
uniquely by the two controllers allows detecting when each controller has
reconciled after changing feature flags and restarting the controller-manager.</li>
</ul>

<div class="image-right" style="width: 250px">
  <p><img src="https://superorbital.io/blog/from-helm-operator-to-go-controller/cuckoo_nest.webp" alt="Cuckoo Nest">
<br>Much like a cuckoo, we tricked helm into adopting our children.
<br><a href="https://avianres.biomedcentral.com/articles/10.1186/s40657-020-00220-x/figures/1">Image source</a></p>
</div>

<p><strong>Tricking Helm</strong>: After a rollback, helm needs to resume control of the
downstream resources. Helm normally does not want to trample resources that it
did not create. We can trick Helm into adopting resources created by the Go
controller by applying the labels and annotations that it adds to chart
resources to mark them as helm managed. Adoption requires a label
(<code>"app.kubernetes.io/managed-by": "Helm"</code>) and two annotations
(<code>"meta.helm.sh/release-name"</code> and <code>"meta.helm.sh/release-namespace"</code>).
Helm-operator names the helm release after the upstream resource’s
<code>metadata.name</code>, so the controller assigns these annotations with the webapp’s
name and namespace.</p>

<h3 id="argo-sync-policy">Argo sync-policy</h3>

<p>In the customer’s environment, every WebApp resource is deployed by an Argo
Application. Argo is configured to
<a href="https://argo-cd.readthedocs.io/en/latest/user-guide/auto_sync/#automatic-pruning">automatically prune</a>
resources that are no longer part of the Application manifest. Argo labels the
resources it creates with <code>argocd.argoproj.io/instance</code>. If a resource has this
label, but is not part of the current manifest, then argo will prune the
resource. Pruned resources are by default deleted with foreground propagation,
so downstream resources should be deleted as well. Eg. if a Deployment is
pruned, then its ReplicaSets and Pods will be cleaned up by garbage collection.
Additionally, if a resource is owned (has an <code>ownerReferences</code> entry), then it
will not be pruned.</p>

<p>The webapp operator copies labels from the webapp resource to most downstream
resources. Since the webapp is labeled with <code>argocd.argoproj.io/instance</code>, but
its downstream resources are not part of the manifest, those downstream
resources are subject to pruning, unless they have <code>ownerReferences</code>.</p>

<p>Normally, a controller should own its downstream resources. But when we added
<code>helm.sh/resource-policy: keep</code> to resources, that also caused helm-operator to
stop adding <code>ownerReferences</code> to those resources<sup id="fnref:1" role="doc-noteref"><a href="https://superorbital.io#fn:1" class="footnote" rel="footnote">1</a></sup>. In combination with the
propagated <code>instance</code> labels, that makes them vulnerable to pruning.</p>

<p>During the window between rolling out a release where the helm chart is changed
to add <code>resource-policy: keep</code>, but before enabling the new Go controller (which
would re-add <code>ownerReferences</code>), there’s a risk that Argo may come along and
prune the resources we are trying so hard to not allow to be deleted.</p>

<p>Fortunately, Argo has an annotation that can be applied to resources that
disables pruning of that resource:
<code>"argocd.argoproj.io/sync-options": "Prune=false"</code>. So we added that to the list
of annotations added to downstream resources by both the helm chart and the Go
controller.</p>

<p>In a future update to the webapp operator, we can avoid this problem more simply
by being more selective about what labels are copied from parent resources to
child resources. Lesson learned: It’s not a good practice to blindly copy all
labels to child resources.</p>

<h2 id="cleaning-up">Cleaning up</h2>

<p>After successfully rolling out the feature flags to the various non-prod and
production clusters, it was time to remove the now-redundant helm chart and
helm-operator from the operator codebase. Those roll-forward-and-back tests had
served their purpose and were retired. All those labels and annotations? The
operator can stop adding those too. And the duty of the finalizer code to delete
the downstream resources can be re-assumed by the kubernetes garbage collector
now that <code>ownerReferences</code> are back in place.</p>

<p>Looking at the clusters, though, there is now an (empty) helm release for every
webapp. The last step was to write a script to <code>helm uninstall</code> those charts
after verifying it is indeed empty with <code>helm get manifest</code>. While we’re at it,
we don’t need those labels and annotations anymore. As a one-time change, it’s
simpler to remove them, and the <code>metadata.finalizers</code> entry, in the cleanup
script than to modify the controller to remove them.</p>

<h2 id="conclusion-and-summary">Conclusion and Summary</h2>

<p>In total, the transition from helm-operator to an equivalent Go controller was
aided by 6 labels and annotations on downstream resources, and a finalizer on
the upstream webapp resource, summarized in the listing below. Most of those can
subsequently be removed along with the feature flags and helm charts after the
transition is complete and stable. The rollback plan unfortunately needed to be
enacted, but fortunately it was successful in mitigating the impact of bugs in
the new Go controller.</p>

<pre><code class="language-yaml">---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: default
  name: my-webapp-config
  labels:
    app.kubernetes.io/managed-by: Helm   # Trick helm into adopting this in case of roll-back
    reconciled-by: webapp-go-controller  # Keep track of which controller last reconciled
  annotations:
    helm.sh/resource-policy: keep        # Tell helm not to delete this when removed from the chart.
    meta.helm.sh/release-name: my-webapp # Inform helm which release this should be adopted into.
    meta.helm.sh/release-namespace: default
    argocd.argoproj.io/sync-options: Prune=false  # Don't let Argo prune this!
  ownerReferences:
  - apiVersion: example.com/v1alpha1     # Retain ownerReferences
    kind: WebApp
    blockOwnerDeletion: true
    controller: true
    name: my-webapp
...
</code></pre>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">

      <p>While helm-operator won’t <em>add</em> the owner reference anymore, neither will it
delete the existing entries. In practice, this means this problem only
affected webapp resources that were newly created (or recreated) during the
rollout window. <a href="https://superorbital.io#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
    </li>
  </ol>
</div>
]]></content>
    <summary type="html">Helm-operator is easy, but what do you do when you need more from your controller? It's time to migrate to Kubebuilder and write your controller in Go. </summary>
  </entry>
  <entry>
    <id>tag:superorbital.io,2024-11-06:/blog/managing-slurm-at-scale/</id>
    <title type="html">Managing Slurm at Scale</title>
    <published>2024-11-06T00:00:00Z</published>
    <updated>2024-11-06T00:00:00Z</updated>
    <link rel="alternate" href="https://superorbital.io/blog/managing-slurm-at-scale/?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+SuperOrbital" type="text/html"/>
    <content type="html"><![CDATA[<details>
  <summary><em>Table of Contents</em></summary>
  <ul>
    <li>
<a href="https://superorbital.io#building-your-cluster-building-your-cluster">Building your cluster</a>
      <ul>
        <li><a href="https://superorbital.io#setup-munge-setup-munge">Setup Munge</a></li>
        <li><a href="https://superorbital.io#setup-slurm-setup-slurm">Setup Slurm</a></li>
      </ul>
    </li>
    <li>
<a href="https://superorbital.io#configuration-deep-dive-configuration-deep-dive">Configuration Deep Dive</a>
      <ul>
        <li><a href="https://superorbital.io#queue-and-workload-management-queue-and-workload-management">Queue and Workload Management</a></li>
        <li><a href="https://superorbital.io#handling-node-failures-handling-node-failures">Handling Node Failures</a></li>
        <li><a href="https://superorbital.io#useful-plugins-useful-plugins">Useful Plugins</a></li>
      </ul>
    </li>
    <li>
<a href="https://superorbital.io#essential-tools-for-managing-slurm-essential-tools-for-managing-slurm">Essential Tools for Managing Slurm</a>
      <ul>
        <li><a href="https://superorbital.io#sinfo-sinfo">sinfo</a></li>
        <li><a href="https://superorbital.io#scontrol-scontrol">scontrol</a></li>
        <li><a href="https://superorbital.io#srunsbatch-srun-sbatch">srun/sbatch</a></li>
        <li><a href="https://superorbital.io#submitit-submitit">Submitit</a></li>
        <li><a href="https://superorbital.io#slurm-exporter-slurm-exporter">slurm-exporter</a></li>
      </ul>
    </li>
    <li><a href="https://superorbital.io#in-conclusion">Conclusion</a></li>
    <li><a href="https://superorbital.io#further-reading-and-resources-further-reading-and-resources">Further Reading and Resources</a></li>
  </ul>
</details>

<p>In our <a href="https://superorbital.io/blog/slurm-an-hpc-scheduler-for-batch-workloads/">previous article</a>, Sean introduced <a href="https://slurm.schedmd.com/documentation.html">Slurm</a> as a powerful HPC scheduler for batch workloads. That post serves as an excellent jumping-off point for those new to Slurm, and I highly recommend reading it before this one if you’re unfamiliar with the basics. Today, we’re going to be taking a more in-depth look at Slurm configuration, provisioning, and management so that you can build and manage your own clusters. Slurm has gained significant traction in AI workloads recently, but we’ll be focusing on CPU-based workloads in this article to keep things focused.  We’ll explore using Slurm for GPU training with PyTorch, as well as other AI applications in a future post.</p>

<h2 id="building-your-cluster">Building your cluster</h2>

<p>Before we dive into advanced configurations and management techniques, let’s start with setting up a basic Slurm cluster. A great resource for this is the <a href="https://github.com/SergioMEV/slurm-for-dummies">Slurm for Dummies</a> GitHub repository, which I’ve found useful when working with providers that don’t offer managed Slurm solutions. I’ll summarize the basics here. We’re going to assume you have a controller node plus some worker nodes that have Ubuntu installed and can communicate with each other over SSH. A controller node is just any Linux VM(let’s assume Ubuntu), and it will be used to schedule jobs vs. actually running workloads, so it can be smaller than the worker nodes typically. Worker nodes are also just linux VMs, but will have the resources necessary to compute the workload. This may mean more CPUs/memory, or it may mean specialized accelerators like GPUs.</p>

<h3 id="setup-munge">Setup Munge</h3>

<p>Munge is used for authentication between nodes. First, configure the controller node to start, and install packages with:</p>

<pre><code class="language-console">sudo apt-get install munge libmunge2 libmunge-dev
</code></pre>

<p>You should now see a key installed at <code>/etc/munge/munge.key</code>, if not, run the following command to create one:</p>

<pre><code class="language-console">sudo /usr/sbin/mungekey
</code></pre>

<p>At this point, munge should have created a user, and you’re almost there, other than giving that user the correct file permissions, which can be done by running:</p>

<pre><code class="language-console">sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key
sudo chown -R munge: /etc/munge/munge.key
</code></pre>

<p>Then to configure the munge service to run at startup:</p>

<pre><code class="language-console">systemctl enable munge
systemctl restart munge
</code></pre>

<p>Now, for the worker nodes, follow the same procedure, <strong>except</strong> copy the munge key at <code>/etc/munge/munge.key</code> from the controller instead of using the generated one. Make sure to do this before running the file permission commands, and your workers should also be good to go. You can test this by running:</p>

<pre><code class="language-console">munge -n | ssh &lt;CONTROLLER_NODE_HOSTNAME&gt; unmunge 
</code></pre>

<p>from the worker nodes.</p>

<h3 id="setup-slurm">Setup Slurm</h3>

<p>To start, on all nodes run:</p>

<pre><code class="language-console">sudo apt-get update
sudo apt-get install -y slurm-wlm
</code></pre>

<p>Next, you can use Slurm’s handy configuration file generator, which is located at <code>/usr/share/doc/slurmctld/slurm-wlm-configurator.html</code> (open the file with your browser) to create your configuration file. You can learn all about the configuration options <a href="https://slurm.schedmd.com/slurm.conf.html">here</a>, but you only need to configure the following to get started:</p>

<ul>
  <li>ClusterName - whatever name you’d like for your cluster needs to be lowercase and must be 40 characters or less</li>
  <li>SlurmctldHost - The hostname of the machine where the Slurm control daemon is executed (can find this by running <code>hostname -s</code> on the machine). This hostname is optionally followed by either the IP address or another name by which the address can be identified, enclosed in parentheses. e.g.</li>
</ul>

<pre><code class="language-conf">SlurmctldHost=slurmctl-primary(12.34.56.78)
</code></pre>

<ul>
  <li>NodeName - output of <code>hostname -s</code> again, but for worker nodes. Ideally, they are numbered, and you can refer to them like <code>&amp;lt;hostname-prefix&gt;[1-4]</code>, otherwise, you can have multiple entries, one per worker node.</li>
  <li>The values for CPUs, Sockets, CoresPerSocket, and ThreadsPerCore are based on the results from running <code>lscpu</code> on a worker node.</li>
  <li>ProctrackType: LinuxProc, unless you’ve installed <strong><a href="https://slurm.schedmd.com/cgroups.html">proctrack/cgroup</a></strong>, in which case, it will be used by default if you don’t set this option.</li>
</ul>

<p>Save the tool’s text output to: <code>/etc/slurm/slurm.conf</code> and copy it to the same path on every worker node.</p>

<p>Finally, you can enable the Slurm controller service to run on startup with the following:</p>

<pre><code class="language-console">systemctl enable slurmctld
systemctl restart slurmctld
</code></pre>

<p>At this point, you can check if the cluster is set up correctly by running:</p>

<pre><code class="language-console">srun hostname
</code></pre>

<h2 id="configuration-deep-dive">Configuration Deep Dive</h2>

<p>Now that we have a basic Slurm cluster up and running let’s explore some more advanced configuration options and common use cases. For full reference docs, refer to <a href="https://slurm.schedmd.com/slurm.conf.html">this page</a>.</p>

<h3 id="queue-and-workload-management">Queue and Workload Management</h3>

<p>Slurm’s queuing system, known as partitions, allows you to organize and prioritize jobs efficiently. Here’s an example of defining partitions in your <code>slurm.conf</code>:</p>

<pre><code class="language-conf">PartitionName=debug Nodes=node[1-4] Default=YES MaxTime=01:00:00 State=UP

PartitionName=batch Nodes=node[5-20] MaxTime=08:00:00 State=UP
</code></pre>

<p>This configuration creates two partitions:</p>

<ol>
  <li>A “debug” partition for short-running jobs (max 1 hour) on nodes 1-4.</li>
  <li>A “batch” partition for longer-running jobs (max 8 hours) on nodes 5-20.</li>
</ol>

<p>You can further customize these partitions with options like:</p>

<ul>
  <li>
<code>PriorityTier</code>: Set priority levels for partitions.</li>
  <li>
<code>PreemptMode</code>: Configure how jobs can be preempted.</li>
  <li>
<code>OverSubscribe</code>: Allow multiple jobs to run on a single node simultaneously.</li>
</ul>

<h3 id="handling-node-failures">Handling Node Failures</h3>

<p>Slurm provides robust tools for managing node states and handling failures:</p>

<ol>
  <li>
<strong>Draining Nodes:</strong> When you need to perform maintenance on a node, you can drain it:</li>
</ol>

<pre><code class="language-console">scontrol update NodeName=node5 State=DRAIN Reason="Scheduled maintenance"
</code></pre>

<pre><code>This prevents new jobs from being scheduled on the node while allowing current jobs to complete.
</code></pre>

<ol>
  <li>
<strong>Automatic Node Failure Detection:</strong> Configure the <code>SlurmdTimeout</code> option in <code>slurm.conf</code> to automatically mark nodes as down if they don’t respond:</li>
</ol>

<pre><code class="language-conf">SlurmdTimeout=300
</code></pre>

<ol>
  <li>
<strong>ResumeProgram and SuspendProgram:</strong> These scripts can automatically handle node power management:</li>
</ol>

<pre><code class="language-conf">ResumeProgram=/usr/local/bin/slurm_resume.sh

SuspendProgram=/usr/local/bin/slurm_suspend.sh
</code></pre>

<h3 id="useful-plugins">Useful Plugins</h3>

<p>Slurm’s plugin architecture allows for extensive customization. Here are a few particularly useful plugins:</p>

<ol>
  <li>
<strong><a href="https://slurm.schedmd.com/job_submit_plugins.html">job_submit/lua</a>:</strong> Allows you to write custom job submission filters and modifications in Lua.</li>
  <li>
<strong><a href="https://slurm.schedmd.com/cgroups.html">proctrack/cgroup</a>:</strong> Provides better process tracking and resource management using Linux cgroups.</li>
  <li>
<strong><a href="https://slurm.schedmd.com/cons_tres.html">select/cons_tres</a>:</strong> Enables Trackable RESources (TRES) for more granular resource allocation.</li>
</ol>

<p>To enable a plugin, add it to the appropriate line in your <code>slurm.conf</code>, for example:</p>

<pre><code class="language-conf">JobSubmitPlugins=lua
ProctrackType=proctrack/cgroup
SelectType=select/cons_tres
</code></pre>

<h2 id="essential-tools-for-managing-slurm">Essential Tools for Managing Slurm</h2>

<h3 id="sinfo">sinfo</h3>

<p><code>sinfo</code> is your go-to command for getting an overview of your cluster’s state. Some useful options include:</p>

<ul>
  <li>
<code>sinfo -Nel</code>: Provides a detailed node-oriented view.</li>
  <li>
<code>sinfo -t idle,mix,alloc</code>: Shows nodes in specific states.</li>
  <li>
<code>sinfo -o "%n %c %m %t"</code>: Customizes output to show node name, CPUs, memory, and state.</li>
</ul>

<h3 id="scontrol">scontrol</h3>

<p><code>scontrol</code> is a powerful tool for viewing and modifying Slurm’s configuration. Some common uses:</p>

<ul>
  <li>
<code>scontrol show job &amp;lt;job_id&gt;</code>: Displays detailed information about a specific job.</li>
  <li>
<code>scontrol update JobId=&amp;lt;job_id&gt; TimeLimit=02:00:00</code>: Modifies a running job’s time limit.</li>
  <li>
<code>scontrol reconfigure</code>: Reloads the Slurm configuration without restarting services.</li>
</ul>

<h3 id="srun-sbatch">srun/sbatch</h3>

<p>These commands are the primary ways to submit jobs to your Slurm cluster. While <code>srun</code> is used for interactive jobs, <code>sbatch</code> handles batch job submissions.</p>

<p>Some examples of running interactive Jobs with <code>srun</code>:</p>

<pre><code class="language-console"># Basic interactive job
srun --pty bash
# Request specific resources
srun --cpus-per-task=4 --mem=8G --time=2:00:00 --pty bash
# Run a specific command across multiple nodes
srun --nodes=2 hostname
</code></pre>

<p>Also some examples of running a batch job with <code>sbatch</code>. Note that you can use the special <code>#SBATCH</code> comments to set command line arguments, or you can also pass these to <code>sbatch</code> command at runtime depending on your use case. I’ve also included some echo statements to print some useful metadata:</p>

<pre><code class="language-console">#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

echo "Date start                = $(date)"
echo "Initiating Host           = $(hostname)"
echo "Working Directory         = $(pwd)"
echo ""
echo "Number of Nodes Allocated = ${SLURM_JOB_NUM_NODES}"
echo "Number of Tasks Allocated = ${SLURM_NTASKS}"
echo ""

python my_script.py

RETURN=${?}

echo ""
echo "Exit code                 = ${RETURN}"
echo "Date end                  = $(date)"
echo ""
</code></pre>

<p>Check out the mpi-ping-pong.py script from our <a href="https://superorbital.io/blog/slurm-an-hpc-scheduler-for-batch-workloads/">previous article</a> for a more realistic example of a task to play around with.</p>

<p>Another cool feature that you can take advantage of is job arrays. Job arrays are perfect for parameter sweeps or processing multiple datasets, here’s an example with <code>sbatch</code>:</p>

<pre><code class="language-console">#!/bin/bash
#SBATCH --array=0-15
#SBATCH --output=array_%A_%a.out
# $SLURM_ARRAY_TASK_ID contains the array index
python process.py --input-file=dataset_${SLURM_ARRAY_TASK_ID}.txt
</code></pre>

<p>You can also create workflows by introducing dependencies between jobs, for example:</p>

<pre><code class="language-console"># Wait for job completion
sbatch --dependency=afterok:12345 script.sh

# Wait for job start
sbatch --dependency=after:12345 script.sh

# Wait for multiple jobs
sbatch --dependency=afterany:12345:12346:12347 script.sh
</code></pre>

<p>You can read all about <code>sbatch</code> <a href="https://slurm.schedmd.com/sbatch.html">here</a> and <code>srun</code> <a href="https://slurm.schedmd.com/srun.html">here</a>.</p>

<h3 id="submitit">Submitit</h3>

<p><a href="https://github.com/facebookincubator/submitit">Submitit</a> is a Python package that provides a user-friendly interface for submitting and managing Slurm jobs. It’s particularly useful for data scientists and researchers who prefer working in Python environments.</p>

<p>Here’s a simple example of using submitit:</p>

<pre><code class="language-python">import submitit

def train_model(learning_rate, batch_size):

    # Your training code here

    return accuracy

executor = submitit.SlurmExecutor(folder="log_test")

executor.update_parameters(time=60, mem_gb=8, cpus_per_task=4)

jobs = executor.map_array(train_model, 

                          [0.01, 0.001, 0.0001],  # learning rates

                          [32, 64, 128])          # batch sizes

results = [job.result() for job in jobs]
</code></pre>

<p>This script submits 9 jobs (3x3 grid of hyperparameters) to Slurm, each with 4 CPUs and 8GB of memory and a 60-minute time limit.</p>

<h3 id="slurm-exporter">slurm-exporter</h3>

<p>The <a href="https://github.com/vpenso/prometheus-slurm-exporter">slurm-exporter</a> allows you to export Slurm metrics to Prometheus, enabling advanced monitoring and alerting capabilities.</p>

<p>To set it up:</p>

<ol>
  <li>Install and configure Prometheus.</li>
  <li>Install the slurm-exporter:</li>
</ol>

<pre><code class="language-console">go get github.com/vpenso/prometheus-slurm-exporter
</code></pre>

<ol>
  <li>Run the exporter:</li>
</ol>

<pre><code class="language-console">prometheus-slurm-exporter
</code></pre>

<ol>
  <li>Add the following to your Prometheus scrape configuration:</li>
</ol>

<pre><code class="language-yaml">scrape_configs:
  - job_name: 'slurm'
    static_configs:
      - targets: ['localhost:8080']
</code></pre>

<p>With this setup, you can create detailed dashboards in Grafana to visualize your cluster’s performance and utilization.</p>

<h2 id="in-conclusion">In Conclusion</h2>

<p>In this article, we’ve covered how to set up your own simple Slurm cluster, covered some useful configurations to make things more robust, and finally talked about the tools you’ll need to actually manage the cluster. Now you’re ready to start running your jobs on your shiny new cluster! In future articles, we’ll explore topics like using Slurm for distributed PyTorch training, optimizing GPU utilization, and integrating Slurm with docker. For now though, happy Slurming!</p>

<h2 id="further-reading-and-resources">Further Reading and Resources</h2>

<ul>
  <li><a href="https://superorbital.io/blog/slurm-an-hpc-scheduler-for-batch-workloads/">Slurm: An HPC Scheduler for Batch Workloads</a></li>
  <li><a href="https://slurm.schedmd.com/overview.html">Official Slurm Documentation</a></li>
  <li><a href="https://github.com/SergioMEV/slurm-for-dummies">Slurm for Dummies GitHub Repository</a></li>
  <li><a href="https://github.com/facebookincubator/submitit">Submitit GitHub Repository</a></li>
  <li><a href="https://github.com/vpenso/prometheus-slurm-exporter">Prometheus Slurm Exporter</a></li>
</ul>
]]></content>
    <summary type="html">Taking an in-depth look at Slurm configuration, provisioning, and management so that you can build and manage your own clusters</summary>
  </entry>
</feed>
