SuperOrbital Blog

Managing Homogenized Workloads Across a Fleet of Cluster API Kubernetes Clusters

2025-04-16T00:00:00Z

This is the third part of our series on Cluster API and how it can be a solution for managing large numbers of clusters at scale. For the previous parts on this series, see part 1 and part 2.

Table of Contents

In-Place Vertical Pod Scaling: The Future of Resource Management

2025-03-03T00:00:00Z

A new way to adjust the resource allocation of running pods with dynamic resource needs without having to recreate them!

The Sky's the Limit: Why Sky Computing is the Cloud’s Future

2025-02-26T00:00:00Z

The Forecast is Cloudy

Cloud computing revolutionized the IT industry. It delivered capabilities that traditional infrastructure could never match: on-demand scalability, pay-as-you-go pricing models, and near-instant global reach. Businesses no longer needed to invest heavily in physical servers or complex maintenance – cloud providers took care of everything. The cloud made it possible to experiment and innovate faster, and it allowed startups and enterprises alike to focus on their products instead of on their data centers. For a while, it seemed like cloud computing was the ultimate solution.

But now, cracks are starting to show. Vendor lock-in ties companies to proprietary tools and ecosystems, making migration increasingly complex. The overwhelming cost and difficulty of moving data between clouds, sometimes called Data Gravity, has created virtual moats around clouds. And of course, no list of drawbacks to cloud environments would be complete without mention of egress fees. AWS charges $0.09 per GB to simply move your data, which can cost enterprises hundreds of thousands annually. So, while the cloud liberated us from our hardware constraints, it has chained us in new, very expensive handcuffs.

Cloud Repatriation

So then, what’s the path forward? Companies like 37Signals have shifted back to on-premises infrastructure, and many others are in the process of doing the same. A report by Citrix last year says that:

“42% of organizations surveyed in the United States are considering or already have moved at least half of their cloud-based workloads back to on-premises infrastructures, a phenomenon known as cloud repatriation.”

While I applaud these companies’ pragmatic view on their infrastructure costs, it asks a question: Did these companies meticulously plan their cloud usage from the start with cost optimization in mind, or are they simply reacting to unexpectedly high AWS bills? Most evidence points to the latter. Companies that are now “repatriating” their infrastructure often failed to implement basic cloud cost controls from day one—such as automatically shutting down dev environments during off-hours, using spot instances for batch workloads, or implementing resource quotas. Rather than fixing these foundational issues, they’re choosing to abandon cloud completely.

Taking a step back to on-prem might not truly be a step forward in the long run. But if we remain in the cloud, how do we address the fundamental challenges of vendor lock-in, data gravity, and rising costs in a transformative way, rather than just applying incremental fixes?

Sky Computing: The Next Evolution of Cloud Computing

Sky Computing is a common-sense path forward for our fractured cloud ecosystem. Imagine a “cloud of clouds”, where workloads flow seamlessly between providers, free from lock-in and inefficiency, without you having to lift a finger. It’s not just another insufferable buzzword (or buzzphrase, for you pedants); it’s the next logical evolution in cloud infrastructure. Removing provider-specific complexity through its abstraction layer (explained below) enables businesses to prioritize performance, cost, and compliance over loyalty to a single vendor. It’s the freedom the cloud always promised but never delivered.

“Wait, doesn’t Multi-Cloud do this already?”

Not exactly. Multi-cloud strategies involve leveraging multiple cloud service providers to distribute workloads. While this approach offers benefits like redundancy and access to diverse services, it often results in fragmented operations. Each cloud platform operates in its silo, requiring distinct management tools and expertise. This fragmentation leads to increased complexity and inefficiencies, negating some of the advantages of a multi-cloud setup.

Sky Computing, on the other hand, overcomes the limitations of multi-cloud by unifying disparate cloud environments into a cohesive, interoperable ecosystem. Instead of treating each cloud as an isolated entity, Sky Computing orchestrates them to function as a single, harmonious infrastructure. This integration eliminates silos, enabling seamless interaction and workload mobility across all participating clouds.

Why Sky Computing?

A Seamless Cloud Experience

As an end user of a Sky Computing product, you’ll no longer care which cloud provider your applications run on. You interact solely with the Sky Computing broker you’ve chosen. Its abstraction layer decides what cloud platform is best for your workload (or even accepts your preferences should you have any). The broker’s decision may or may not change over time, based on the rising and sinking costs of various providers, improved abstraction algorithms, or based on whether your own requirements change (such as data locality, for example).

The more cloud platforms the abstraction layer supports, the greater the flexibility it has in choosing the best cloud platform for your business and applications, and the greater cost savings it can pass on to you.

Regulations and Resilience

Moreover, rising regulations like GDPR and CCPA are forcing companies to comply with strict data sovereignty rules, demanding infrastructure that adapts to regional requirements. And when major outages occur—like the ones that have left entire businesses offline for hours—it further highlights the dangers of single-cloud dependency. Sky Computing isn’t just an opportunity; it’s becoming a critical next step for businesses that need to stay agile, resilient, and competitive in this rapidly changing world.

The 3 Pillars of Sky Computing

Pillar 1: Abstraction

At the heart of Sky Computing lies abstraction, which serves as the glue that unifies disparate cloud platforms. Through a compatibility layer leveraging tools like Kubernetes, Ray, and standardized APIs, Sky Computing hides the complexities of individual clouds. This means you won’t need to worry about whether your data sits in AWS S3 or Azure Blob Storage—the abstraction layer handles those details, deciding on the optimal storage based on cost and performance.

By providing a “write once, run anywhere” experience, abstraction eliminates vendor lock-in and simplifies application deployment. As more cloud platforms become supported, the abstraction layer offers greater flexibility, enabling smoother transitions and substantial cost savings without requiring you to change how you develop or manage your workloads.

Even beyond the Hypercloud providers of today like AWS, GCP, and Azure, you’ll see support for Neocloud providers like Lambda Labs, Coreweave, or Nebius, providing even greater operational flexibility.

Pillar 2: Automation

Building on the foundation of abstraction, automation is the next pillar driving Sky Computing forward. Intercloud brokers embody this pillar by serving as intelligent decision-makers who manage workload placement, cost optimization, and compliance across multiple clouds.

These brokers continuously analyze factors like pricing, resource availability, and regulatory requirements using AI and real-time data. They automatically route or adjust workloads based on current conditions, ensuring your applications always run in the most cost-effective and efficient environment. This removes the manual overhead of juggling multiple cloud providers, reduces the chance of human error, and lets you focus on higher-level tasks while the system optimizes operations behind the scenes.

Pillar 3: Agility

The final pillar, agility, is about creating a responsive and flexible cloud ecosystem where data and workloads move freely. Reciprocal peering agreements are key to this agility. These agreements are collaborations between cloud providers that allow for free or low-cost data transfers, breaking down barriers such as egress fees and data gravity.

As these agreements take shape—often organically driven by hyperscale providers keen to support popular brokers—workloads can move seamlessly between clouds. This dynamic environment empowers businesses to adapt quickly to shifting costs, regulatory changes, or performance requirements without being locked into a single provider.

Crucially, this level of agility opens up the cloud ecosystem in ways never seen before. By tearing down silos and encouraging collaboration between providers, Sky Computing creates an interconnected landscape where innovation thrives. Smaller and more specialized neocloud providers gain a seat at the table, fostering competition and driving breakthroughs in service offerings. Enterprises can mix and match services from various providers without fear, leveraging the best features from each platform to suit their needs.

The result is an agile infrastructure that can pivot on demand, offering both resilience and the flexibility to innovate and disrupt rather than just iterate. This unprecedented openness not only breaks the barriers of vendor lock-in but also sparks a whole new era of creativity and efficiency across the entire cloud industry, fundamentally changing how businesses harness cloud technology.

Real-World Examples

Sky Computing is not just a theoretical framework—it has practical applications that transform how businesses run complex workloads. Here’s how we’re already seeing it play out in the real world.

AI/ML Workloads

In the world of AI and machine learning, different stages of a pipeline may benefit from different cloud providers’ specialties. For example, a company could split its ML pipeline: run model training on Google Cloud, which offers TPU-optimized instances for deep learning; perform inference on AWS, utilizing their Inferentia chips for lower latency; and handle data preprocessing on Azure, benefiting from its robust data services. By strategically placing each stage where it performs best, organizations gain speed, cost savings, and the ability to comply with regional data regulations. The concepts of a unified platform and intelligent routing introduced earlier come into play here, as Sky Computing brokers manage this orchestration—dynamically routing workloads to the optimal environment for each task without manual intervention.

Global Data Compliance

Regulatory requirements like GDPR in Europe or CCPA in California mandate strict handling of data based on location. Sky Computing can automatically route workloads and data to the correct geographical region to meet these legal requirements. This ensures compliance without sacrificing performance, as the system intelligently selects the cloud environments that best balance regulatory needs with operational efficiency.

Enterprise Batch Jobs

Batch processing tasks, such as large-scale data analysis or report generation, often require significant computational resources and time. Sky Computing’s cost-aware brokers analyze available resources across multiple clouds and choose the most cost-effective option to run these jobs. By doing so, enterprises can save millions on large-scale batch workloads, as the brokers not only find the cheapest compute options but also optimize job scheduling to take advantage of low-cost, high-performance opportunities.

Challenges to Sky Computing Adoption

Adopting Sky Computing won’t be all rainbows and unicorns. It comes with its own set of challenges that need to be navigated carefully.

Standardization

Achieving universal standards across all cloud platforms is unlikely due to competitive interests and proprietary technologies. However, progress can still be made by leveraging partial compatibility sets—common ground in widely adopted tools like Kubernetes, Ray, and S3 APIs. These standards don’t cover every scenario but provide a practical bridge, allowing Sky Computing to move forward without waiting for complete industry-wide uniformity.

Economic Resistance

Large cloud providers may resist reciprocal peering agreements, as sharing data freely between platforms can conflict with their business models. While this resistance exists, smaller cloud providers and innovative startups have strong incentives to embrace Sky Computing principles. Their agility and desire to compete with larger players drive them to support the ecosystem, gradually encouraging wider adoption and putting pressure on the bigger providers to reconsider their stance.

Infrastructure Inertia

Organizations have significant investments in their existing cloud infrastructure - not just in terms of cost, but also in expertise, tooling, and operational processes. Many firms are understandably hesitant to make dramatic changes to their infrastructure stack, especially when it comes to adopting new paradigms like Sky Computing that don’t yet have widespread adoption. This resistance to change is compounded by the fact that existing cloud deployments often work “well enough,” even if they’re not optimal in terms of cost or performance.

The overhead of retraining staff, updating deployment pipelines, and potentially refactoring applications to work with Sky Computing’s abstraction layer can seem daunting to many organizations. Additionally, there are perceived risks around reliability and support when moving away from established cloud providers’ native services. These factors create significant inertia that must be overcome for widespread Sky Computing adoption.

The Challenge of Legitimacy

The concept of Sky Computing faces some uphill battles in establishing legitimacy, particularly in light of recent events. A visit to Wikipedia’s Sky Computing entry reveals a troubling warning banner questioning the reliability of sources and noting a lack of academic citations. This stems from an incident where a commercial entity attempted to shape the narrative around Sky Computing through Wikipedia editing, leading to their eventual ban from the platform.

This highlights a broader challenge: as emerging technologies gain traction, there’s often a rush by commercial entities to stake their claim as thought leaders or pioneers, sometimes through questionable means. This can inadvertently damage the credibility of legitimate technological advances. Sky Computing, as an architectural evolution of cloud computing backed by academic research and technical merit, deserves to be evaluated on its technical foundations rather than through marketing efforts.

The incident serves as a reminder that transformative technologies often face skepticism when commercial interests precede widespread technical validation. However, the fundamental value proposition of Sky Computing—providing a unified interface across cloud providers while optimizing for cost, performance, and compliance—stands independent of any single company’s implementation.

Final Thoughts

I genuinely believe that we’re at a turning point in cloud infrastructure. I don’t think Sky Computing is just another buzzword—it’s a practical fix for real problems. It brings together different cloud services into one smooth system, making life easier for businesses and SREs who need reliable, flexible, and efficient operations. At the end of the day, this makes a sizeable dent in balance sheets, and that’s what matters.

As more Sky Computing solutions emerge, tech leaders worldwide will continue to notice. They’ll see the benefits and quickly move their workloads to these smarter, more open, and more cost-effective cloud setups.

The future of the cloud is here, knocking at our door. It’s an exciting moment to rethink how we build and manage systems that stand up to real-world demands—more resilient, more adaptable, and ready for what’s next.

Integrating GPUd with Node Problem Detector

2025-02-15T00:00:00Z

This post continues on from our previous article on building custom plugins for Node Problem Detector.

Managing GPU-enabled Kubernetes clusters presents unique challenges that require closely monitoring GPU health and responding to hardware issues. While Kubernetes excels at container orchestration, it needs to be extended to monitor specialized hardware like GPUs. The combination of Node Problem Detector (NPD) and GPUd could create a solution for automated GPU health monitoring through Kubernetes’ native health reporting mechanisms.

Introduction

Node Problem Detector (NPD) is a Kubernetes monitoring agent that detects system-level issues and reports them as node conditions and events. The conditions and events are exposed through the Kubernetes API, and visible with kubectl describe node. NPD comes with built-in problem monitors, and supports custom plugins for extending its capabilities.

GPUd is a system monitoring daemon specializing in GPU metrics. It has a component-based architecture that allows it to monitor GPU-specific metrics and related system components affecting GPU clusters. It’s output includes states and events that indicate the health of each component.

The monitoring models between NPD and GPUd appear to be compatible—states mapping to node conditions and events aligning with Kubernetes events—and should enable effective GPU health monitoring in Kubernetes.

In Collaboration with Sailplane

This blog post was produced in collaboration with Sailplane. I paired with their AI agent to develop the proof-of-concept plugin, create and manage a test environment, and deploy GPUd and NPD within it, as well as authoring this post.

The GPUd Architecture and API

GPUd’s NVIDIA-specific GPU monitoring components include: GPU status and performance metrics, temperature monitoring, driver and CUDA toolkit health, GPU memory usage, and ECC errors.

GPUd also has components for monitoring non-GPU general system health, like systemd services, memory and CPU usage, kernel module status, and kernel dmesg logs.

GPUd caches monitoring data in a SQLite database and exposes it through a RESTful API. The API endpoints include:

/v1/components to list the components;
/v1/states shows instantaneous current health of the component;
/v1/events shows timestamped series of notable events within the component;
/v1/metrics gathers measurements from the component similar to Prometheus metrics; and
/v1/info to gather all component information in one response.

Each accepts a components query parameter to filter the results to one or a set of components, and the events and metrics endpoints accept startTime and endTime to query a time range.

Here’s an example of a state response from the systemd component (/v1/states?components=systemd):

[{
  "component": "systemd",
  "states": [{
    "name": "unit",
    "healthy": true,
    "reason": "name: kubelet active: true uptime: 1 day ago",
    "extra_info": {
      "active": "true",
      "name": "kubelet",
      "uptime_humanized": "1 day ago",
      "uptime_seconds": "90344"
    }
  }]
}]

The /v1/events API produces similar output, but requires the startTime parameter to actually produce any output. endTime is also an available parameter, and both default to the current time (ie, the default time range has zero duration and no events would be selected).

Here’s an example of an event response from the memory component (/v1/events?components=memory&startTime=[...]):

[{
  "component": "memory",
  "startTime": "2025-02-11T20:59:30Z",
  "endTime": "2025-02-12T00:59:30.450503144Z",
  "events": [{
    "time": "2025-02-11T21:09:19Z",
    "name": "memory_oom_cgroup",
    "type": "Warning",
    "message": "oom cgroup detected",
    "extra_info": {
      "log_line": "Memory cgroup out of memory: Killed process 339038 (python) total-vm:92920kB, anon-rss:64672kB, file-rss:4608kB, shmem-rss:0kB, UID:0 pgtables:184kB oom_score_adj:992"
    }
  }]
}]

NPD Custom Plugin Implementation

We can use NPD’s custom plugin system to bridge GPUd’s monitoring capabilities to Kubernetes’ node health model. NPD interprets the exit status of a plugin script as the detection of a problem, and if a problem is detected, uses any message on stdout as the condition reason or event message. The plugin script can query GPUd’s API and process the response with jq to filter for the relevant state or events. It can then print a message for the state’s reason, or the event message, as well as setting the process exit status.

See the appendix below for a proof-of-concept implementation of such a script.

For state monitoring, the plugin can detect a problem when a state is reported with "healthy": false. For events, the plugin can detect a problem whenever a matching event is emitted. Here’s an example configuration of rules for monitoring the state of the kubelet service, and event monitoring for OOM kills:

{
  ...
  "rules": [
    {
      "type": "permanent",
      "condition": "GPUdKubeletHealthy",
      "reason": "KubeletRunning",
      "path": "/usr/local/bin/gpud-npd-plugin.sh",
      "args": [
        "--mode", "states",
        "--component", "systemd",
        "--state-name", "unit",
        "--match-extra-info", ".name == \"kubelet\""
      ]
    },
    {
      "type": "temporary",
      "reason": "OOMKilling",
      "path": "/usr/local/bin/gpud-npd-plugin.sh",
      "args": [
        "--mode", "events",
        "--component", "memory",
        "--event-name", "memory_oom_cgroup"
      ]
    },
    ...
  ]
}

Limitations

Some limitations arise from the way NPD queries plugins for problem detection.

Event Handling

The plugin can only emit one event per polling interval. This will miss sequences of events that occur faster than the polling interval. While shrinking the polling interval may help, that approach cannot guarantee events will not be missed. Instead, the script should output an indicator that multiple events occurred. Additionally, the event rules should be split up with finely sliced (more specific) queries to match the smallest number of events for a "reason".

However, splitting these into finely sliced queries explodes the configuration above and will compound the performance overhead, as explained next.

Performance Overhead

The NPD custom plugin architecture requires polling GPUd’s API separately for each configured component. Each poll of each rule must fork/exec a process, and each script execution will launch several other processes. The most expensive process will be contacting the GPUd API with the connection overhead (TLS) that entails. Depending on the component, GPUd will read in-memory caches or run a SQLite query to collect the requested information. With many components, this can add up to significant overhead.

A potential mitigation (not implemented for this proof-of-concept) is to run another per-node process (eg, another daemonset or added to the NPD daemonset) that periodically polls GPUd for the information from all components in one request, and then process it out to individual files to be more cheaply read by individual rules.

GPUd Code Quality

GPUd is a young project published by a fast-moving startup. As such it shows signs of immaturity that we can hope will improve over time.

There is not a lot of documentation for the project. The list of components has once-sentence descriptions that give little more than restating the name, and then links to the GoDocs for the component, which has no additional information. The API documentation lists the API endpoints. But it shows component as a query parameter, instead of the correct parameter components. Meanwhile startTime and endTime are not documented, yet these are critical to getting any information from the /v1/events endpoint, as noted earlier. The shortcomings of the documentation leaves you to either probe the API directly to figure out what information is available, or read the code.

Some components appear to have tunable thresholds, like the fd component has a Config struct with a field threshold_allocated_file_handles which is also in the component’s output. If you’re looking to change this threshold, you are out of luck. You might, as I did, look at the code to see how to set this configuration and have some hope. There is a global configuration object (that includes the fd component’s Config struct). And there is a function that reads from a YAML file in a set of fixed locations. But, at the time of writing, that function is dead code, never referenced by any other code. GPUd always launches with its built-in, automatic configuration, with limited modification from command-line arguments.

Conclusion

The integration of GPUd with Node Problem Detector demonstrates how Kubernetes’ node health monitoring can be extended to cover specialized hardware like GPUs. By mapping GPUd’s monitoring capabilities to Kubernetes’ native health reporting mechanisms through NPD’s plugin system, clusters gain visibility into GPU health and can potentially automate responses to GPU-related issues. While the plugin architecture has limitations around event handling and performance, it provides a starting point for exploring automated GPU health monitoring in Kubernetes environments and should work fine for a limited number of extracted event and condition types.

Appendix

This is the proof-of-concept plugin script written for this blog post. curl and jq must be installed in the node-problem-detector image.

#!/bin/sh

# Exit code definitions
EXIT_SUCCESS=0           # No problem detected
EXIT_PROBLEM_DETECTED=1  # Problem detected in GPUd events/states
EXIT_SYSTEM_ERROR=2      # System errors like API failures or invalid arguments

die() {
  echo "$1"
  exit $EXIT_SYSTEM_ERROR
}

MODE=""
COMPONENT=""
EVENT_NAME=""
STATE_NAME=""
MATCH_EXTRA_INFO=""

while [ $# -gt 0 ]; do
  case "$1" in
    --mode)             MODE="$2";             shift 2 ;;
    --component)        COMPONENT="$2";        shift 2 ;;
    --event-name)       EVENT_NAME="$2";       shift 2 ;;
    --state-name)       STATE_NAME="$2";       shift 2 ;;
    --match-extra-info) MATCH_EXTRA_INFO="$2"; shift 2 ;;
    *) die "Unknown argument: $1" ;;
  esac
done

query_events() {
  # Query GPUd events for specific component and filter by event name and message pattern
  # startTime is needed to get any events. endTime defaults to now. 30sec matches the polling interval.
  response=$(curl -sk "https://$NODE_NAME:15132/v1/events?components=${COMPONENT}&startTime=$(date -d "30sec ago" +%s)")
  if [ $? -ne 0 ]; then
    die "Failed to query GPUd events API"
  fi

  event_count=$(echo "$response" | jq --arg name "$EVENT_NAME" \
    '[.[].events[] | select(.name == $name)] | length')
  event_msg=$(echo "$response" | jq -r --arg name "$EVENT_NAME" \
    '[.[].events[] | select(.name == $name) | .message][0]')

  if [ "$event_count" -gt 0 ]; then
    echo -n "$event_msg"
    if [ "$event_count" -gt 1 ]; then
      echo -n " ($((event_count - 1)) events missed)"
    fi
    exit $EXIT_PROBLEM_DETECTED
  fi

  return $EXIT_SUCCESS
}

query_states() {
  # Query GPUd states for specific component and filter by state name and extra info
  response=$(curl -sk "https://$NODE_NAME:15132/v1/states?components=${COMPONENT}")
  if [ $? -ne 0 ]; then
    die "Failed to query GPUd states API"
  fi

  state_reason=$(echo "$response" | jq -r --arg name "$STATE_NAME" \
    "[.[].states[] | select(.name == \$name and (.extra_info | $MATCH_EXTRA_INFO) and .healthy == false) | .reason][0]")

  if [ -n "$state_reason" ] && [ "$state_reason" != "null" ]; then
    echo -n "$state_reason"
    exit $EXIT_PROBLEM_DETECTED
  fi

  return $EXIT_SUCCESS
}

case "$MODE" in
  "events") query_events ;;
  "states") query_states ;;
  *) die "Invalid mode: $MODE" ;;
esac

exit $EXIT_SUCCESS

Node Problem Detector Custom Plugins

2025-01-24T00:00:00Z

I was recently asked to prototype a custom plugin for node-problem-detector. I found that the documentation for the plugin interface is pretty inscrutable. The official documentation links to the plugin interface proposal document on Google Docs which doesn’t make a good guide to writing your own plugin. So here’s a primer I developed based on my brief experience.

Quick Introduction to Node Problem Detector

Node Problem Detector (NPD) runs as a DaemonSet, and sets conditions in the status field of the node that it is running on. It can also output an Event when a problem occurs. The condition and events would be visible when running kubectl describe node .

NPD has several built-in problems it will detect. But it also has a way to take advantage of its infrastructure to add your own conditions and events.

Writing Custom Plugins

The “script interface” is based on exit status and a message on stdout. Exit status of 0 means no problem detected; 1 means the problem was detected; any other non-zero status means the status could not be determined.

The path to this script is added to a JSON configuration file, and the path to that configuration file is passed to the node-problem-detector binary via a command-line argument. The helm chart abstracts some of this plumbing away so that you only need to author the script, the configuration, and probably modify the node-problem-detector image so that it contains the tools necessary for your script.

You might want to start by adding a configuration like this to your helm values:

      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "30s",
          "timeout": "5s",
          "max_output_length": 80,
          "concurrency": 3
        },
        "source": "my-custom-plugin-monitor",
        "metricsReporting": true,

        "conditions": [
          {
            "type": "MyProblemCondition",
            "reason": "NoProblem",
            "message": "Everything is normal"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "MyProblemCondition",
            "reason": "ProblemCause",
            "path": "./custom-config/plugin-my_problem.sh"
          }
        ]
      }

But what do these configuration keys even mean? What are the “conditions” and “rules”?

The best explanation I’ve seen for custom plugin configuration is in the source for node-problem-detector at docs/custom_plugin_monitor.md. The struct definitions for Condition and CustomRule serve as additional references.

It’s not the most obvious configuration surface. Instead of starting from configuration, it’s simpler to think about this from the bottom-up, starting from the script, how it gets executed, and how the result is turned into a status condition or an event.

How a plugin script invocation is turned into a Condition or Event

I write a script that can return an exit status of 0 or 1. For example, below is an abridged version of a sample script from the NPD repository that checks if a systemd service (NTP) is running.

# Return success if service active (i.e. running)
if systemctl -q is-active ntp.service; then
  echo "NTP is running"
  exit 0
else
  echo "NTP is not running"
  exit 1
fi

I put this script as the "path" of a "rules" entry with "type": "temporary".

Since this is a "temporary" rule, NPD may output an Event.
If the script returns 0, then do nothing.
If the script returns 1, then output an Event with "reason" from this "rule", and "message" from stdout.

Alternatively, I could add this script to the configuration as the "path" of a "rules" entry with "type": "permanent".

Since this is a "permanent" rule, NPD will update an entry in status.conditions of the node. Which Condition entry? The one with "type" equal to this rule’s "condition" field. (Or if none exists, then it will create it, of course.)
If the script returns 0, then NPD will set the Condition to its default state. The default state comes from the entry in the "conditions" section whose "type" matches this rule’s "condition". NPD will update the Condition using both the "reason" and "message" from the "conditions" entry.
If it returns 1, then NPD will set the Condition to the "reason" from this rule and the "message" to the stdout produced by the script.

Recipes

We can distill this further into three recipes.

I want an Event to be emitted when my script returns non-zero status.

Don’t add anything to "conditions".
Add an entry in "rules" with "type": "temporary", and "path" with the path to your script.
Set the rule’s "reason" to what you want emitted in the event.

I want a Condition to be set according to how my script returns

Add an entry in "conditions" with "type", and an entry in "rules" with "condition", set to the same value: The name of the Condition you want to output.
The rule should have "type": "permanent", and "path" with the path to your script.
The condition should have a "reason" and "message" for the passing case.
The rule should have a “reason” for the failing case. The detailed “message” will come from the script’s stdout.

I want a Condition and an Event when my script returns non-zero.

Combine the above…
A "conditions" entry for the default state of the Condition.
A "permanent" rule for the erroring state of the Condition.
A "temporary" rule to emit an event.
That is, one condition and two rules with the same "path".

Deployment

You now have a script and custom plugin configuration to tell node-problem-detector to run the script and update a Condition or output an Event. A little more work is needed to bundle it all together to deploy. Using the helm chart, the following values should be set to enable the custom plugin:

image: If a custom image was needed to extend the node-problem-detector image with additional binaries, then specify it here.
settings.custom_plugin_monitors: This is a list of file paths within the container to the JSON configuration file for custom plugins. If using the chart’s custom_monitor_definitions to populate a ConfigMap, then these paths should start with /custom-config/.
Settings.custom_monitor_definitions define the contents of a ConfigMap mounted at /custom-config/. Add a key under here with a filename like "my-custom-monitor.json". Your script can also go in a key here.

That’s it! Deploy the helm chart with these values, and your custom conditions and events should begin appearing on the node objects when you run kubectl get nodes -o yaml or kubectl describe a node.

Demo

As part of my exploration, I made a demo repository. The exploration was to see if NPD could be used to detect that the node had a network connection issue. The conclusion of that exploration was that NPD was not a good candidate for detecting network connectivity problems, since it would be unable to write the status back to the api-server over the very network it was detecting a problem within. Nevertheless, the repository shows a complete custom plugin deployment that can be deployed in a kind cluster.

Make Your K8s Apps Istio-Retry Aware!

2025-01-08T00:00:00Z

Learn how to effectively code API requests to microservices in Istio-enabled Kubernetes.

Status and Conditions: Explained!

2024-12-11T00:00:00Z

How to interpret the status and conditions fields in Kubernetes resources

Bring on the Chaos!

2024-12-09T00:00:00Z

Table of Contents

From Helm Operator to Go Controller

2024-11-06T00:00:00Z

A recent client engagement asked us to upgrade an operator using helm-operator with a Go controller in the kubebuilder framework in order to enable more sophisticated use cases than can be provided by simple helm charts and templating. The operator enables application authors to self-manage deploying a webapp and integrates with other operators to enable ingress, secrets management, and autoscaling.

What is Helm Operator?

helm-operator is a project from Operator SDK that simplifies operator development by putting most of the business logic of producing downstream resources into the templating of an embedded helm chart. It’s convenient for very simple operators that only need to apply resources in response to the upstream resource. The upstream resource is made available as a chart value so that the CRD values can affect the output.

Sounds Great! What’s the Problem?

Well, this simplicity comes at the cost of not allowing anything more sophisticated. For example, it could not react to the status of the downstream resources. And it would be unable to reach “Level 5” of the operator capability levels defined by Operator SDK.

The webapp operator used the hybrid helm operator approach, which runs the helm reconciler within a controller-runtime controller manager. With a hybrid operator, the helm reconciler can also be configured with a translator to transform the resource into a more suitable form, or to apply complex transformations that would be difficult to express in helm templates. The hybrid approach is intended for either mixing with controller-runtime reconcilers, or as a transition path to replacement of the helm-operator.

This post is about how we achieved that replacement, and some of the challenges along the way.

The Game Plan

Since helm-operator was already being invoked as a controller added to a controller-runtime controller Manager, it made sense to mount a second controller for the same resource, and gradually move resources from being managed by helm to being managed by the Go controller. Transitioning one downstream resource at a time would allow us to deploy to development and non-production clusters first to shake out issues with the process and reveal unanticipated problems with the migration. We carefully planned the order of resources to minimize risk and impact to running services, starting with ConfigMaps, and ending with the Deployment.

Challenges

The simplest part of making the transition was translating each templated resource in the chart into filling out the corresponding Go struct to submit to the api-server via the controller-runtime client. Beyond that mostly mechanical transformation, there were a few other details to work out.

Migrating object ownership

The helm reconciler uses helm as a library, and as such behaves very similarly to invoking helm from the command line. When updating a helm release, if the new version of the chart removes resources compared to the installed manifest, then helm will delete those resources. This creates a challenge for migrating the resources from helm over to the new controller. While some resources may be more-or-less harmless to delete and recreate, others like Deployments would cause workloads to restart across the whole cluster. Recreating an Ingress would disrupt service as the cloud load balancer is recreated, which takes minutes.

Controlling helm: Fortunately, helm honors an annotation, helm.sh/resource-policy: keep , which when applied to a chart resource, prevents helm from deleting that resource if it is removed from the release, or if the release is deleted. Helm-operator also normally applies ownerReferences to chart resources to control garbage collection–when the upstream resource is deleted, the owner references on the downstream resources would ensure they are cleaned up. Applying the annotation also disables adding the owner references.

This requires two phases to roll out. An initial release must update the helm chart to apply the annotation. Then a second rollout will migrate the resource by removing it from the chart, and allowing the Go controller to adopt the resource. The new Go controller will also apply owner references using controllerutil.SetControllerReference(). But note that between the rollouts, the downstream resource may not have any ownerReferences applied. In practice, we found that when resource-policy: keep is added, helm-operator does not remove the owner references that it added when the annotation was not present. However, any new webapps created between the rollout phases would not have owner references.

Finalizer for deletion: resource-policy: keep stopping owner references gives rise to another problem: After the first phase of roll out (where the annotation is applied, but helm is still managing resources), the downstream resources will not be deleted when the webapp is deleted. To address this, we had the new Go controller apply a finalizer to the webapp resource. When a webapp is deleted, then the controller code for the finalizer will delete the downstream resources in place of garbage collection.

Contingencies for rollback

Any large migration like this comes with risk of unforeseen problems and mistakes, so rollbacks must be considered within the plan.

Feature flags: We decided to add a feature flag for each downstream resource that directs it to be managed by the Go controller. When off, helm-operator continues to include the resource in the chart, and the Go controller will not manage creating or updating the resource. However, because we have prevented helm from deleting anything, the Go controller always deletes resources that are no longer needed due to a configuration change in the webapp, or—via the finalizer—due to deleting the webapp. That is, the create-and-update code is behind the feature flags, but the delete and finalizer code is not.

Equivalent controllers: The Go controller should, initially, precisely match the output of the helm controller. New behaviors can come in subsequent releases. Ensuring the controllers have equivalent output was taken on by the unit tests. Existing unit tests in the project, using EnvTest, had pretty good coverage of the expected output for a given input manifest. I adapted the tests to be run twice, with each controller enabled.

Testing roll-forward and -back: Because accidentally deleting resources would cause a major disruption, I added tests of roll-forward and roll-back of each feature flag that ensures each object is not deleted by mistake in the process. A couple of techniques facilitate these tests:

Recording and checking the UID of an object ensures that it is indeed the same object, and not an object that has been deleted and recreated with the same object key (name, namespace, apiVersion, and kind).
Adding a label, “reconciled-by”, to the downstream resources that is set uniquely by the two controllers allows detecting when each controller has reconciled after changing feature flags and restarting the controller-manager.

Much like a cuckoo, we tricked helm into adopting our children.
Image source

Tricking Helm: After a rollback, helm needs to resume control of the downstream resources. Helm normally does not want to trample resources that it did not create. We can trick Helm into adopting resources created by the Go controller by applying the labels and annotations that it adds to chart resources to mark them as helm managed. Adoption requires a label ("app.kubernetes.io/managed-by": "Helm") and two annotations ("meta.helm.sh/release-name" and "meta.helm.sh/release-namespace"). Helm-operator names the helm release after the upstream resource’s metadata.name, so the controller assigns these annotations with the webapp’s name and namespace.

Argo sync-policy

In the customer’s environment, every WebApp resource is deployed by an Argo Application. Argo is configured to automatically prune resources that are no longer part of the Application manifest. Argo labels the resources it creates with argocd.argoproj.io/instance. If a resource has this label, but is not part of the current manifest, then argo will prune the resource. Pruned resources are by default deleted with foreground propagation, so downstream resources should be deleted as well. Eg. if a Deployment is pruned, then its ReplicaSets and Pods will be cleaned up by garbage collection. Additionally, if a resource is owned (has an ownerReferences entry), then it will not be pruned.

The webapp operator copies labels from the webapp resource to most downstream resources. Since the webapp is labeled with argocd.argoproj.io/instance, but its downstream resources are not part of the manifest, those downstream resources are subject to pruning, unless they have ownerReferences.

Normally, a controller should own its downstream resources. But when we added helm.sh/resource-policy: keep to resources, that also caused helm-operator to stop adding ownerReferences to those resources¹. In combination with the propagated instance labels, that makes them vulnerable to pruning.

During the window between rolling out a release where the helm chart is changed to add resource-policy: keep, but before enabling the new Go controller (which would re-add ownerReferences), there’s a risk that Argo may come along and prune the resources we are trying so hard to not allow to be deleted.

Fortunately, Argo has an annotation that can be applied to resources that disables pruning of that resource: "argocd.argoproj.io/sync-options": "Prune=false". So we added that to the list of annotations added to downstream resources by both the helm chart and the Go controller.

In a future update to the webapp operator, we can avoid this problem more simply by being more selective about what labels are copied from parent resources to child resources. Lesson learned: It’s not a good practice to blindly copy all labels to child resources.

Cleaning up

After successfully rolling out the feature flags to the various non-prod and production clusters, it was time to remove the now-redundant helm chart and helm-operator from the operator codebase. Those roll-forward-and-back tests had served their purpose and were retired. All those labels and annotations? The operator can stop adding those too. And the duty of the finalizer code to delete the downstream resources can be re-assumed by the kubernetes garbage collector now that ownerReferences are back in place.

Looking at the clusters, though, there is now an (empty) helm release for every webapp. The last step was to write a script to helm uninstall those charts after verifying it is indeed empty with helm get manifest. While we’re at it, we don’t need those labels and annotations anymore. As a one-time change, it’s simpler to remove them, and the metadata.finalizers entry, in the cleanup script than to modify the controller to remove them.

Conclusion and Summary

In total, the transition from helm-operator to an equivalent Go controller was aided by 6 labels and annotations on downstream resources, and a finalizer on the upstream webapp resource, summarized in the listing below. Most of those can subsequently be removed along with the feature flags and helm charts after the transition is complete and stable. The rollback plan unfortunately needed to be enacted, but fortunately it was successful in mitigating the impact of bugs in the new Go controller.

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: default
  name: my-webapp-config
  labels:
    app.kubernetes.io/managed-by: Helm   # Trick helm into adopting this in case of roll-back
    reconciled-by: webapp-go-controller  # Keep track of which controller last reconciled
  annotations:
    helm.sh/resource-policy: keep        # Tell helm not to delete this when removed from the chart.
    meta.helm.sh/release-name: my-webapp # Inform helm which release this should be adopted into.
    meta.helm.sh/release-namespace: default
    argocd.argoproj.io/sync-options: Prune=false  # Don't let Argo prune this!
  ownerReferences:
  - apiVersion: example.com/v1alpha1     # Retain ownerReferences
    kind: WebApp
    blockOwnerDeletion: true
    controller: true
    name: my-webapp
...

While helm-operator won’t add the owner reference anymore, neither will it delete the existing entries. In practice, this means this problem only affected webapp resources that were newly created (or recreated) during the rollout window. ↩

Managing Slurm at Scale

2024-11-06T00:00:00Z

Taking an in-depth look at Slurm configuration, provisioning, and management so that you can build and manage your own clusters