Skip to the content.
2024/09/12

Background

I set up a single-node microk8s Kubernetes cluster to manage my lightswitch and other dumb home automation tasks.

That same host also ran a few non-k8s workloads, like the controller for my home networking system.

I procured an extra PC that I planned to use as a redundant mirror of my single-node cluster. If the primary failed, I could simply change the IP address on the secondary to match the IP address of the failed primary, and all the devices around the house would observe no difference and keep on humming.

It turns out this was a reasonably good DR plan!

Literally two days after I got the secondary PC, the primary suffered a catastrophic failure and would not boot. I had not yet begun the mirroring process. Yikes!

Recovery

Despite not having set up the secondary PC yet, I was able to quickly recover most of my services. I followed a modified version of my original DR plan. On a fresh Linux install, I configured the new PC to have the old PC’s IP address. I followed public instructions and private notes for setting up the network controller software.

I had a backup of my network controller config, although it was old enough that it did not include my new edge router.

My custom applications - homeslice - were trivial to reinstall: a single pulumi up was all it took. IaC is a powerful ally!

The tricky part was configuring the microk8s container registry the way I needed it. Public documentation of my exact config is not available, and my notes weren’t great. Fortunately, someone blogged about the exact process - me! Following my own instructions, I finished up my recovery.

I had most of everything back up and running within a few distracted evening hours. homeslice, by design, hardcodes some IP addresses, for light bulbs and so forth. After provisioning the new network controller, some devices got new IP addresses. The fix was simply to change the config and run pulumi up again. Did I mention how great IaC is?

Root Cause

After mitigating the outage, I moved on to Root Cause Analysis. I obtained an SSD enclosure and a new SSD. The new SSD was recognized by the failed PC, and the failed PC’s original SSD, in the enclosure, was not readable. Mystery solved: SSD failure.

I put the new SSD in the old PC. Now I had two fully functional PCs, as originally planned. At this point, I admitted my deepest desire: a 3 node high-availability Kubernetes cluster. For my lightswitch.

I put in the order for a 3rd PC and started thinking about what I could do to make the next outage recoverable in minutes instead of hours.

What went well

What went poorly

Where I got lucky

Proper Disaster Recovery

I can imagine two DR scenarios.

Single-node k8s cluster

If a single node cluster fails, recovery is straightforward and looks a lot like what I did:

Multi-node k8s cluster

A multi-node cluster of sufficient size (>= 3 nodes) should self-heal, rescheduling workloads off failed nodes on to ready nodes.

If the failed node isn’t the one running the network controller, there’s no urgency to do anything. Just repair or replace the failed node, eventually. That may require installing and configuring microk8s.

If the failed node is the one running the network controller, then recovery involves:

Those manual steps…

Both scenarios involve a lot of manual steps. If I lose my notes, or if the process for any of the steps changes, my next disaster recovery won’t go smoothly.

One plan to rule them all

I chose Ansible instead of “a script”. I’ve never used it before, but it was straightforward to get some working playbooks that install the network controller, microk8s, set static IPs, and perform the steps necessary to get microk8s working with my local container registry.

I tested it out on clusters of multipass VMs, then ran it against my other two nodes. I manually joined them to the primary, and then things started failing again!

False start

When my cluster had just one node, I used the hostpath-storage plugin. With multiple nodes, that could work, with a lot of headache to ensure stateful pods get scheduled on the node hosting their storage.

I decided to postpone multi-node and focus on perfecting my Ansible playbook and my backup jobs. Later I returned to the problem, and learned that OpenEBS is essentially a drop-in replacement for hostpath-storage (in my setup), and now I have everything I wanted: working Ansible playbooks, automated network controller backups, and a multi-node microk8s cluster in HA mode.

Along the way to getting working network controller backups, I practiced scheduling pods on specific nodes. The backup pod mounts the network controller’s local backup directory into the pod, so the pod has to run on the node that hosts the network controller.

Avoiding bitrot

Scripts go out of date over time, including Ansible playbooks and Pulumi programs. Declarative IaC makes it easy to avoid surprises in a DR scenario. By design, declarative IaC is idempotent - you can apply your IaC repeatedly, and if there are no changes, it has no effect.

So the key to detecting mismatches between your IaC and the capabilities of the underlying infrastructure (new way to set static IP addresses, fresh untracked dependencies, etc.) is to apply your IaC periodically and fix what’s broken.

I can use my Ansible playbooks to perform OS upgrades, Microk8s upgrades, and to upgrade the network controller. This way, I have both a test of the current validity of my IaC, and my IaC stays current with my infrastructure.

I also learned about the brilliant multipass, and can use that (maybe even in an automation) to continuously validate my IaC and develop fixes.

A brighter future

The way to deal with setbacks is to come back stronger than before. That’s exactly what I did here. Now I have:

Bonus Declarativeness

I also added some Pulumi code to populate the chime PV with the mp3 it needs to serve up.

Given the self-imposed constraints (the mp3 resides on my hard drive, is not available for GitHub image builds, and so forth), it was surprisingly challenging to come up with a declarative(-ish) way to get it done. Even without those constraints, the Kubernetes model makes it tricky to declare a pre-populated PV.

Still to do

The failed SSD was about 5 years old. It may have died due to excessive writes from logging, particularly once I installed Kubernetes on it. I’d like to reduce excessive logging to prolong SSD life on all my nodes.

I’d also like to eliminate the few remaining manual steps, like Homebridge backup, running pulumi up manually, and manual periodic k8s/os upgrades.

Monitoring may have been useful here. I may have noticed a decline in write speeds or an increase in write errors. But honestly, I don’t want 3AM alerts from my lightswitch. And even if I had gotten some advance notice of the failure, I still would have had to run a DR playbook - and now I have a good one!

Bottom line: The mindset

This episode (dare I say “incident”) reveals a mindset: when problems strike, mitigate them, then ensure they never happen again.

I could have let this go - after all, I did recover from the SSD crash. But developing a thorough DR plan was a worthwhile exercise, in the sense that fun is worthwhile.

There’s overlap and echoes here with my day job, though the clusters at my day job are a lot bigger than 3 nodes. But Yoda taught us not to judge things by their size.

Anyone can set up a one node - or with tools like multipass, an N-node - Kubernetes cluster and learn everything they need to administer and deploy apps to a 1000 node cluster.

Likewise, a 1-3 node cluster can be just as much fun as a 1000 node cluster, with the right mindset!