I haven’t upgraded the K3s cluster in the homelab for quite some time. Now seemed like a good time to do a round of upgrades before starting a new project.

What has been upgraded:

Item New version
K3s cluster v1.35.1+k3s1
Calico v3.31.x
ArgoCD 3.x
Longhorn 1.10.x (Helm chart version)
Sealed Secrets 2.18.x (Helm chart version)
step-ca 1.29.x (Helm chart version)
step-issuer 1.9.9 (Helm chart version)
MetalLB 0.15.3 (Helm chart version)
Prometheus 28.13.x (Helm chart version)
Loki 6.53.x (Helm chart version)
opentelemetry-collector 0.146.1 (Helm chart version)
cert-manager 1.19.x (Helm chart version)
pve-exporter 3.8.1

An excerpt of the upgrade process is detailed in the following sections.

K3s

For K3s, the upgrade procedure boils down to:

Grab the latest version of the K3s installation script from https://get.k3s.io and overwrite the previous version stored in the Ansible role.

Change the release channel and version in the inventory:

--- a/inventory.ini
+++ b/inventory.ini
@@ -1,8 +1,8 @@
[all:vars]
# Channels and versions for k3s is located at
# https://update.k3s.io/v1-release/channels
-k3s_channel=v1.34
-k3s_version=v1.34.4+k3s1
+k3s_channel=v1.35
+k3s_version=v1.35.0+k3s1
k3s_install_script_src=installation_scripts/install_k3s.sh
k3s_install_script_dest=/usr/local/bin/install_k3s.sh
reinstall=false

Rerun the Ansible playbook with -e reinstall=true to trigger a rerun of the installation script using the new version. The playbook uses serial: 1 to upgrade a single node at a time, starting with the control plane nodes before proceeding to the worker nodes.

Thats it. I have yet to experience any issues with this approach and I commend the maintainers of K3s for providing a great upgrade experience.

This is during the upgrade process to v1.32.0+k3s1. The control plane nodes have been upgraded, worker nodes are in the process of being upgraded:

Grafana terminal showing K3s upgrade process

Calico

When upgrading Calico there was an issue with the CPU architecture of the control plane node VMs in Proxmox. I had not switched the architecture from qemu to x86-64-v2-AES, resulting in this error:

csi-node-driver-registrar This program can only be run on AMD64 processors with v2 microarchitecture support.
calico-csi This program can only be run on AMD64 processors with v2 microarchitecture support.

When it started to fail I tried rolling back the version of Calico to v3.30.x, but doing so proved more difficult (for several reasons) than just going ahead with the upgrade. I drained the control plane nodes, switched the VM CPU architecture, rebooted, and once all 3 nodes were fixed I had version v3.31.x running.

Changes to the Terraform Proxmox VM config for control plane nodes:

--- a/environments/production/terraform.tfvars
+++ b/environments/production/terraform.tfvars
@@ -7,7 +7,7 @@ vms = [
     ip                  = "10.0.3.9/24"
     memory              = 4096
     cores               = 4,
-    cpu_architecture    = "qemu64"
+    cpu_architecture    = "x86-64-v2-AES"
     bootdisk_size       = "20"
     disks               = []
     tags                = ["k8s", "control", "ubuntu-24.04"]
@@ -20,7 +20,7 @@ vms = [
     ip                  = "10.0.3.10/24"
     memory              = 4096
     cores               = 4,
-    cpu_architecture    = "qemu64"
+    cpu_architecture    = "x86-64-v2-AES"
     bootdisk_size       = "20"
     disks               = []
     tags                = ["k8s", "control", "ubuntu-24.04"]
@@ -33,7 +33,7 @@ vms = [
     ip                  = "10.0.3.11/24"
     memory              = 4096
     cores               = 4
-    cpu_architecture    = "qemu64"
+    cpu_architecture    = "x86-64-v2-AES"
     bootdisk_size       = "20"
     disks               = []
     tags                = ["k8s", "control", "ubuntu-24.04"]

ArgoCD

ArgoCD had a long chain of releases since my last upgrade, reading changelogs took most of the time. Not only was the major version of ArgoCD itself bumped from 2.x to 3.x (upgrade notes), but the Helm chart had gone through a few major releases from 7.xto 9.x. I had to follow the advice in this issue and disable redis.ha temporarily to be able to upgrade to 8.x. The upgrade from 8.x to 9.x did not cause any issues.

Longhorn

Longhorn was time-consuming since each version upgrade was preceded by an offsite backup of all volumes. I upgraded from version 1.8.x to 1.10.x. My Longhorn installation has a tendency to attach all volumes to a single worker node, which sometimes overloads the node and causes instability. It does eventually stabilize after each upgrade, but I haven’t found the root cause yet.

Prometheus

I temporarily lost some cluster metrics due to a missing port in a NetworkPolicy after upgrading the Prometheus chart from 27.x to 28.x (starting point was 26.x):

Screenshot of Grafana panel showing K3s cluster metrics disappearing before the NetworkPolicy is updated

I crosschecked the traffic in Whisker and calico-flow-logs-otlphttp-exporter to identify the blocked port:

Screenshot of Whisker showing blocked network traffic
Screenshot of Grafana panel showing blocked network traffic from calico-flow-logs-otlphttp-exporter

Updating the ports in the NetworkPolicy for traffic from Prometheus to the K3s VLAN fixed it:

--- a/prometheus/allow_egress_to_k3s_vlan_netpol.yaml
+++ b/prometheus/allow_egress_to_k3s_vlan_netpol.yaml
@@ -15,3 +15,4 @@ spec:
         ports:
           - 80
           - 7472
+          - 10250

Conclusion

Using Ansible to manage cluster configuration and ArgoCD to manage applications is one of the better investments I’ve made in the homelab. Most of my time spent upgrading is reading changelogs, taking backups of persistent data and updating config. I don’t really have to worry about versioning or how to apply changes.

Updated: