Making the K3s cluster control plane HA

October 29, 2024

I’ll admit, I haven’t been running the K3s control plane in HA. The architecture of 1 control plane node and 3 worker nodes is from a time when every node was running physically on Raspberry Pis. I then gradually migrated to VMs with no changes to the architecture.

There are a few reasons this is starting to become important:

The nodes are long overdue for an update from Ubuntu 20.04 to Ubuntu 24.04
In Ubuntu 20.04, the kernel version (5.15.x) can cause issues when upgrading to Longhorn version 1.6.x
Upgrading the cluster, while still mostly automated, is still a risk without HA and requires more planning
Live migration of the control plane node VM between Proxmox hosts causes unknown behaviour at times

The new Proxmox host (Proxmox host 3) ensures I now have the physical resources required.

So here’s the plan:

IP address assignments:

IP address	Host
`10.0.3.100`	`Load Balancer frontend VIP`
`10.0.3.2`	`k8s-control-1`
`10.0.3.7`	`k8s-control-2`
`10.0.3.8`	`k8s-control-3`

Setting up the HAProxy plugin in OPNsense as the load balancer

I could spin up 2 VMs to create a HAProxy load balancer setup with failover like in the official documentation. I can also use the OPNsense HAProxy plugin.

While I prefer IaC and configuration in code, I opted for using OPNsense. The physical host of OPNsense is currently underutilized, and it would be nice to get more capabilities out of it.

I won’t go into the specifics of the HAProxy plugin itself. See the official documentation for more information. Some of the screenshots were taken after the setup was confirmed working so the titles sometimes display “Edit” as opposed to “New”.

Installing the plugin was fairly straightforward:

Once installed, the Services list is updated:

The VIP is defined as an IP Alias. I added this to the existing VLAN interface of the K3s cluster:

The VIP was added to a firewall alias K3s_Control_Plane_Loadbalancer:

A new firewall rule was created on the K3s interface to allow traffic to the VIP from the worker nodes:

What I’m doing from this point on is more or less copying the HAProxy configuration from the documentation into OPNsense. I also used information from this forum post.

Adding the existing control plane host as a server:

Creating the health monitor:

Creating the backend pool:

Creating the public service (frontend):

Statistics page:

The Config Export:

# Frontend: k3s-controlplane (K3s controlplane hosts)
frontend k3s-controlplane
    bind 10.0.3.100:6443 name 10.0.3.100:6443 
    mode tcp
    default_backend k3s-controlplane

    # logging options
    option tcplog

# Backend: k3s-controlplane (Pool serving all K3s controlplane hosts)
backend k3s-controlplane
    option log-health-checks
    # health check: k3s-controlplane
    mode tcp
    balance roundrobin

    server k8s-control-1 10.0.3.2:6443 check inter 10s

This is everything that is required until the cluster is expanded with more control plane nodes. I’ll do that in the next section.

Changing the existing single-node cluster to HA with embedded etcd

The documentation on this is pretty good. The existing control plane node (k8s-control-1) will be the node bootstrapping the cluster.

Inventory of the Ansible playbook:

[all:vars]
...
# VIP
k3s_control_plane_vip=10.0.3.100

# There will only ever be one host designated to
# bootstrap the cluster
[initial_k3s_server]
10.0.3.2 node_name=k8s-control-1

[remaining_k3s_servers]
10.0.3.7 node_name=k8s-control-2
10.0.3.8 node_name=k8s-control-3

I added a new role: k3s-initial-server, with the following tasks:

- name: Check if k3s is already installed
  ansible.builtin.stat:
    path: /usr/local/bin/k3s
  register: k3s

- name: Bootstrap k3s cluster with etcd
  become: yes
  shell: "{{ k3s_install_script_dest }}"
  environment:
    INSTALL_K3S_CHANNEL: "{{ k3s_channel }}"
    INSTALL_K3S_VERSION: "{{ k3s_version }}"
    # I extracted the existing token from /var/lib/rancher/k3s/server/token
    # and set it as a variable to be passed into the Ansible playbook
    K3S_TOKEN: "{{ k3s_token }}"
    K3S_NODE_NAME: "{{ node_name }}"
    # Set the --tls-san to the VIP
    INSTALL_K3S_EXEC: "--disable servicelb --tls-san {{ k3s_control_plane_vip }} --cluster-init"
  when: (reinstall | bool) or not k3s.stat.exists

The role is delegated to the group initial_k3s_server:

- name: Setup initial k3s server
  hosts: initial_k3s_server
  roles:
  - { role: k3s-initial-server, tags: [k3s-baseline] }

Running the Ansible playbook with -e reinstall=true changes the existing control plane installation from SQLite to etcd:

NAME            STATUS                     ROLES                       AGE     VERSION
k8s-control-1   Ready                      control-plane,etcd,master   296d    v1.24.17+k3s1

At this stage I have only reinstalled k8s-control-1, none of the worker nodes were reinstalled until after I confirmed that the HAProxy setup was working.

The configuration file /etc/rancher/k3s/config.yaml now has the new VIP in it:

k8s-control-1:~$ cat /etc/rancher/k3s/config.yaml
token: X
tls-san:
- 10.0.3.100

This was not enough to update the existing certificates. Simply downloading the new kubeconfig from k8s-control-1 and changing the IP address to the VIP produced this:

$ kubectl config view
...
clusters:
- cluster:
    ...
    server: https://10.0.3.100:6443
...

$ kubectl get pods -A
Unable to connect to the server: tls: failed to verify certificate: x509: certificate is valid for 10.0.3.2, 10.43.0.1, 127.0.0.1, ::1, not 10.0.3.100

I had to delete the old certificates and generate new ones. The solution comes from this comment:

On k8s-control-1:

k8s-control-1:~$ sudo k3s kubectl -n kube-system delete secrets/k3s-serving
k8s-control-1:~$ sudo mv /var/lib/rancher/k3s/server/tls/dynamic-cert.json /tmp/dynamic-cert.json
k8s-control-1:~$ sudo systemctl restart k3s
# After verifying that it works
k8s-control-1:~$ rm /tmp/dynamic-cert.json

After downloading the kubeconfig from k8s-control-1 again and changing the IP address to the VIP, I was able to access the control plane over TLS:

$ kubectl get nodes
NAME            STATUS   ROLES                       AGE    VERSION
k8s-control-1   Ready    control-plane,etcd,master   293d   v1.24.17+k3s1
...

The new control plane nodes are initialized using the old role k3s-servers with some changes:

- name: Check if k3s is already installed
  ansible.builtin.stat:
    path: /usr/local/bin/k3s
  register: k3s

- name: Install k3s on server
  become: yes
  shell: "{{ k3s_install_script_dest }}"
  environment:
    INSTALL_K3S_CHANNEL: "{{ k3s_channel }}"
    INSTALL_K3S_VERSION: "{{ k3s_version }}"
    K3S_TOKEN: "{{ k3s_token }}"
    # There will only ever be one host in this group
    K3S_URL: "https://{{ groups['initial_k3s_server'] | first }}:6443"
    K3S_NODE_NAME: "{{ node_name }}"
    # This needs to include `server` first so it does not get registered as an agent. Set the --tls-san to the VIP
    INSTALL_K3S_EXEC: "server --disable servicelb --tls-san {{ k3s_control_plane_vip }}"
  when: (reinstall | bool) or not k3s.stat.exists

The role is delegated to the group remaining_k3s_servers:

- name: Setup remaining k3s servers
  hosts: remaining_k3s_servers
  roles:
  - { role: k3s-servers, tags: [k3s-baseline] }

This installs K3s and joins the remaining control plane nodes to the cluster:

$ kubectl get nodes
NAME            STATUS   ROLES                       AGE     VERSION
k8s-control-1   Ready    control-plane,etcd,master   297d    v1.24.17+k3s1
k8s-control-2   Ready    control-plane,etcd,master   40h     v1.24.17+k3s1
k8s-control-3   Ready    control-plane,etcd,master   40h     v1.24.17+k3s1
...

The new control plane nodes were then added to the setup in HAProxy.

Creating the hosts:

Updating the backend pool:

Statistics page:

The Config Export:

# Frontend: k3s-controlplane (K3s controlplane hosts)
frontend k3s-controlplane
    bind 10.0.3.100:6443 name 10.0.3.100:6443 
    mode tcp
    default_backend k3s-controlplane

    # logging options
    option tcplog

# Backend: k3s-controlplane (Pool serving all K3s controlplane hosts)
backend k3s-controlplane
    option log-health-checks
    # health check: k3s-controlplane
    mode tcp
    balance roundrobin

    server k8s-control-1 10.0.3.2:6443 check inter 10s 
    server k8s-control-2 10.0.3.7:6443 check inter 10s 
    server k8s-control-3 10.0.3.8:6443 check inter 10s

With the HAProxy setup confirmed working, I went on to reinstall the workers and join them to the cluster using the new VIP. Role k3s-agents:

- name: Check if k3s is already installed
  ansible.builtin.stat:
    path: /usr/local/bin/k3s
  register: k3s

- name: Install k3s on agent
  become: yes
  shell: "{{ k3s_install_script_dest }}"
  environment:
    INSTALL_K3S_CHANNEL: "{{ k3s_channel }}"
    INSTALL_K3S_VERSION: "{{ k3s_version }}"
    K3S_TOKEN: "{{ k3s_token }}"
    # The VIP is used instead of k8s-control-1 IP address
    K3S_URL: "https://{{ k3s_control_plane_vip }}:6443"
    K3S_NODE_NAME: "{{ node_name }}"
  when: (reinstall | bool) or not k3s.stat.exists

/etc/systemd/system/k3s-agent.service.env after reinstalling the workers:

...
K3S_URL='https://10.0.3.100:6443'

I was expecting a reinstall with a new K3S_URL to break the existing setup in some way, but it didn’t. Everything worked and none of the workloads seemed to have been impacted. This made me suspicious enough that I had to verify the new setup.

I ended up:

Draining and shutting down k8s-cluster-1
Performing operations that would trigger traffic in the cluster, such as deleting pods
Observing that workers still functioned as normal

I considered the setup verified after this.

Enabling the HAProxy plugin Prometheus exporter

I enabled the Prometheus exporter to gather metrics from HAProxy:

The Config Export:

frontend prometheus_exporter
   bind *:8404
   mode http
   http-request use-service prometheus-exporter if { path /metrics }

Updating the Prometheus scrape config:

- job_name: 'k3s-controlplane-haproxy'
  scrape_interval: 15s
  static_configs:
    - targets:
        - '10.0.3.100:8404'

I imported the HAProxy 2 Grafana dashboard. Displaying metrics from the k3s-controlplane services:

The amount of traffic passing through HAProxy seemed very low compared to that I was expecting:

I could not get this to compute. There is enough traffic between the worker nodes and the control plane that no traffic for such a long period of time means something isn’t right. I verified earlier that traffic was not exclusively being sent to k8s-control-1 after the reinstall.

I checked the logs of k8s-worker-1 and discovered this:

Oct 27 08:45:36 k8s-worker-1 k3s[2406293]: time="2024-10-27T08:45:36Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 10.0.3.2:6443"

That seemed strange, the worker nodes aren’t configured with any knowledge of the setup in HAProxy.

Turns out that K3s agents don’t actually use the Fixed Registration Address (the VIP in this case) for anything beyond initial registration, or as a failover if all control plane nodes are down: https://docs.k3s.io/architecture#fixed-registration-address-for-agent-nodes

In the high-availability server configuration, each node can also register with the Kubernetes API by using a fixed registration address, as shown in the diagram below.

After registration, the agent nodes establish a connection directly to one of the server nodes.

Which makes perfect sense as to why there is close to no traffic going through the HAProxy frontend after the initial registration. There are some issues that clarify this:

And sure enough, a restart of the k3s-agent.service produced the following:

Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Starting k3s agent v1.24.17+k3s1 (026bb0ec)"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.0.3.100:6443"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.0.3.7:6443"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.0.3.8:6443"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 10.0.3.2:6443"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 10.0.3.100:6443"
Oct 27 18:57:32 k8s-worker-1 k3s[327800]: time="2024-10-27T18:57:32Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.0.3.2:6443 10.0.3.7:6443 10.0.3.8:6443] [default: 10.0.3.100:6443]"

With that cleared up, I merged the changes to the playbook and called it a day.

Conclusion

One downside of the current setup is that I’m placing more responsibility in OPNsense by running HAProxy there. I don’t have a secondary instance so there is no failover if OPNsense goes down. In most cases I opt for using Terraform and Ansible, now I’m relying on UI configuration steps with offsite backups in the event of failure.

That being said, the certificates in the cluster are already configured with the VIP. If I ever change my mind and want to convert from OPNsense to something else, as long as whatever load balancer I end up using serves the frontend over the same IP address I should be fine.

It’s great to finally have proper HA setup for the control plane.