Edit (26.08.2021): The initial playbooks did not include disabling/enabling of swap as part of the setup/teardown. I’ve updated the Ansible playbooks to reflect these changes.

I first started thinking about setting up a Kubernetes cluster on a few Raspberry Pis last year. At the time I had some issues related to the setup of K3s and MetalLB, which proved to be more difficult than I had thought initially. I ended up abandoning the project in favor of a homelab based only on Docker for running applications and services. This is how I self-hosted Pi-Hole.

I decided some time ago to restart the project of setting up Kubernetes on Raspberry Pi with K3s, while also investing a bit more effort into the setup. More specifically, I wanted it to be more fault-tolerant than the previous one. As I discovered when running Pi-Hole on a single host: sometimes you’ll have issues related to SD-Cards or memory usage at the worst possible time. This is especially annoying when the Pi also serves as your primary DNS server.

The Build

Part of the motivation behind this project was to tidy up the current setup. Previously I’ve had 1 Raspberry Pi 4B and 2x Raspberry Pi 3B+, each coupled to a power-outlet, using wifi to connect to the network. To improve the setup I wanted all Pis colocated in a rack tower, with only physical connections to the network.

Last time I tried running K3s I kept having issues with reliability running larger work-loads. I noticed that while the server node (4B) seemed to manage fine, the agents (3B+) struggled much more running certain monitoring and logging stacks. To avoid this issue, I purchased another 4B to increase the total throughput in the cluster. One of the 4Bs will continue as the designated server, while the other will be an agent. Overall this should increase the performance quite a bit. If I see that the 3B+ agents continue to struggle, I can look into providing them lighter loads from the scheduler, or replace them with 4B agents.

For the networking part, I was 1 ethernet port short on my router from being able to connect all 4x Raspberry Pis to it. I ended up purchasing an 8 port switch to fix the issue, this also gives me some room in the future for adding more devices.

The final list of parts in the cluster ended up being the following (including devices I already had, and new additions):

  • S2Pi ZP-0088 Rack Tower
  • TP-Link TL-SG108E
  • 2x Raspberry Pi 4B
  • 2x Raspberry Pi 3B+
  • 4x Cat6 Ethernet cables, 0.5 meters
  • 4x SD-Cards
  • Power adapters

The Network

The physical network setup is fairly straightforward:

Router
  |
  V
Switch
  |
  V
Raspberry Pi [1:4]

The Raspberry Pis are all setup with DHCP reservations in the router, which also gives each Pi a DNS A record. Another option would have been to go with static IP addresses, I’ll explore that option if I see that there is a need for it.

I considered creating a VLAN in the router or the switch but decided against it to keep things simple for now.

The Cluster

I want the setup of the cluster structured as code. If an SD-card breaks, hardware fails, or something else should happen, I want to know that I can recreate parts of the cluster or the entire cluster by flashing a few SD-cards and running some commands.

There are several pre-configured options for setting up a Kubernetes cluster, I decided to stick with K3s since it is lightweight and I already had some experience with it. I considered using k3sup since it has a dedicated guide for setting up K3s on Raspberry Pi. A drawback for me was that I still had to manually modify kernel settings each time I flashed a new SD-card. I could have opted for a shell script to make the modifications, but then I would also have to make the script idempotent. In the end I opted to automate everything using Ansible and writing my own playbooks.

Creating the Ansible playbook to setup the cluster was one of the more time-consuming parts of the project. Especially the bit related to configuring the kernel settings in /boot/cmdline.txt. I chose to not include removal of those same kernel settings as part of the playbook for uninstalling K3s, since most of my projects run as containers.

Ansible playbooks

Playbook which sets up the cluster:

# site.yml
---
- name: Setup baseline for all k3s nodes 
  hosts: k3s_servers, k3s_agents
  vars:
    boot_file: /boot/cmdline.txt
  tasks:
  - name: Check if cgroups are set correctly
    become: yes
    shell: grep -c 'cgroup_memory=1 cgroup_enable=memory' '{{ boot_file }}' | cat
    register: cmdline
    ignore_errors: yes

  - name: Ensure cgroups are correctly set by updating raspberry pi
    become: yes
    ansible.builtin.lineinfile:
      state: present
      backrefs: yes
      path: '{{ boot_file }}'
      regexp: '(.+)(?!cgroup_memory=1 cgroup_enable=memory)'
      line: '\1 cgroup_memory=1 cgroup_enable=memory'
    when: cmdline.stdout == "0"

  - name: Disable swap with dphys-swapfile
    become: yes
    shell: dphys-swapfile swapoff && dphys-swapfile uninstall && update-rc.d dphys-swapfile remove

  - name: Disable dphys-swapfile service
    become: yes
    systemd:
      name: dphys-swapfile
      enabled: no
    register: swapfile_service

  - name: Reboot host if system settings were updated
    become: yes
    ansible.builtin.reboot:
      reboot_timeout: 3600
    when: cmdline.stdout == "0" or swapfile_service.changed

- name: Setup k3s servers
  hosts: k3s_servers
  tasks:
  - name: Check if k3s is already installed
    ansible.builtin.stat:
      path: /usr/local/bin/k3s
    register: k3s

  - name: Install k3s on server
    become: yes    
    shell: curl -sfL https://get.k3s.io | sh -
    environment:
      K3S_NODE_NAME: "{{ inventory_hostname }}"
      INSTALL_K3S_EXEC: "--disable servicelb"
    when: not k3s.stat.exists

  - name: Get node join token
    become: yes
    ansible.builtin.fetch:
      src: /var/lib/rancher/k3s/server/token
      dest: 'node_join_token'
      flat: yes

- name: Setup k3s agents
  hosts: k3s_agents
  tasks:
  - name: Check if k3s is already installed
    ansible.builtin.stat:
      path: /usr/local/bin/k3s
    register: k3s

  - name: Extract k3s server node token from control node
    local_action:
      module: shell
      cmd: cat node_join_token
    register: node_join_token

  - name: Install k3s on agent
    become: yes
    shell: curl -sfL https://get.k3s.io | sh -
    environment:
      K3S_TOKEN: "{{ node_join_token.stdout }}"
      # Select the first host in the group of k3s servers as the server for the agent
      K3S_URL: "https://{{ groups['k3s_servers'] | first }}:6443"
      K3S_NODE_NAME: "{{ inventory_hostname }}"
    when: not k3s.stat.exists

- name: Wait for all nodes to complete their registraion
  hosts: k3s_servers
  vars:
    total_amount_of_nodes: "{{ groups['k3s_servers'] | count + groups['k3s_agents'] | count }}"
  tasks:
  - name: Wait until all agents are registered
    become: yes
    shell: k3s kubectl get nodes --no-headers | wc -l
    until: agents.stdout | int == total_amount_of_nodes | int
    register: agents
    retries: 10
    delay: 10

  - name: Copy kubectl config from server to temp .kube directory on control node 
    become: yes
    ansible.builtin.fetch:
      src: /etc/rancher/k3s/k3s.yaml
      dest: .kube/k3s-config
      # The kubeconfig should be identical for all servers
      flat: yes

- name: Setup kubectl on control node with new context
  hosts: localhost
  tasks:
  - name: Create $HOME/.kube directory if not present
    ansible.builtin.file:
      path: $HOME/.kube
      state: directory

  - name: Replace the server reference in k3s kube config with IP of a server node
    ansible.builtin.replace:
      path: .kube/k3s-config
      regexp: '127\.0\.0\.1'
      replace: "{{ groups['k3s_servers'] | first }}" 
      backup: yes

  - name: Copy k3s kube config to $HOME/.kube directory
    ansible.builtin.copy:
      src: .kube/k3s-config
      dest: $HOME/.kube
      mode: '600'

  - name: Remove node join token
    ansible.builtin.file:
      path: node_join_token
      state: absent

  - name: Remove temporary kube config directory
    ansible.builtin.file:
      path: .kube
      state: absent

Playbook for deleting the cluster, without reverting the kernel settings:

# teardown.yml
---
- name: Reset system configuration
  hosts: k3s_servers, k3s_agents
  tasks:
  - name: Enable swap with dphys-swapfile
    become: yes
    shell: dphys-swapfile setup && dphys-swapfile swapon && update-rc.d dphys-swapfile start

  - name: Enable dphys-swapfile service
    become: yes
    systemd:
      name: dphys-swapfile
      enabled: yes
      state: started
    register: swapfile_service

  - name: Reboot host if system settings were updated
    become: yes
    ansible.builtin.reboot:
      reboot_timeout: 3600
    when: swapfile_service.changed

- name: Uninstall k3s agents
  hosts: k3s_agents
  tasks:
  - name: Check if script for uninstalling k3s agent is present
    stat: 
      path: /usr/local/bin/k3s-agent-uninstall.sh
    register: k3s_present
  
  - name: Uninstall k3s from agent
    shell: /usr/local/bin/k3s-agent-uninstall.sh
    when: k3s_present.stat.exists

- name: Uninstall k3s servers
  hosts: k3s_servers
  tasks:
  - name: Check if script for uninstalling k3s server is present
    stat: 
      path: /usr/local/bin/k3s-uninstall.sh
    register: k3s_present

  - name: Uninstall k3s from server
    shell: /usr/local/bin/k3s-uninstall.sh
    when: k3s_present.stat.exists

Inventory:

[k3s_servers]
rpi4b1

[k3s_agents]
rpi4b2
rpi3b1
rpi3b2

The only manual process left in the setup is flashing SD-cards and adding configuration in the boot partition to enable the ssh server.

With the setup of the cluster itself automated, it was time to configure it.

Cluster configuration

MetalLB

In order to run apps such as Pi-Hole, I need a way to ensure a Service of type LoadBalancer in Kubernetes is exposed with a valid IP address in the network. By default K3s ships with a load balancer named Klipper Load Balancer, which according to the documentation works by reserving ports on nodes. This means that the IP address of the nodes themselves in combination with ports is how Services are seen outside the cluster. Also, once a certain port is reserved on all nodes in the cluster, it can no longer be used for any new Service. Instead of using this approach, I would rather make use of the available address space in my network for Services.

Enter MetalLB, a network load balancer implementation for bare-metal Kubernetes clusters. MetalLB uses lower-level networking protocols to advertise each Service of type LoadBalancer. It is compatible with Flannel, which ships with K3s by default. Using MetalLB requires disabling the Klipper Load Balancer. There are two modes, Layer2 and BGP. Since my router does not support BGP, I opted for Layer2.

MetalLB needs a pool of IP addresses which it can allocate to Services. I selected a range of IP addresses outside the range of the existing DHCP server in the router. I then created 2 pools in MetalLB, one for Pi-Hole and one default. The Pi-Hole pool only contains a single IP address. The reason for locking down the IP address used for the Pi-Hole Service is so avoid reconfiguring static DNS servers in the router, even if the Service is recreated. With this setup, a “static” IP assignment can be configured by only using pools and annotations in MetalLB (more on this in the section about installing Pi-Hole).

To manage applications in the cluster I use Helm. There is an official MetalLB chart available.

The values.yaml for the MetalLB chart:

# values.yaml
configInline:
  address-pools:
  - name: default
    protocol: layer2
    addresses:
    - 192.168.1.154-192.168.1.254
  # Ensures there is only ever a single IP address
  # that can be given to the pihole service
  - name: pihole
    protocol: layer2
    addresses:
    - 192.168.1.153-192.168.1.153

Installation of MetalLB: helm upgrade --install metallb metallb/metallb -f values.yaml.

Installing Pi-Hole

The final step to replicate the old homelab is to install Pi-Hole. I did this using the chart from Mojo2600.

The annotation metallb.universe.tf/allow-shared-ip ensures that both UDP and TCP communication is colocated on the same Service IP address.

The values.yaml for the chart:

replicaCount: 2

dnsmasq:
  customDnsEntries:
  # Add custom DNS records in 
  # dnsmasq-installation of Pi-Hole
  - address=/pihole.local/192.168.1.153

persistentVolumeClaim:
  enabled: false

serviceWeb:
  # The static LoadBalancer IP address for serviceWeb and
  # serviceDns does not have to be set, since the pool "pihole" 
  # in metallb will only contain a single IP address that can 
  # be allocated when using the address-pool "pihole".
  annotations:
    # Ensures that the pihole receives IP address from
    # predefined pool in metallb
    metallb.universe.tf/address-pool: pihole
    # This ensures that port 53 for TCP and UDP is colocated
    # on the same IP address. 
    metallb.universe.tf/allow-shared-ip: pihole-svc
  type: LoadBalancer

serviceDns:
  annotations:
    # Ensures that the pihole receives IP address from
    # predefined pool in metallb
    metallb.universe.tf/address-pool: pihole
    # This ensures that port 53 for TCP and UDP is colocated
    # on the same IP address. 
    metallb.universe.tf/allow-shared-ip: pihole-svc
  type: LoadBalancer

Installation of Pi-Hole: helm upgrade --install pihole mojo2600/pihole -f values.yaml --set adminPassword=<pihole-admin-password>.

Wrapping up

At this point I have a 4 node cluster colocated in a rack tower with only physical networking. K3s has a network load balancer implementation providing IP addresses within the network. Pi-Hole is installed and serves as the primary DNS server in the router, blocking ads across the network.

The entire setup is checked into code using Ansible and Helm. In the event of a single node failure or cluster-wide failure I can replace almost everything by flashing some SD-cards and running a few commands. I say almost because there is nothing in this setup that takes into account storage and recovery of data. This is something I’ll have to look into once I start deploying work-loads that rely on data persistence.

Although this is a work in progress, it serves as a good baseline for a new homelab.

Resources

While doing researching for this project, I came across several well written articles that deserves mentioning: