Setting up K3s nodes in Proxmox using Terraform

August 5, 2023

One of the main benefits of using Proxmox in the homelab, is that I can create and destroy VMs on demand. Proxmox also has a Terraform provider that I can use for defining VMs using HCL, which means creating/changing/destroying resources is similar to any other cloud provider. Getting to this stage required a non-trivial amount of effort, which is what this post is about.

Setting up cloud-init Virtual Machine Templates

It’s possible to create VMs using the Terraform provider without using cloud-init VM Templates. However, I want a baseline template that I can use for all types of workloads which also supports specifying:

Network interface with a static IP address
Predefined SSH keys added to authorized keys

The end goal is that once a VM is provisioned by Terraform, I can run an Ansible playbook to configure it directly. The only manual step required should be accepting the host key of the VM on the Ansible control node.

When setting up all of this I have relied on the following sources of information:

Big shoutout to all of them for good writeups!

To create the VM templates I added a new role to the existing Ansible playbook for the Proxmox hosts. This role basically does the following:

Downloads the cloud-init image for multiple distributions
Customizes the images by installing the QEMU guest agent
Creates a VM template from each image

Files involved:

roles/tasks/download_iso_images.yaml
roles/tasks/install_qemu_guest_agent.yaml
roles/tasks/create_cloud_init_template.yaml
roles/tasks/main.yaml
roles/vars/main.yaml

Definitions:

roles/tasks/download_iso_images.yaml:

- name: Download ISO images to directory "{{ pvesm_local_storage_path }}/{{ pvesm_local_storage_iso_subpath }}"
  loop: "{{ iso_images_to_download }}"
  get_url:
    url: "{{ item.url }}"
    dest: "{{ pvesm_local_storage_path }}/{{ pvesm_local_storage_iso_subpath }}/{{ item.filename }}"

roles/tasks/install_qemu_guest_agent.yaml:

- name: Install libguestfs-tools on proxmox nodes
  apt:
    name: libguestfs-tools
    update_cache: true

- name: Install qemu-guest-agent in all images in directory "{{ pvesm_local_storage_path }}/{{ pvesm_local_storage_iso_subpath }}"
  loop: "{{ iso_images_to_download }}"
  command: virt-customize -a "{{ pvesm_local_storage_path }}/{{ pvesm_local_storage_iso_subpath }}/{{ item.filename }}" --install qemu-guest-agent

roles/tasks/create_cloud_init_template.yaml:

- name: Check if template already exists
  shell: qm list | grep "{{ vm_id }}"
  ignore_errors: true
  register: template_exists_cmd

- name: Delete template {{ vm_id }} if it already exists
  command: qm destroy {{ vm_id }}
  when: template_exists_cmd.rc == 0

- name: Create the VM to be used as a cloud-init base for template {{ vm_instance_template_name }}
  command: qm create {{ vm_id }} -name {{ vm_instance_template_name }} -memory 1024 -net0 virtio,bridge=vmbr0 -cores 1 -sockets 1

- name: Import the cloud init ISO image into the disk
  command: qm importdisk {{ vm_id }} {{ iso_image_path }} {{ storage }}

- name: Attach the disk to the virtual machine
  command: qm set {{ vm_id }} -scsihw virtio-scsi-pci -virtio0 "{{ storage }}:vm-{{ vm_id }}-disk-0"

- name: Set a second drive of the cloud-init type
  command: qm set {{ vm_id }} -ide2 "{{ storage }}:cloudinit"

- name: Add serial output
  command: qm set {{ vm_id }} -serial0 socket -vga serial0

- name: Set the bootdisk to their imported disk
  command: qm set {{ vm_id }} -boot c -bootdisk virtio0

- name: Enable qemu guest agent
  command: qm set {{ vm_id }} --agent enabled=1
  when: enable_qemu_guest_agent is defined

- name: Create a template from the instance
  command: qm template {{ vm_id }}

roles/tasks/main.yaml:

- name: Download ISO images
  import_tasks: download_iso_images.yaml

- name: Install qemu-guest-agent in iso images
  import_tasks: install_qemu_guest_agent.yaml

- name: Setup cloud-init for alle downloaded ISO images that has cloud-init
  loop: "{{ iso_images_to_download }}"
  include_tasks: create_cloud_init_template.yaml
  when: item.is_cloud_init_image is defined
  vars:
    vm_id: "{{ item.cloud_init_vm_id }}"
    vm_instance_template_name: "{{ item.cloud_init_vm_instance_template_name }}"
    iso_image_path: "{{ pvesm_local_storage_path }}/{{ pvesm_local_storage_iso_subpath }}/{{ item.filename }}"
    storage: local-lvm
    enable_qemu_guest_agent: "{{ item.enable_qemu_guest_agent }}"

roles/vars/main.yaml:

pvesm_local_storage_path: /var/lib/vz
pvesm_local_storage_iso_subpath: template/iso

iso_images_to_download:
- filename: ubuntu-22.04-server-cloudimg-amd64.img
  url: https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img
  is_cloud_init_image: true
  cloud_init_vm_id: 9001
  cloud_init_vm_instance_template_name: ubuntu-22.04-server-cloudimg-amd64
  enable_qemu_guest_agent: true

I opted for recreating the VM templates on each run to ensure they always reflect the latest images. One thing to note is that I use full_clone for all VMs to ensure there is no reference that breaks if the template is deleted.

Once the playbook has been ran, the VM templates are ready and it’s time to define the VMs using HCL.

Writing the Terraform configuration for the VMs

I won’t spend time talking about initializing the Terraform provider or creating the user account in Proxmox since this is already covered in the documentation:

Once everything has been setup, it’s time to define the VMs.

Most of the configuration is similar enough that defining multiple resources was not necessary in my case:

k8s.tf:

resource "proxmox_vm_qemu" "k8s" {
  for_each = {
    for vm in var.vms : vm.name => vm
  }

  name        = each.value.name
  desc        = each.value.desc
  target_node = each.value.pve_node

  # Setting the OS type to cloud-init
  os_type    = "cloud-init"
  # Set to the name of the cloud-init VM template created earlier
  clone      = var.cloud_init_template
  # Ensure each VM is cloned in full to avoid
  # dependency to the original VM template
  full_clone = true

  cores  = each.value.cores
  memory = each.value.memory

  # Define a static IP on the primary network interface
  ipconfig0 = "ip=${each.value.ip},gw=${var.gateway_ip}"

  ciuser  = var.username
  sshkeys = var.ssh_public_key

  # Enable the QEMU guest agent
  agent = 1

  # This is matched to the
  # default mounted OS drive.
  disk {
    type    = "virtio"
    storage = "local-lvm"
    size    = "20G"
  }

  # Additional storage drive for Longhorn.
  disk {
    type    = "virtio"
    storage = "local-lvm"
    size    = each.value.disk_size
    format  = "raw"
  }

  lifecycle {
    ignore_changes = [
      target_node,
      network,
      clone,
      full_clone,
      qemu_os
    ]
  }
}

vars.tf:

variable "cloud_init_template" {
  type        = string
  description = "Name of the cloud-init template to use"
  default     = "ubuntu-22.04-server-cloudimg-amd64"
}

variable "username" {
  type        = string
  description = "Username of the cloud-init user"
  sensitive   = true
}

variable "ssh_public_key" {
  type        = string
  description = "Public SSH Key to add to authorized keys"
  sensitive   = true
}

variable "gateway_ip" {
  type        = string
  description = "IP of gateway"
  default     = "<IP address>"
}

variable "vms" {
  type = list(object({
    name      = string
    pve_node  = string
    desc      = string
    ip        = string
    memory    = number
    cores     = number
    disk_size = string
  }))
  default = []
}

My current configuration. The 120GB drive on all 3 worker nodes is dedicated to Longhorn. The 30GB drive on the control plane node was previously used for Longhorn storage and is going to be removed in the future.

terraform.tfvars:

vms = [
  {
    name      = "k8s-control-1"
    pve_node  = "<Proxmox host containing VM templates in the Proxmox cluster>"
    desc      = "Kubernetes control plane node 1"
    ip        = "<IP address in CIDR notation>"
    memory    = 4096
    cores     = 4
    disk_size = "30G"
  },
  {
    name      = "k8s-worker-1"
    pve_node  = "<Proxmox host containing VM templates in the Proxmox cluster>"
    desc      = "Kubernetes worker node 1"
    ip        = "<IP address in CIDR notation>"
    memory    = 4096
    cores     = 2
    disk_size = "120G"
  },
  {
    name      = "k8s-worker-2"
    pve_node  = "<Proxmox host containing VM templates in the Proxmox cluster>"
    desc      = "Kubernetes worker node 2"
    ip        = "<IP address in CIDR notation>"
    memory    = 4096
    cores     = 2
    disk_size = "120G"
  },
  {
    name      = "k8s-worker-3"
    pve_node  = "<Proxmox host containing VM templates in the Proxmox cluster>"
    desc      = "Kubernetes worker node 3"
    ip        = "<IP address in CIDR notation>"
    memory    = 4096
    cores     = 2
    disk_size = "120G"
  }
]

Since all the VMs have predefined IP addresses and usernames I can specify all this in ~/.ssh/config.

Host k8s-control-1
    HostName <IP address>
    User <Username>

Host k8s-worker-1
    HostName <IP address>
    User <Username>

Host k8s-worker-2
    HostName <IP address>
    User <Username>

Host k8s-worker-3
    HostName <IP address>
    User <Username>

Once each VM is up I can SSH into all of them without any passwords involved.

The Ansible playbook for the K3s cluster adds new hosts placed in the inventory. Removing nodes from the cluster is now reduced to the following steps:

Drain the node
Remove it from the cluster
Destroy and recreate the VM of the node using Terraform
Replace the old SSH public key of the VM on the Ansible control node
Rerun the Ansible playbook to configure the new node

Unplugging the power from Raspberry Pis, reflashing SD cards, etc is now a thing of the past.

Given my current setup, it’s about as close as I’m going to get to a fully automated solution for managing the underlying infrastructure of the K3s nodes.

Conclusion

At this point I have migrated the entire K3s cluster away from Raspberry Pis and into VMs on Proxmox. Some of the larger pain points I’ve had previously is no longer a concern:

Nodes in the cluster fail and become unresponsive due to overload, requiring physically reconnecting the power of the Raspberry Pi to force a reboot
No more silently failing network interfaces which are only fixed by rebooting the Raspberry Pi

I now have IaC for the entire K3s cluster. Combine this with external backups of stateful data and it makes the setup robust enough that I can destroy and recreate the entire cluster when necessary.

As an added bonus, everything I cannot deploy as a container to the K3s cluster I can deploy as a VM using the same baseline template as the K3s nodes.

Some future improvements:

No replication of Virtual Machine Templates is currently possible in the Proxmox cluster. This forces me to use a single Proxmox host for first-time provisioning, which is why the property pve_node is ignored. After the initial provisioning I have to migrate VMs manually to other nodes. I would have to look into storage in Proxmox that supports replication and use that for storing the templates to fix this.
Maybe look into using existing Terraform modules such as fvumbaca/k3s/proxmox to remove the need for having my own Ansible playbook for the initial configuration of the cluster. The Longhorn configuration would still be necessary.