Upgrades

Scope

This runbook covers three upgrade paths:

Kubernetes version upgrades — moving the cluster control plane and nodes to a new minor release.
Addon/chart upgrades — bumping Helm chart versions for cluster addons (ArgoCD, cert-manager, Capsule, etc.).
Tooling upgrades — updating CLI tools and CI pipelines (kubectl, helm, argocd CLI).

Each follows the same pre-flight → upgrade → verify → rollback-if-needed pattern.

Before any upgrade:

Read the upstream changelog / release notes for breaking changes.
Confirm the target version is tested in the staging cluster (if available).
Notify tenants of the maintenance window via the RCS ops channel.
Take an etcd snapshot: etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d).db.
Verify current cluster health: kubectl get nodes, kubectl get pods -A | grep -v Running.
Confirm ArgoCD shows all Applications as Synced and Healthy.

Update the Kubernetes version in the kubespray inventory/playbook variables for the cluster.
Run the kubespray upgrade playbook — the control plane components (api-server, controller-manager, scheduler) roll one at a time.
Verify: kubectl version should show the new server version. kubectl get --raw /healthz should return ok (prefer this over the deprecated kubectl get cs).

Cordon and drain nodes one at a time: kubectl cordon <node> && kubectl drain <node> --ignore-daemonsets --delete-emptydir-data.
Update the node OS image or kubelet version (depends on provisioning method).
Uncordon the node: kubectl uncordon <node>.
Verify the node rejoins as Ready with the new kubelet version: kubectl get node <node> -o wide.
Repeat for each node. Wait for all workloads to reschedule before moving to the next node.

Identify the addon to upgrade and its current chart version in ubernetes-applications.
Check the chart’s changelog for breaking changes (especially CRD updates — these often require manual steps).
Update the chart version in the appropriate values file.
If the chart includes CRD changes: apply CRDs manually first (kubectl apply -f crds/) — Helm does not upgrade CRDs automatically.
Commit, merge, and let ArgoCD sync.
Verify the Application returns to Synced + Healthy. Check pod logs for startup errors.

Addon	Watch for
ArgoCD	RBAC policy changes, UI breaking changes, ApplicationSet CRD updates
cert-manager	CRD version jumps, webhook certificate rotation
Capsule	Tenant CRD spec changes, admission webhook compatibility
Traefik	IngressRoute/middleware CRD changes, static/dynamic config key renames

Kubernetes does not support in-place downgrade of the control plane. If a version upgrade fails:

This is a last-resort procedure — prefer fixing forward when possible.

Revert the version-bump MR in ubernetes-applications.
If CRDs were updated: check whether the old chart version is compatible with the new CRDs. If not, manually restore the old CRDs from the previous chart tarball.
ArgoCD syncs the reverted values within ~60 seconds.
Verify the addon pods restart with the previous version.

After any upgrade:

kubectl get nodes -o wide — all nodes Ready, correct version.
kubectl get pods -A | grep -v Running — no unexpected non-running pods.
ArgoCD dashboard — all Applications Synced and Healthy.
Spot-check tenant workloads — ask one or two active tenants to confirm their services are operating normally.
Update the lastVerified date on this page and any affected platform-admin docs.