Upgrades
This runbook covers three upgrade paths:
- Kubernetes version upgrades — moving the cluster control plane and nodes to a new minor release.
- Addon/chart upgrades — bumping Helm chart versions for cluster addons (ArgoCD, cert-manager, Capsule, etc.).
- Tooling upgrades — updating CLI tools and CI pipelines (kubectl, helm, argocd CLI).
Each follows the same pre-flight → upgrade → verify → rollback-if-needed pattern.
Pre-flight checklist
Section titled “Pre-flight checklist”Before any upgrade:
- Read the upstream changelog / release notes for breaking changes.
- Confirm the target version is tested in the staging cluster (if available).
- Notify tenants of the maintenance window via the RCS ops channel.
- Take an etcd snapshot:
etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d).db. - Verify current cluster health:
kubectl get nodes,kubectl get pods -A | grep -v Running. - Confirm ArgoCD shows all Applications as
SyncedandHealthy.
Kubernetes version upgrade
Section titled “Kubernetes version upgrade”Control plane
Section titled “Control plane”- Update the Kubernetes version in the kubespray inventory/playbook variables for the cluster.
- Run the kubespray upgrade playbook — the control plane components (api-server, controller-manager, scheduler) roll one at a time.
- Verify:
kubectl versionshould show the new server version.kubectl get --raw /healthzshould returnok(prefer this over the deprecatedkubectl get cs).
Worker nodes
Section titled “Worker nodes”- Cordon and drain nodes one at a time:
kubectl cordon <node> && kubectl drain <node> --ignore-daemonsets --delete-emptydir-data. - Update the node OS image or kubelet version (depends on provisioning method).
- Uncordon the node:
kubectl uncordon <node>. - Verify the node rejoins as
Readywith the new kubelet version:kubectl get node <node> -o wide. - Repeat for each node. Wait for all workloads to reschedule before moving to the next node.
Addon / chart upgrades
Section titled “Addon / chart upgrades”- Identify the addon to upgrade and its current chart version in
ubernetes-applications. - Check the chart’s changelog for breaking changes (especially CRD updates — these often require manual steps).
- Update the chart version in the appropriate values file.
- If the chart includes CRD changes: apply CRDs manually first (
kubectl apply -f crds/) — Helm does not upgrade CRDs automatically. - Commit, merge, and let ArgoCD sync.
- Verify the Application returns to
Synced+Healthy. Check pod logs for startup errors.
Common addon-specific notes
Section titled “Common addon-specific notes”| Addon | Watch for |
|---|---|
| ArgoCD | RBAC policy changes, UI breaking changes, ApplicationSet CRD updates |
| cert-manager | CRD version jumps, webhook certificate rotation |
| Capsule | Tenant CRD spec changes, admission webhook compatibility |
| Traefik | IngressRoute/middleware CRD changes, static/dynamic config key renames |
Rollback procedure
Section titled “Rollback procedure”Kubernetes version
Section titled “Kubernetes version”Kubernetes does not support in-place downgrade of the control plane. If a version upgrade fails:
- Restore the etcd snapshot taken during pre-flight.
- Re-provision control plane components at the previous version.
- Worker nodes: drain, re-image to the previous version, uncordon.
This is a last-resort procedure — prefer fixing forward when possible.
Addon / chart
Section titled “Addon / chart”- Revert the version-bump MR in
ubernetes-applications. - If CRDs were updated: check whether the old chart version is compatible with the new CRDs. If not, manually restore the old CRDs from the previous chart tarball.
- ArgoCD syncs the reverted values within ~60 seconds.
- Verify the addon pods restart with the previous version.
Post-upgrade verification
Section titled “Post-upgrade verification”After any upgrade:
kubectl get nodes -o wide— all nodesReady, correct version.kubectl get pods -A | grep -v Running— no unexpected non-running pods.- ArgoCD dashboard — all Applications
SyncedandHealthy. - Spot-check tenant workloads — ask one or two active tenants to confirm their services are operating normally.
- Update the
lastVerifieddate on this page and any affected platform-admin docs.