Break and fix
How to use this page
Section titled “How to use this page”Each entry follows a symptom → cause → fix format. Check the symptom heading, confirm the cause matches, then apply the fix. If the issue is not listed here, escalate to RCS via the channels on the Support and contact page.
Pod stuck in Pending
Section titled “Pod stuck in Pending”Symptom: A tenant reports pods staying in Pending for more than 5 minutes.
Likely cause: The namespace has hit its ResourceQuota ceiling, or the cluster is out of schedulable capacity.
Fix:
kubectl describe pod <pod> -n <ns>— check the Events section forFailedScheduling.- If quota-related:
kubectl get resourcequota -n <ns>— compareusedvshard. - If cluster-capacity-related:
kubectl top nodes— check for nodes near 100% allocatable. - For quota issues, the tenant admin can request a quota increase via the process in the tenant-admin docs. For capacity issues, escalate to the RCS infrastructure team.
Namespace creation denied
Section titled “Namespace creation denied”Symptom: A tenant owner gets namespaces "foo" is forbidden when creating a namespace.
Likely cause: The Capsule namespace quota is exhausted, or the user’s Keycloak group mapping is stale.
Fix:
kubectl get tenant <tenant-name> -o jsonpath='{.status.size}'— compare withspec.namespaceOptions.quota.- If at quota: the tenant needs to delete unused namespaces or request a quota increase.
- If under quota: check that the user’s OIDC token includes the correct group claim —
kubectl oidc-loginwith--v=6shows the token payload.
ArgoCD Application stuck OutOfSync
Section titled “ArgoCD Application stuck OutOfSync”Symptom: An Application shows OutOfSync indefinitely after a merge.
Likely cause: A schema validation error in the rendered manifests, or a webhook is rejecting the resource.
Fix:
kubectl -n argo-cd describe application <app-name>— read the sync error message.- If schema error: check the diff in the ArgoCD UI — usually a CRD version mismatch or a removed field.
- If webhook rejection:
kubectl get events -n <target-ns> --sort-by=.lastTimestamp— look for admission webhook errors. - Fix the manifest in
ubernetes-applications, merge, and wait for the next sync cycle.
Certificate not issuing
Section titled “Certificate not issuing”Symptom: A new Ingress shows CertificateNotReady or the browser shows a TLS warning.
Likely cause: cert-manager cannot complete the ACME challenge — usually a DNS propagation delay or a rate-limit hit.
Fix:
kubectl get certificate -n <ns>— check theREADYcolumn.kubectl describe certificaterequest -n <ns>— look for ACME solver errors.kubectl logs -n cert-manager deploy/cert-manager -c cert-manager --tail=50— check for rate-limit or authorization failures.- If DNS propagation: wait 5-10 minutes and re-check. If rate-limited: check Let’s Encrypt rate limit status and wait for the window to reset.
Node NotReady
Section titled “Node NotReady”Symptom: kubectl get nodes shows one or more nodes as NotReady.
Likely cause: The kubelet has stopped reporting, usually due to resource exhaustion (disk pressure, memory pressure) or a VM-level issue on the OpenStack host.
Fix:
kubectl describe node <node>— check Conditions forDiskPressure,MemoryPressure, orNetworkUnavailable.- If disk pressure: identify and clean up large files (container images, logs).
crictl rmi --pruneremoves unused images. - If the node is unreachable: check the VM status in the OpenStack dashboard. Reboot the VM if it is in an error state.
- If the node does not recover within 15 minutes, cordon and drain it:
kubectl cordon <node> && kubectl drain <node> --ignore-daemonsets --delete-emptydir-data.