Skip to content

Break and fix

Each entry follows a symptom → cause → fix format. Check the symptom heading, confirm the cause matches, then apply the fix. If the issue is not listed here, escalate to RCS via the channels on the Support and contact page.


Symptom: A tenant reports pods staying in Pending for more than 5 minutes.

Likely cause: The namespace has hit its ResourceQuota ceiling, or the cluster is out of schedulable capacity.

Fix:

  1. kubectl describe pod <pod> -n <ns> — check the Events section for FailedScheduling.
  2. If quota-related: kubectl get resourcequota -n <ns> — compare used vs hard.
  3. If cluster-capacity-related: kubectl top nodes — check for nodes near 100% allocatable.
  4. For quota issues, the tenant admin can request a quota increase via the process in the tenant-admin docs. For capacity issues, escalate to the RCS infrastructure team.

Symptom: A tenant owner gets namespaces "foo" is forbidden when creating a namespace.

Likely cause: The Capsule namespace quota is exhausted, or the user’s Keycloak group mapping is stale.

Fix:

  1. kubectl get tenant <tenant-name> -o jsonpath='{.status.size}' — compare with spec.namespaceOptions.quota.
  2. If at quota: the tenant needs to delete unused namespaces or request a quota increase.
  3. If under quota: check that the user’s OIDC token includes the correct group claim — kubectl oidc-login with --v=6 shows the token payload.

Symptom: An Application shows OutOfSync indefinitely after a merge.

Likely cause: A schema validation error in the rendered manifests, or a webhook is rejecting the resource.

Fix:

  1. kubectl -n argo-cd describe application <app-name> — read the sync error message.
  2. If schema error: check the diff in the ArgoCD UI — usually a CRD version mismatch or a removed field.
  3. If webhook rejection: kubectl get events -n <target-ns> --sort-by=.lastTimestamp — look for admission webhook errors.
  4. Fix the manifest in ubernetes-applications, merge, and wait for the next sync cycle.

Symptom: A new Ingress shows CertificateNotReady or the browser shows a TLS warning.

Likely cause: cert-manager cannot complete the ACME challenge — usually a DNS propagation delay or a rate-limit hit.

Fix:

  1. kubectl get certificate -n <ns> — check the READY column.
  2. kubectl describe certificaterequest -n <ns> — look for ACME solver errors.
  3. kubectl logs -n cert-manager deploy/cert-manager -c cert-manager --tail=50 — check for rate-limit or authorization failures.
  4. If DNS propagation: wait 5-10 minutes and re-check. If rate-limited: check Let’s Encrypt rate limit status and wait for the window to reset.

Symptom: kubectl get nodes shows one or more nodes as NotReady.

Likely cause: The kubelet has stopped reporting, usually due to resource exhaustion (disk pressure, memory pressure) or a VM-level issue on the OpenStack host.

Fix:

  1. kubectl describe node <node> — check Conditions for DiskPressure, MemoryPressure, or NetworkUnavailable.
  2. If disk pressure: identify and clean up large files (container images, logs). crictl rmi --prune removes unused images.
  3. If the node is unreachable: check the VM status in the OpenStack dashboard. Reboot the VM if it is in an error state.
  4. If the node does not recover within 15 minutes, cordon and drain it: kubectl cordon <node> && kubectl drain <node> --ignore-daemonsets --delete-emptydir-data.