Moving data

Getting data onto and off of the cluster is one of the first things you will need to do. This page covers the supported patterns, their tradeoffs, and the anti-patterns that cause trouble.

Patterns

kubectl cp (small transfers)

kubectl cp copies files between your local machine and a running Pod. It works through the Kubernetes API server using tar over the exec channel — no direct network path to the Pod is needed.

# Local → Pod
kubectl cp ./local-file.csv <your-tenant>-prod/my-pod:/data/local-file.csv

# Pod → Local
kubectl cp <your-tenant>-prod/my-pod:/data/output.csv ./output.csv

This works well for small files (under ~500 MB). For anything larger, use a PVC-backed Job or rsync sidecar instead — see the anti-patterns section below for why.

PVC-backed Jobs (bulk data processing)

A Job that mounts a PVC can pull data from an external source (object store, HTTP endpoint, database dump), process it, and write results back to the PVC. This is the standard pattern for bulk data movement because the transfer runs inside the cluster network, survives interruptions via Job retry semantics, and does not depend on your local machine staying connected.

apiVersion: batch/v1
kind: Job
metadata:
  name: data-import
  namespace: <your-tenant>-prod
spec:
  template:
    spec:
      containers:
        - name: import
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - "curl -fSL https://example.com/dataset.tar.gz | tar xz -C /data"
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: my-data
      restartPolicy: OnFailure

See Jobs & CronJobs for the full recipe with PSS-compliant security context, active deadlines, and TTL cleanup.

NFS mount from outside (if your tenant has one)

If RCS has provisioned a per-tenant NFS volume for you (the same static PV used for shared RWX storage — see Persistent volumes in practice), you can mount that NFS export directly from a workstation or transfer node that has access to it. This gives you native filesystem access to the same data your Pods see through the NFS-backed PVC — no Kubernetes API involved.

sudo mount -t nfs rsfshare.uvic.ca:/mlp /mnt/tenant-data

This option is not available by default and depends on network access to the NFS server. Ask RCS whether an external NFS mount is available for your use case.

rsync sidecar (resumable transfers)

For large or unreliable transfers, run an rsync server as a sidecar container alongside your workload. The sidecar mounts the same PVC and exposes rsync over a ClusterIP Service. You then kubectl port-forward to the Service and rsync from your local machine:

kubectl port-forward -n <your-tenant>-prod svc/rsync-sidecar 8873:873 &
rsync -avz --progress ./large-dataset/ rsync://localhost:8873/data/

rsync handles resume on interruption, delta transfers (only changed blocks), and progress reporting — all things kubectl cp cannot do.

Anti-patterns

Large kubectl cp transfers

kubectl cp streams data through the Kubernetes API server as a tar pipe over an exec channel. There is no resume, no progress indication, no delta transfer, and no compression. A network hiccup drops the entire transfer, and you start over. For anything over ~500 MB, use a PVC-backed Job or rsync sidecar instead.

Storing data in the container filesystem

The container filesystem is ephemeral. Anything written outside a mounted PVC is lost when the Pod restarts, reschedules, or gets evicted. This is not a storage option — it is a temporary scratch space.

If your workload writes output to a local path, make sure that path is backed by a PVC volumeMount. Otherwise the data survives only as long as the specific container instance does.

Downloading large datasets at Pod startup

Pulling a multi-gigabyte dataset in an initContainer on every Pod start is slow and wasteful. The download runs every time the Pod reschedules, consumes network bandwidth, and delays readiness. Instead, download the dataset once into a PVC (via a Job) and mount that PVC into your workload.