Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
user_questions:
- How can I deploy Ray clusters on Giant Swarm using KubeRay?
- How do I configure KubeRay for distributed machine learning workloads?
last_review_date: 2025-10-21
last_review_date: 2026-05-19
---

[Ray](https://www.ray.io/) is a unified framework for scaling AI and Python applications. It provides a simple, universal API for building distributed applications and includes libraries for machine learning, reinforcement learning, and hyperparameter tuning. [KubeRay](https://ray-project.github.io/kuberay/) is the official Kubernetes operator for Ray that automates the deployment, scaling, and management of Ray clusters on Kubernetes.
Expand Down Expand Up @@ -51,30 +51,32 @@
--name=kuberay-operator \
--organization=${ORGANIZATION} \
--target-namespace=kuberay-system \
--version=1.1.0 > kuberay-operator.yaml
--version=1.1.0 2>/dev/null > kuberay-operator.yaml

kubectl apply -f kuberay-operator.yaml
```

**Note**: `kubectl gs template app` may print a deprecation banner in latest releases of `kubectl gs` related to a transition how apps are deployed. That is why we are redirecting the `stderr`.

### Verifying the installation

Check that the KubeRay operator is running:

```bash
kubectl get pods -n kuberay-system
```nohighlight
$ kubectl get pods -n kuberay-system

NAME READY STATUS RESTARTS AGE
kuberay-operator-7b5c8f6d4b-xyz12 1/1 Running 0 2m
```

Verify that the Custom Resource Definitions (CRDs) are installed:

```bash
kubectl get crd | grep ray
```nohighlight
$ kubectl get crd | grep ray

rayclusters.ray.io 2025-10-12T10:00:00Z
rayjobs.ray.io 2025-10-12T10:00:00Z
rayservices.ray.io 2025-10-12T10:00:00Z
rayclusters.ray.io 2026-05-19T10:00:00Z
rayjobs.ray.io 2026-05-19T10:00:00Z
rayservices.ray.io 2026-05-19T10:00:00Z
```

## Deploying a Ray cluster
Expand All @@ -83,26 +85,53 @@

### Basic Ray cluster configuration

Create a basic Ray cluster configuration:
Create a basic Ray cluster configuration. The manifest below works on a standard Giant Swarm workload cluster with PSS-restricted policies enforced by Kyverno (the default on most installations):

Check warning on line 88 in src/content/tutorials/fleet-management/job-management/kuberay/index.md

View workflow job for this annotation

GitHub Actions / vale

[Vale] reported by reviewdog 🐶 [Microsoft.Acronyms] 'PSS' has no definition. Raw Output: {"message": "[Microsoft.Acronyms] 'PSS' has no definition.", "location": {"path": "src/content/tutorials/fleet-management/job-management/kuberay/index.md", "range": {"start": {"line": 88, "column": 116}}}, "severity": "INFO"}

```yaml
apiVersion: ray.io/v1alpha1
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: sample-raycluster
namespace: default
spec:
rayVersion: '2.50.1'
enableInTreeAutoscaling: true
# The operator injects an autoscaler sidecar into the head pod when
# enableInTreeAutoscaling is true. PSS-restricted clusters reject it
# unless we set its securityContext explicitly.
autoscalerOptions:
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop: [ALL]
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
block: 'true'
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 100
fsGroup: 100
seccompProfile:
type: RuntimeDefault
containers:
- name: ray-head
image: rayproject/ray:2.50.1
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop: [ALL]
ports:
- containerPort: 6379
name: gcs-server
Expand All @@ -111,12 +140,15 @@
- containerPort: 10001
name: client
resources:
# 4Gi memory is the practical minimum for the head: the Ray
# dashboard subprocesses sit around ~1.94Gi on idle, so a 2Gi
# limit OOMs the moment you submit a job.
limits:
cpu: "2"
memory: "2Gi"
memory: "4Gi"
requests:
cpu: "1"
memory: "1Gi"
memory: "2Gi"
workerGroupSpecs:
- replicas: 2
minReplicas: 1
Expand All @@ -125,24 +157,24 @@
rayStartParams: {}
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
runtimeClassName: nvidia
tolerations:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 100
fsGroup: 100
seccompProfile:
type: RuntimeDefault
containers:
- name: ray-worker
image: rayproject/ray:2.50.1
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop: [ALL]
resources:
limits:
cpu: "2"
Expand All @@ -158,28 +190,32 @@
kubectl apply -f ray-cluster.yaml
```

**Note**: The manifest above schedules `Ray` workers on any node. If you want workers to land on GPU nodes, add a `runtimeClassName: nvidia` plus a toleration for the `nvidia.com/gpu` taint to the worker `template.spec`. Drop those settings on non-GPU clusters, they prevent scheduling there.

Check warning on line 193 in src/content/tutorials/fleet-management/job-management/kuberay/index.md

View workflow job for this annotation

GitHub Actions / vale

[Vale] reported by reviewdog 🐶 [Microsoft.Vocab] Verify your use of 'above' with the A-Z word list. Raw Output: {"message": "[Microsoft.Vocab] Verify your use of 'above' with the A-Z word list.", "location": {"path": "src/content/tutorials/fleet-management/job-management/kuberay/index.md", "range": {"start": {"line": 193, "column": 25}}}, "severity": "INFO"}

### Verifying the Ray cluster deployment

Check the status of your Ray cluster:

```bash
kubectl get raycluster
```nohighlight
$ kubectl get raycluster

NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
sample-raycluster 2 2 ready 3m
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
sample-raycluster 2 2 6 6Gi 0 ready 3m
```

List the Ray cluster pods:

```bash
kubectl get pods -l ray.io/cluster=sample-raycluster
```nohighlight
$ kubectl get pods -l ray.io/cluster=sample-raycluster

NAME READY STATUS RESTARTS AGE
sample-raycluster-head-xxxxx 1/1 Running 0 3m
sample-raycluster-worker-small-group-xxxxx 1/1 Running 0 3m
sample-raycluster-worker-small-group-yyyyy 1/1 Running 0 3m
sample-raycluster-head-xxxxx 2/2 Running 0 3m
sample-raycluster-small-group-worker-xxxxx 1/1 Running 0 3m
sample-raycluster-small-group-worker-yyyyy 1/1 Running 0 3m
```

The head pod shows `2/2` containers because the operator injects an autoscaler sidecar alongside the Ray head when `enableInTreeAutoscaling: true`.

## Accessing the Ray cluster

### Using the Ray Dashboard
Expand All @@ -190,37 +226,64 @@
kubectl port-forward service/sample-raycluster-head-svc 8265:8265
```

`sample-raycluster-head-svc` is a headless service (`ClusterIP: None`), but `kubectl port-forward` resolves it to the head pod and works the same way.

Open your browser and navigate to `http://localhost:8265` to access the Ray Dashboard.

![Ray UI](ray-ui.png)

## Running a test job

Once your Ray cluster is running, you can submit a computing job using the Ray Job Submission SDK to test the cluster capabilities.
Once your Ray cluster is running, submit a computing job to validate it. We'll calculate the value of π using the Monte Carlo method. The Python script lives [in this gist](https://gist.githubusercontent.com/pipo02mix/a32771ec8358d338426c915e2b7a8078/raw/9bb509f37dba7edf09f042cee5e71f78aa0ccb10/dt.py).

First, make sure you have the Ray client on your local machine:
Make sure the dashboard port is still forwarded:

```bash
pip install -U "ray[default]"
kubectl port-forward service/sample-raycluster-head-svc 8265:8265
```

Set up port forwarding to access your Ray cluster:
You can submit the job in two ways.

### Option A: Ray CLI

Check warning on line 247 in src/content/tutorials/fleet-management/job-management/kuberay/index.md

View workflow job for this annotation

GitHub Actions / vale

[Vale] reported by reviewdog 🐶 [Microsoft.Headings] 'Option A: Ray CLI' should use sentence-style capitalization. Raw Output: {"message": "[Microsoft.Headings] 'Option A: Ray CLI' should use sentence-style capitalization.", "location": {"path": "src/content/tutorials/fleet-management/job-management/kuberay/index.md", "range": {"start": {"line": 247, "column": 5}}}, "severity": "INFO"}

Install the Ray client if you don't already have it:

```bash
kubectl port-forward service/sample-raycluster-head-svc 8265:8265
pip install -U "ray[default]"
```

Let's calculate the value of pi using the Monte Carlo method. The Python script can be found [in this gist file](https://gist.githubusercontent.com/pipo02mix/a32771ec8358d338426c915e2b7a8078/raw/9bb509f37dba7edf09f042cee5e71f78aa0ccb10/dt.py). You can use this command to submit the job to the Ray cluster API.
Then submit the job. The `working_dir` points at the gist so you don't need a local copy:

```bash
# Submit a job using Ray CLI
ray job submit \
--address="http://localhost:8265" \
--runtime-env-json='{"pip": ["numpy"], "working_dir": "."}' \
-- python https://gist.githubusercontent.com/pipo02mix/a32771ec8358d338426c915e2b7a8078/raw/9bb509f37dba7edf09f042cee5e71f78aa0ccb10/dt.py
--runtime-env-json='{"pip": ["numpy"], "working_dir": "https://gist.githubusercontent.com/pipo02mix/a32771ec8358d338426c915e2b7a8078/archive/9bb509f37dba7edf09f042cee5e71f78aa0ccb10.zip"}' \
-- python dt.py
```

### Option B: REST API (no Python required)

Check warning on line 264 in src/content/tutorials/fleet-management/job-management/kuberay/index.md

View workflow job for this annotation

GitHub Actions / vale

[Vale] reported by reviewdog 🐶 [Microsoft.Acronyms] 'REST' has no definition. Raw Output: {"message": "[Microsoft.Acronyms] 'REST' has no definition.", "location": {"path": "src/content/tutorials/fleet-management/job-management/kuberay/index.md", "range": {"start": {"line": 264, "column": 15}}}, "severity": "INFO"}

Check warning on line 264 in src/content/tutorials/fleet-management/job-management/kuberay/index.md

View workflow job for this annotation

GitHub Actions / vale

[Vale] reported by reviewdog 🐶 [Microsoft.Headings] 'Option B: REST API (no Python required)' should use sentence-style capitalization. Raw Output: {"message": "[Microsoft.Headings] 'Option B: REST API (no Python required)' should use sentence-style capitalization.", "location": {"path": "src/content/tutorials/fleet-management/job-management/kuberay/index.md", "range": {"start": {"line": 264, "column": 5}}}, "severity": "INFO"}

If you don't have Python or `ray` installed locally, submit the job directly with `curl`:

```bash
curl -X POST http://localhost:8265/api/jobs/ \
-H "Content-Type: application/json" \
-d '{
"entrypoint": "python dt.py",
"runtime_env": {
"pip": ["numpy"],
"working_dir": "https://gist.githubusercontent.com/pipo02mix/a32771ec8358d338426c915e2b7a8078/archive/9bb509f37dba7edf09f042cee5e71f78aa0ccb10.zip"
}
}'
```

The response includes a `submission_id`. Poll the status with:

```bash
curl -s http://localhost:8265/api/jobs/<submission_id>
```

Observe in the dashboard how the job is executed in parallel and how resources are scaled based on load.
Either way, observe in the dashboard how the job is executed in parallel and how resources are scaled based on load.

![Ray Job UI](job-ui.png)

Expand Down
Loading