Submitting RayJobs

The CodeFlare SDK provides a RayJob interface for submitting and managing Ray jobs via the KubeRay operator (RayJob custom resource). You can either create a short-lived Ray cluster for the job (managed by the operator and cleaned up after the job finishes) or run the job on an existing Ray cluster.

Import the following to use RayJob:

from codeflare_sdk import RayJob, ManagedClusterConfig

Submitting a job with a new cluster (ManagedClusterConfig)

When you provide cluster_config, the KubeRay operator creates a Ray cluster for the job and tears it down after the job completes. You do not need to manage the cluster lifecycle yourself.

Required: job_name (str), entrypoint (str), cluster_config (ManagedClusterConfig).

Optional: namespace, runtime_env, ttl_seconds_after_finished, active_deadline_seconds, local_queue, priority_class.

from codeflare_sdk import RayJob, ManagedClusterConfig

cluster_config = ManagedClusterConfig(
    head_memory_requests=6,
    head_memory_limits=8,
    num_workers=2,
    worker_cpu_requests=1,
    worker_cpu_limits=1,
    worker_memory_requests=4,
    worker_memory_limits=6,
    head_accelerators={"nvidia.com/gpu": 0},
    worker_accelerators={"nvidia.com/gpu": 0},
)

job = RayJob(
    job_name="my-rayjob",
    entrypoint="python -c 'print(\"Hello from RayJob!\")'",
    cluster_config=cluster_config,
    namespace="default",
)
job.submit()

Submitting a job to an existing cluster

When you provide cluster_name, the job runs on an existing Ray cluster. The cluster is not shut down when the job finishes.

Required: job_name (str), entrypoint (str), cluster_name (str).
Optional: namespace, runtime_env, active_deadline_seconds, local_queue, priority_class.
Note: ttl_seconds_after_finished cannot be set when using an existing cluster.

from codeflare_sdk import RayJob

job = RayJob(
    job_name="my-rayjob",
    entrypoint="python my_script.py",
    cluster_name="my-existing-cluster",
    namespace="default",
)
job.submit()

RayJob methods

job.submit() — Submits the RayJob to the KubeRay operator. Returns the job name on success. When using cluster_config, the operator creates the cluster and runs the job; when using cluster_name, the job is submitted to the specified cluster.
job.status(print_to_console=True) — Returns the job status (e.g. RUNNING, COMPLETE, FAILED) and a ready flag; optionally prints a formatted status to the console.
job.stop() — Suspends the Ray job.
job.resubmit() — Resubmits the Ray job.
job.delete() — Deletes the RayJob custom resource (and the cluster if it was created by this RayJob).

Runtime environment

You can pass runtime_env when creating a RayJob to set the Ray runtime environment (e.g. working directory, pip packages, environment variables). It can be a Ray RuntimeEnv object from ray.runtime_env or a dict with keys such as working_dir, pip, env_vars. For example: runtime_env={"working_dir": "./my-scripts", "pip": ["requests"]}. See the Ray documentation for runtime environment options.

Kueue integration

When Kueue is installed, you can set local_queue to the name of a Kueue LocalQueue and priority_class to a WorkloadPriorityClass name for preemption control. These apply to both new clusters (cluster_config) and existing clusters (cluster_name). For Kueue setup, see Basic Kueue Resources configuration.

Note

RayJob is used for the RayJob custom resource (batch job lifecycle managed by the KubeRay operator). For submitting jobs interactively to an already-running cluster via the Ray dashboard API, the SDK exposes RayJobClient; see the Code Documentation (modules) for the API reference.