Submitting RayJobs
==================

The CodeFlare SDK provides a ``RayJob`` interface for submitting and
managing Ray jobs via the KubeRay operator (RayJob custom resource).
You can either create a short-lived Ray cluster for the job (managed by
the operator and cleaned up after the job finishes) or run the job on an
existing Ray cluster.

Import the following to use RayJob:

::

   from codeflare_sdk import RayJob, ManagedClusterConfig

Submitting a job with a new cluster (ManagedClusterConfig)
---------------------------------------------------------

When you provide ``cluster_config``, the KubeRay operator creates a
Ray cluster for the job and tears it down after the job completes. You
do not need to manage the cluster lifecycle yourself.

| Required: ``job_name`` (str), ``entrypoint`` (str), ``cluster_config`` (ManagedClusterConfig).
| Optional: ``namespace``, ``runtime_env``, ``ttl_seconds_after_finished``, ``active_deadline_seconds``, ``local_queue``, ``priority_class``.

.. code:: python

   from codeflare_sdk import RayJob, ManagedClusterConfig

   cluster_config = ManagedClusterConfig(
       head_memory_requests=6,
       head_memory_limits=8,
       num_workers=2,
       worker_cpu_requests=1,
       worker_cpu_limits=1,
       worker_memory_requests=4,
       worker_memory_limits=6,
       head_accelerators={"nvidia.com/gpu": 0},
       worker_accelerators={"nvidia.com/gpu": 0},
   )

   job = RayJob(
       job_name="my-rayjob",
       entrypoint="python -c 'print(\"Hello from RayJob!\")'",
       cluster_config=cluster_config,
       namespace="default",
   )
   job.submit()

Submitting a job to an existing cluster
--------------------------------------

When you provide ``cluster_name``, the job runs on an existing Ray
cluster. The cluster is not shut down when the job finishes.

| Required: ``job_name`` (str), ``entrypoint`` (str), ``cluster_name`` (str).
| Optional: ``namespace``, ``runtime_env``, ``active_deadline_seconds``, ``local_queue``, ``priority_class``.
| Note: ``ttl_seconds_after_finished`` cannot be set when using an existing cluster.

.. code:: python

   from codeflare_sdk import RayJob

   job = RayJob(
       job_name="my-rayjob",
       entrypoint="python my_script.py",
       cluster_name="my-existing-cluster",
       namespace="default",
   )
   job.submit()

RayJob methods
--------------

| ``job.submit()`` — Submits the RayJob to the KubeRay operator. Returns the job name on success. When using ``cluster_config``, the operator creates the cluster and runs the job; when using ``cluster_name``, the job is submitted to the specified cluster.
| ``job.status(print_to_console=True)`` — Returns the job status (e.g. RUNNING, COMPLETE, FAILED) and a ready flag; optionally prints a formatted status to the console.
| ``job.stop()`` — Suspends the Ray job.
| ``job.resubmit()`` — Resubmits the Ray job.
| ``job.delete()`` — Deletes the RayJob custom resource (and the cluster if it was created by this RayJob).

Runtime environment
-------------------

You can pass ``runtime_env`` when creating a ``RayJob`` to set the Ray
runtime environment (e.g. working directory, pip packages, environment
variables). It can be a Ray ``RuntimeEnv`` object from ``ray.runtime_env``
or a dict with keys such as ``working_dir``, ``pip``, ``env_vars``. For
example: ``runtime_env={"working_dir": "./my-scripts", "pip": ["requests"]}``.
See the Ray documentation for runtime environment options.

Kueue integration
-----------------

When Kueue is installed, you can set ``local_queue`` to the name of a
Kueue LocalQueue and ``priority_class`` to a WorkloadPriorityClass name
for preemption control. These apply to both new clusters (``cluster_config``)
and existing clusters (``cluster_name``). For Kueue setup, see :doc:`./setup-kueue`.

.. note::

   ``RayJob`` is used for the **RayJob custom resource** (batch job
   lifecycle managed by the KubeRay operator). For submitting jobs
   interactively to an already-running cluster via the Ray dashboard API,
   the SDK exposes ``RayJobClient``; see the Code Documentation (modules)
   for the API reference.