codeflare_sdk.ray.cluster package

Submodules

codeflare_sdk.ray.cluster.build_ray_cluster module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for RayCluster/AppWrapper generation.

codeflare_sdk.ray.cluster.build_ray_cluster.add_queue_label(cluster: Cluster, labels: dict)[source]

The add_queue_label() function updates the given base labels with the local queue label if Kueue exists on the Cluster

codeflare_sdk.ray.cluster.build_ray_cluster.build_ray_cluster(cluster: Cluster)[source]

build_ray_cluster is used for creating a Ray Cluster/AppWrapper dict

The resource is a dict template which uses Kubernetes Objects for creating metadata, resource requests, specs and containers. The result is sanitised and returned either as a dict or written as a yaml file.

codeflare_sdk.ray.cluster.build_ray_cluster.gen_names(name)[source]

Generates a unique name for the appwrapper and Ray Cluster

codeflare_sdk.ray.cluster.build_ray_cluster.generate_env_vars(cluster: Cluster)[source]

The generate_env_vars() builds and returns a V1EnvVar object which is populated by user specified environment variables

codeflare_sdk.ray.cluster.build_ray_cluster.generate_image_pull_secrets(cluster: Cluster)[source]

The generate_image_pull_secrets() methods generates a list of V1LocalObjectReference including each of the specified image pull secrets

codeflare_sdk.ray.cluster.build_ray_cluster.get_default_local_queue(cluster: Cluster, labels: dict)[source]

The get_default_local_queue() function attempts to find a local queue with the default label == true, if that is the case the labels variable is updated with that local queue

codeflare_sdk.ray.cluster.build_ray_cluster.get_head_container_spec(cluster: Cluster)[source]

The get_head_container_spec() function builds and returns a V1Container object including user defined resource requests/limits

codeflare_sdk.ray.cluster.build_ray_cluster.get_labels(cluster: Cluster)[source]

The get_labels() function generates a dict “labels” which includes the base label, local queue label and user defined labels

codeflare_sdk.ray.cluster.build_ray_cluster.get_metadata(cluster: Cluster)[source]

The get_metadata() function builds and returns a V1ObjectMeta Object using cluster configurtation parameters

codeflare_sdk.ray.cluster.build_ray_cluster.get_nb_annotations()[source]

The get_nb_annotations() function generates the annotation for NB Prefix if the SDK is running in a notebook

codeflare_sdk.ray.cluster.build_ray_cluster.get_pod_spec(cluster: Cluster, containers)[source]

The get_pod_spec() function generates a V1PodSpec for the head/worker containers

codeflare_sdk.ray.cluster.build_ray_cluster.get_resources(cpu_requests: int | str, cpu_limits: int | str, memory_requests: int | str, memory_limits: int | str, custom_extended_resource_requests: Dict[str, int] | None = None)[source]

The get_resources() function generates a V1ResourceRequirements object for cpu/memory request/limits and GPU resources

codeflare_sdk.ray.cluster.build_ray_cluster.get_worker_container_spec(cluster: Cluster)[source]

The get_worker_container_spec() function builds and returns a V1Container object including user defined resource requests/limits

codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_extended_resources_from_cluster(cluster: Cluster) Tuple[dict, dict][source]

The head_worker_extended_resources_from_cluster() function returns 2 dicts for head/worker respectively populated by the GPU type requested by the user

codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_gpu_count_from_cluster(cluster: Cluster) Tuple[int, int][source]

The head_worker_gpu_count_from_cluster() function returns the total number of requested GPUs for the head and worker separately

codeflare_sdk.ray.cluster.build_ray_cluster.local_queue_exists(cluster: Cluster)[source]

The local_queue_exists() checks if the user inputted local_queue exists in the given namespace and returns a bool

codeflare_sdk.ray.cluster.build_ray_cluster.update_image(image) str[source]

The update_image() function automatically sets the image config parameter to a preset image based on Python version if not specified. If no Ray image exists for the given Python version a warning is produced.

codeflare_sdk.ray.cluster.build_ray_cluster.wrap_cluster(cluster: Cluster, appwrapper_name: str, ray_cluster_yaml: dict)[source]

Wraps the pre-built Ray Cluster dict in an AppWrapper

codeflare_sdk.ray.cluster.build_ray_cluster.write_to_file(cluster: Cluster, resource: dict)[source]

The write_to_file function writes the built Ray Cluster/AppWrapper dict as a yaml file in the .codeflare folder

codeflare_sdk.ray.cluster.cluster module

The cluster sub-module contains the definition of the Cluster object, which represents the resources requested by the user. It also contains functions for checking the cluster setup queue, a list of all existing clusters, and the user’s working namespace.

class codeflare_sdk.ray.cluster.cluster.Cluster(config: ClusterConfiguration)[source]

Bases: object

An object for requesting, bringing up, and taking down resources. Can also be used for seeing the resource cluster status and details.

Note that currently, the underlying implementation is a Ray cluster.

cluster_dashboard_uri() str[source]

Returns a string containing the cluster’s dashboard URI.

cluster_uri() str[source]

Returns a string containing the cluster’s URI.

create_resource()[source]

Called upon cluster object creation, creates an AppWrapper yaml based on the specifications of the ClusterConfiguration.

details(print_to_console: bool = True) RayCluster[source]

Retrieves details about the Ray Cluster.

This method returns a copy of the Ray Cluster information and optionally prints the details to the console.

Args:
print_to_console (bool):

Flag to determine if the cluster details should be printed to the console. Defaults to True.

Returns:
RayCluster:

A copy of the Ray Cluster details.

down()[source]

Deletes the AppWrapper yaml, scaling-down and deleting all resources associated with the cluster.

is_dashboard_ready() bool[source]

Checks if the cluster’s dashboard is ready and accessible.

This method attempts to send a GET request to the cluster dashboard URI. If the request is successful (HTTP status code 200), it returns True. If an SSL error occurs, it returns False, indicating the dashboard is not ready.

Returns:
bool:

True if the dashboard is ready, False otherwise.

property job_client
job_logs(job_id: str) str[source]

This method accesses the head ray node in your cluster and returns the logs for the provided job id.

job_status(job_id: str) str[source]

This method accesses the head ray node in your cluster and returns the job status for the provided job id.

list_jobs() List[source]

This method accesses the head ray node in your cluster and lists the running jobs.

local_client_url()[source]

Constructs the URL for the local Ray client.

Returns:
str:

The Ray client URL based on the ingress domain.

status(print_to_console: bool = True) Tuple[CodeFlareClusterStatus, bool][source]

Returns the requested cluster’s status, as well as whether or not it is ready for use.

up()[source]

Applies the Cluster yaml, pushing the resource request onto the Kueue localqueue.

wait_ready(timeout: int | None = None, dashboard_check: bool = True)[source]

Waits for the requested cluster to be ready, up to an optional timeout.

This method checks the status of the cluster every five seconds until it is ready or the timeout is reached. If dashboard_check is enabled, it will also check for the readiness of the dashboard.

Args:
timeout (Optional[int]):

The maximum time to wait for the cluster to be ready in seconds. If None, waits indefinitely.

dashboard_check (bool):

Flag to determine if the dashboard readiness should be checked. Defaults to True.

Raises:
TimeoutError:

If the timeout is reached before the cluster or dashboard is ready.

codeflare_sdk.ray.cluster.cluster.get_cluster(cluster_name: str, namespace: str = 'default', verify_tls: bool = True, write_to_file: bool = False)[source]

Retrieves an existing Ray Cluster or AppWrapper as a Cluster object.

This function fetches an existing Ray Cluster or AppWrapper from the Kubernetes cluster and returns it as a Cluster object, including its YAML configuration under Cluster.resource_yaml.

Args:
cluster_name (str):

The name of the Ray Cluster or AppWrapper.

namespace (str, optional):

The Kubernetes namespace where the Ray Cluster or AppWrapper is located. Default is “default”.

verify_tls (bool, optional):

Whether to verify TLS when connecting to the cluster. Default is True.

write_to_file (bool, optional):

If True, writes the resource configuration to a YAML file. Default is False.

Returns:
Cluster:

A Cluster object representing the retrieved Ray Cluster or AppWrapper.

Raises:
Exception:

If the Ray Cluster or AppWrapper cannot be found or does not exist.

codeflare_sdk.ray.cluster.cluster.get_current_namespace()[source]

Retrieves the current Kubernetes namespace.

Returns:
str:

The current namespace or None if not found.

codeflare_sdk.ray.cluster.cluster.list_all_clusters(namespace: str, print_to_console: bool = True)[source]

Returns (and prints by default) a list of all clusters in a given namespace.

codeflare_sdk.ray.cluster.cluster.list_all_queued(namespace: str, print_to_console: bool = True, appwrapper: bool = False)[source]

Returns (and prints by default) a list of all currently queued-up Ray Clusters in a given namespace.

codeflare_sdk.ray.cluster.cluster.remove_autogenerated_fields(resource)[source]

Recursively remove autogenerated fields from a dictionary.

codeflare_sdk.ray.cluster.config module

The config sub-module contains the definition of the ClusterConfiguration dataclass, which is used to specify resource requirements and other details when creating a Cluster object.

class codeflare_sdk.ray.cluster.config.ClusterConfiguration(name: str, namespace: str | None = None, head_cpu_requests: int | str = 2, head_cpu_limits: int | str = 2, head_cpus: int | str | None = None, head_memory_requests: int | str = 8, head_memory_limits: int | str = 8, head_memory: int | str | None = None, head_gpus: int | None = None, head_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, worker_cpu_requests: int | str = 1, worker_cpu_limits: int | str = 1, min_cpus: int | str | None = None, max_cpus: int | str | None = None, num_workers: int = 1, worker_memory_requests: int | str = 2, worker_memory_limits: int | str = 2, min_memory: int | str | None = None, max_memory: int | str | None = None, num_gpus: int | None = None, appwrapper: bool = False, envs: ~typing.Dict[str, str] = <factory>, image: str = '', image_pull_secrets: ~typing.List[str] = <factory>, write_to_file: bool = False, verify_tls: bool = True, labels: ~typing.Dict[str, str] = <factory>, worker_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, extended_resource_mapping: ~typing.Dict[str, str] = <factory>, overwrite_default_resource_mapping: bool = False, local_queue: str | None = None)[source]

Bases: object

This dataclass is used to specify resource requirements and other details, and is passed in as an argument when creating a Cluster object.

Args:
name:

The name of the cluster.

namespace:

The namespace in which the cluster should be created.

head_cpus:

The number of CPUs to allocate to the head node.

head_memory:

The amount of memory to allocate to the head node.

head_gpus:

The number of GPUs to allocate to the head node. (Deprecated, use head_extended_resource_requests)

head_extended_resource_requests:

A dictionary of extended resource requests for the head node. ex: {“nvidia.com/gpu”: 1}

min_cpus:

The minimum number of CPUs to allocate to each worker.

max_cpus:

The maximum number of CPUs to allocate to each worker.

num_workers:

The number of workers to create.

min_memory:

The minimum amount of memory to allocate to each worker.

max_memory:

The maximum amount of memory to allocate to each worker.

num_gpus:

The number of GPUs to allocate to each worker. (Deprecated, use worker_extended_resource_requests)

appwrapper:

A boolean indicating whether to use an AppWrapper.

envs:

A dictionary of environment variables to set for the cluster.

image:

The image to use for the cluster.

image_pull_secrets:

A list of image pull secrets to use for the cluster.

write_to_file:

A boolean indicating whether to write the cluster configuration to a file.

verify_tls:

A boolean indicating whether to verify TLS when connecting to the cluster.

labels:

A dictionary of labels to apply to the cluster.

worker_extended_resource_requests:

A dictionary of extended resource requests for each worker. ex: {“nvidia.com/gpu”: 1}

extended_resource_mapping:

A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names

overwrite_default_resource_mapping:

A boolean indicating whether to overwrite the default resource mapping.

appwrapper: bool = False
envs: Dict[str, str]
extended_resource_mapping: Dict[str, str]
head_cpu_limits: int | str = 2
head_cpu_requests: int | str = 2
head_cpus: int | str | None = None
head_extended_resource_requests: Dict[str, str | int]
head_gpus: int | None = None
head_memory: int | str | None = None
head_memory_limits: int | str = 8
head_memory_requests: int | str = 8
image: str = ''
image_pull_secrets: List[str]
labels: Dict[str, str]
local_queue: str | None = None
max_cpus: int | str | None = None
max_memory: int | str | None = None
min_cpus: int | str | None = None
min_memory: int | str | None = None
name: str
namespace: str | None = None
num_gpus: int | None = None
num_workers: int = 1
overwrite_default_resource_mapping: bool = False
verify_tls: bool = True
worker_cpu_limits: int | str = 1
worker_cpu_requests: int | str = 1
worker_extended_resource_requests: Dict[str, str | int]
worker_memory_limits: int | str = 2
worker_memory_requests: int | str = 2
write_to_file: bool = False

codeflare_sdk.ray.cluster.pretty_print module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for pretty-printing cluster status and details.

codeflare_sdk.ray.cluster.pretty_print.print_app_wrappers_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]
codeflare_sdk.ray.cluster.pretty_print.print_cluster_status(cluster: RayCluster)[source]

Pretty prints the status of a passed-in cluster

codeflare_sdk.ray.cluster.pretty_print.print_clusters(clusters: List[RayCluster])[source]
codeflare_sdk.ray.cluster.pretty_print.print_no_resources_found()[source]
codeflare_sdk.ray.cluster.pretty_print.print_ray_clusters_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]

codeflare_sdk.ray.cluster.status module

The status sub-module defines Enums containing information for Ray cluster states states, and CodeFlare cluster states, as well as dataclasses to store information for Ray clusters.

class codeflare_sdk.ray.cluster.status.CodeFlareClusterStatus(value)[source]

Bases: Enum

Defines the possible reportable states of a Codeflare cluster.

FAILED = 5
QUEUED = 3
QUEUEING = 4
READY = 1
STARTING = 2
SUSPENDED = 7
UNKNOWN = 6
class codeflare_sdk.ray.cluster.status.RayCluster(name: str, status: ~codeflare_sdk.ray.cluster.status.RayClusterStatus, head_cpu_requests: int, head_cpu_limits: int, head_mem_requests: str, head_mem_limits: str, num_workers: int, worker_mem_requests: str, worker_mem_limits: str, worker_cpu_requests: int | str, worker_cpu_limits: int | str, namespace: str, dashboard: str, worker_extended_resources: ~typing.Dict[str, int] = <factory>, head_extended_resources: ~typing.Dict[str, int] = <factory>)[source]

Bases: object

For storing information about a Ray cluster.

dashboard: str
head_cpu_limits: int
head_cpu_requests: int
head_extended_resources: Dict[str, int]
head_mem_limits: str
head_mem_requests: str
name: str
namespace: str
num_workers: int
status: RayClusterStatus
worker_cpu_limits: int | str
worker_cpu_requests: int | str
worker_extended_resources: Dict[str, int]
worker_mem_limits: str
worker_mem_requests: str
class codeflare_sdk.ray.cluster.status.RayClusterStatus(value)[source]

Bases: Enum

Defines the possible reportable states of a Ray cluster.

FAILED = 'failed'
READY = 'ready'
SUSPENDED = 'suspended'
UNHEALTHY = 'unhealthy'
UNKNOWN = 'unknown'

Module contents