codeflare_sdk.ray.cluster package

Submodules

codeflare_sdk.ray.cluster.build_ray_cluster module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for RayCluster/AppWrapper generation.

codeflare_sdk.ray.cluster.build_ray_cluster.add_queue_label(cluster: Cluster, labels: dict)[source]: The add_queue_label() function updates the given base labels with the local queue label if Kueue exists on the Cluster

codeflare_sdk.ray.cluster.build_ray_cluster.build_ray_cluster(cluster: Cluster)[source]

build_ray_cluster is used for creating a Ray Cluster/AppWrapper dict

The resource is a dict template which uses Kubernetes Objects for creating metadata, resource requests, specs and containers. The result is sanitised and returned either as a dict or written as a yaml file.

codeflare_sdk.ray.cluster.build_ray_cluster.gen_names(name)[source]: Generates a unique name for the appwrapper and Ray Cluster

codeflare_sdk.ray.cluster.build_ray_cluster.generate_custom_storage(provided_storage: list, default_storage: list)[source]: The generate_custom_storage function updates the volumes/volume mounts configs with the default volumes/volume mounts.

codeflare_sdk.ray.cluster.build_ray_cluster.generate_env_vars(cluster: Cluster)[source]: The generate_env_vars() builds and returns a V1EnvVar object which is populated by user specified environment variables

codeflare_sdk.ray.cluster.build_ray_cluster.generate_image_pull_secrets(cluster: Cluster)[source]: The generate_image_pull_secrets() methods generates a list of V1LocalObjectReference including each of the specified image pull secrets

codeflare_sdk.ray.cluster.build_ray_cluster.get_default_local_queue(cluster: Cluster, labels: dict)[source]: The get_default_local_queue() function attempts to find a local queue with the default label == true, if that is the case the labels variable is updated with that local queue

codeflare_sdk.ray.cluster.build_ray_cluster.get_head_container_spec(cluster: Cluster)[source]: The get_head_container_spec() function builds and returns a V1Container object including user defined resource requests/limits

codeflare_sdk.ray.cluster.build_ray_cluster.get_labels(cluster: Cluster)[source]: The get_labels() function generates a dict “labels” which includes the base label, local queue label and user defined labels

codeflare_sdk.ray.cluster.build_ray_cluster.get_metadata(cluster: Cluster)[source]: The get_metadata() function builds and returns a V1ObjectMeta Object using cluster configuration parameters

codeflare_sdk.ray.cluster.build_ray_cluster.get_pod_spec(cluster: Cluster, containers: List, tolerations: List[V1Toleration]) → V1PodSpec[source]: The get_pod_spec() function generates a V1PodSpec for the head/worker containers

codeflare_sdk.ray.cluster.build_ray_cluster.get_resources(cpu_requests: int | str, cpu_limits: int | str, memory_requests: int | str, memory_limits: int | str, custom_extended_resource_requests: Dict[str, int] = None)[source]: The get_resources() function generates a V1ResourceRequirements object for cpu/memory request/limits and GPU resources

codeflare_sdk.ray.cluster.build_ray_cluster.get_worker_container_spec(cluster: Cluster)[source]: The get_worker_container_spec() function builds and returns a V1Container object including user defined resource requests/limits

codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_extended_resources_from_cluster(cluster: Cluster) → Tuple[dict, dict][source]: The head_worker_extended_resources_from_cluster() function returns 2 dicts for head/worker respectively populated by the GPU type requested by the user

codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_gpu_count_from_cluster(cluster: Cluster) → Tuple[int, int][source]: The head_worker_gpu_count_from_cluster() function returns the total number of requested GPUs for the head and worker separately

codeflare_sdk.ray.cluster.build_ray_cluster.local_queue_exists(cluster: Cluster)[source]: The local_queue_exists() checks if the user inputted local_queue exists in the given namespace and returns a bool

codeflare_sdk.ray.cluster.build_ray_cluster.update_image(image) → str[source]: The update_image() function automatically sets the image config parameter to a preset image based on Python version if not specified. If no Ray image exists for the given Python version a warning is produced.

codeflare_sdk.ray.cluster.build_ray_cluster.with_nb_annotations(annotations: dict)[source]: The with_nb_annotations() function generates the annotation for NB Prefix if the SDK is running in a notebook and appends any user set annotations

codeflare_sdk.ray.cluster.build_ray_cluster.wrap_cluster(cluster: Cluster, appwrapper_name: str, ray_cluster_yaml: dict)[source]: Wraps the pre-built Ray Cluster dict in an AppWrapper

codeflare_sdk.ray.cluster.build_ray_cluster.write_to_file(cluster: Cluster, resource: dict)[source]: The write_to_file function writes the built Ray Cluster/AppWrapper dict as a yaml file in the .codeflare folder

codeflare_sdk.ray.cluster.cluster module

The cluster sub-module contains the definition of the Cluster object, which represents the resources requested by the user. It also contains functions for checking the cluster setup queue, a list of all existing clusters, and the user’s working namespace.

class codeflare_sdk.ray.cluster.cluster.Cluster(config: ClusterConfiguration)[source]

Bases: object

An object for requesting, bringing up, and taking down resources. Can also be used for seeing the resource cluster status and details.

Note that currently, the underlying implementation is a Ray cluster.

apply(force=False)[source]: Applies the Cluster yaml using server-side apply. If ‘force’ is set to True, conflicts will be forced.

cluster_dashboard_uri() → str[source]: Returns a string containing the cluster’s dashboard URI.

cluster_uri() → str[source]: Returns a string containing the cluster’s URI.

config_check()[source]

create_resource()[source]: Called upon cluster object creation, creates an AppWrapper yaml based on the specifications of the ClusterConfiguration.

details(print_to_console: bool = True) → RayCluster[source]

Retrieves details about the Ray Cluster.

This method returns a copy of the Ray Cluster information and optionally prints the details to the console.

Args:

print_to_console (bool):: Flag to determine if the cluster details should be printed to the console. Defaults to True.

Returns:

RayCluster:: A copy of the Ray Cluster details.

down()[source]: Deletes the AppWrapper yaml, scaling-down and deleting all resources associated with the cluster.

get_dynamic_client()[source]

is_dashboard_ready() → bool[source]

Checks if the cluster’s dashboard is ready and accessible.

This method attempts to send a GET request to the cluster dashboard URI. If the request is successful (HTTP status code 200), it returns True. If an SSL error occurs, it returns False, indicating the dashboard is not ready.

Returns:

bool:: True if the dashboard is ready, False otherwise.

property job_client

job_logs(job_id: str) → str[source]: This method accesses the head ray node in your cluster and returns the logs for the provided job id.

job_status(job_id: str) → str[source]: This method accesses the head ray node in your cluster and returns the job status for the provided job id.

list_jobs() → List[source]: This method accesses the head ray node in your cluster and lists the running jobs.

local_client_url()[source]

Constructs the URL for the local Ray client.

Returns:

str:: The Ray client URL based on the ingress domain.

status(print_to_console: bool = True) → Tuple[CodeFlareClusterStatus, bool][source]: Returns the requested cluster’s status, as well as whether or not it is ready for use.

up()[source]: Applies the Cluster yaml, pushing the resource request onto the Kueue localqueue.

wait_ready(timeout: int | None = None, dashboard_check: bool = True)[source]

Waits for the requested cluster to be ready, up to an optional timeout.

This method checks the status of the cluster every five seconds until it is ready or the timeout is reached. If dashboard_check is enabled, it will also check for the readiness of the dashboard.

Args:

timeout (Optional[int]):: The maximum time to wait for the cluster to be ready in seconds. If None, waits indefinitely.
dashboard_check (bool):: Flag to determine if the dashboard readiness should be checked. Defaults to True.

Raises:

TimeoutError:: If the timeout is reached before the cluster or dashboard is ready.

codeflare_sdk.ray.cluster.cluster.get_cluster(cluster_name: str, namespace: str = 'default', verify_tls: bool = True, write_to_file: bool = False)[source]

Retrieves an existing Ray Cluster or AppWrapper as a Cluster object.

This function fetches an existing Ray Cluster or AppWrapper from the Kubernetes cluster and returns it as a Cluster object, including its YAML configuration under Cluster.resource_yaml.

Args:

cluster_name (str):: The name of the Ray Cluster or AppWrapper.
namespace (str, optional):: The Kubernetes namespace where the Ray Cluster or AppWrapper is located. Default is “default”.
verify_tls (bool, optional):: Whether to verify TLS when connecting to the cluster. Default is True.
write_to_file (bool, optional):: If True, writes the resource configuration to a YAML file. Default is False.

Returns:

Cluster:: A Cluster object representing the retrieved Ray Cluster or AppWrapper.

Raises:

Exception:: If the Ray Cluster or AppWrapper cannot be found or does not exist.

codeflare_sdk.ray.cluster.cluster.get_current_namespace()[source]

Retrieves the current Kubernetes namespace.

Returns:

str:: The current namespace or None if not found.

codeflare_sdk.ray.cluster.cluster.list_all_clusters(namespace: str, print_to_console: bool = True)[source]: Returns (and prints by default) a list of all clusters in a given namespace.

codeflare_sdk.ray.cluster.cluster.list_all_queued(namespace: str, print_to_console: bool = True, appwrapper: bool = False)[source]: Returns (and prints by default) a list of all currently queued-up Ray Clusters in a given namespace.

codeflare_sdk.ray.cluster.cluster.remove_autogenerated_fields(resource)[source]: Recursively remove autogenerated fields from a dictionary.

codeflare_sdk.ray.cluster.config module

The config sub-module contains the definition of the ClusterConfiguration dataclass, which is used to specify resource requirements and other details when creating a Cluster object.

class codeflare_sdk.ray.cluster.config.ClusterConfiguration(name: str, namespace: str | None = None, head_cpu_requests: int | str = 2, head_cpu_limits: int | str = 2, head_cpus: int | str | None = None, head_memory_requests: int | str = 8, head_memory_limits: int | str = 8, head_memory: int | str | None = None, head_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, head_tolerations: ~typing.List[~kubernetes.client.models.v1_toleration.V1Toleration] | None = None, worker_cpu_requests: int | str = 1, worker_cpu_limits: int | str = 1, num_workers: int = 1, worker_memory_requests: int | str = 2, worker_memory_limits: int | str = 2, worker_tolerations: ~typing.List[~kubernetes.client.models.v1_toleration.V1Toleration] | None = None, appwrapper: bool = False, envs: ~typing.Dict[str, str] = <factory>, image: str = '', image_pull_secrets: ~typing.List[str] = <factory>, write_to_file: bool = False, verify_tls: bool = True, labels: ~typing.Dict[str, str] = <factory>, worker_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, extended_resource_mapping: ~typing.Dict[str, str] = <factory>, overwrite_default_resource_mapping: bool = False, local_queue: str | None = None, annotations: ~typing.Dict[str, str] = <factory>, volumes: list[~kubernetes.client.models.v1_volume.V1Volume] = <factory>, volume_mounts: list[~kubernetes.client.models.v1_volume_mount.V1VolumeMount] = <factory>, enable_gcs_ft: bool = False, enable_usage_stats: bool = False, redis_address: str | None = None, redis_password_secret: ~typing.Dict[str, str] | None = None, external_storage_namespace: str | None = None)[source]

Bases: object

This dataclass is used to specify resource requirements and other details, and is passed in as an argument when creating a Cluster object.

Args:

name:: The name of the cluster.
namespace:: The namespace in which the cluster should be created.
head_cpus:: The number of CPUs to allocate to the head node.
head_memory:: The amount of memory to allocate to the head node.
head_extended_resource_requests:: A dictionary of extended resource requests for the head node. ex: {“nvidia.com/gpu”: 1}
head_tolerations:: List of tolerations for head nodes.
num_workers:: The number of workers to create.
worker_tolerations:: List of tolerations for worker nodes.
appwrapper:: A boolean indicating whether to use an AppWrapper.
envs:: A dictionary of environment variables to set for the cluster.
image:: The image to use for the cluster.
image_pull_secrets:: A list of image pull secrets to use for the cluster.
write_to_file:: A boolean indicating whether to write the cluster configuration to a file.
verify_tls:: A boolean indicating whether to verify TLS when connecting to the cluster.
labels:: A dictionary of labels to apply to the cluster.
worker_extended_resource_requests:: A dictionary of extended resource requests for each worker. ex: {“nvidia.com/gpu”: 1}
extended_resource_mapping:: A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names
overwrite_default_resource_mapping:: A boolean indicating whether to overwrite the default resource mapping.
annotations:: A dictionary of annotations to apply to the cluster.
volumes:: A list of V1Volume objects to add to the Cluster
volume_mounts:: A list of V1VolumeMount objects to add to the Cluster
enable_gcs_ft:: A boolean indicating whether to enable GCS fault tolerance.
enable_usage_stats:: A boolean indicating whether to capture and send Ray usage stats externally.
redis_address:: The address of the Redis server to use for GCS fault tolerance, required when enable_gcs_ft is True.
redis_password_secret:: Kubernetes secret reference containing Redis password. ex: {“name”: “secret-name”, “key”: “password-key”}
external_storage_namespace:: The storage namespace to use for GCS fault tolerance. By default, KubeRay sets it to the UID of RayCluster.

annotations: Dict[str, str]

appwrapper: bool = False

enable_gcs_ft: bool = False

enable_usage_stats: bool = False

envs: Dict[str, str]

extended_resource_mapping: Dict[str, str]

external_storage_namespace: str | None = None

head_cpu_limits: int | str = 2

head_cpu_requests: int | str = 2

head_cpus: int | str | None = None

head_extended_resource_requests: Dict[str, str | int]

head_memory: int | str | None = None

head_memory_limits: int | str = 8

head_memory_requests: int | str = 8

head_tolerations: List[V1Toleration] | None = None

image: str = ''

image_pull_secrets: List[str]

labels: Dict[str, str]

local_queue: str | None = None

name: str

namespace: str | None = None

num_workers: int = 1

overwrite_default_resource_mapping: bool = False

redis_address: str | None = None

redis_password_secret: Dict[str, str] | None = None

verify_tls: bool = True

volume_mounts: list[V1VolumeMount]

volumes: list[V1Volume]

worker_cpu_limits: int | str = 1

worker_cpu_requests: int | str = 1

worker_extended_resource_requests: Dict[str, str | int]

worker_memory_limits: int | str = 2

worker_memory_requests: int | str = 2

worker_tolerations: List[V1Toleration] | None = None

write_to_file: bool = False

codeflare_sdk.ray.cluster.pretty_print module

This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for pretty-printing cluster status and details.

codeflare_sdk.ray.cluster.pretty_print.print_app_wrappers_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]

codeflare_sdk.ray.cluster.pretty_print.print_cluster_status(cluster: RayCluster)[source]: Pretty prints the status of a passed-in cluster

codeflare_sdk.ray.cluster.pretty_print.print_clusters(clusters: List[RayCluster])[source]

codeflare_sdk.ray.cluster.pretty_print.print_no_resources_found()[source]

codeflare_sdk.ray.cluster.pretty_print.print_ray_clusters_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]

codeflare_sdk.ray.cluster.status module

The status sub-module defines Enums containing information for Ray cluster states states, and CodeFlare cluster states, as well as dataclasses to store information for Ray clusters.

class codeflare_sdk.ray.cluster.status.CodeFlareClusterStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Defines the possible reportable states of a Codeflare cluster.

FAILED = 5

QUEUED = 3

QUEUEING = 4

READY = 1

STARTING = 2

SUSPENDED = 7

UNKNOWN = 6

class codeflare_sdk.ray.cluster.status.RayCluster(name: str, status: ~codeflare_sdk.ray.cluster.status.RayClusterStatus, head_cpu_requests: int, head_cpu_limits: int, head_mem_requests: str, head_mem_limits: str, num_workers: int, worker_mem_requests: str, worker_mem_limits: str, worker_cpu_requests: int | str, worker_cpu_limits: int | str, namespace: str, dashboard: str, worker_extended_resources: ~typing.Dict[str, int] = <factory>, head_extended_resources: ~typing.Dict[str, int] = <factory>)[source]

Bases: object

For storing information about a Ray cluster.

dashboard: str

head_cpu_limits: int

head_cpu_requests: int

head_extended_resources: Dict[str, int]

head_mem_limits: str

head_mem_requests: str

name: str

namespace: str

num_workers: int

status: RayClusterStatus

worker_cpu_limits: int | str

worker_cpu_requests: int | str

worker_extended_resources: Dict[str, int]

worker_mem_limits: str

worker_mem_requests: str

class codeflare_sdk.ray.cluster.status.RayClusterStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Defines the possible reportable states of a Ray cluster.

FAILED = 'failed'

READY = 'ready'

SUSPENDED = 'suspended'

UNHEALTHY = 'unhealthy'

UNKNOWN = 'unknown'

codeflare_sdk.ray.cluster package

Submodules

codeflare_sdk.ray.cluster.build_ray_cluster module

codeflare_sdk.ray.cluster.cluster module

codeflare_sdk.ray.cluster.config module

codeflare_sdk.ray.cluster.pretty_print module

codeflare_sdk.ray.cluster.status module

Module contents