codeflare_sdk.ray.cluster package
Submodules
codeflare_sdk.ray.cluster.build_ray_cluster module
This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for RayCluster/AppWrapper generation.
- codeflare_sdk.ray.cluster.build_ray_cluster.add_queue_label(cluster: Cluster, labels: dict)[source]
The add_queue_label() function updates the given base labels with the local queue label if Kueue exists on the Cluster
- codeflare_sdk.ray.cluster.build_ray_cluster.build_ray_cluster(cluster: Cluster)[source]
build_ray_cluster is used for creating a Ray Cluster/AppWrapper dict
The resource is a dict template which uses Kubernetes Objects for creating metadata, resource requests, specs and containers. The result is sanitised and returned either as a dict or written as a yaml file.
- codeflare_sdk.ray.cluster.build_ray_cluster.gen_names(name)[source]
Generates a unique name for the appwrapper and Ray Cluster
- codeflare_sdk.ray.cluster.build_ray_cluster.generate_env_vars(cluster: Cluster)[source]
The generate_env_vars() builds and returns a V1EnvVar object which is populated by user specified environment variables
- codeflare_sdk.ray.cluster.build_ray_cluster.generate_image_pull_secrets(cluster: Cluster)[source]
The generate_image_pull_secrets() methods generates a list of V1LocalObjectReference including each of the specified image pull secrets
- codeflare_sdk.ray.cluster.build_ray_cluster.get_default_local_queue(cluster: Cluster, labels: dict)[source]
The get_default_local_queue() function attempts to find a local queue with the default label == true, if that is the case the labels variable is updated with that local queue
- codeflare_sdk.ray.cluster.build_ray_cluster.get_head_container_spec(cluster: Cluster)[source]
The get_head_container_spec() function builds and returns a V1Container object including user defined resource requests/limits
- codeflare_sdk.ray.cluster.build_ray_cluster.get_labels(cluster: Cluster)[source]
The get_labels() function generates a dict “labels” which includes the base label, local queue label and user defined labels
- codeflare_sdk.ray.cluster.build_ray_cluster.get_metadata(cluster: Cluster)[source]
The get_metadata() function builds and returns a V1ObjectMeta Object using cluster configurtation parameters
- codeflare_sdk.ray.cluster.build_ray_cluster.get_nb_annotations()[source]
The get_nb_annotations() function generates the annotation for NB Prefix if the SDK is running in a notebook
- codeflare_sdk.ray.cluster.build_ray_cluster.get_pod_spec(cluster: Cluster, containers)[source]
The get_pod_spec() function generates a V1PodSpec for the head/worker containers
- codeflare_sdk.ray.cluster.build_ray_cluster.get_resources(cpu_requests: int | str, cpu_limits: int | str, memory_requests: int | str, memory_limits: int | str, custom_extended_resource_requests: Dict[str, int] | None = None)[source]
The get_resources() function generates a V1ResourceRequirements object for cpu/memory request/limits and GPU resources
- codeflare_sdk.ray.cluster.build_ray_cluster.get_worker_container_spec(cluster: Cluster)[source]
The get_worker_container_spec() function builds and returns a V1Container object including user defined resource requests/limits
- codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_extended_resources_from_cluster(cluster: Cluster) Tuple[dict, dict] [source]
The head_worker_extended_resources_from_cluster() function returns 2 dicts for head/worker respectively populated by the GPU type requested by the user
- codeflare_sdk.ray.cluster.build_ray_cluster.head_worker_gpu_count_from_cluster(cluster: Cluster) Tuple[int, int] [source]
The head_worker_gpu_count_from_cluster() function returns the total number of requested GPUs for the head and worker separately
- codeflare_sdk.ray.cluster.build_ray_cluster.local_queue_exists(cluster: Cluster)[source]
The local_queue_exists() checks if the user inputted local_queue exists in the given namespace and returns a bool
- codeflare_sdk.ray.cluster.build_ray_cluster.update_image(image) str [source]
The update_image() function automatically sets the image config parameter to a preset image based on Python version if not specified. If no Ray image exists for the given Python version a warning is produced.
codeflare_sdk.ray.cluster.cluster module
The cluster sub-module contains the definition of the Cluster object, which represents the resources requested by the user. It also contains functions for checking the cluster setup queue, a list of all existing clusters, and the user’s working namespace.
- class codeflare_sdk.ray.cluster.cluster.Cluster(config: ClusterConfiguration)[source]
Bases:
object
An object for requesting, bringing up, and taking down resources. Can also be used for seeing the resource cluster status and details.
Note that currently, the underlying implementation is a Ray cluster.
- create_resource()[source]
Called upon cluster object creation, creates an AppWrapper yaml based on the specifications of the ClusterConfiguration.
- details(print_to_console: bool = True) RayCluster [source]
Retrieves details about the Ray Cluster.
This method returns a copy of the Ray Cluster information and optionally prints the details to the console.
- Args:
- print_to_console (bool):
Flag to determine if the cluster details should be printed to the console. Defaults to True.
- Returns:
- RayCluster:
A copy of the Ray Cluster details.
- down()[source]
Deletes the AppWrapper yaml, scaling-down and deleting all resources associated with the cluster.
- is_dashboard_ready() bool [source]
Checks if the cluster’s dashboard is ready and accessible.
This method attempts to send a GET request to the cluster dashboard URI. If the request is successful (HTTP status code 200), it returns True. If an SSL error occurs, it returns False, indicating the dashboard is not ready.
- Returns:
- bool:
True if the dashboard is ready, False otherwise.
- property job_client
- job_logs(job_id: str) str [source]
This method accesses the head ray node in your cluster and returns the logs for the provided job id.
- job_status(job_id: str) str [source]
This method accesses the head ray node in your cluster and returns the job status for the provided job id.
- list_jobs() List [source]
This method accesses the head ray node in your cluster and lists the running jobs.
- local_client_url()[source]
Constructs the URL for the local Ray client.
- Returns:
- str:
The Ray client URL based on the ingress domain.
- status(print_to_console: bool = True) Tuple[CodeFlareClusterStatus, bool] [source]
Returns the requested cluster’s status, as well as whether or not it is ready for use.
- wait_ready(timeout: int | None = None, dashboard_check: bool = True)[source]
Waits for the requested cluster to be ready, up to an optional timeout.
This method checks the status of the cluster every five seconds until it is ready or the timeout is reached. If dashboard_check is enabled, it will also check for the readiness of the dashboard.
- Args:
- timeout (Optional[int]):
The maximum time to wait for the cluster to be ready in seconds. If None, waits indefinitely.
- dashboard_check (bool):
Flag to determine if the dashboard readiness should be checked. Defaults to True.
- Raises:
- TimeoutError:
If the timeout is reached before the cluster or dashboard is ready.
- codeflare_sdk.ray.cluster.cluster.get_cluster(cluster_name: str, namespace: str = 'default', verify_tls: bool = True, write_to_file: bool = False)[source]
Retrieves an existing Ray Cluster or AppWrapper as a Cluster object.
This function fetches an existing Ray Cluster or AppWrapper from the Kubernetes cluster and returns it as a Cluster object, including its YAML configuration under Cluster.resource_yaml.
- Args:
- cluster_name (str):
The name of the Ray Cluster or AppWrapper.
- namespace (str, optional):
The Kubernetes namespace where the Ray Cluster or AppWrapper is located. Default is “default”.
- verify_tls (bool, optional):
Whether to verify TLS when connecting to the cluster. Default is True.
- write_to_file (bool, optional):
If True, writes the resource configuration to a YAML file. Default is False.
- Returns:
- Cluster:
A Cluster object representing the retrieved Ray Cluster or AppWrapper.
- Raises:
- Exception:
If the Ray Cluster or AppWrapper cannot be found or does not exist.
- codeflare_sdk.ray.cluster.cluster.get_current_namespace()[source]
Retrieves the current Kubernetes namespace.
- Returns:
- str:
The current namespace or None if not found.
- codeflare_sdk.ray.cluster.cluster.list_all_clusters(namespace: str, print_to_console: bool = True)[source]
Returns (and prints by default) a list of all clusters in a given namespace.
codeflare_sdk.ray.cluster.config module
The config sub-module contains the definition of the ClusterConfiguration dataclass, which is used to specify resource requirements and other details when creating a Cluster object.
- class codeflare_sdk.ray.cluster.config.ClusterConfiguration(name: str, namespace: str | None = None, head_cpu_requests: int | str = 2, head_cpu_limits: int | str = 2, head_cpus: int | str | None = None, head_memory_requests: int | str = 8, head_memory_limits: int | str = 8, head_memory: int | str | None = None, head_gpus: int | None = None, head_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, worker_cpu_requests: int | str = 1, worker_cpu_limits: int | str = 1, min_cpus: int | str | None = None, max_cpus: int | str | None = None, num_workers: int = 1, worker_memory_requests: int | str = 2, worker_memory_limits: int | str = 2, min_memory: int | str | None = None, max_memory: int | str | None = None, num_gpus: int | None = None, appwrapper: bool = False, envs: ~typing.Dict[str, str] = <factory>, image: str = '', image_pull_secrets: ~typing.List[str] = <factory>, write_to_file: bool = False, verify_tls: bool = True, labels: ~typing.Dict[str, str] = <factory>, worker_extended_resource_requests: ~typing.Dict[str, str | int] = <factory>, extended_resource_mapping: ~typing.Dict[str, str] = <factory>, overwrite_default_resource_mapping: bool = False, local_queue: str | None = None)[source]
Bases:
object
This dataclass is used to specify resource requirements and other details, and is passed in as an argument when creating a Cluster object.
- Args:
- name:
The name of the cluster.
- namespace:
The namespace in which the cluster should be created.
- head_cpus:
The number of CPUs to allocate to the head node.
- head_memory:
The amount of memory to allocate to the head node.
- head_gpus:
The number of GPUs to allocate to the head node. (Deprecated, use head_extended_resource_requests)
- head_extended_resource_requests:
A dictionary of extended resource requests for the head node. ex: {“nvidia.com/gpu”: 1}
- min_cpus:
The minimum number of CPUs to allocate to each worker.
- max_cpus:
The maximum number of CPUs to allocate to each worker.
- num_workers:
The number of workers to create.
- min_memory:
The minimum amount of memory to allocate to each worker.
- max_memory:
The maximum amount of memory to allocate to each worker.
- num_gpus:
The number of GPUs to allocate to each worker. (Deprecated, use worker_extended_resource_requests)
- appwrapper:
A boolean indicating whether to use an AppWrapper.
- envs:
A dictionary of environment variables to set for the cluster.
- image:
The image to use for the cluster.
- image_pull_secrets:
A list of image pull secrets to use for the cluster.
- write_to_file:
A boolean indicating whether to write the cluster configuration to a file.
- verify_tls:
A boolean indicating whether to verify TLS when connecting to the cluster.
- labels:
A dictionary of labels to apply to the cluster.
- worker_extended_resource_requests:
A dictionary of extended resource requests for each worker. ex: {“nvidia.com/gpu”: 1}
- extended_resource_mapping:
A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names
- overwrite_default_resource_mapping:
A boolean indicating whether to overwrite the default resource mapping.
- appwrapper: bool = False
- envs: Dict[str, str]
- extended_resource_mapping: Dict[str, str]
- head_cpu_limits: int | str = 2
- head_cpu_requests: int | str = 2
- head_cpus: int | str | None = None
- head_extended_resource_requests: Dict[str, str | int]
- head_gpus: int | None = None
- head_memory: int | str | None = None
- head_memory_limits: int | str = 8
- head_memory_requests: int | str = 8
- image: str = ''
- image_pull_secrets: List[str]
- labels: Dict[str, str]
- local_queue: str | None = None
- max_cpus: int | str | None = None
- max_memory: int | str | None = None
- min_cpus: int | str | None = None
- min_memory: int | str | None = None
- name: str
- namespace: str | None = None
- num_gpus: int | None = None
- num_workers: int = 1
- overwrite_default_resource_mapping: bool = False
- verify_tls: bool = True
- worker_cpu_limits: int | str = 1
- worker_cpu_requests: int | str = 1
- worker_extended_resource_requests: Dict[str, str | int]
- worker_memory_limits: int | str = 2
- worker_memory_requests: int | str = 2
- write_to_file: bool = False
codeflare_sdk.ray.cluster.pretty_print module
This sub-module exists primarily to be used internally by the Cluster object (in the cluster sub-module) for pretty-printing cluster status and details.
- codeflare_sdk.ray.cluster.pretty_print.print_app_wrappers_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]
- codeflare_sdk.ray.cluster.pretty_print.print_cluster_status(cluster: RayCluster)[source]
Pretty prints the status of a passed-in cluster
- codeflare_sdk.ray.cluster.pretty_print.print_clusters(clusters: List[RayCluster])[source]
- codeflare_sdk.ray.cluster.pretty_print.print_ray_clusters_status(app_wrappers: List[AppWrapper], starting: bool = False)[source]
codeflare_sdk.ray.cluster.status module
The status sub-module defines Enums containing information for Ray cluster states states, and CodeFlare cluster states, as well as dataclasses to store information for Ray clusters.
- class codeflare_sdk.ray.cluster.status.CodeFlareClusterStatus(value)[source]
Bases:
Enum
Defines the possible reportable states of a Codeflare cluster.
- FAILED = 5
- QUEUED = 3
- QUEUEING = 4
- READY = 1
- STARTING = 2
- SUSPENDED = 7
- UNKNOWN = 6
- class codeflare_sdk.ray.cluster.status.RayCluster(name: str, status: ~codeflare_sdk.ray.cluster.status.RayClusterStatus, head_cpu_requests: int, head_cpu_limits: int, head_mem_requests: str, head_mem_limits: str, num_workers: int, worker_mem_requests: str, worker_mem_limits: str, worker_cpu_requests: int | str, worker_cpu_limits: int | str, namespace: str, dashboard: str, worker_extended_resources: ~typing.Dict[str, int] = <factory>, head_extended_resources: ~typing.Dict[str, int] = <factory>)[source]
Bases:
object
For storing information about a Ray cluster.
- dashboard: str
- head_cpu_limits: int
- head_cpu_requests: int
- head_extended_resources: Dict[str, int]
- head_mem_limits: str
- head_mem_requests: str
- name: str
- namespace: str
- num_workers: int
- status: RayClusterStatus
- worker_cpu_limits: int | str
- worker_cpu_requests: int | str
- worker_extended_resources: Dict[str, int]
- worker_mem_limits: str
- worker_mem_requests: str