Fault Tolerance
Overview of Capabilities
The AppWrapper controller is designed to enhance and extend the fault tolerance capabilities provided by the controllers of its wrapped resources. If Autopilot is deployed on the cluster, the AppWrapper controller can automate both the injection of Node anti-affinites to avoid scheduling workloads on unhealthy Nodes and the migration of running workloads away from unhealthy Nodes. Throughout the execution of a workload, the AppWrapper controller monitors both the status of the contained top-level resources and the status of all Pods created by the workload. If a workload is determined to be unhealthy, the AppWrapper controller firsts waits for a bounded time period to allow the underlying controllers to correct the problem. If they fail to do so, then the AppWrapper controller will reset the workload by removing all created resources, and then, if the maximum number of retires has not been exceeded, recreating the workload. This reset process is carefully engineered to ensure that it will always make progress and eventually succeed in completely removing all Pods and other resources created by a failed workload.
Progress Guarantees
When the AppWrapper controller decides to delete the resources for a
workload, it proceeds through several phases. First it does a normal
delete of the top-level resources, allowing the primary resource
controllers time to cascade the deletion through all child resources.
If they are not able to successfully delete all of the workload’s Pods
and resources within a ForcefulDeletionGracePeriod
, the AppWrapper
controller then initiates a forceful deletion of all remaining Pods
and resources by deleting them with a GracePeriod
of 0
. An
AppWrapper will continue to have its ResourcesDeployed
condition to
be True
until all resources and Pods are successfully deleted.
This process ensures that when ResourcesDeployed
becomes False
,
which indicates to Kueue that the quota has been released, all
resources created by a failed workload will have been totally removed
from the cluster.
Detailed Description
The podSets
contained in the AppWrapper specification enable the
AppWrapper controller to inject labels into every Pod that is created
by the workload during its execution. Throughout the execution of the
workload, the AppWrapper controller monitors the number and health of
all labeled Pods. It also watches the top-level created resources and
for selected resources types understands how to interpret their status
information. This information is combined to determine if a workload
is unhealthy. A workload can be deemed unhealthy if any of the
following conditions are true:
- There are a non-zero number of
Failed
Pods. - It takes longer than
AdmissionGracePeriod
for the expected number of Pods to reach thePending
state. - It takes longer than the
WarmupGracePeriod
for the expected number of Pods to reach theRunning
state. - If a non-zero number of
Running
Pods are using resources that Autopilot has tagged asNoExecute
. - The status information of a batch/v1 Job or PyTorchJob indicates that it has failed.
- A top-level wrapped resource is externally deleted.
If a workload is determined to be unhealthy by one of the first three
Pod-level conditions above, the AppWrapper controller first waits for
a FailureGracePeriod
to allow the primary resource controller an
opportunity to react and return the workload to a healthy state. The
FailureGracePeriod
is elided by the remaining conditions because the
primary resource controller is not expected to take any further
action. If the FailureGracePeriod
passes and the workload is still
unhealthy, the AppWrapper controller will reset the workload by
deleting its resources, waiting for a RetryPausePeriod
, and then
creating new instances of the resources.
During this retry pause, the AppWrapper does not release the workload’s
quota; this ensures that when the resources are recreated they will still
have sufficient quota to execute. The number of times an AppWrapper is reset
is tracked as part of its status; if the number of resets exceeds the RetryLimit
,
then the AppWrapper moves into a Failed
state and its resources are deleted
(thus finally releasing its quota). If at any time during this retry loop,
an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper),
the AppWrapper controller will respect this request by proceeding to delete
the resources. Workload resets that are initiated in response to Autopilot
are subject to the RetryLimit
but do not increment the retryCount
.
External deletion of a top-level wrapped resource will cause the AppWrapper to
directly enter the Failed
state independent of the RetryLimit
.
To support debugging Failed
workloads, an annotation can be added to an
AppWrapper that adds a DeletionOnFailureGracePeriod
between the time the
AppWrapper enters the Failed
state and when the process of deleting its resources
begins. Since the AppWrapper continues to consume quota during this delayed deletion period,
this annotation should be used sparingly and only when interactive debugging of
the failed workload is being actively pursued.
All child resources for an AppWrapper that successfully completed will be automatically
deleted after a SuccessTTL
after the AppWrapper entered the Succeeded
state.
Configuration Details
The parameters of the retry loop described about are configured at the operator level and can be customized on a per-AppWrapper basis by adding annotations. The table below lists the parameters, gives their default, and the annotation that can be used to customize them.
Parameter | Default Value | Annotation |
---|---|---|
AdmissionGracePeriod | 1 Minute | workload.codeflare.dev.appwrapper/admissionGracePeriodDuration |
WarmupGracePeriod | 5 Minutes | workload.codeflare.dev.appwrapper/warmupGracePeriodDuration |
FailureGracePeriod | 1 Minute | workload.codeflare.dev.appwrapper/failureGracePeriodDuration |
RetryPausePeriod | 90 Seconds | workload.codeflare.dev.appwrapper/retryPausePeriodDuration |
RetryLimit | 3 | workload.codeflare.dev.appwrapper/retryLimit |
DeletionOnFailureGracePeriod | 0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
ForcefulDeletionGracePeriod | 10 Minutes | workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration |
SuccessTTL | 7 Days | workload.codeflare.dev.appwrapper/successTTLDuration |
GracePeriodMaximum | 24 Hours | Not Applicable |
The GracePeriodMaximum
imposes a system-wide upper limit on all other grace periods to
limit the potential impact of user-added annotations on overall system utilization.
The set of resources monitored by Autopilot and the associated labels that identify unhealthy resources can be customized as part of the AppWrapper operator’s configuration. The default Autopilot configuration used by the controller is:
autopilot:
injectAntiAffinities: true
monitorNodes: true
resourceTaints:
nvidia.com/gpu:
- key: autopilot.ibm.com/gpuhealth
value: ERR
effect: NoSchedule
- key: autopilot.ibm.com/gpuhealth
value: EVICT
effect: NoExecute
The resourceTaints
is a map from resource names to taints. For this example
configuration, for exactly those Pods that have a non-zero resource request for
nvidia.com/gpu
, the AppWrapper controller will automatically inject the stanza below
into the affinity
portion of their Spec.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autopilot.ibm.com/gpuhealth
operator: NotIn
values:
- ERR
- EVICT