Node Monitoring

The AppWrapper controller can optionally monitor Kubernetes Nodes and dynamically adjust the lendingLimits on a designated ClusterQueue to account for dynamically unavailable resources. This capability is designed to enable cluster admins of an MLBatch cluster to fully automate the small scale quota adjustments required to maintain full cluster utilization in the presence of isolated node failures and/or minor maintenance activities. The monitoring detects both Nodes that are marked as Unscheduable via standard Kubernetes mechanisms and Nodes that have resources that Autopilot has flagged as unhealthy (see Fault Tolerance). The lendingLimit of a designated slack capacity ClusterQueue is automatically adjusted to reflect the current dynamically unavailable resources.

Node monitoring is enabled by the following additional configuration:

slackQueueName: "slack-queue"
autopilot:
  monitorNodes: true

See node_health_monitor.go for the implementation.