Metrics

Inspect Karpenter Metrics

Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at karpenter.karpenter.svc.cluster.local:8080/metrics configurable via the METRICS_PORT environment variable documented here

Controller Runtime Metrics

`controller_runtime_active_workers`

Number of currently used workers per controller

`controller_runtime_max_concurrent_reconciles`

Maximum number of concurrent reconciles per controller

`controller_runtime_reconcile_errors_total`

Total number of reconciliation errors per controller

`controller_runtime_reconcile_time_seconds`

Length of time per reconciliation per controller

`controller_runtime_reconcile_total`

Total number of reconciliations per controller

Consistency Metrics

`karpenter_consistency_errors`

Number of consistency checks that have failed.

Deprovisioning Metrics

`karpenter_deprovisioning_actions_performed`

Number of deprovisioning actions performed. Labeled by deprovisioner.

`karpenter_deprovisioning_eligible_machines`

Number of machines eligible for deprovisioning by Karpenter. Labeled by deprovisioner

`karpenter_deprovisioning_evaluation_duration_seconds`

Duration of the deprovisioning evaluation process in seconds.

`karpenter_deprovisioning_replacement_machine_initialized_seconds`

Amount of time required for a replacement machine to become initialized.

Interruption Metrics

`karpenter_interruption_actions_performed`

Number of notification actions performed. Labeled by action

`karpenter_interruption_deleted_messages`

Count of messages deleted from the SQS queue.

`karpenter_interruption_message_latency_time_seconds`

Length of time between message creation in queue and an action taken on the message by the controller.

`karpenter_interruption_received_messages`

Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.

Machines Metrics

`karpenter_machines_created`

Number of machines created in total by Karpenter. Labeled by reason the machine was created and the owning provisioner.

`karpenter_machines_initialized`

Number of machines initialized in total by Karpenter. Labeled by the owning provisioner.

`karpenter_machines_launched`

Number of machines launched in total by Karpenter. Labeled by the owning provisioner.

`karpenter_machines_registered`

Number of machines registered in total by Karpenter. Labeled by the owning provisioner.

`karpenter_machines_terminated`

Number of machines terminated in total by Karpenter. Labeled by reason the machine was terminated.

Provisioner Metrics

`karpenter_provisioner_limit`

The Provisioner Limits are the limits specified on the provisioner that restrict the quantity of resources provisioned. Labeled by provisioner name and resource type.

`karpenter_provisioner_scheduling_duration_seconds`

Duration of scheduling process in seconds. Broken down by provisioner and error.

`karpenter_provisioner_usage`

The Provisioner Usage is the amount of resources that have been provisioned by a particular provisioner. Labeled by provisioner name and resource type.

`karpenter_provisioner_usage_pct`

The Provisioner Usage Percentage is the percentage of each resource used based on the resources provisioned and the limits that have been configured in the range [0,100]. Labeled by provisioner name and resource type.

Nodes Metrics

`karpenter_nodes_allocatable`

Node allocatable are the resources allocatable by nodes.

`karpenter_nodes_created`

Number of nodes created in total by Karpenter. Labeled by owning provisioner.

`karpenter_nodes_system_overhead`

Node system daemon overhead are the resources reserved for system overhead, the difference between the node’s capacity and allocatable values are reported by the status.

`karpenter_nodes_terminated`

Number of nodes terminated in total by Karpenter. Labeled by owning provisioner.

`karpenter_nodes_termination_time_seconds`

The time taken between a node’s deletion request and the removal of its finalizer

`karpenter_nodes_total_daemon_limits`

Node total daemon limits are the resources specified by DaemonSet pod limits.

`karpenter_nodes_total_daemon_requests`

Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.

`karpenter_nodes_total_pod_limits`

Node total pod limits are the resources specified by non-DaemonSet pod limits.

`karpenter_nodes_total_pod_requests`

Node total pod requests are the resources requested by non-DaemonSet pods bound to nodes.

Pods Metrics

`karpenter_pods_startup_time_seconds`

The time from pod creation until the pod is running.

`karpenter_pods_state`

Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, provisioner name, zone, architecture, capacity type, instance type and pod phase.

Cloudprovider Metrics

`karpenter_cloudprovider_duration_seconds`

Duration of cloud provider method calls. Labeled by the controller, method name and provider.

`karpenter_cloudprovider_errors_total`

Total number of errors returned from CloudProvider calls.

`karpenter_cloudprovider_instance_type_cpu_cores`

VCPUs cores for a given instance type.

`karpenter_cloudprovider_instance_type_memory_bytes`

Memory, in bytes, for a given instance type.

`karpenter_cloudprovider_instance_type_price_estimate`

Estimated hourly price used when making informed decisions on node cost calculation. This is updated once on startup and then every 12 hours.

Cloudprovider Batcher Metrics

`karpenter_cloudprovider_batcher_batch_size`

Size of the request batch per batcher

`karpenter_cloudprovider_batcher_batch_time_seconds`

Duration of the batching window per batcher