2859
views
✓ Answered

Kubernetes v1.36: Tracking Route Sync Efficiency with a New Counter Metric

Asked 2026-05-01 20:18:07 Category: Cloud Computing

Kubernetes v1.36 introduces a valuable new alpha metric, route_controller_route_sync_total, within the Cloud Controller Manager (CCM) route controller implementation. This counter helps operators monitor how often routes are synchronized with the underlying cloud provider. Designed to complement the CloudControllerManagerWatchBasedRoutesReconciliation feature gate (first seen in v1.35), the metric enables precise A/B testing of reconciliation strategies. By switching from a fixed-interval loop to a watch-based approach, teams can reduce unnecessary API calls, conserve rate limits, and optimize quota usage. This article answers key questions about the metric, its purpose, and how to use it effectively.

What is the new metric introduced in Kubernetes v1.36 for the Cloud Controller Manager?

The new metric is an alpha counter named route_controller_route_sync_total, located at k8s.io/cloud-provider within the CCM route controller code. Each increment of this counter represents a single route synchronization event between the Kubernetes cluster and the cloud provider. Operators can expose this metric by enabling the appropriate monitoring setup, such as Prometheus scraping endpoints. It serves as a direct indicator of reconciliation activity, making it easier to evaluate the impact of different route syncing methods.

Kubernetes v1.36: Tracking Route Sync Efficiency with a New Counter Metric

Why was this metric added to Kubernetes?

The primary motivation was to provide a reliable, observable way to validate the CloudControllerManagerWatchBasedRoutesReconciliation feature gate, which debuted in Kubernetes v1.35. Before this metric existed, engineers had limited visibility into whether the new watch-based reconciliation logic was actually reducing syncs. By tracking route_controller_route_sync_total, operations teams can now quantitatively compare behavior with the feature gate enabled versus disabled. This data-driven approach helps confirm that the watch-based method only triggers syncs when node changes occur, thereby eliminating unnecessary calls to infrastructure APIs.

How does the CloudControllerManagerWatchBasedRoutesReconciliation feature gate work?

When the feature gate is disabled (the default), the route controller runs a fixed-interval loop that synchronizes routes at a constant rate—regardless of whether any nodes have changed. This leads to steady API calls even in stable clusters. Enabling the feature gate switches the controller to a watch-based approach. Instead of polling, it listens for node add/remove/update events and reconciles routes only when necessary. This reduces the number of requests sent to the cloud provider, lowering pressure on rate-limited APIs and making more efficient use of available quota, especially beneficial in large or static clusters.

How can operators A/B test the feature gate using the new metric?

Operators can apply A/B testing by running two sets of clusters or nodes: one with the feature gate disabled (default) and another with it enabled. By comparing route_controller_route_sync_total values over the same time period, you can directly measure the reduction in sync events. For example, in clusters where nodes rarely change, the watch-based approach will produce a much lower counter growth. Ensure you collect baseline data with the gate off, then enable it on a subset and monitor the metric. The difference becomes especially evident in stable environments where node changes are infrequent.

What does the counter show with and without the feature gate? (Example scenarios)

Without the feature gate (fixed-interval loop), the counter increments at a constant rate even if nothing changes:

  • After 10 minutes with no node changes: route_controller_route_sync_total = 60
  • After 20 minutes, still no changes: route_controller_route_sync_total = 120

With the feature gate enabled (watch-based), the counter only increases when actual node events occur:

  • After 10 minutes with no node changes: route_controller_route_sync_total = 1
  • After 20 minutes, still no changes: counter remains 1
  • After a new node joins: route_controller_route_sync_total = 2

This stark contrast highlights the efficiency gain in stable clusters.

Where can I give feedback and learn more about this feature?

Feedback is welcome through these channels:

  • The #sig-cloud-provider channel on Kubernetes Slack
  • The KEP-5237 issue on GitHub
  • The SIG Cloud Provider community page for other communication methods

For deeper understanding, refer to the full KEP-5237 document, which outlines the original motivation, design decisions, and future plans. The Kubernetes community encourages early testing and feedback to refine this alpha feature before it graduates.