Troubleshooting a Persistent CUBIC Congestion Window Stuck Bug in QUIC

Introduction

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux, governing how most TCP and QUIC connections probe for bandwidth, back off on loss, and recover. At Cloudflare, our QUIC implementation (quiche) uses CUBIC. We encountered a bug where the congestion window (cwnd) got permanently stuck at its minimum after a congestion collapse, never recovering. This guide walks you through diagnosing and fixing that bug—a story of how an optimization in the Linux kernel led to unexpected behavior when ported to QUIC, and how an elegant one-line fix resolved it.

Troubleshooting a Persistent CUBIC Congestion Window Stuck Bug in QUIC
Source: blog.cloudflare.com

What You Need

Step-by-Step Debugging and Fix Guide

Step 1: Understand CUBIC's Congestion Window Dynamics

CUBIC's core mechanism adjusts the congestion window (cwnd)—the sender-side limit on bytes in flight. Normally, cwnd grows when no loss is detected and shrinks upon loss (multiplicative decrease). After a severe congestion collapse (e.g., heavy early loss), cwnd can be reduced to its minimum value (typically 2 MSS). The algorithm should then probe for capacity and grow cwnd again. In our bug, this growth never happened.

Step 2: Reproduce the Symptom – A Failing Test

Set up a test scenario with heavy packet loss right after connection establishment. For example, simulate 50% loss for the first few RTTs. Run the test multiple times (e.g., 100 iterations). Expect that in about 61% of runs, the connection never recovers its throughput—the cwnd stays at minimum even after loss stops. This erratic failure is the first clue.

Step 3: Narrow Down the Root Cause – The App-Limited Exclusion

The bug originated from a Linux kernel change that implemented the app-limited exclusion described in RFC 9438 §4.2–12. This rule prevents cwnd from growing when the connection is app-limited (i.e., not sending enough data to fully use the window). In TCP, this exclusion is applied after recovery. When ported to QUIC's CUBIC, the condition inadvertently blocked cwnd growth even when the connection was network-limited—because of differences in how QUIC tracks application pacing. Examine your CUBIC code for a condition like:

if (app_limited) return; // do not update cwnd

This check might be too broad during the recovery phase.

Step 4: Identify the Exact Code Path

In quiche, the bug manifested in the cubic_ack() function (or equivalent). When an ACK arrives after congestion collapse, the code checks if the connection is app-limited using a flag that remains set from earlier idle periods. Due to a race condition in QUIC's event handling, the flag never clears after recovery. Trace through the logic:

Troubleshooting a Persistent CUBIC Congestion Window Stuck Bug in QUIC
Source: blog.cloudflare.com

The fix requires clearing that flag at the right moment—ideally right when the recovery period ends.

Step 5: Apply the One-Line Fix

The solution is to reset the app-limited flag (or ensure the exclusion condition is not met) when the first ACK after recovery is processed. In code, this might look like:

if (cubic->recovery_end && ack_time > cubic->recovery_end) { clear_app_limited_flag(); }

This single line breaks the cycle: once recovery ends, the connection is no longer treated as app-limited, allowing cwnd to grow normally.

Step 6: Verify the Fix

Re-run your heavy-loss test suite. The failure rate should drop to near zero. Also test other scenarios (light loss, no loss) to ensure the fix doesn't cause regressions. Monitor cwnd traces to confirm normal growth after recovery.

Tips for Avoiding Similar Bugs

By following these steps, you can diagnose and fix a stuck cwnd bug that might otherwise remain hidden. The key takeaway: even a well-meaning optimization can break in unexpected contexts—test thoroughly and be ready to trace back to the original change.

Tags:

Recommended

Discover More

Why California's Social Media Ban Threatens Free Speech OnlineLightweight Linux Distros for Old Laptops: A 4GB RAM Test Reveals a Surprising WinnerExclusive: Samsung S26 Ultra Display Fails Brightness Test, Expert RevealsFrom Coal to Green: A Step-by-Step Guide to Investing in Clean Steel Production with DRI TechnologyCentralized AI Safety Across Accounts: Amazon Bedrock Guardrails Cross-Account Safeguards Q&A