Little Rabbit (Part 3) - The Night of 500,000 Connections

8:09 PM

January 11, 2023.

I was about to leave the office when my phone rang. SOC calling. The voice on the other end sounded more tense than usual: "Hey, RabbitMQ has a problem. Message rate dropped to 0."

Message rate at 0. Not decreased. Zero.

I opened my laptop and accessed the dashboard. What appeared on the screen froze me in place.

Connections were at 500,000. Normal was 50,000. Ten times higher.

Message rate was jumping like the heartbeat of a dying man - up a few hundred then dropping to 0, up then down again. The system was gasping for breath.

I immediately called Hieu. "RabbitMQ is dead."

Haunting Numbers

While waiting for the team to come online, I looked at the business metrics dashboard. This is when my heart truly started racing.

1.8 million transactions lost. 446,000 users affected.

All services using RabbitMQ were dead. Core banking, payment, user service, notification. Everything. The entire payment system was "clinically dead" - still breathing but no longer alive.

Hunting the Culprit

When the whole team was online, the first question was: What triggered this incident?

Checking logs, we found the first clue. Around 8 PM - the same time as the incident - there was a large batch of commands from the Core system connecting to the Database, causing Lock Agent. Database connections spiked dramatically.

But that was just the trigger, not the root cause. What does database trouble have to do with RabbitMQ?

The answer lies in a phenomenon we later called Connection Storm.

The Deadly Herd Effect

Imagine a small bridge crossing a river. Normally, 100 people cross per minute - comfortable, no pushing.

One day, a small incident causes the bridge to jam for 30 seconds. People waiting start getting impatient. They push forward trying to cross faster. But the more they push, the more jammed it gets. The more jammed, the more people push.

And then, the bridge collapses.

That's exactly what happened to our RabbitMQ.

When the Database lock occurred, some services depending on the database started slowing down. Slowing down meant holding RabbitMQ connections longer. Connections held longer meant RabbitMQ started overloading.

And this is when our "auto reconnect" config became the killer.

// Config at that time
.setReconnectAttempts(Integer.MAX_VALUE)  // INFINITE retry
.setReconnectInterval(1000L)               // Retry every 1 second

Integer.MAX_VALUE. Infinite retry. Every second.

When RabbitMQ started slowing down, services started timing out. And when they timed out, what did they do? Reconnect. Immediately. No waiting. No thinking.

100 workers, each capable of creating up to 3,000 connections. All reconnecting simultaneously. Not once. But countless times every second.

RabbitMQ was already overloaded, now having to handle hundreds of thousands of connection requests per second. It got slower. More timeouts. More reconnects.

Vicious cycle. Panicked herd.

Within minutes, the system went from "a bit slow" to "completely dead."

TIME_WAIT - The Silent Killer

But the story doesn't end there.

When checking HAProxy, we discovered another problem: hundreds of thousands of connections in TIME_WAIT state.

To understand TIME_WAIT, you need to know how TCP connections work. When a connection is closed, it doesn't disappear immediately. It stays in a "waiting" state for about 60 seconds to ensure no packets are lost during the closing process.

Normally, this isn't a problem. But with 500,000 connections continuously closing and reopening every second, the number of TIME_WAIT connections grew uncontrollably. Each TIME_WAIT connection still occupies a port, a file descriptor, a portion of memory.

The system was drowning in its own sea of connections.

The Hardest Decision

At this point, we faced a difficult decision.

Normally, when there's an incident, you try to fix the problem without shutting down the service. Downtime is the enemy. Every minute of downtime is money, reputation, customers.

But looking at the current situation, Hieu made the decision no one wanted to hear: "Stop everything. Both HAProxy and RabbitMQ."

Stop completely. Not restart. Not graceful shutdown. Stop.

This was a decision I later understood to be absolutely correct. When a system is in "thrashing" state - when every remediation action only makes things worse - the only way is to stop completely and start from scratch.

Like when a computer is hard frozen, sometimes the only option is to power off and restart.

Rising from the Ashes

After stopping both HAProxy and RabbitMQ, we began a controlled "resurrection" process.

First step: limit connections. Previously, each IP could create up to 3,000 connections. We reduced it to 300. One tenth. The goal wasn't permanent restriction, but to ensure that when the system restarted, no one could create another connection storm.

# New config for HAProxy
stick-table type ip size 100k expire 30s store conn_cur
tcp-request connection reject if { src_conn_cur ge 300 }

Second step: start RabbitMQ first. Check status. Ensure it's stable. Ensure TIME_WAIT connections have dropped to normal levels.

Third step: start HAProxy. But still no traffic coming in because all services were still scaled down to 0.

Final and most important step: start services in priority order. Not all at once. Core banking first. Then payment. Then user profile. Then authentication. Finally, the auxiliary services.

Each time we started a group of services, we paused and monitored. Are connections increasing too fast? Is message rate stable? Only when everything was OK did we continue to the next group.

This process took nearly 2 hours. But when it ended, the system was alive again. Stable. Controllable.

The Fatal Mistake: Infinite Retry

After the incident, when we sat down to analyze, one question haunted me: Why did we configure infinite reconnect?

The painful answer was: because we thought it was "resilient."

The logic at the time sounded right: "If the connection is lost, just reconnect until successful. Service must always be available. Never give up."

But "never give up" in the context of distributed systems is actually a terrible idea.

Think about it: if RabbitMQ is overloaded, and 1,000 clients are simultaneously retrying every second, what are you doing? You're DDoSing your own system.

Infinite reconnect isn't resilience. It's collective suicide.

The correct config should be:

.setReconnectAttempts(5)        // Only retry 5 times
.setReconnectInterval(10000L)   // 10 seconds between each attempt

If 5 reconnects in 50 seconds don't work, stop. Log error. Alert. Let the circuit breaker handle it. Let human intervention step in.

Don't try to reconnect infinitely. You're not a hero. You're making things worse.

Architectural Lessons

This incident taught us many things, but one lesson goes beyond technical: don't put all your eggs in one basket.

Before the incident, all services - from critical to non-critical - shared the same RabbitMQ cluster. Payment and notification in the same place. Core banking and analytics in the same place.

This meant: when RabbitMQ died, EVERYTHING died. There was no way to "sacrifice" notification to "save" payment.

After the incident, we split into two clusters:

Primary cluster: Only for critical services - payment, core banking, user authentication
Secondary cluster: For non-critical services - notification, logging, analytics

If the secondary cluster has problems, users can still make payments. They just won't receive notifications immediately - an acceptable trade-off.

This is the principle of graceful degradation: when the system encounters problems, it doesn't die completely, but "degrades" in a controlled manner. The most important features are protected. Less important features can be temporarily sacrificed.

Playbook - What No One Wants to Write But Everyone Needs

One of the first things after the incident was writing a playbook - a detailed document on how to handle similar RabbitMQ incidents.

Playbooks aren't for casual reading. They're for use at 2 AM, when you've just been woken by PagerDuty, eyes still blurry, brain not yet awake. At that moment, you need a clear checklist: step 1 do this, step 2 do that, verify this way.

We also started regular drills - once a quarter, simulating incidents and following the playbook. Not for "fun," but to ensure that when a real incident happens, everyone knows what to do.

First drill took 45 minutes for recovery. Second drill, 30 minutes. Most recent drill, only 15 minutes.

Practice makes perfect. Especially for things you don't want to have to do.

Looking Back

The night of 01/11/2023 was my team's longest night.

1.8 million transactions. 446,000 users. Haunting numbers that I still remember vividly.

But from that night, we learned lessons that no textbook teaches:

Little Rabbit is very powerful, but also very fragile when not configured correctly. An "infinite reconnect" config that seemed harmless could kill the entire system.

Connection storm is a silent killer. It doesn't come from outside, but from your own services. The very "effort" of services killed the system.

Sometimes, the best way to fix is to stop completely. You don't always need to "fix on the fly." When the system is thrashing, stopping and starting from scratch might be the right choice.

Graceful degradation is mandatory, not optional. Separate critical and non-critical services. When there's an incident, know what needs to be protected and what can be sacrificed.

Little Rabbit is still our reliable companion. But now, we understand it better. Respect it more. And most importantly - never take it for granted again.

Not the End

But the story of Little Rabbit doesn't end here.

A few months later, a new problem emerged. On peak days - the 5th, the 10th of each month - the system would jam again. But this time, the strange thing was all metrics were normal.

Server said processing was fast. Producer said calls were slow. DevOps said RabbitMQ had no issues.

Three teams, three sets of metrics, and everyone was right. But the system was still slow.

Who was right? Who was wrong? And how did a simple mathematical formula called Little's Law help the CTO see what no one else could?

This story, I'll tell in the final part of the series.

"Little Rabbit" Series

Part	Title	Key Lesson
Part 1	When HTTP Is No Longer Enough	Competing consumers, natural load balancing
Part 2	The Deadly Traps	Singleton pattern, channel/queue management
Part 3	The Night of 500,000 Connections	Connection storm, graceful degradation
Part 4	A War Without Winners	Little's Law, accept uncertainty

Stay tuned for the final part!