When the System Went Down, the Emergency Switch Didn't Work Either

This is the story of an ordinary Tuesday afternoon — until it wasn't.

12:00 PM: A Decision That Seemed Harmless

The infras team was doing their usual system check. They opened the Redis dashboard and noticed something odd: a Redis server with zero active connections.

No connections. Meaning nobody was using it.

In their heads, it probably sounded like: "Perfect time to clean up the config — no one's connected anyway."

They enabled maintenance mode on Redis.

Nobody knew that decision would set off 12 hours of chaos.

12:05 PM: Something Is Wrong

Five minutes later, the monitoring dashboard started turning red.

Transactions through the TCB channel — stuck. Not one or two. Dozens. The counter on the dashboard climbing every second.

The team jumped in immediately. Standard procedure: switch to the backup channel, enable maintenance mode to stop accepting new transactions while investigating.

Click. No response.

Click again. Still nothing.

Enable maintenance mode. Command sent. Nothing changed. The system kept accepting new transactions, kept freezing them, kept pushing the counter higher.

"Why can't we switch channels? This has nothing to do with TCB or ACB."

That's the moment everyone realized: this wasn't a normal incident. Normally when transactions freeze, you switch channels and investigate later. This time, the tool for switching channels had died too.

It was like your house was on fire, and the fire hose had no water.

12:05–12:15 PM: Ten Minutes Following the Thread

Can't switch channels → the problem isn't with TCB/ACB.

Maintenance mode won't enable → the problem isn't in the business logic.

Two unrelated mechanisms, both dead at the same time. That meant they shared a common dependency somewhere.

The team started tracing backwards through the code.

Transaction commit mechanism: partner notifies → API receives → publishes to Redis Pub/Sub channel → subscriber listens → commits to DB.

Maintenance mechanism: admin sends command → publishes to Redis Pub/Sub channel → services subscribe → reload config.

Both routes go through Redis Pub/Sub.

And Redis Pub/Sub had just been shut off.

The answer was painfully clear. It wasn't two things dying. It was one thing dying — but it sat at the foundation of everything.

Redis Pub/Sub: The Walkie-Talkie That Doesn't Record

To understand why this was so bad, you need to understand what Redis Pub/Sub actually is.

Imagine a walkie-talkie. You press the button and speak — whoever's holding a receiver and has it turned on hears you. Whoever's not holding one, whoever's turned off — hears nothing. And more importantly: nothing is recorded. The message goes out and disappears. There's no way to replay it.

Redis Pub/Sub works exactly the same way. Publisher sends a message into a channel. Subscribers currently listening receive it. Subscriber offline, subscriber disconnected, subscriber not yet connected — that message is gone forever. No retry. No replay. No dead letter queue.

This property has a name in distributed systems: fire-and-forget.

It's perfectly suited for ephemeral things: cache invalidation signals, live dashboard updates, real-time notifications. Things where missing one message is fine because the next one comes right away.

But committing a financial transaction is not that kind of thing. Each notify from TCB or ACB is a one-time event. If the subscriber wasn't listening at that exact moment — the window to commit that transaction is gone.

Compare this to Kafka: messages are written to disk with offsets, consumers can reconnect after a disruption and read back from exactly where they left off. Kafka has durability — like a recording device that never erases the tape. Redis Pub/Sub does not.

This was a fundamental mismatch between the tool and the problem it was solving. And it had been sitting in production, quietly waiting, since the day the system was built.

The Blueprint Nobody Questioned

Tracing further, the team saw the full picture.

The cashout system had two layers:

Data plane — handles transactions: receives partner notify → publishes to Redis → subscriber commits to DB.

Control plane — controls the system: admin publishes config → Redis → services reload → switch channels/enable maintenance.

Both layers depended on the same Redis Pub/Sub instance.

In systems theory, this is the most classic form of a Single Point of Failure: one component whose failure removes the system's ability to recover itself. What made this particularly dangerous: it didn't just kill the data plane — it killed the control plane along with it. You had no hands left to fix anything.

This is a worse problem than "the car breaks down on the highway." This is "the car breaks down, and the radio to call for help breaks down too."

Who designed it this way? Nobody, really. More accurately: nobody asked the right question early enough.

When the system was first built, Redis Pub/Sub was fast, simple, and worked fine in dev. Nobody stopped to ask: "If Redis goes down, can we still control the system?"

12:20–12:45 PM: Getting Control Back

12:20 PM, the team reached out to infras: "Something is wrong with Redis — we need it restored immediately."

The infras team only now understood what had happened. They began rolling back.

12:30 PM, Redis restored. Pub/Sub reconnected. Subscribers started listening again.

But the transactions that had been stuck from 12:05 to 12:30 — 25 minutes of partner notifications with no subscriber to receive them — those messages were gone. Redis Pub/Sub doesn't buffer. Nothing to replay.

Total backlog: approximately 22,000 transactions requiring manual processing.

1:00 PM – Midnight: The Real Cost

This is the part incident reports underemphasize. It's also the heaviest part.

From 1:00 PM to 6:00 PM, the team coordinated with TCB and ACB to re-notify each transaction individually. Not a script. Manual coordination — confirming each batch, tracking each round of commits. Five hours.

Recipients had money in their accounts since noon. Senders opened the app and still saw "transaction processing." They called. They messaged. They worried. CS tickets surged. The phones never stopped.

And the thing nobody said out loud but everyone understood: customers don't distinguish between "the system is being fixed" and "the system can't be trusted." Every stuck transaction is one more moment a user asks themselves whether to come back.

From 6:00 PM to midnight, the accounting team handled what remained — transactions the partners couldn't automatically re-notify, reconciled manually, one line at a time.

Midnight. Incident closed. The actual financial damage: zero. But 12 hours lost, trust eroded, and a lot of people's evenings gone.

Déjà Vu

Back in September 2025, something almost identical happened.

Kafka disconnected. VA transaction sync dropped. Same pattern: message broker as Single Point of Failure, no fallback mechanism, no early detection.

A report was written. Lessons were documented. Action items were assigned.

Six months later, March 24, 2026 — same pattern. Different broker. Different feature.

The lessons from last time hadn't been applied broadly enough.

That sentence in the incident report sounds neutral. It isn't.

Three Things to Fix, in Order

The first and non-negotiable priority: migrate the transaction commit flow from Redis Pub/Sub to Kafka. Kafka has durability, consumer offsets, and replay after reconnection — a consumer that drops and comes back can pick up exactly where it left off. That's the minimum bar for anything touching money.

In parallel, Redis cache needs a DB fallback with a write-through strategy — DB as source of truth, Redis as the fast layer on top. When Redis dies, the system gets slower, but it doesn't break.

And just as important: the control plane can no longer depend on Pub/Sub. Instead, services should self-poll config from DB every 30 seconds. No Redis push required, no Pub/Sub needed. When Redis dies, the control plane stays alive — you keep your hands free to act.

The principle behind all three comes down to one sentence: the control plane must be independent from the data plane. When the data plane has an incident, you need to retain the ability to direct the system. If you lose both at once, you're just a spectator.

The Question Worth Asking in Every Design Review

Looking back at this incident, one question — asked early enough — would have changed everything:

"If this component goes down, can we still turn the system off?"

Not "will the system keep running." But "can we still control it."

That's the question that separates a resilient architecture from a brittle one. Resilient doesn't mean never failing. Resilient means when you fail, you still have enough tools to manage the situation.

One more thing: shared infrastructure needs change management. Redis, Kafka, shared databases — before changing any of them, notify every dependent team first. Not because of distrust. Because nobody can know every dependency in a system large enough to matter.

A short maintenance window at noon, executed by a team that had no way of knowing someone was using that Redis instance.

22,000 transactions. 12 hours. One very long night for the accounting team.

Single Points of Failure don't advertise themselves. They're not red on the architecture diagram. They don't warn you at deploy time.

They just sit there, patient, waiting for the exact day someone decides to run maintenance at noon.