3 AM, 47 Alerts, and Lessons About AI in Monitoring
MonitoringIncident ResponseAIDevOpsArchitecture

3 AM, 47 Alerts, and Lessons About AI in Monitoring

Phuoc NguyenFebruary 10, 202610 min read

3:07 AM. Tien's phone buzzes.

Then again. Then continuously without stopping.

Tien opens his eyes, reaches for his phone. The screen glares bright in the darkness: 47 notifications from PagerDuty.

Database timeout. API latency spike. Queue backlog. 5xx errors spiking. Memory pressure on 3 nodes. Certificate warning. Consumer lag. Connection pool exhausted.

Tien sits up, trying to stay alert. Mind still foggy. Looking at the chaos of alerts, one question lingers: "Where do I even start?"


When monitoring becomes noise

Tien was the on-call engineer that week. He had 5 years of experience, not his first night alert. But 47 alerts at once... that was new.

Tien started reading each alert one by one. Database timeout - could be the root cause. API latency - could be a consequence. Queue backlog - also possibly a consequence. Memory pressure - related? Certificate warning - probably unrelated, but why did it appear at the same time?

30 minutes passed. Tien was still fumbling.

That's when Hieu - a senior engineer on the team - woke up from the Slack notification. Hieu jumped into the channel and said:

"Tien, look at the timeline. DB connection pool exhausted at 3:01. Everything else is a domino effect. Focus on DB first."

5 minutes later, Tien found the problem: a deployment at 2:58 AM had changed the connection pool config. Connection limit dropped from 100 to 10 due to a typo in the config file.

Rollback. Restart. Production recovered.

But the bigger question remained: Why did Tien spend 30 minutes fumbling while Hieu only needed 30 seconds to see it?

And more importantly: Why couldn't the monitoring system do that itself?

Alert on phone
Alert on phone

Lessons from that night

The next morning, the whole team held a postmortem.

Hung - the DevOps lead - opened: "Our monitoring system did its job. It detected all the problems. But it sent 47 separate alerts instead of telling Tien: 'Hey, 46 of these are consequences, only 1 is the root cause.'"

Tien nodded. "I wasted 30 minutes because I didn't know which one was most important. Every alert was red, every one critical."

Hieu added: "I knew immediately because I'd seen this pattern before. DB connection pool exhausted is always the starting point of a domino chain. But this experience was in my head, not in the system."

Silence for a moment.

Then Hung said: "So what's the solution? Use AI to make alerting smarter?"

That question opened a long discussion - and the lessons I want to share.


Two layers of monitoring: Nervous system and Brain

After that postmortem, the team realized monitoring systems have two very different layers.

Layer 1 - The Nervous System. Like when you touch fire, nerves react immediately - no thinking needed. CPU > 95%? Alert. Disk > 90%? Alert. Service not responding? Alert. Fast. Simple. Reliable. No room for creativity.

Layer 2 - The Brain. Like when you realize "ah, my hand hurts because I just touched fire, and the pot is boiling because I forgot to turn off the stove, and there's smoke because water overflowed." The brain correlates information, reasons, understands context, suggests actions. Slower, but handles complexity.

The most common mistake is confusing these two layers.

Some teams use AI for Layer 1 tasks - the system becomes unpredictable, sometimes missing important alerts.

Some teams try to use rule-based for Layer 2 tasks - resulting in hundreds of complex rules that still don't cover all cases.

Let's dive into each layer.

Monitoring dashboard
Monitoring dashboard

Layer 1: Where cron job is king

Remember the article about scalpels and paring knives? This is paring knife territory.

Tasks at this layer you want running 99.99% reliably. Never miss an alert. No surprises. Simple, deterministic, and boring - exactly as it should be.

Health check every 30 seconds. Job pings each service, response 200 means OK, otherwise count consecutive failures. Hit 3 in a row, alert fires. Logic is if status != 200: alert(). No AI understanding needed.

Resource monitoring every minute. CPU > 85% for 5 continuous minutes means warning, > 95% means critical. Memory, disk similar. Simple threshold comparison. Write once, run forever.

Certificate expiry every day. 30 days left means email warning, 7 days means critical. Date arithmetic. Nothing ambiguous.

Queue depth every minute. RabbitMQ > 10,000 messages and consumer < 3, alert "queue is backlogging." Fixed logic. No need to understand what's in the messages.

Database replication lag every 30 seconds. Lag > 5 seconds means warning, > 30 seconds means critical. Number versus number. Completely deterministic.

Synthetic transaction every 5 minutes. Run fixed scenario: create test order → verify appears in DB → cleanup. Pass or fail. No gray area.

Hung once told the team: "These jobs are like the body's nervous system. You don't need to think to pull your hand back when touching fire. Reflexes must be fast and reliable. If you had to 'think' every time you touched fire, you'd already be burned."

Common pattern: input is numbers, logic is comparison, output is pass/fail. No semantic understanding needed, no context required. This is cron job territory.


Layer 2: Where AI Agent shines

Back to that night with 47 alerts.

Cron job did its job - detected all 47 problems. But it couldn't do what Hieu did: look at the timeline, correlate information, and recognize which was root cause, which was domino effect.

This is AI Agent territory.

Correlate alerts and find root cause. Agent reads all 47 alerts, analyzes timeline: DB connection pool exhausted at 3:01, then API timeout at 3:02, queue backlog at 3:03, 5xx errors at 3:04. Agent consolidates into one message: "Root cause likely DB connection pool. 46 other alerts are domino effect. Recommend: check recent deployment and connection pool config."

Each incident, different alert combinations. No rule covers them all.

Read and understand error logs. System generates thousands of error log lines daily. Cron job counts them, but doesn't know which lines matter. Agent distinguishes: NullPointerException at PaymentService appearing 200 times in 10 minutes - serious bug, affects users. TimeoutException at RecommendationService 500 times - low severity, only affects non-critical recommendation feature.

Agent classifies based on business impact, not just counting.

Suggest runbook actions. 3 AM, Tien's mind still foggy. Agent says: "These symptoms resemble incident INC-4521 last month. Root cause then was connection pool config after deployment. Step 1: check recent deployment. Step 2: verify connection pool setting. Step 3: if confirmed, rollback."

Agent doesn't just match keywords but understands pattern similarity.

Auto-write incident summary. After incidents, team must write postmortem. Information is scattered - alert history from PagerDuty, deployment logs from CI/CD, Slack conversation, git commits. Agent collects from multiple sources, consolidates into draft with timeline accurate to the minute.

Each incident unfolds completely differently - can't be templated.

Common pattern: input is unstructured data (logs, alert stream, conversation), needs context understanding to decide, output is nuanced recommendations. This is AI Agent territory.


The story of when AI almost killed production

After the postmortem, Hung decided to experiment with AI in monitoring. He set up an Agent to "smartify" alerting - the Agent would evaluate each alert and decide whether to escalate or not.

Initially everything worked great. Agent suppressed many false positives. Team was bothered less by alert noise. Hung proudly demoed it in sprint review.

Until one Saturday night.

Truong - a junior engineer on-call - didn't receive any alerts all night. Sunday morning, a customer called: "App hasn't been working since 2 AM."

When the team checked, they discovered: the traditional monitoring system had caught the error right at 2:03 AM. But the AI Agent evaluated it as a false positive - because the pattern was "similar" to some unimportant alerts before - and suppressed it.

Production down 6 hours. No one knew.

Hung sat writing the postmortem with a pained expression. Tien sat beside him, silent.

Finally Hung wrote: "Root cause: AI Agent wrongly suppressed critical alert. Lesson learned: AI in monitoring should be an advisor, never a gatekeeper."


Golden principles the team learned

After that incident, the team established several principles:

1. AI must never have authority to suppress alerts.

Agent can say "this alert might be false positive," but the alert must still reach the on-call engineer. Final decision authority stays with humans, not models.

2. Layer 1 must be independent of AI.

If AI Agent dies, Layer 1 must still work normally. Cron job still pings health check, still compares thresholds, still sends alerts. AI is a supplementary layer, not a replacement layer.

3. The closer AI is to production decisions, the smaller its authority.

At post-incident layer (writing postmortem, analyzing patterns), AI can be free. At response layer (suggesting runbooks), AI only suggests. At original alert layer, AI shouldn't intervene.

4. Don't use AI to compensate for poor alert config.

Hieu once said: "Many teams think they need AI because the system has too much noise. But the real question is: why do you have so much noise? Are thresholds reasonable? Is alert granularity correct? 70-80% of alert noise problems can be solved by better rule tuning - no AI needed."

Using AI to filter alert noise is like using painkillers instead of curing the disease.


The monitoring architecture the team chose

After all those lessons, here's the architecture Hung's team built:

Layer 1: Collection and Alerting (100% rule-based)

Prometheus, Grafana, AlertManager, PagerDuty. No AI here. This layer must be as simple and reliable as possible. This is the foundation the team bets production on.

Layer 2: Enrichment and Correlation (AI Agent)

Agent receives alert stream from layer below, correlates, adds context, assesses true severity, groups related alerts. But has no authority to suppress or modify original alerts. It only adds an information layer on top.

Layer 3: Response Assistance (AI Agent)

Suggests runbooks, finds similar past incidents, drafts stakeholder communications. But all actual actions (restart service, rollback deploy) are still performed by engineers.

Layer 4: Post-incident (AI Agent)

Draft postmortem, extract action items, identify recurring patterns. This is the least risky place for AI - mistakes don't affect production.

Overarching philosophy: AI is an advisor sitting beside you, not autopilot.

Server room
Server room

What I worry about most

There's one thing I always remind the team: don't lose the ability to debug manually.

Debugging is a skill that needs practice. You need to look at logs, reason yourself, form hypotheses, verify. If every incident, engineers just ask Agent and follow suggestions, then after 1-2 years, when Agent suggests wrong or faces situations it's never seen, no one on the team has enough experience to handle it themselves.

Tien learned a lot after that night of 47 alerts. He started reading logs more carefully, learned to correlate information, built mental models of the system. A few months later, Tien could spot patterns like Hieu.

If that night had AI Agent doing everything, would Tien have learned anything?


Closing thoughts

Back to the postmortem after the 47-alert night.

Hung ended with: "Monitoring is where boring technology wins. A cron job running health checks every 30 seconds isn't sexy, nothing to demo, but it saves production at 3 AM more reliably than any AI Agent."

Tien nodded. Hieu nodded.

"Build the boring foundation rock solid," Hung continued, "then add AI on top as a value-adding layer. Never use AI to replace that foundation."

That's a lesson the team learned the hard way - through sleepless nights and painful incidents.

And that's the lesson I want to share with you.


Boring is beautiful. Reliable is everything.

And never let AI become a single point of failure in your monitoring stack.

Share: