Building Reactive Systems

The Reactive Manifesto

On September 16, 2014, Jonas Bonér and his colleagues published The Reactive Manifesto - a document defining the core characteristics of a Reactive system.

“
Reactive Systems are systems that are highly responsive, elastically scalable, fault-tolerant, and built on message-driven architecture.

The 4 Pillars of Reactive Systems

Pillar	Meaning	Role
RESPONSIVE	Fast response	Ultimate goal - Good UX
RESILIENT	Recovery capability	Maintain responsive when failures occur
ELASTIC	Flexible scaling	Maintain responsive when load changes
MESSAGE DRIVEN	Message-oriented	Technical foundation for everything

Relationship: Message Driven → (Resilient + Elastic) → Responsive

1. Responsive

Systems must respond in a timely manner if at all possible. Responsiveness is the foundation of usability.

Characteristics of Responsive Systems:

Characteristic	Description
Consistent response time	Predictable response times
Simplified error handling	Simplified error handling
User confidence	Building trust with users
Encourage interaction	Encouraging interaction and growth

“
Key insight: Responsive is the result of properly applying Resilient and Elastic.

2. Resilient

Systems must remain responsive when failures occur. Any system that is not resilient will be unresponsive after failure.

Resilience is achieved through:

Technique	Description
Replication	Replicating data/services for failover
Containment	Containing failures, preventing spread
Isolation	Separating components, reducing coupling
Delegation	Delegating recovery handling to other components

3. Elastic

Systems must remain responsive under varying workload. Can increase or decrease resources based on demand.

Requirements for Elasticity:

Requirement	Explanation
No central bottlenecks	No central bottleneck points
No contention points	No contention points
Shard/Replicate components	Ability to shard and replicate
Distribute inputs	Distribute input across components

4. Message Driven

Reactive Systems rely on asynchronous message-passing to establish boundaries between components.

Benefits of Message-Driven:

Loose coupling - Components don't depend directly on each other
Isolation - Clear boundaries between components
Location transparency - No need to know physical location
Error delegation - Errors are passed as messages
Back-pressure - Flow control when overloaded

Commands vs Events

Characteristic	Commands	Events
Send to	Unicast (1 target)	Broadcast/Multicast
Purpose	Request specific action	Notify something happened
Response	Expect response	Don't expect response
Example	"Transfer $100 to User X"	"Transaction ABC completed"

Non-Blocking I/O with Netty

The Problem with Blocking I/O

Blocking I/O	Non-Blocking I/O
1 thread = 1 connection	Few threads = thousands connections
Thread blocked while waiting for I/O	Not blocked
10K connections = 10K threads	10K connections = few threads
High memory cost	Low memory cost
Context switching overhead	Minimal switching

Netty Architecture

Netty Architecture by Layer:

Layer	Component	Role
1	Channels	Represent connections (conn1, conn2, conn3...)
2	Selector	Multiplexing - monitor multiple channels simultaneously
3	Event Loop	1 Thread handles all events

“
Key insight: With Netty, 1 Event Loop thread can manage thousands of connections thanks to non-blocking I/O and multiplexing.

Resilience Patterns

When building payment systems at MoMo, we applied the following patterns:

1. Retry Pattern

Purpose: Retry operations when transient failures occur.

Step	State	Wait Time
1	Request → Fail	-
2	Wait	1s
3	Retry → Fail	-
4	Wait (longer)	2s
5	Retry → Fail	-
6	Wait (even longer)	4s
7	Retry → Success	-

“
Exponential Backoff: Wait time doubles after each failure

Best Practices:

Use exponential backoff (1s → 2s → 4s → 8s)
Limit retry attempts (max 3-5)
Distinguish between retryable and non-retryable errors

2. Circuit Breaker Pattern

Purpose: Prevent continuous calls to a failing service.

Circuit Breaker State Flow:

From State	Condition	To State
CLOSED	Multiple consecutive failures	OPEN
OPEN	After timeout	HALF-OPEN
HALF-OPEN	Test request succeeds	CLOSED
HALF-OPEN	Test request fails	OPEN

State	Behavior
Closed	Normal operation, counting failures
Open	Reject requests immediately, don't call downstream
Half-Open	Allow test requests to check recovery

3. Rate Limiter Pattern

Purpose: Control the number of requests within a time period.

Algorithms:

Algorithm	Description	Use case
Token Bucket	Each request consumes 1 token	API rate limiting
Leaky Bucket	Process at fixed rate	Traffic shaping
Fixed Window	Count within fixed time period	Simple counting
Sliding Window	Combines advantages of methods	Smooth limiting

4. Bulkhead Pattern

Purpose: Isolate parts of the system so failures don't spread.

Like ship compartments:

Compartments (bulkheads) are separated
Water entering one compartment doesn't sink the entire ship

Application with Bulkhead Pattern:

Thread Pool	Function	Isolation
Pool A	Bank Integration	Separate
Pool B	Payment Processing	Separate
Pool C	User Service	Separate

“
Each pool is completely isolated - if Bank Integration is overloaded, Payment and User still work normally.

5. Fallback Pattern

Purpose: Provide alternative values when primary operation fails.

Examples:

Return cached data when database is unavailable
Use default values when external service is down
Redirect to backup service

6. Timeout Pattern

Purpose: Set time limits to avoid waiting indefinitely.

Timeout	Problem
Too short	False positives, request canceled early
Too long	Resources held too long, cascade failures
Appropriate	Based on SLA and historical data

Real Experience at MoMo

Applied Architecture:

1. Message-Driven Architecture:

Apache Kafka as message broker
Event Sourcing for transaction history
CQRS to separate read/write operations

2. Resilience Patterns:

Circuit Breaker for bank integrations (30+ banks)
Retry with exponential backoff
Bulkhead to isolate critical payment flows

3. Non-Blocking I/O:

Vert.x for core services
Reactive streams with back-pressure
Connection pooling with non-blocking drivers

Results Achieved:

Metric	Before	After
Throughput	10K TPS	100K+ TPS
Latency P99	500ms	50ms
Availability	99.9%	99.99%
Resource usage	High	Optimized

Key Takeaways

Reactive is not just a technical choice - It's an architecture decision affecting the entire system design
4 pillars must be applied together - Missing one reduces overall effectiveness
Resilience patterns are mandatory - In distributed systems, failure is normal, not an exception
Message-driven is the foundation - Enables loose coupling and location transparency
Non-blocking I/O is the technical enabler for high throughput

Building Reactive Systems - From Manifesto to Practice