Building Reactive Systems - From Manifesto to Practice
Building Reactive Systems
The Reactive Manifesto
On September 16, 2014, Jonas Bonér and his colleagues published The Reactive Manifesto - a document defining the core characteristics of a Reactive system.
“Reactive Systems are systems that are highly responsive, elastically scalable, fault-tolerant, and built on message-driven architecture.
The 4 Pillars of Reactive Systems
| Pillar | Meaning | Role |
|---|---|---|
| RESPONSIVE | Fast response | Ultimate goal - Good UX |
| RESILIENT | Recovery capability | Maintain responsive when failures occur |
| ELASTIC | Flexible scaling | Maintain responsive when load changes |
| MESSAGE DRIVEN | Message-oriented | Technical foundation for everything |
Relationship: Message Driven → (Resilient + Elastic) → Responsive
1. Responsive
Systems must respond in a timely manner if at all possible. Responsiveness is the foundation of usability.
Characteristics of Responsive Systems:
| Characteristic | Description |
|---|---|
| Consistent response time | Predictable response times |
| Simplified error handling | Simplified error handling |
| User confidence | Building trust with users |
| Encourage interaction | Encouraging interaction and growth |
“Key insight: Responsive is the result of properly applying Resilient and Elastic.
2. Resilient
Systems must remain responsive when failures occur. Any system that is not resilient will be unresponsive after failure.
Resilience is achieved through:
| Technique | Description |
|---|---|
| Replication | Replicating data/services for failover |
| Containment | Containing failures, preventing spread |
| Isolation | Separating components, reducing coupling |
| Delegation | Delegating recovery handling to other components |
3. Elastic
Systems must remain responsive under varying workload. Can increase or decrease resources based on demand.
Requirements for Elasticity:
| Requirement | Explanation |
|---|---|
| No central bottlenecks | No central bottleneck points |
| No contention points | No contention points |
| Shard/Replicate components | Ability to shard and replicate |
| Distribute inputs | Distribute input across components |
4. Message Driven
Reactive Systems rely on asynchronous message-passing to establish boundaries between components.
Benefits of Message-Driven:
- Loose coupling - Components don't depend directly on each other
- Isolation - Clear boundaries between components
- Location transparency - No need to know physical location
- Error delegation - Errors are passed as messages
- Back-pressure - Flow control when overloaded
Commands vs Events
| Characteristic | Commands | Events |
|---|---|---|
| Send to | Unicast (1 target) | Broadcast/Multicast |
| Purpose | Request specific action | Notify something happened |
| Response | Expect response | Don't expect response |
| Example | "Transfer $100 to User X" | "Transaction ABC completed" |
Non-Blocking I/O with Netty
The Problem with Blocking I/O
| Blocking I/O | Non-Blocking I/O |
|---|---|
| 1 thread = 1 connection | Few threads = thousands connections |
| Thread blocked while waiting for I/O | Not blocked |
| 10K connections = 10K threads | 10K connections = few threads |
| High memory cost | Low memory cost |
| Context switching overhead | Minimal switching |
Netty Architecture
Netty Architecture by Layer:
| Layer | Component | Role |
|---|---|---|
| 1 | Channels | Represent connections (conn1, conn2, conn3...) |
| 2 | Selector | Multiplexing - monitor multiple channels simultaneously |
| 3 | Event Loop | 1 Thread handles all events |
“Key insight: With Netty, 1 Event Loop thread can manage thousands of connections thanks to non-blocking I/O and multiplexing.
Resilience Patterns
When building payment systems at MoMo, we applied the following patterns:
1. Retry Pattern
Purpose: Retry operations when transient failures occur.
| Step | State | Wait Time |
|---|---|---|
| 1 | Request → Fail | - |
| 2 | Wait | 1s |
| 3 | Retry → Fail | - |
| 4 | Wait (longer) | 2s |
| 5 | Retry → Fail | - |
| 6 | Wait (even longer) | 4s |
| 7 | Retry → Success | - |
“Exponential Backoff: Wait time doubles after each failure
Best Practices:
- Use exponential backoff (1s → 2s → 4s → 8s)
- Limit retry attempts (max 3-5)
- Distinguish between retryable and non-retryable errors
2. Circuit Breaker Pattern
Purpose: Prevent continuous calls to a failing service.
Circuit Breaker State Flow:
| From State | Condition | To State |
|---|---|---|
| CLOSED | Multiple consecutive failures | OPEN |
| OPEN | After timeout | HALF-OPEN |
| HALF-OPEN | Test request succeeds | CLOSED |
| HALF-OPEN | Test request fails | OPEN |
| State | Behavior |
|---|---|
| Closed | Normal operation, counting failures |
| Open | Reject requests immediately, don't call downstream |
| Half-Open | Allow test requests to check recovery |
3. Rate Limiter Pattern
Purpose: Control the number of requests within a time period.
Algorithms:
| Algorithm | Description | Use case |
|---|---|---|
| Token Bucket | Each request consumes 1 token | API rate limiting |
| Leaky Bucket | Process at fixed rate | Traffic shaping |
| Fixed Window | Count within fixed time period | Simple counting |
| Sliding Window | Combines advantages of methods | Smooth limiting |
4. Bulkhead Pattern
Purpose: Isolate parts of the system so failures don't spread.
Like ship compartments:
- Compartments (bulkheads) are separated
- Water entering one compartment doesn't sink the entire ship
Application with Bulkhead Pattern:
| Thread Pool | Function | Isolation |
|---|---|---|
| Pool A | Bank Integration | Separate |
| Pool B | Payment Processing | Separate |
| Pool C | User Service | Separate |
“Each pool is completely isolated - if Bank Integration is overloaded, Payment and User still work normally.
5. Fallback Pattern
Purpose: Provide alternative values when primary operation fails.
Examples:
- Return cached data when database is unavailable
- Use default values when external service is down
- Redirect to backup service
6. Timeout Pattern
Purpose: Set time limits to avoid waiting indefinitely.
| Timeout | Problem |
|---|---|
| Too short | False positives, request canceled early |
| Too long | Resources held too long, cascade failures |
| Appropriate | Based on SLA and historical data |
Real Experience at MoMo
Applied Architecture:
1. Message-Driven Architecture:
- Apache Kafka as message broker
- Event Sourcing for transaction history
- CQRS to separate read/write operations
2. Resilience Patterns:
- Circuit Breaker for bank integrations (30+ banks)
- Retry with exponential backoff
- Bulkhead to isolate critical payment flows
3. Non-Blocking I/O:
- Vert.x for core services
- Reactive streams with back-pressure
- Connection pooling with non-blocking drivers
Results Achieved:
| Metric | Before | After |
|---|---|---|
| Throughput | 10K TPS | 100K+ TPS |
| Latency P99 | 500ms | 50ms |
| Availability | 99.9% | 99.99% |
| Resource usage | High | Optimized |
Key Takeaways
- Reactive is not just a technical choice - It's an architecture decision affecting the entire system design
- 4 pillars must be applied together - Missing one reduces overall effectiveness
- Resilience patterns are mandatory - In distributed systems, failure is normal, not an exception
- Message-driven is the foundation - Enables loose coupling and location transparency
- Non-blocking I/O is the technical enabler for high throughput