Saga Pattern - Distributed Transactions Guide

Saga Pattern Deep Dive

Managing distributed transactions across microservices without distributed locks

Quick Summary (30 seconds):
The Saga pattern manages distributed transactions by breaking them into local transactions, each with a compensating action. Two implementations: Choreography (decentralized) and Orchestration (centralized).

The Problem

In a monolith, ACID transactions guarantee consistency. In microservices, you cannot have a single database transaction spanning multiple services. The Saga pattern solves this.

Core Challenge:

Distributed transactions require all participants to be available (reduces availability)
They hold locks for long periods (reduces scalability)
Most modern databases don't support distributed transactions well

What is a Saga?

A saga is a sequence of local transactions where each transaction updates its local database and publishes an event or sends a message. If a local transaction fails, the saga executes compensating transactions to undo previous changes.

Example Order Saga:

Step 1: Order Service -> Create Order (PENDING)
Step 2: Payment Service -> Process Payment
Step 3: Inventory Service -> Reserve Stock
Step 4: Shipping Service -> Create Shipment

If Step 3 fails:
  Compensating Step 2: Payment Service -> Refund Payment
  Compensating Step 1: Order Service -> Cancel Order

Choreography vs Orchestration

Choreography (Decentralized)

Services talk directly via events - no central coordinator

Order Service --event--> Payment Service
Payment Service --event--> Inventory Service
Inventory Service --event--> Shipping Service

Pros: No central point of failure, simple for basic flows, good scalability

Cons: Hard to debug, cyclic dependencies risk, no central visibility

Orchestration (Centralized)

Central coordinator service manages the entire flow

Orchestrator --calls--> Order Service
Orchestrator --calls--> Payment Service
Orchestrator --calls--> Inventory Service
Orchestrator --calls--> Shipping Service

Pros: Central visibility and control, easier for complex flows, clear audit trail

Cons: Orchestrator bottleneck, single point of failure risk, more complexity

Implementation 1: Choreography Saga

// Order Service - Publishes ORDER_CREATED event
class OrderService {
    function createOrder(request) {
        order = saveOrder(request, status=PENDING);
        publishEvent("ORDER_CREATED", order.id, order.amount);
        return order;
    }
    
    function handlePaymentFailed(event) {
        order = findOrder(event.orderId);
        order.status = CANCELLED;
        saveOrder(order);
    }
}

// Payment Service - Listens for ORDER_CREATED
class PaymentService {
    function handleOrderCreated(event) {
        try {
            processPayment(event.orderId, event.amount);
            publishEvent("PAYMENT_SUCCEEDED", event.orderId);
        } catch(error) {
            publishEvent("PAYMENT_FAILED", event.orderId);
        }
    }
}

Implementation 2: Orchestration Saga

// Saga Orchestrator - Central coordinator
class OrderSagaOrchestrator {
    function executeOrderSaga(request) {
        sagaId = generateId();
        
        try {
            // Step 1
            order = orderClient.createOrder(request);
            
            // Step 2
            payment = paymentClient.processPayment(order.id, request.amount);
            
            // Step 3
            inventory = inventoryClient.reserveStock(order.id, request.items);
            
            // Step 4
            shipping = shippingClient.createShipment(order.id, request.address);
            
            markSagaComplete(sagaId);
            
        } catch(error) {
            compensate(sagaId);
        }
    }
    
    function compensate(sagaId) {
        // Compensate in reverse order
        releaseInventory(orderId);
        refundPayment(paymentId);
        cancelOrder(orderId);
    }
}

Comparison Table

Aspect	Choreography	Orchestration
Complexity	Simple for basic flows	Better for complex workflows
Coupling	Loose (event-based)	Tighter (depends on orchestrator)
Visibility	Low - need distributed logging	High - central coordinator tracks state
Scalability	Excellent - fully distributed	Limited by orchestrator capacity
Single Point of Failure	No	Yes (orchestrator)
Debugging	Difficult	Easier
Best For	Simple, linear workflows	Complex, branching workflows

Compensating Transactions Reference

Transaction	Compensation
Create Order	Cancel Order
Process Payment	Refund Payment
Reserve Inventory	Release Inventory
Book Hotel	Cancel Booking
Send Email	Send Reversal Email

Important: All operations and compensations MUST be idempotent. The same operation can be called multiple times due to retries or network issues.

Real-World Example: Travel Booking Saga

Travel Booking Flow:

1. Book Flight -> Compensation: Cancel Flight
2. Book Hotel -> Compensation: Cancel Hotel
3. Book Car -> Compensation: Cancel Car
4. Process Payment -> Compensation: Refund Payment

If Car booking fails:
  Step 1: Refund Payment
  Step 2: Cancel Hotel
  Step 3: Cancel Flight
  Step 4: Send Cancellation Notice

Common Pitfalls and Solutions

Pitfall	Solution
Non-idempotent operations	Use idempotency keys + unique constraints
Missing compensations	Design compensation for every step
Long-running sagas	Add timeout and checkpoint mechanisms
No visibility	Use central saga state repository
Compensation failure	Use dead letter queue + manual alerts

Best Practices

Design compensations first - before implementing the forward transaction
Make all operations idempotent - use idempotency keys for all APIs
Store saga state persistently - allows recovery after crashes
Implement timeouts - don't wait forever for responses
Monitor sagas actively - detect stuck or long-running sagas
Test compensation paths - they will execute in production
Keep sagas short - long sagas increase failure probability

Selection Guide

Choose Choreography when:

The workflow is simple and linear (2-4 steps)
Teams want autonomy
You have good event-driven infrastructure

Choose Orchestration when:

The workflow is complex with branching logic
You need central visibility and audit trail
You have multiple sagas that share steps
You need to resume sagas after failures

Expert Insight:
"Sagas sacrifice ACID for BASE (Basically Available, Soft state, Eventual consistency). The choice between choreography and orchestration is organizational, not technical. Start with choreography for simplicity; add orchestration when you feel the pain of distributed debugging."

Key Takeaways

Sagas replace distributed transactions in microservices
Two patterns: Choreography (decentralized) vs Orchestration (centralized)
Every transaction needs a compensating action
Idempotency is mandatory for all operations
Store saga state for recovery and monitoring
Test compensation paths as thoroughly as success paths