Skip to content
AyoKoding

Resilience Patterns

Why Resilience Patterns Matter

Resilience patterns prevent cascading failures in distributed systems by gracefully handling errors, timeouts, and service degradation. Without resilience, a slow or failing service brings down dependent services causing complete system outages. Understanding timeouts, retries, circuit breakers, and bulkheads prevents cascading failures, improves system stability, and ensures graceful degradation under load.

Core benefits:

  • Fault isolation: Failures don't cascade to dependent services
  • Graceful degradation: System remains partially functional
  • Fast failure detection: Timeouts prevent hanging requests
  • Automatic recovery: Retry logic handles transient errors

Problem: Standard library provides context.WithTimeout for basic timeouts but no circuit breaker, retry logic, or bulkhead patterns. Manual implementation is complex and error-prone.

Solution: Start with context.WithTimeout to understand timeout fundamentals, identify limitations (no retry, no circuit breaking), then use production libraries (gobreaker for circuit breakers, exponential backoff for retries) for comprehensive resilience.

Standard Library: Timeouts with Context

Go's context package provides timeout and cancellation support.

Pattern from standard library:

package main
 
import (
    "context"
    // => Standard library for timeout and cancellation
    // => context.WithTimeout creates timed context
    "fmt"
    "net/http"
    // => Standard library HTTP client
    "time"
    // => Standard library for time operations
)
 
func fetchWithTimeout(url string, timeout time.Duration) (string, error) {
    // => Makes HTTP request with timeout
    // => Returns response body or timeout error
 
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    // => ctx is context that expires after timeout
    // => cancel is function to cancel context early
    // => CRITICAL: always call cancel() to release resources
 
    defer cancel()
    // => Ensures cancel() called when function returns
    // => Releases context resources (goroutines, timers)
    // => Called even if function returns early
 
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    // => NewRequestWithContext creates request with context
    // => Request automatically cancelled when context expires
    // => ctx controls request lifecycle
 
    if err != nil {
        return "", err
    }
 
    client := &http.Client{}
    // => Creates HTTP client
    // => Production: reuse client (connection pooling)
 
    resp, err := client.Do(req)
    // => Executes request with timeout
    // => Returns error if timeout exceeded
    // => err is context.DeadlineExceeded on timeout
 
    if err != nil {
        // => Request failed or timed out
 
        if err == context.DeadlineExceeded {
            // => Timeout occurred
            // => err is context.DeadlineExceeded
 
            return "", fmt.Errorf("request timed out after %v", timeout)
        }
 
        return "", err
        // => Other error (network, DNS, etc.)
    }
 
    defer resp.Body.Close()
    // => Close response body to prevent resource leak
 
    body := make([]byte, 1024)
    // => Buffer for response body
    // => Production: use io.ReadAll or bufio
 
    n, _ := resp.Body.Read(body)
    // => Read response body
    // => n is bytes read
 
    return string(body[:n]), nil
    // => Return response body as string
}
 
func main() {
    // Fast response (within timeout)
    result, err := fetchWithTimeout("https://httpbin.org/delay/1", 3*time.Second)
    // => URL delays 1 second, timeout is 3 seconds
    // => Request completes successfully
 
    if err != nil {
        fmt.Println("Error:", err)
    } else {
        fmt.Println("Success:", len(result), "bytes")
        // => Output: Success: 329 bytes
    }
 
    // Slow response (exceeds timeout)
    result, err = fetchWithTimeout("https://httpbin.org/delay/5", 2*time.Second)
    // => URL delays 5 seconds, timeout is 2 seconds
    // => Request times out
 
    if err != nil {
        fmt.Println("Error:", err)
        // => Output: Error: request timed out after 2s
    }
}

Pattern: Database Query Timeout:

package main
 
import (
    "context"
    "database/sql"
    "fmt"
    "time"
)
 
func queryUserWithTimeout(db *sql.DB, id int, timeout time.Duration) (string, error) {
    // => Queries database with timeout
    // => Prevents hanging queries
 
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()
    // => Context expires after timeout
    // => Query cancelled automatically
 
    var username string
 
    err := db.QueryRowContext(ctx, "SELECT username FROM users WHERE id = $1", id).Scan(&username)
    // => QueryRowContext uses context for timeout
    // => Query cancelled if context expires
    // => err is context.DeadlineExceeded on timeout
 
    if err != nil {
        if err == context.DeadlineExceeded {
            return "", fmt.Errorf("query timed out after %v", timeout)
        }
        return "", err
    }
 
    return username, nil
}

Limitations for production resilience:

  • No retry logic (single attempt only)
  • No exponential backoff (retries immediately or not at all)
  • No circuit breaker (doesn't stop hammering failing service)
  • No bulkhead (no resource isolation between services)
  • Manual timeout handling (verbose context plumbing)

Production Pattern: Exponential Backoff with Jitter

Exponential backoff with jitter prevents retry storms by spacing out retries with randomness.

Pattern from standard library:

package main
 
import (
    "fmt"
    "math"
    "math/rand"
    "time"
)
 
func retryWithBackoff(maxRetries int, operation func() error) error {
    // => Retries operation with exponential backoff
    // => maxRetries is maximum retry attempts
    // => operation is function to retry
 
    var err error
 
    for attempt := 0; attempt < maxRetries; attempt++ {
        // => Retry loop (0 to maxRetries-1)
 
        err = operation()
        // => Execute operation
        // => err is nil on success
 
        if err == nil {
            // => Operation succeeded
 
            return nil
            // => Exit retry loop
        }
 
        // Operation failed, calculate backoff
        if attempt < maxRetries-1 {
            // => Not last attempt (still have retries left)
 
            backoff := calculateBackoff(attempt)
            // => Calculate wait time before retry
            // => Exponential: 1s, 2s, 4s, 8s, 16s...
 
            fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n", attempt+1, err, backoff)
 
            time.Sleep(backoff)
            // => Wait before retry
            // => Gives service time to recover
        }
    }
 
    // All retries exhausted
    return fmt.Errorf("operation failed after %d attempts: %w", maxRetries, err)
}
 
func calculateBackoff(attempt int) time.Duration {
    // => Calculates exponential backoff with jitter
    // => attempt is retry number (0, 1, 2, ...)
 
    baseDelay := 1 * time.Second
    // => Base delay (starting point)
 
    maxDelay := 30 * time.Second
    // => Maximum delay (cap exponential growth)
 
    exponentialDelay := time.Duration(math.Pow(2, float64(attempt))) * baseDelay
    // => 2^0 * 1s = 1s
    // => 2^1 * 1s = 2s
    // => 2^2 * 1s = 4s
    // => 2^3 * 1s = 8s
    // => Doubles on each retry
 
    if exponentialDelay > maxDelay {
        exponentialDelay = maxDelay
        // => Cap at 30 seconds
        // => Prevents unbounded growth
    }
 
    jitter := time.Duration(rand.Int63n(int64(exponentialDelay / 2)))
    // => Jitter is random value: 0 to exponentialDelay/2
    // => Prevents retry storms (many clients retry simultaneously)
    // => Spreads retries over time
 
    return exponentialDelay + jitter
    // => Final backoff with randomness
    // => Example: 4s + 1.5s = 5.5s
}
 
func main() {
    rand.Seed(time.Now().UnixNano())
    // => Seed random number generator
    // => Different jitter on each run
 
    attemptCount := 0
    operation := func() error {
        // => Simulates failing operation
        // => Succeeds on 4th attempt
 
        attemptCount++
        if attemptCount < 4 {
            return fmt.Errorf("temporary error")
        }
        return nil
    }
 
    err := retryWithBackoff(5, operation)
    // => Retry up to 5 times
 
    if err != nil {
        fmt.Println("Final error:", err)
    } else {
        fmt.Println("Operation succeeded!")
        // => Output: Operation succeeded!
        // => After 3 retries
    }
}

Production Framework: Circuit Breaker with gobreaker

Circuit breaker prevents cascading failures by stopping requests to failing services.

Adding gobreaker:

go get github.com/sony/gobreaker
# => Installs circuit breaker library
# => Industry-standard implementation

Pattern: Circuit Breaker:

package main
 
import (
    "fmt"
    "time"
 
    "github.com/sony/gobreaker"
    // => Circuit breaker library
    // => Three states: Closed, Open, Half-Open
)
 
var cb *gobreaker.CircuitBreaker
// => Global circuit breaker instance
 
func init() {
    // => Initializes circuit breaker
 
    settings := gobreaker.Settings{
        Name: "API Circuit Breaker",
        // => Circuit breaker name (for logging)
 
        MaxRequests: 3,
        // => Max requests allowed in Half-Open state
        // => After 3 successes, circuit closes
 
        Interval: 10 * time.Second,
        // => Interval to reset failure count in Closed state
        // => Counts failures per 10-second window
 
        Timeout: 30 * time.Second,
        // => Timeout in Open state before transitioning to Half-Open
        // => After 30s, allows test requests
 
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            // => Determines when to open circuit
            // => counts contains failure statistics
 
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            // => Failure ratio: TotalFailures / Requests
            // => Example: 5 failures / 10 requests = 0.5 (50%)
 
            return counts.Requests >= 10 && failureRatio >= 0.5
            // => Open circuit if:
            // => - At least 10 requests (minimum sample size)
            // => - 50%+ failure rate
        },
 
        OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
            // => Called when circuit state changes
            // => from: previous state (Closed, Open, Half-Open)
            // => to: new state
 
            fmt.Printf("Circuit breaker %s: %s -> %s\n", name, from, to)
            // => Output: Circuit breaker API Circuit Breaker: StateOpen -> StateHalfOpen
            // => Logging for monitoring
        },
    }
 
    cb = gobreaker.NewCircuitBreaker(settings)
    // => Creates circuit breaker with settings
}
 
func callExternalAPI() (string, error) {
    // => Calls external API through circuit breaker
    // => Returns response or circuit breaker error
 
    result, err := cb.Execute(func() (interface{}, error) {
        // => Execute wraps operation in circuit breaker
        // => Only executes if circuit Closed or Half-Open
 
        return fetchFromAPI()
        // => Actual API call
        // => Success or failure recorded by circuit breaker
    })
    // => result is interface{} (type assertion needed)
    // => err is operation error or gobreaker.ErrOpenState
 
    if err != nil {
        // => Operation failed or circuit open
 
        if err == gobreaker.ErrOpenState {
            // => Circuit is open (too many failures)
            // => Fast-fail without calling API
 
            return "", fmt.Errorf("circuit breaker open: service unavailable")
        }
 
        return "", err
        // => Operation error
    }
 
    return result.(string), nil
    // => Type assertion to string
}
 
func fetchFromAPI() (string, error) {
    // => Simulates external API call
    // => Production: actual HTTP request
 
    time.Sleep(100 * time.Millisecond)
    // => Simulate API latency
 
    // Simulate failures (70% failure rate)
    if time.Now().Unix()%10 < 7 {
        return "", fmt.Errorf("API error")
    }
 
    return "API response", nil
}
 
func main() {
    // Make multiple requests
    for i := 0; i < 50; i++ {
        result, err := callExternalAPI()
 
        if err != nil {
            fmt.Printf("Request %d failed: %v\n", i+1, err)
        } else {
            fmt.Printf("Request %d succeeded: %s\n", i+1, result)
        }
 
        time.Sleep(200 * time.Millisecond)
        // => Delay between requests
    }
 
    // Circuit breaker state transitions:
    // 1. Closed: Normal operation (all requests allowed)
    // 2. Open: Too many failures (fast-fail, no API calls)
    // 3. Half-Open: Test if service recovered (limited requests)
    // 4. Back to Closed if tests succeed
}

Circuit Breaker States:

  1. Closed (Normal): All requests pass through, failures counted
  2. Open (Failing): All requests rejected immediately (fast-fail), no API calls
  3. Half-Open (Testing): Limited test requests to check if service recovered
  4. Closed (Recovered): If test requests succeed, resume normal operation

Production Pattern: Bulkhead

Bulkhead isolates resources (connections, goroutines) to prevent exhaustion.

Pattern: Worker Pool Bulkhead:

package main
 
import (
    "fmt"
    "time"
)
 
type Bulkhead struct {
    semaphore chan struct{}
    // => Semaphore controls concurrent operations
    // => Buffer size limits concurrency
}
 
func NewBulkhead(maxConcurrent int) *Bulkhead {
    // => Creates bulkhead with concurrency limit
    // => maxConcurrent is max parallel operations
 
    return &Bulkhead{
        semaphore: make(chan struct{}, maxConcurrent),
        // => Buffered channel with maxConcurrent capacity
        // => Full channel blocks new operations
    }
}
 
func (b *Bulkhead) Execute(fn func() error) error {
    // => Executes function with concurrency control
    // => Blocks if concurrency limit reached
 
    b.semaphore <- struct{}{}
    // => Acquire semaphore (blocks if full)
    // => Adds token to channel
    // => Blocks when channel full (maxConcurrent operations running)
 
    defer func() {
        <-b.semaphore
        // => Release semaphore
        // => Removes token from channel
        // => Allows next operation to proceed
    }()
 
    return fn()
    // => Execute operation
    // => Guaranteed: at most maxConcurrent executing concurrently
}
 
func main() {
    bulkhead := NewBulkhead(5)
    // => Limit to 5 concurrent operations
    // => 6th operation blocks until slot available
 
    for i := 0; i < 20; i++ {
        go func(id int) {
            // => Launch goroutine (non-blocking)
 
            err := bulkhead.Execute(func() error {
                // => Operation controlled by bulkhead
                // => At most 5 operations running simultaneously
 
                fmt.Printf("Task %d started\n", id)
                time.Sleep(1 * time.Second)
                // => Simulate work
                fmt.Printf("Task %d completed\n", id)
                return nil
            })
 
            if err != nil {
                fmt.Printf("Task %d error: %v\n", id, err)
            }
        }(i)
    }
 
    time.Sleep(10 * time.Second)
    // => Wait for all tasks to complete
    // => Production: use sync.WaitGroup
}

Why bulkhead matters:

  • Prevents resource exhaustion (connection pool, goroutines)
  • Isolates failures (one service can't consume all resources)
  • Maintains system stability under load
  • Enables graceful degradation

Trade-offs: When to Use Each

Comparison table:

PatternPurposeUse Case
TimeoutLimit operation durationAll external calls (HTTP, database, RPC)
RetryHandle transient errorsNetwork failures, temporary unavailability
Circuit BreakerStop cascading failuresDegraded or failing services
BulkheadResource isolationPrevent resource exhaustion

When to use timeouts:

  • All external calls (HTTP, database, RPC)
  • Long-running operations (file I/O, computation)
  • User-facing requests (prevent hanging)
  • Default: 30s for HTTP, 5s for database

When to use retries:

  • Transient network errors (temporary DNS failures)
  • Rate limiting (429 status codes)
  • Temporary service unavailability (503 status)
  • Idempotent operations (safe to retry)
  • NOT for non-idempotent operations (payment processing)

When to use circuit breakers:

  • Cascading failure prevention (service degradation)
  • Fast-fail requirements (immediate error response)
  • Service-to-service communication (microservices)
  • External API calls (third-party services)

When to use bulkheads:

  • Resource exhaustion prevention (connection pools)
  • Multi-tenant systems (per-tenant resource limits)
  • Mixed criticality workloads (prioritize critical operations)
  • Goroutine pool management (limit concurrency)

Production Best Practices

Combine timeout + retry + circuit breaker:

// GOOD: defense in depth (multiple resilience layers)
func callServiceResilience(url string) (string, error) {
    return cb.Execute(func() (interface{}, error) {
        // => Layer 3: Circuit breaker
 
        return retryWithBackoff(3, func() error {
            // => Layer 2: Retry with backoff
 
            ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
            defer cancel()
            // => Layer 1: Timeout
 
            return fetchWithContext(ctx, url)
        })
    })
}
 
// BAD: timeout only (no retry, no circuit breaker)
func callServiceUnsafe(url string) (string, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    return fetchWithContext(ctx, url)
}

Set appropriate timeout values:

// GOOD: reasonable timeouts based on operation
httpTimeout := 30 * time.Second    // HTTP requests
dbTimeout := 5 * time.Second       // Database queries
rpcTimeout := 10 * time.Second     // RPC calls
 
// BAD: too short (false positives) or too long (hanging)
timeout := 100 * time.Millisecond  // Too short for HTTP
timeout := 5 * time.Minute         // Too long (blocks resources)

Use idempotency keys for retries:

// GOOD: idempotency key prevents duplicate operations
func processPaymentWithRetry(paymentID string, amount float64) error {
    idempotencyKey := fmt.Sprintf("payment-%s", paymentID)
    // => Idempotency key identifies unique operation
    // => Same key = same operation (not duplicate)
 
    return retryWithBackoff(3, func() error {
        return processPayment(paymentID, amount, idempotencyKey)
        // => Server checks idempotency key
        // => Duplicate requests return original result
    })
}
 
// BAD: retry non-idempotent operation (double payment)
func processPaymentUnsafe(paymentID string, amount float64) error {
    return processPayment(paymentID, amount, "")  // No idempotency key
}

Monitor circuit breaker state:

// Track circuit breaker metrics
// => Closed: healthy service
// => Open: failing service (alert ops team)
// => Half-Open: testing recovery
 
OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
    metrics.RecordCircuitBreakerState(name, to)
    // => Send metrics to monitoring system (Prometheus, Datadog)
 
    if to == gobreaker.StateOpen {
        alerts.SendAlert(fmt.Sprintf("Circuit breaker %s is OPEN", name))
        // => Alert operations team immediately
    }
}

Summary

Resilience patterns prevent cascading failures in distributed systems through fault isolation and graceful degradation. Standard library provides context.WithTimeout for basic timeouts but no circuit breaker, retry logic, or bulkhead patterns. Production systems combine timeouts, exponential backoff with jitter for retries, gobreaker for circuit breakers, and worker pools for bulkheads. Use timeouts for all external calls, retries for transient errors, circuit breakers to stop hammering failing services, and bulkheads to prevent resource exhaustion. Monitor circuit breaker state and combine multiple resilience layers for defense in depth.

Key takeaways:

  • Use context.WithTimeout for all external calls (HTTP, database, RPC)
  • Implement exponential backoff with jitter for retry logic
  • Use gobreaker circuit breaker to stop cascading failures
  • Implement bulkhead pattern to isolate resources
  • Combine timeout + retry + circuit breaker for defense in depth
  • Set appropriate timeouts based on operation type (30s HTTP, 5s database)
  • Use idempotency keys for safe retries
  • Monitor circuit breaker state transitions (alert on Open state)

Last updated February 3, 2026

Command Palette

Search for a command to run...