Resilience Patterns

Understanding Resilience Patterns

Resilience means systems continue functioning despite failures. Distributed systems face network failures, service outages, and transient errors. Resilience patterns prevent cascading failures and enable graceful degradation.

Why resilience matters:

  • Failures are normal: Network partitions, timeouts, service crashes
  • Cascading failures: One failure triggers others
  • User experience: Partial functionality beats complete failure
  • Cost efficiency: Retries can succeed without manual intervention

This guide covers retry patterns with exponential backoff and circuit breaker patterns for protecting systems from overload.

Retry Pattern with Exponential Backoff

Problem: Transient failures (network glitches, temporary overload) cause operations to fail when immediate retry might succeed. Naive retries create thundering herd problem.

Recognition signals:

  • Operations fail intermittently
  • Network timeouts occur sporadically
  • Services report temporary unavailability (503 Service Unavailable)
  • Immediate retries create load spikes
  • No delay between retry attempts

Solution: Retry with exponential backoff - increasing delays between attempts.

CharacteristicNaive RetryExponential Backoff
Retry timingImmediateIncreasing delays (1s, 2s, 4s, 8s)
Load patternThundering herdDistributed load
Success probabilitySame each attemptIncreases with time
Resource usageHigh (constant hammering)Bounded (max attempts)

Example:

import java.time.Duration;
import java.util.Random;
import java.util.function.Supplier;

public class RetryWithBackoff {
    private final int maxAttempts;
    // => Maximum retry attempts before giving up
    private final Duration initialDelay;
    // => Starting delay (e.g., 1 second)
    private final double multiplier;
    // => Exponential factor (e.g., 2.0 doubles each retry)
    private final Random random = new Random();
    // => For jitter calculation (prevents thundering herd)

    public RetryWithBackoff(int maxAttempts, Duration initialDelay, double multiplier) {
        this.maxAttempts = maxAttempts;
        // => Store: Max attempts (e.g., 5)
        this.initialDelay = initialDelay;
        // => Store: Initial delay (e.g., 1s)
        this.multiplier = multiplier;
        // => Store: Backoff multiplier (e.g., 2.0 for exponential)
    }

    public <T> T execute(Supplier<T> operation) {
        Exception lastException = null;
        // => Track last failure for final exception
        long delay = initialDelay.toMillis();
        // => Current delay in milliseconds (starts at initialDelay)

        for (int attempt = 1; attempt <= maxAttempts; attempt++) {
            // => Loop: attempt 1, 2, 3... up to maxAttempts
            try {
                return operation.get();
                // => TRY OPERATION: Call provided function
                // => SUCCESS PATH: Return immediately (no more retries)
            } catch (Exception e) {
                lastException = e;
                // => FAILURE: Store exception
                if (attempt == maxAttempts) {
                    break;
                    // => EXHAUSTED: Last attempt failed, exit loop
                }

                // => EXPONENTIAL BACKOFF with jitter
                long jitter = (long) (delay * 0.1 * random.nextDouble());
                // => JITTER: Random 0-10% of delay
                // => Purpose: Prevents synchronized retries (thundering herd)
                // => Example: 1000ms ± 100ms randomness
                long sleepTime = delay + jitter;
                // => TOTAL DELAY: base delay + random jitter

                try {
                    Thread.sleep(sleepTime);
                    // => WAIT: Sleep before next retry
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    // => INTERRUPTED: Restore interrupt flag
                    throw new RuntimeException("Retry interrupted", ie);
                    // => ABORT: Throw exception if interrupted
                }

                delay = (long) (delay * multiplier);
                // => EXPONENTIAL INCREASE: Multiply delay for next attempt
                // => Example: 1s → 2s → 4s → 8s (with multiplier 2.0)
            }
        }

        throw new RuntimeException("Failed after " + maxAttempts + " attempts", lastException);
        // => FINAL FAILURE: All retries exhausted, throw with last exception
    }
}

// => USAGE EXAMPLE
RetryWithBackoff retry = new RetryWithBackoff(
    5,  // => MAX 5 attempts (1 initial + 4 retries)
    Duration.ofSeconds(1),  // => START with 1 second delay
    2.0  // => DOUBLE delay each time (exponential backoff)
);
// => Timing: 0s, 1s, 2s, 4s, 8s (total ~15s before final failure)

String result = retry.execute(() -> {
    return externalService.call();
    // => Operation that MIGHT FAIL transiently
    // => Network glitches, temporary overload, etc.
});

Timing example:

AttemptDelayTotal Time
10s0s
21s1s
32s3s
44s7s
58s15s

Jitter: Random variance prevents synchronized retries.

// WITHOUT JITTER: All clients retry at same time
delay = baseDelay * Math.pow(multiplier, attempt);

// WITH JITTER: Spread out retries
long jitter = (long) (delay * 0.1 * random.nextDouble());  // ±10%
delay = baseDelay * Math.pow(multiplier, attempt) + jitter;

Retry Decision Logic

Not all failures should be retried:

public boolean shouldRetry(Exception e) {
    // RETRY: Transient errors
    if (e instanceof SocketTimeoutException) return true;
    if (e instanceof ConnectException) return true;

    // RETRY: HTTP 503 (Service Unavailable), 429 (Too Many Requests)
    if (e instanceof HttpStatusException) {
        int status = ((HttpStatusException) e).getStatusCode();
        return status == 503 || status == 429;
    }

    // DON'T RETRY: Client errors (4xx except 429)
    if (e instanceof HttpStatusException) {
        int status = ((HttpStatusException) e).getStatusCode();
        return status < 400 || status >= 500;  // NOT 4xx
    }

    // DON'T RETRY: Authentication failures
    if (e instanceof AuthenticationException) return false;

    return false;  // DEFAULT: Don't retry
}
HTTP StatusRetry?Reason
200-299N/ASuccess
400 Bad RequestNoClient error (won’t change on retry)
401 UnauthorizedNoAuthentication issue
404 Not FoundNoResource doesn’t exist
429 Too Many RequestsYesRate limit (with backoff)
500 Internal Server ErrorYesTransient server issue
503 Service UnavailableYesTemporary overload
504 Gateway TimeoutYesTemporary timeout

Circuit Breaker Pattern

Problem: When downstream service fails, continuous retries waste resources and delay failure detection. Clients amplify load on failing services.

Recognition signals:

  • Repeated failures to same service
  • Timeouts exhaust connection pools
  • Cascading failures across services
  • Slow degradation instead of fast failure
  • No recovery time for failing service

Solution: Circuit breaker stops requests to failing service, allowing recovery time.

Circuit Breaker States

  stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: Failure threshold exceeded
    OPEN --> HALF_OPEN: Timeout elapsed
    HALF_OPEN --> CLOSED: Success threshold met
    HALF_OPEN --> OPEN: Failure detected
    CLOSED --> CLOSED: Success
    OPEN --> OPEN: Requests blocked

    style CLOSED fill:#029E73,stroke:#000,color:#fff
    style OPEN fill:#CC3311,stroke:#000,color:#fff
    style HALF_OPEN fill:#DE8F05,stroke:#000,color:#000
StateBehaviorTransition
CLOSEDRequests pass throughFailure threshold → OPEN
OPENRequests fail immediatelyTimeout → HALF_OPEN
HALF_OPENLimited requests test serviceSuccess → CLOSED, Failure → OPEN

Implementation:

import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicReference;
import java.util.function.Supplier;

public class CircuitBreaker {
    private enum State { CLOSED, OPEN, HALF_OPEN }
    // => CLOSED: Normal operation (requests flow through)
    // => OPEN: Circuit tripped (requests fail immediately)
    // => HALF_OPEN: Testing (allow limited requests to check recovery)

    private final int failureThreshold;
    // => How many failures before opening circuit (e.g., 5)
    private final Duration timeout;
    // => How long to wait before testing recovery (e.g., 30 seconds)
    private final AtomicInteger failureCount = new AtomicInteger(0);
    // => Thread-safe failure counter (atomic operations)
    private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    // => Thread-safe state holder (starts CLOSED)
    private volatile Instant openedAt;
    // => Timestamp when circuit opened (volatile for visibility)

    public CircuitBreaker(int failureThreshold, Duration timeout) {
        this.failureThreshold = failureThreshold;
        // => Store: Failures needed to open circuit
        this.timeout = timeout;
        // => Store: Wait time before attempting recovery
    }

    public <T> T execute(Supplier<T> operation) {
        if (state.get() == State.OPEN) {
            // => CIRCUIT OPEN: Check if timeout elapsed
            if (Instant.now().isAfter(openedAt.plus(timeout))) {
                // => TIMEOUT ELAPSED: Time to test recovery
                state.set(State.HALF_OPEN);
                // => TRANSITION: OPEN → HALF_OPEN (allow test request)
            } else {
                // => STILL OPEN: Timeout not elapsed, fail fast
                throw new CircuitBreakerOpenException("Circuit breaker is OPEN");
                // => FAST FAIL: No request to failing service (prevents overload)
            }
        }

        try {
            T result = operation.get();
            // => EXECUTE OPERATION: Call underlying service
            // => May throw exception if service fails
            onSuccess();
            // => SUCCESS: Record successful execution
            return result;
            // => Return: Operation result
        } catch (Exception e) {
            onFailure();
            // => FAILURE: Record failed execution
            throw e;
            // => Rethrow: Propagate exception to caller
        }
    }

    private void onSuccess() {
        failureCount.set(0);  // RESET
        if (state.get() == State.HALF_OPEN) {
            state.set(State.CLOSED);  // TRANSITION: Service recovered
        }
    }

    private void onFailure() {
        int failures = failureCount.incrementAndGet();
        if (failures >= failureThreshold) {
            state.set(State.OPEN);  // TRANSITION: Too many failures
            openedAt = Instant.now();
        }
    }

    public State getState() {
        return state.get();
    }
}

// USAGE
CircuitBreaker breaker = new CircuitBreaker(
    5,  // OPEN after 5 failures
    Duration.ofSeconds(30)  // WAIT 30 seconds before retry
);

try {
    String response = breaker.execute(() -> {
        return externalService.call();
    });
} catch (CircuitBreakerOpenException e) {
    // FAST FAIL: Circuit is open
    return cachedResponse();  // FALLBACK
}

Behavior timeline:

TimeFailuresStateAction
0s0CLOSEDRequests pass
5s5OPENCircuit opens, block requests
10s5OPENRequests blocked (fail fast)
35s5HALF_OPENAllow one test request
35s5CLOSEDTest succeeds, circuit closes

Advanced Circuit Breaker

public class AdvancedCircuitBreaker {
    private final int failureThreshold;
    private final int successThreshold;  // HALF_OPEN → CLOSED
    private final Duration timeout;
    private final SlidingWindow failureWindow;  // Track recent failures
    private final AtomicInteger halfOpenSuccesses = new AtomicInteger(0);

    private static class SlidingWindow {
        private final int size;
        private final CircularBuffer<Boolean> results;

        public SlidingWindow(int size) {
            this.size = size;
            this.results = new CircularBuffer<>(size);
        }

        public void record(boolean success) {
            results.add(success);
        }

        public int getFailureCount() {
            return (int) results.stream().filter(r -> !r).count();
        }

        public double getFailureRate() {
            return (double) getFailureCount() / results.size();
        }
    }

    public <T> T execute(Supplier<T> operation) {
        State current = state.get();

        if (current == State.OPEN) {
            if (shouldAttemptReset()) {
                state.set(State.HALF_OPEN);
                halfOpenSuccesses.set(0);
            } else {
                throw new CircuitBreakerOpenException();
            }
        }

        try {
            T result = operation.get();
            onSuccess();
            return result;
        } catch (Exception e) {
            onFailure();
            throw e;
        }
    }

    private void onSuccess() {
        failureWindow.record(true);

        if (state.get() == State.HALF_OPEN) {
            int successes = halfOpenSuccesses.incrementAndGet();
            if (successes >= successThreshold) {
                state.set(State.CLOSED);  // RECOVERED
                failureWindow.clear();
            }
        }
    }

    private void onFailure() {
        failureWindow.record(false);

        if (state.get() == State.HALF_OPEN) {
            state.set(State.OPEN);  // STILL FAILING
            openedAt = Instant.now();
        } else if (failureWindow.getFailureRate() >= failureThreshold) {
            state.set(State.OPEN);
            openedAt = Instant.now();
        }
    }
}

Combining Patterns

Problem: Retry and circuit breaker solve different problems. Combined, they provide robust resilience.

Solution: Circuit breaker outside, retry inside.

public class ResilientHttpClient {
    private final CircuitBreaker circuitBreaker;
    private final RetryWithBackoff retry;

    public ResilientHttpClient() {
        this.circuitBreaker = new CircuitBreaker(5, Duration.ofSeconds(30));
        this.retry = new RetryWithBackoff(3, Duration.ofSeconds(1), 2.0);
    }

    public String call(String url) {
        return circuitBreaker.execute(() -> {  // CIRCUIT BREAKER: Fail fast if open
            return retry.execute(() -> {  // RETRY: Handle transient failures
                return httpClient.get(url);
            });
        });
    }
}

Decision flowchart:

  graph TD
    Start["Request"] --> CircuitOpen{"Circuit<br/>Open?"}
    CircuitOpen -->|Yes| FailFast["Fail Fast"]
    CircuitOpen -->|No| Attempt["Attempt Request"]
    Attempt --> Success{"Success?"}
    Success -->|Yes| Return["Return Result"]
    Success -->|No| Retryable{"Retryable<br/>Error?"}
    Retryable -->|Yes| AttemptsLeft{"Attempts<br/>Remaining?"}
    Retryable -->|No| RecordFailure["Record Failure"]
    AttemptsLeft -->|Yes| Backoff["Exponential<br/>Backoff"]
    Backoff --> Attempt
    AttemptsLeft -->|No| RecordFailure
    RecordFailure --> ThresholdExceeded{"Threshold<br/>Exceeded?"}
    ThresholdExceeded -->|Yes| OpenCircuit["Open Circuit"]
    ThresholdExceeded -->|No| FailRequest["Fail Request"]

    style Start fill:#0173B2,stroke:#000,color:#fff
    style Return fill:#029E73,stroke:#000,color:#fff
    style FailFast fill:#CC3311,stroke:#000,color:#fff
    style OpenCircuit fill:#CC3311,stroke:#000,color:#fff
    style FailRequest fill:#DE8F05,stroke:#000,color:#000

Fallback Strategies

When all else fails, provide degraded functionality:

public class ResilientService {
    private final CircuitBreaker breaker;
    private final Cache<String, String> cache;

    public String getData(String key) {
        try {
            return breaker.execute(() -> {
                return externalService.fetch(key);
            });
        } catch (CircuitBreakerOpenException e) {
            // FALLBACK 1: Return cached data
            String cached = cache.get(key);
            if (cached != null) {
                return cached;
            }

            // FALLBACK 2: Return default value
            return getDefaultValue(key);
        }
    }

    private String getDefaultValue(String key) {
        return "Service temporarily unavailable";
    }
}

Guidelines

When to use retry:

  • ✓ Transient network failures
  • ✓ Temporary service overload
  • ✓ Read operations (idempotent)
  • ✓ Operations with bounded time

When to avoid retry:

  • ✗ Client errors (4xx)
  • ✗ Non-idempotent writes (unless idempotency key used)
  • ✗ Real-time operations with strict latency requirements

When to use circuit breaker:

  • ✓ Calls to external services
  • ✓ Operations that can fail in cascading manner
  • ✓ Services with known failure modes
  • ✓ Operations where fast failure preferred over slow failure

Best practices:

  1. Combine patterns: Circuit breaker + retry + fallback
  2. Monitor metrics: Track failure rates, circuit state transitions
  3. Set realistic thresholds: Balance sensitivity vs. false positives
  4. Test failures: Chaos engineering, fault injection
  5. Provide fallbacks: Cached data, default values, degraded functionality

Conclusion

Resilience patterns enable fault-tolerant systems:

  • Retry with backoff: Handle transient failures gracefully
  • Circuit breaker: Prevent cascading failures, enable recovery
  • Fallbacks: Degrade gracefully instead of complete failure

Build resilience in layers: retry for transient issues, circuit breaker for sustained failures, and fallbacks for complete outages. Monitor and tune thresholds based on real-world behavior. Resilience is not optional in distributed systems.

Last updated