Retry Strategies
Why This Matters
Flaky tests are one of the most frustrating challenges in end-to-end testing. A test that passes locally fails in CI, then passes when re-run without any code changes. This undermines confidence in the test suite, wastes developer time investigating false failures, and eventually leads teams to ignore legitimate failures. In production environments, flaky tests can block deployments, create noise in monitoring systems, and erode trust in automated testing altogether.
The root cause of flakiness often lies in timing issues: network latency, animation delays, asynchronous rendering, race conditions between test actions and application state. While Playwright provides auto-waiting for many scenarios, complex production applications require explicit retry strategies to handle edge cases like rate limiting, transient network errors, and eventually-consistent systems. Understanding when and how to retry operations is critical for building resilient test suites that accurately reflect production behavior.
Production retry strategies must balance reliability with execution time. Naive retry approaches mask real bugs by retrying indiscriminately. Smart retry strategies distinguish between retriable failures (network timeouts, transient errors) and non-retriable failures (assertion errors, logic bugs). This guide shows how to progress from Playwright’s built-in retry mechanisms to sophisticated production patterns with exponential backoff, conditional retries, and comprehensive failure tracking.
Standard Library Approach: Playwright Built-in Retries
Playwright provides built-in test and assertion retries through configuration:
// playwright.config.ts
import { defineConfig } from "@playwright/test";
// => Import Playwright configuration helper
// => defineConfig provides type-safe config
export default defineConfig({
retries: process.env.CI ? 2 : 0,
// => Retries entire test 2 times in CI, 0 locally
// => Applies to all tests in project
// => Test fails only if all retries fail
// => Each retry runs from beginning of test
use: {
actionTimeout: 10000,
// => Maximum time for single action (click, fill, etc.)
// => Playwright auto-waits up to this timeout
// => Throws TimeoutError if action incomplete
// => Default: 0 (no timeout, uses testTimeout)
navigationTimeout: 30000,
// => Maximum time for page navigation
// => Applies to goto(), reload(), goBack(), goForward()
// => Separate from actionTimeout for slower page loads
// => Critical for production apps with heavy assets
},
expect: {
timeout: 5000,
// => Maximum time for expect() assertions
// => Applies to all expect() calls
// => Playwright retries assertions until pass or timeout
// => Handles eventually-consistent UI updates
},
});Using built-in retries in tests:
// tests/checkout.spec.ts
import { test, expect } from "@playwright/test";
// => Import test runner and assertion library
// => Inherits retry config from playwright.config.ts
test("checkout flow completes successfully", async ({ page }) => {
// => test() marks function for Playwright execution
// => page fixture provides browser context
// => Will retry up to 2 times (CI) if test fails
await page.goto("https://example.com/checkout");
// => Navigate to checkout page
// => Auto-waits for page load (up to navigationTimeout)
// => Retries on transient network errors
await page.click('[data-testid="submit-order"]');
// => Click submit button
// => Auto-waits for element to be visible, enabled, stable
// => Retries action up to actionTimeout
await expect(page.locator(".confirmation")).toBeVisible();
// => Assert confirmation message appears
// => Retries assertion up to expect.timeout (5000ms)
// => Handles race conditions in async rendering
});Limitations for production:
- Indiscriminate retries: Retries entire test on any failure, masking real bugs
- No retry customization: Cannot retry specific operations differently
- No exponential backoff: Retries immediately without delay, hammering rate-limited APIs
- Limited retry context: Cannot distinguish retriable vs non-retriable failures
- No retry metrics: Cannot track retry patterns to identify systemic issues
- Global configuration: All tests share same retry settings, limiting flexibility
Production Framework: Custom Retry Utilities
Production environments require granular retry control with exponential backoff and conditional logic:
// utils/retry.ts
export interface RetryOptions {
maxAttempts: number;
// => Maximum retry attempts before failure
// => First attempt counts toward total
// => Example: maxAttempts=3 means 1 initial + 2 retries
initialDelay: number;
// => Milliseconds to wait before first retry
// => Base value for exponential backoff
// => Typical: 100-1000ms depending on operation
maxDelay: number;
// => Maximum delay between retries (backoff ceiling)
// => Prevents infinite backoff growth
// => Typical: 5000-30000ms for API calls
backoffMultiplier: number;
// => Exponential growth factor for delays
// => delay = min(initialDelay * multiplier^attempt, maxDelay)
// => Typical: 2.0 (doubles delay each retry)
shouldRetry?: (error: Error) => boolean;
// => Optional predicate to determine if error is retriable
// => Return false to abort retries immediately
// => Prevents retrying logic errors, 4xx responses
}
export async function retryWithBackoff<T>(operation: () => Promise<T>, options: RetryOptions): Promise<T> {
// => Generic retry wrapper for any async operation
// => T is return type of operation (type-safe)
// => Throws if all retries exhausted
const {
maxAttempts,
initialDelay,
maxDelay,
backoffMultiplier,
shouldRetry = () => true,
// => Default: retry all errors (can be overridden)
} = options;
let lastError: Error;
// => Store most recent error for final throw
// => Used if all retries fail
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
// => Attempt 1 is initial try, 2+ are retries
// => Loop continues until success or maxAttempts reached
try {
const result = await operation();
// => Execute operation (Playwright action, API call)
// => If succeeds, return immediately (no more retries)
// => If throws, catch block handles retry logic
if (attempt > 1) {
console.log(`✓ Retry succeeded on attempt ${attempt}`);
// => Log successful retry for metrics/debugging
// => Only logged if not first attempt
}
return result;
// => Return successful result to caller
// => Exits retry loop
} catch (error) {
lastError = error as Error;
// => Cast to Error type for type safety
// => Store for potential final throw
if (!shouldRetry(lastError)) {
console.error(`✗ Non-retriable error: ${lastError.message}`);
throw lastError;
// => Abort retries for non-retriable errors
// => Example: assertion failures, 404 errors
}
if (attempt === maxAttempts) {
console.error(`✗ All ${maxAttempts} attempts failed`);
throw lastError;
// => Exhausted all retries, propagate error
// => Includes attempt count in error context
}
const delay = Math.min(initialDelay * Math.pow(backoffMultiplier, attempt - 1), maxDelay);
// => Calculate exponential backoff delay
// => attempt=1: initialDelay * 2^0 = initialDelay
// => attempt=2: initialDelay * 2^1 = 2x initialDelay
// => Capped at maxDelay to prevent excessive waits
console.warn(`⚠ Attempt ${attempt} failed: ${lastError.message}. ` + `Retrying in ${delay}ms...`);
// => Log retry attempt for debugging
// => Includes error message and calculated delay
await new Promise((resolve) => setTimeout(resolve, delay));
// => Sleep for calculated delay before retry
// => Prevents hammering rate-limited endpoints
// => Allows transient errors to resolve
}
}
throw lastError!;
// => TypeScript safety: lastError guaranteed defined here
// => Only reachable if loop exits without return/throw
}Using custom retry in tests:
// tests/checkout-with-retry.spec.ts
import { test, expect } from "@playwright/test";
import { retryWithBackoff } from "../utils/retry";
// => Import custom retry utility
// => Provides granular control over specific operations
test("checkout with smart retries", async ({ page }) => {
// => Production test with conditional retry logic
// => Only retries transient failures, not logic errors
await page.goto("https://example.com/checkout");
// => Initial navigation uses Playwright auto-wait
// => No custom retry needed for simple navigation
await retryWithBackoff(
async () => {
await page.click('[data-testid="submit-order"]');
// => Click submit button
// => May fail transiently due to animations, loading states
await expect(page.locator(".loading-spinner")).toBeHidden();
// => Wait for spinner to disappear
// => Indicates order processing started
},
{
maxAttempts: 3,
// => 1 initial + 2 retries
// => Sufficient for most transient UI issues
initialDelay: 500,
// => 500ms first retry delay
// => Allows animations/renders to complete
maxDelay: 5000,
// => Cap at 5 seconds
// => Prevents excessive wait on repeated failures
backoffMultiplier: 2.0,
// => Doubles delay: 500ms, 1000ms, 2000ms
// => Standard exponential backoff pattern
shouldRetry: (error) => {
// => Custom retry logic based on error type
// => Prevents retrying non-retriable errors
if (error.message.includes("TimeoutError")) {
return true;
// => Retry timeout errors (slow rendering, network)
// => Common transient issue worth retrying
}
if (error.message.includes("Element is not visible")) {
return true;
// => Retry visibility errors (animations, lazy loading)
// => May resolve after delay
}
return false;
// => Don't retry assertion failures, logic errors
// => Prevents masking real bugs
},
},
);
const confirmationText = await retryWithBackoff(
async () => {
const locator = page.locator(".confirmation");
// => Find confirmation message element
// => May not exist immediately after submission
await expect(locator).toBeVisible();
// => Assert confirmation visible
// => Throws if not visible within expect timeout
return await locator.textContent();
// => Extract confirmation text
// => Returns string or null
},
{
maxAttempts: 5,
// => More retries for backend-dependent operation
// => Order confirmation requires server processing
initialDelay: 1000,
// => Longer initial delay for server processing
// => Backend may need time to persist order
maxDelay: 10000,
// => Higher ceiling for backend delays
// => Production systems can be slow under load
backoffMultiplier: 1.5,
// => Gentler backoff for rate-limited APIs
// => 1000ms, 1500ms, 2250ms, 3375ms, 5062ms
shouldRetry: (error) => {
if (error.message.includes("Expected to be visible")) {
return true;
// => Retry visibility assertions
// => Element may render after backend responds
}
if (error.message.includes("null")) {
return true;
// => Retry null textContent (element not rendered)
// => May appear after async updates
}
return false;
// => Don't retry unexpected errors
},
},
);
expect(confirmationText).toContain("Order confirmed");
// => Final assertion on extracted text
// => No retry here (already have confirmed text)
// => Fails immediately if text doesn't match
});Mermaid Diagram: Retry Decision Flow
flowchart TD
A[Execute Operation] -->|Success| B[Return Result]
A -->|Error| C{Should Retry?}
C -->|No - Non-retriable| D[Throw Error]
C -->|Yes - Check Attempts| E{Attempts < Max?}
E -->|No| F[All Retries Exhausted]
F --> D
E -->|Yes| G[Calculate Backoff Delay]
G --> H[delay = min initial * multiplier^attempt, max]
H --> I[Wait for Delay]
I --> J[Log Retry Attempt]
J --> A
style A fill:#0173B2,stroke:#014d7a,color:#fff
style B fill:#029E73,stroke:#017a52,color:#fff
style D fill:#DE8F05,stroke:#b67304,color:#fff
style G fill:#CC78BC,stroke:#a35f97,color:#fff
style H fill:#CA9161,stroke:#a3714d,color:#fff
Diagram explanation:
- Blue (Operation): Initial execution attempt
- Green (Success): Successful result returned
- Orange (Failure): Non-retriable or exhausted retries
- Purple (Backoff): Exponential delay calculation
- Brown (Formula): Mathematical backoff formula
Production Patterns and Best Practices
Pattern 1: Retry Configuration Hierarchy
Configure retries at different granularity levels for flexibility:
// config/retry-presets.ts
import { RetryOptions } from "../utils/retry";
// => Import retry types for type safety
// => Ensures presets match expected interface
export const RETRY_PRESETS = {
// => Centralized retry configurations
// => Promotes consistency across test suite
// => Easy to adjust globally
fast: {
maxAttempts: 2,
initialDelay: 100,
maxDelay: 1000,
backoffMultiplier: 2.0,
// => Minimal retries for fast-failing operations
// => Local development, simple UI interactions
// => Total max wait: 100ms + 200ms = 300ms
} as RetryOptions,
standard: {
maxAttempts: 3,
initialDelay: 500,
maxDelay: 5000,
backoffMultiplier: 2.0,
// => Default for most test operations
// => Balances reliability and speed
// => Total max wait: 500ms + 1000ms + 2000ms = 3500ms
} as RetryOptions,
api: {
maxAttempts: 5,
initialDelay: 1000,
maxDelay: 30000,
backoffMultiplier: 2.0,
// => API calls with rate limiting
// => Backend operations requiring processing time
// => Total max wait: ~63 seconds (capped at 30s max delay)
} as RetryOptions,
network: {
maxAttempts: 4,
initialDelay: 2000,
maxDelay: 15000,
backoffMultiplier: 1.5,
// => Network-dependent operations
// => Slower backoff for transient network issues
// => Total max wait: 2s + 3s + 4.5s + 6.75s = 16.25s
} as RetryOptions,
};
// tests/example.spec.ts
import { RETRY_PRESETS } from "../config/retry-presets";
// => Import preset configurations
// => Ensures consistent retry behavior
test("using preset configurations", async ({ page }) => {
// => Demonstrates preset usage patterns
// => Easy to swap presets for different scenarios
await retryWithBackoff(async () => page.click(".button"), {
...RETRY_PRESETS.fast,
// => Spread operator clones preset
// => Allows customization without mutating preset
shouldRetry: (error) => error.message.includes("Timeout"),
// => Override shouldRetry for specific needs
// => Preset provides base config, custom logic added
});
const apiData = await retryWithBackoff(
async () => page.evaluate(() => fetch("/api/data").then((r) => r.json())),
RETRY_PRESETS.api,
// => Use api preset as-is
// => No customization needed for standard API calls
);
});Pattern 2: Exponential Backoff with Jitter
Add randomization to prevent thundering herd in distributed systems:
// utils/retry-jitter.ts
export interface JitterRetryOptions extends RetryOptions {
jitterFactor?: number;
// => Randomization factor (0.0-1.0)
// => 0.0 = no jitter, 1.0 = full jitter
// => Typical: 0.1-0.3 for reasonable variation
}
export async function retryWithJitter<T>(operation: () => Promise<T>, options: JitterRetryOptions): Promise<T> {
// => Enhanced retry with randomized delays
// => Prevents synchronized retries in distributed tests
// => Critical for CI systems running parallel tests
const {
maxAttempts,
initialDelay,
maxDelay,
backoffMultiplier,
jitterFactor = 0.2,
// => Default 20% jitter (reasonable variation)
// => Can be overridden per operation
shouldRetry = () => true,
} = options;
let lastError: Error;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const result = await operation();
// => Execute operation, return on success
if (attempt > 1) {
console.log(`✓ Retry succeeded on attempt ${attempt}`);
}
return result;
} catch (error) {
lastError = error as Error;
if (!shouldRetry(lastError)) {
throw lastError;
// => Abort immediately for non-retriable errors
}
if (attempt === maxAttempts) {
console.error(`✗ All ${maxAttempts} attempts failed. ` + `Last error: ${lastError.message}`);
throw lastError;
// => Exhausted retries, propagate final error
}
const baseDelay = Math.min(initialDelay * Math.pow(backoffMultiplier, attempt - 1), maxDelay);
// => Calculate exponential backoff base delay
// => Same formula as basic retry
const jitterAmount = baseDelay * jitterFactor * (Math.random() - 0.5);
// => Calculate random jitter: ±(baseDelay * jitterFactor * 0.5)
// => Example: baseDelay=1000, jitterFactor=0.2
// => jitterAmount = 1000 * 0.2 * [-0.5, 0.5] = [-100, +100]
// => Adds/subtracts up to 10% of baseDelay
const delay = Math.max(0, baseDelay + jitterAmount);
// => Final delay = baseDelay ± jitter
// => Math.max ensures delay never negative
// => Example: 1000ms ± 100ms = [900ms, 1100ms]
console.warn(
`⚠ Attempt ${attempt}/${maxAttempts} failed: ${lastError.message}. ` +
`Retrying in ${delay.toFixed(0)}ms (base: ${baseDelay}ms, ` +
`jitter: ${jitterAmount.toFixed(0)}ms)...`,
);
// => Detailed logging shows jitter calculation
// => Useful for debugging retry patterns
await new Promise((resolve) => setTimeout(resolve, delay));
// => Wait for jittered delay
// => Prevents synchronized retries across parallel tests
}
}
throw lastError!;
}
// tests/parallel-tests.spec.ts
import { retryWithJitter } from "../utils/retry-jitter";
// => Import jitter-enabled retry utility
// => Essential for parallel test execution
test.describe.parallel("parallel checkout tests", () => {
// => Runs tests in parallel (multiple workers)
// => Risk of thundering herd without jitter
test("checkout test 1", async ({ page }) => {
// => First parallel test instance
await page.goto("https://example.com/checkout");
await retryWithJitter(async () => page.click('[data-testid="submit"]'), {
maxAttempts: 4,
initialDelay: 1000,
maxDelay: 8000,
backoffMultiplier: 2.0,
jitterFactor: 0.3,
// => 30% jitter prevents synchronized retries
// => If 10 tests retry simultaneously, jitter spreads load
// => Example delays: 900ms, 1100ms, 950ms, 1050ms, etc.
});
});
test("checkout test 2", async ({ page }) => {
// => Second parallel test instance
// => Jitter ensures different retry timing than test 1
await page.goto("https://example.com/checkout");
await retryWithJitter(async () => page.click('[data-testid="submit"]'), {
maxAttempts: 4,
initialDelay: 1000,
maxDelay: 8000,
backoffMultiplier: 2.0,
jitterFactor: 0.3,
// => Same base config as test 1
// => Random jitter creates different actual delays
// => Reduces backend load spikes from synchronized retries
});
});
});Pattern 3: Conditional Retry with Error Classification
Implement intelligent retry decisions based on error types:
// utils/retry-conditional.ts
export enum ErrorCategory {
RETRIABLE = "RETRIABLE",
// => Transient errors safe to retry
// => Examples: timeouts, network errors, rate limits
NON_RETRIABLE = "NON_RETRIABLE",
// => Permanent errors should not retry
// => Examples: assertion failures, 404s, logic bugs
UNKNOWN = "UNKNOWN",
// => Unclassified errors (default to non-retriable)
// => Fail fast to avoid masking unexpected issues
}
export function classifyError(error: Error): ErrorCategory {
// => Analyze error to determine retriability
// => Centralized classification logic
// => Returns category for retry decision
const message = error.message.toLowerCase();
// => Normalize to lowercase for pattern matching
// => Case-insensitive error detection
// Retriable patterns
const retriablePatterns = [
"timeout",
"network",
"connection reset",
"econnrefused",
"enotfound",
"rate limit",
"too many requests",
"service unavailable",
"gateway timeout",
"element is not visible",
"element is not stable",
"element is not enabled",
];
// => Common transient error patterns
// => Network issues, timing issues, rate limits
// => Safe to retry after delay
if (retriablePatterns.some((pattern) => message.includes(pattern))) {
return ErrorCategory.RETRIABLE;
// => Error matches known retriable pattern
// => Caller should retry with backoff
}
// Non-retriable patterns
const nonRetriablePatterns = [
"assertion failed",
"expected",
"received",
"not found",
"invalid",
"forbidden",
"unauthorized",
"bad request",
];
// => Errors indicating logic bugs or permanent failures
// => Retrying will not fix these issues
// => Fail fast for quick feedback
if (nonRetriablePatterns.some((pattern) => message.includes(pattern))) {
return ErrorCategory.NON_RETRIABLE;
// => Error matches known non-retriable pattern
// => Caller should throw immediately
}
return ErrorCategory.UNKNOWN;
// => Unknown error type, default to non-retriable
// => Conservative approach prevents masking bugs
// => Can be adjusted based on production patterns
}
export async function retryConditional<T>(operation: () => Promise<T>, options: RetryOptions): Promise<T> {
// => Retry with automatic error classification
// => Simplifies test code (no manual shouldRetry)
const {
maxAttempts,
initialDelay,
maxDelay,
backoffMultiplier,
shouldRetry,
// => Optional custom shouldRetry overrides classification
} = options;
let lastError: Error;
const retriedErrors: Array<{ attempt: number; error: Error; category: ErrorCategory }> = [];
// => Track retry history for debugging
// => Useful for identifying patterns in flaky tests
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const result = await operation();
if (attempt > 1) {
console.log(`✓ Retry succeeded on attempt ${attempt}. ` + `Previous errors: ${retriedErrors.length}`);
// => Log successful retry with error count
// => Helps identify flaky test patterns
retriedErrors.forEach(({ attempt: a, error, category }) => {
console.log(` Attempt ${a}: [${category}] ${error.message}`);
});
// => Log all errors leading to success
// => Useful for debugging intermittent issues
}
return result;
} catch (error) {
lastError = error as Error;
const category = classifyError(lastError);
// => Automatically classify error type
// => Determines retriability without manual logic
retriedErrors.push({ attempt, error: lastError, category });
// => Record error for retry history
// => Enables detailed failure reporting
const isRetriable = shouldRetry ? shouldRetry(lastError) : category === ErrorCategory.RETRIABLE;
// => Use custom shouldRetry if provided, otherwise use classification
// => Allows override for special cases
if (!isRetriable) {
console.error(`✗ Non-retriable error [${category}]: ${lastError.message}`);
throw lastError;
// => Abort retries for non-retriable errors
// => Fail fast for quick feedback
}
if (attempt === maxAttempts) {
console.error(`✗ All ${maxAttempts} attempts failed. Error history:`);
retriedErrors.forEach(({ attempt: a, error, category }) => {
console.error(` Attempt ${a}: [${category}] ${error.message}`);
});
throw lastError;
// => Exhausted retries, log full history
// => Provides debugging context
}
const delay = Math.min(initialDelay * Math.pow(backoffMultiplier, attempt - 1), maxDelay);
console.warn(
`⚠ Attempt ${attempt}/${maxAttempts} failed [${category}]: ` +
`${lastError.message}. Retrying in ${delay}ms...`,
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError!;
}
// tests/conditional-retry-example.spec.ts
import { retryConditional } from "../utils/retry-conditional";
import { RETRY_PRESETS } from "../config/retry-presets";
// => Import conditional retry and presets
// => Simplifies test code with automatic classification
test("checkout with automatic retry classification", async ({ page }) => {
// => Test using conditional retry
// => No manual shouldRetry logic needed
await page.goto("https://example.com/checkout");
await retryConditional(
async () => {
await page.click('[data-testid="submit-order"]');
await expect(page.locator(".confirmation")).toBeVisible();
},
RETRY_PRESETS.standard,
// => Uses preset configuration
// => Automatic error classification
// => Retries only retriable errors (timeouts, visibility)
// => Fails fast on assertions, logic errors
);
const orderNumber = await retryConditional(
async () => {
const text = await page.locator(".order-number").textContent();
// => Extract order number text
// => May be null if element not rendered
if (!text) {
throw new Error("Order number not found");
// => Throw custom error for null text
// => Will be classified as UNKNOWN (non-retriable)
// => Test will fail immediately without retries
}
return text;
},
{
...RETRY_PRESETS.api,
shouldRetry: (error) => {
// => Override classification for specific case
// => "Order number not found" should be retriable
if (error.message === "Order number not found") {
return true;
// => Treat as retriable (backend processing delay)
// => Override default UNKNOWN classification
}
// Fallback to automatic classification
return undefined as any;
// => Return undefined to use classifyError()
// => Preserves automatic classification for other errors
},
},
);
expect(orderNumber).toMatch(/^ORD-\d{8}$/);
// => Final assertion on extracted order number
// => Fails immediately (no retry needed)
});Trade-offs and When to Use
Standard Approach (Playwright Built-in Retries):
- Use when: Simple test suites, development environments, quick prototyping
- Benefits: Zero configuration, works out of box, simple mental model
- Costs: Indiscriminate retries mask bugs, no granular control, slower feedback on real failures
Production Framework (Custom Retry with Backoff):
- Use when: CI/CD pipelines, production monitoring, rate-limited APIs, distributed systems
- Benefits: Granular control, exponential backoff prevents hammering, conditional logic reduces false retries, comprehensive logging and metrics
- Costs: Additional code to maintain, complexity overhead, requires error classification knowledge
Production recommendation: Use Playwright built-in retries as baseline (retries: 2 in CI), then add custom retry utilities for specific high-value operations like API calls, critical user flows, and rate-limited endpoints. This hybrid approach balances simplicity with production resilience. Configure retry presets centrally to promote consistency and make adjustments easy.
Security Considerations
Credential exposure in logs: Never log authentication tokens, API keys, or credentials in retry error messages. Sanitize error messages before logging to prevent sensitive data leakage.
Rate limit respect: Implement exponential backoff with appropriate max delays to respect API rate limits. Aggressive retries can trigger security blocks (IP bans, account suspension) or overload protection systems.
Distributed denial of service (DDoS) prevention: Use jitter in parallel test execution to prevent synchronized retry thundering herd. Thousands of tests retrying simultaneously can appear as DDoS attack to backend systems.
Error message information disclosure: Error classification logic may expose system internals through detailed error patterns. Ensure error messages in production don’t reveal sensitive implementation details (database schemas, internal endpoints).
Retry exhaustion monitoring: Track retry patterns and failures to detect potential security incidents (brute force attacks, credential stuffing). Unusual retry patterns may indicate malicious activity or system compromise.
Common Pitfalls
Retrying non-idempotent operations: Retrying mutations (POST, PUT, DELETE) without idempotency keys can create duplicate records, charge users multiple times, or corrupt state. Ensure operations are idempotent before enabling retries, or implement idempotency tokens.
Infinite retry loops: Missing max attempt limits or incorrect shouldRetry logic can create infinite loops. Always set maxAttempts and validate shouldRetry returns false for non-retriable errors. Use timeouts as failsafe.
Masking real bugs with retries: Overly aggressive retry strategies hide legitimate test failures. If tests consistently need retries to pass, investigate root cause rather than increasing retry attempts. Retries should handle transient issues, not paper over bugs.
Ignoring retry metrics: Not tracking which operations retry frequently indicates flaky tests or system issues. Log retry attempts with context (error type, attempt number, delay) and aggregate metrics to identify problematic tests or endpoints.
Linear backoff in production: Using fixed delays (100ms, 100ms, 100ms) instead of exponential backoff (100ms, 200ms, 400ms) hammers rate-limited systems and doesn’t allow transient issues to resolve. Always use exponential backoff with jitter for production systems.