Supervision
Need fault-tolerant applications that recover from crashes automatically? This guide teaches you OTP supervision patterns with restart strategies, child specifications, and hierarchical trees for building resilient systems.
Prerequisites
- Understanding of GenServer
- Basic OTP concepts
- Completed Intermediate Tutorial or equivalent
Problem
Processes crash due to bugs, invalid input, or external failures. Manual crash handling with try/catch is error-prone and doesn’t scale. You need automatic recovery and isolation.
Challenges:
- Recovering from process crashes automatically
- Isolating failures (preventing cascade)
- Choosing appropriate restart strategies
- Managing process dependencies
- Monitoring system health
Solution
Use Supervisor - OTP’s process that monitors child processes and restarts them according to defined strategies. Supervisors form hierarchical trees for systematic fault tolerance.
Key Concepts
- Supervisor - Monitors and restarts children
- Child Specification - Defines how to start/restart child
- Restart Strategy - Determines which children restart on failure
- Restart Intensity - Limits restart frequency to prevent crash loops
How It Works
1. Basic Supervisor
defmodule MyApp.Supervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
children = [
{Counter, 0},
{KeyValueStore, []},
{SessionStore, []}
]
# Restart strategy: if one child dies, restart only that child
Supervisor.init(children, strategy: :one_for_one)
end
end
{:ok, _pid} = MyApp.Supervisor.start_link([])Anatomy:
start_link/1- Start supervisorinit/1- Define children and strategychildren- List of child specs (module + args)strategy- How to handle child failures
2. Restart Strategies
:one_for_one - Restart Only Failed Child
defmodule OneForOneSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{WorkerA, []},
{WorkerB, []},
{WorkerC, []}
]
# If WorkerB crashes, only WorkerB restarts
# WorkerA and WorkerC keep running
Supervisor.init(children, strategy: :one_for_one)
end
endUse when: Children are independent (no shared state).
:one_for_all - Restart All Children
defmodule OneForAllSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{DatabaseConnection, []},
{CacheServer, []},
{APIHandler, []}
]
# If ANY child crashes, ALL children restart
# Ensures consistent state across dependent processes
Supervisor.init(children, strategy: :one_for_all)
end
endUse when: Children are interdependent (must restart together for consistency).
:rest_for_one - Restart Failed Child and Following Children
defmodule RestForOneSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{DatabaseConnection, []}, # 1
{CacheServer, []}, # 2 (depends on 1)
{APIHandler, []} # 3 (depends on 2)
]
# If CacheServer crashes:
# - DatabaseConnection keeps running
# - CacheServer restarts
# - APIHandler restarts (depends on CacheServer)
Supervisor.init(children, strategy: :rest_for_one)
end
endUse when: Children have sequential dependencies (pipeline).
3. Child Specifications
Simple Format
{Counter, 0}
{KeyValueStore, [name: :kv_store]}Full Format with Options
%{
id: Counter, # Unique identifier
start: {Counter, :start_link, [0]}, # {module, function, args}
restart: :permanent, # :permanent | :temporary | :transient
shutdown: 5000, # Timeout in ms or :brutal_kill
type: :worker # :worker | :supervisor
}Using child_spec/1
defmodule ConfigurableWorker do
use GenServer
def start_link(opts) do
name = Keyword.get(opts, :name, __MODULE__)
GenServer.start_link(__MODULE__, opts, name: name)
end
def child_spec(opts) do
%{
id: Keyword.get(opts, :name, __MODULE__),
start: {__MODULE__, :start_link, [opts]},
restart: Keyword.get(opts, :restart, :permanent),
shutdown: 5000,
type: :worker
}
end
# GenServer callbacks...
def init(opts), do: {:ok, opts}
end
children = [
{ConfigurableWorker, name: :worker1, restart: :transient},
{ConfigurableWorker, name: :worker2, restart: :permanent}
]4. Restart Types
:permanent - Always Restart
%{
id: CriticalService,
start: {CriticalService, :start_link, []},
restart: :permanent # ALWAYS restart on exit (normal or crash)
}Use for: Essential services that must always run.
:temporary - Never Restart
%{
id: OneTimeTask,
start: {OneTimeTask, :start_link, []},
restart: :temporary # NEVER restart (even on crash)
}Use for: Tasks meant to run once.
:transient - Restart Only on Abnormal Exit
%{
id: Worker,
start: {Worker, :start_link, []},
restart: :transient # Restart on crash, NOT on normal exit
}Use for: Workers that may exit normally but should recover from crashes.
5. Dynamic Supervisors
For dynamically starting/stopping children at runtime:
defmodule TaskSupervisor do
use DynamicSupervisor
def start_link(init_arg) do
DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
DynamicSupervisor.init(strategy: :one_for_one)
end
# Client API
def start_task(task_fun) do
child_spec = %{
id: Task,
start: {Task, :start_link, [task_fun]},
restart: :temporary
}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
def stop_task(pid) do
DynamicSupervisor.terminate_child(__MODULE__, pid)
end
def list_tasks do
DynamicSupervisor.which_children(__MODULE__)
end
end
{:ok, _sup_pid} = TaskSupervisor.start_link([])
{:ok, task_pid} = TaskSupervisor.start_task(fn ->
Process.sleep(1000)
IO.puts("Task complete!")
end)
TaskSupervisor.list_tasks() # [{:undefined, task_pid, :worker, [Task]}]6. Hierarchical Supervision Trees
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
# Top-level supervisor
{MyApp.DatabaseSupervisor, []},
{MyApp.CacheSupervisor, []},
{MyApp.WebSupervisor, []},
{MyApp.WorkerSupervisor, []}
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
defmodule MyApp.DatabaseSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{MyApp.Repo, []},
{MyApp.DatabaseMonitor, []}
]
Supervisor.init(children, strategy: :one_for_all)
end
end
defmodule MyApp.WebSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{Plug.Cowboy, scheme: :http, plug: MyApp.Router, options: [port: 4000]},
{MyApp.Endpoint, []},
{Phoenix.PubSub, name: MyApp.PubSub}
]
Supervisor.init(children, strategy: :one_for_one)
end
endTree Structure:
MyApp.Supervisor (one_for_one)
├─ MyApp.DatabaseSupervisor (one_for_all)
│ ├─ MyApp.Repo
│ └─ MyApp.DatabaseMonitor
├─ MyApp.CacheSupervisor
├─ MyApp.WebSupervisor (one_for_one)
│ ├─ Plug.Cowboy
│ ├─ MyApp.Endpoint
│ └─ Phoenix.PubSub
└─ MyApp.WorkerSupervisor7. Max Restarts and Intensity
Prevent infinite crash loops with restart intensity limits:
defmodule LimitedSupervisor do
use Supervisor
def start_link(_opts) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
{FlakyWorker, []}
]
# Max 3 restarts in 5 seconds
# If exceeded, Supervisor itself crashes (escalates to parent)
Supervisor.init(
children,
strategy: :one_for_one,
max_restarts: 3,
max_seconds: 5
)
end
endDefault: max_restarts: 3, max_seconds: 5
When limit exceeded: Supervisor terminates and escalates to parent supervisor (fault isolation).
8. Shutdown Strategies
Control how children are terminated:
children = [
# Worker with 5 second graceful shutdown
%{
id: Worker,
start: {Worker, :start_link, []},
shutdown: 5000 # Wait 5s for graceful termination
},
# Worker with brutal kill (immediate SIGKILL)
%{
id: FastWorker,
start: {FastWorker, :start_link, []},
shutdown: :brutal_kill
},
# Supervisor with infinity timeout (wait for all children)
%{
id: SubSupervisor,
start: {SubSupervisor, :start_link, []},
type: :supervisor,
shutdown: :infinity # Supervisors should use :infinity
}
]Variations
Registry-Based Dynamic Supervisors
defmodule SessionSupervisor do
use DynamicSupervisor
def start_link(_opts) do
DynamicSupervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_session(user_id) do
child_spec = {Session, user_id}
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
def stop_session(pid) do
DynamicSupervisor.terminate_child(__MODULE__, pid)
end
end
defmodule SessionRegistry do
def start_session(user_id) do
case SessionSupervisor.start_session(user_id) do
{:ok, pid} ->
Registry.register(SessionRegistry, user_id, pid)
{:ok, pid}
error -> error
end
end
def lookup_session(user_id) do
case Registry.lookup(SessionRegistry, user_id) do
[{pid, _}] -> {:ok, pid}
[] -> {:error, :not_found}
end
end
endTask.Supervisor for Concurrent Tasks
children = [
{Task.Supervisor, name: MyApp.TaskSupervisor}
]
Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
# Supervised task (crashes isolated)
perform_work()
end)
Task.Supervisor.async(MyApp.TaskSupervisor, fn ->
# Supervised async task with result
fetch_data()
end) |> Task.await()Pitfalls
Wrong Restart Strategy
children = [
{Database, []}, # 1
{Cache, []}, # 2 (needs Database)
{APIHandler, []} # 3 (needs Cache)
]
Supervisor.init(children, strategy: :one_for_one)
Supervisor.init(children, strategy: :rest_for_one)Restart Loops (No Intensity Limit)
Supervisor.init(children, strategy: :one_for_one)
Supervisor.init(
children,
strategy: :one_for_one,
max_restarts: 10,
max_seconds: 60
)Blocking init/1
def init(:ok) do
# This blocks supervision tree startup!
result = HTTPoison.get!("http://slow-service.com/config")
config = parse_config(result)
children = [{Worker, config}]
Supervisor.init(children, strategy: :one_for_one)
end
def init(:ok) do
children = [{Worker, :fetch_config_async}]
Supervisor.init(children, strategy: :one_for_one)
end
def init(:fetch_config_async) do
# Fetch config after process started
config = HTTPoison.get!("http://slow-service.com/config") |> parse_config()
{:ok, config}
endNot Using Hierarchical Trees
children = [
{DatabaseConnection, []},
{DatabaseMonitor, []},
{CacheServer, []},
{Worker1, []},
{Worker2, []},
{Worker3, []}
]
children = [
{DatabaseSupervisor, []}, # Manages DB connection + monitor
{CacheSupervisor, []}, # Manages cache-related processes
{WorkerSupervisor, []} # Manages worker pool
]Use Cases
Application Startup:
- Supervise all critical services
- Ensure consistent startup order
- Handle initialization failures
Worker Pools:
- Dynamically start/stop workers
- Isolate worker failures
- Manage resource limits
Connection Management:
- Database connection pools
- External API client supervision
- WebSocket connection supervision
Fault Isolation:
- Prevent cascading failures
- Automatic recovery from crashes
- System resilience
Related Resources
- GenServer Guide - Building supervised workers
- Intermediate Tutorial - Supervision fundamentals
- Error Handling - Complementary error strategies
- Cookbook - Supervision recipes
Next Steps
- Build multi-level supervision tree for your application
- Experiment with different restart strategies
- Implement dynamic worker pool with DynamicSupervisor
- Learn OTP Application behavior for full app supervision
- Study Phoenix supervision tree structure