Monitoring
Need production observability? Use Logger, Telemetry, and metrics for comprehensive monitoring.
Prerequisites
- Production Elixir application
- Understanding of processes and supervision
- Completed Advanced Tutorial
Problem
Production applications need visibility into performance, errors, resource usage, and business metrics. You need structured logging for debugging, real-time metrics for monitoring, and telemetry for understanding system behavior without impacting performance.
Challenges:
- Collecting metrics without performance degradation
- Correlating logs across distributed processes
- Monitoring system health in real-time
- Tracking business metrics and SLAs
- Debugging production issues without access to the machine
Solution
Use Telemetry for instrumentation, Logger for structured logging, and external services (Prometheus, Datadog, New Relic) for aggregation and visualization.
How It Works
1. Structured Logging with Logger
Basic logging:
require Logger
defmodule MyApp.UserService do
def create_user(params) do
Logger.info("Creating user", params: sanitize(params))
with {:ok, validated} <- validate(params),
{:ok, user} <- insert_user(validated) do
Logger.info("User created successfully",
user_id: user.id,
email: user.email,
duration_ms: calculate_duration()
)
{:ok, user}
else
{:error, :validation} = error ->
Logger.warning("User validation failed",
errors: inspect(params),
reason: "invalid_data"
)
error
{:error, reason} = error ->
Logger.error("User creation failed",
reason: inspect(reason),
params: sanitize(params),
stacktrace: Process.info(self(), :current_stacktrace)
)
error
end
end
endConfigure Logger:
config :logger,
backends: [:console],
level: :info,
compile_time_purge_matching: [
[level_lower_than: :info] # Remove debug logs in prod
]
config :logger, :console,
format: "$time $metadata[$level] $message\n",
metadata: [:request_id, :user_id, :module, :function, :file, :line],
colors: [enabled: false] # Disable in production
config :logger, :console,
format: {MyApp.LogFormatter, :format},
metadata: :allCustom JSON formatter:
defmodule MyApp.LogFormatter do
def format(level, message, timestamp, metadata) do
%{
timestamp: format_timestamp(timestamp),
level: level,
message: to_string(message),
metadata: Map.new(metadata)
}
|> Jason.encode!()
|> Kernel.<>("\n")
end
defp format_timestamp({date, {h, m, s, ms}}) do
with {:ok, datetime} <- NaiveDateTime.new(date, {h, m, s, ms * 1000}),
{:ok, datetime} <- DateTime.from_naive(datetime, "Etc/UTC") do
DateTime.to_iso8601(datetime)
end
end
end2. Telemetry Events
Instrument your code:
defmodule MyApp.Accounts do
def create_user(params) do
start_time = System.monotonic_time()
result = do_create_user(params)
duration = System.monotonic_time() - start_time
:telemetry.execute(
[:my_app, :accounts, :create_user],
%{duration: duration},
%{result: elem(result, 0), user_id: extract_user_id(result)}
)
result
end
endAttach handlers:
defmodule MyApp.Telemetry do
require Logger
def setup do
events = [
[:my_app, :accounts, :create_user],
[:my_app, :api, :request],
[:phoenix, :endpoint, :stop],
[:my_app, :repo, :query]
]
:telemetry.attach_many(
"my-app-handler",
events,
&handle_event/4,
nil
)
end
def handle_event([:my_app, :accounts, :create_user], measurements, metadata, _config) do
Logger.info("User creation",
duration_ms: measurements.duration / 1_000_000,
result: metadata.result,
user_id: metadata.user_id
)
# Send to metrics system
MyApp.Metrics.histogram("user.create.duration", measurements.duration)
end
def handle_event([:phoenix, :endpoint, :stop], measurements, metadata, _config) do
Logger.info("HTTP request completed",
method: metadata.conn.method,
path: metadata.conn.request_path,
status: metadata.conn.status,
duration_ms: measurements.duration / 1_000_000
)
end
def handle_event([:my_app, :repo, :query], measurements, metadata, _config) do
if measurements.total_time > 100_000_000 do # > 100ms
Logger.warning("Slow query detected",
query: metadata.query,
duration_ms: measurements.total_time / 1_000_000,
source: metadata.source
)
end
end
endStart telemetry in application:
def start(_type, _args) do
MyApp.Telemetry.setup()
children = [
# ... other children
]
Supervisor.start_link(children, strategy: :one_for_one)
end3. Prometheus Metrics
Add dependencies:
{:telemetry_metrics, "~> 0.6"},
{:telemetry_poller, "~> 1.0"},
{:telemetry_metrics_prometheus, "~> 1.1"}Define metrics:
defmodule MyApp.Metrics do
use TelemetryMetricsPrometheus
def metrics do
[
# Counters
counter("my_app.accounts.create_user.count",
event_name: [:my_app, :accounts, :create_user],
tags: [:result]
),
# Distributions (histograms)
distribution("my_app.accounts.create_user.duration",
event_name: [:my_app, :accounts, :create_user],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:result],
buckets: [10, 100, 500, 1000, 5000]
),
# HTTP request metrics
counter("http.requests.total",
event_name: [:phoenix, :endpoint, :stop],
tags: [:method, :status]
),
distribution("http.request.duration",
event_name: [:phoenix, :endpoint, :stop],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:method, :route],
buckets: [10, 50, 100, 200, 500, 1000]
),
# Database metrics
counter("db.queries.total",
event_name: [:my_app, :repo, :query],
tags: [:source, :result]
),
distribution("db.query.duration",
event_name: [:my_app, :repo, :query],
measurement: :total_time,
unit: {:native, :millisecond},
tags: [:source],
buckets: [1, 5, 10, 50, 100, 500]
),
# System metrics
last_value("vm.memory.total",
event_name: [:vm, :memory],
measurement: :total,
unit: :byte
),
last_value("vm.total_run_queue_lengths.total",
event_name: [:vm, :total_run_queue_lengths],
measurement: :total
)
]
end
def setup do
TelemetryMetricsPrometheus.init(
metrics: metrics(),
port: 9568,
name: :my_app_metrics
)
end
endExpose metrics endpoint:
scope "/metrics" do
pipe_through :api
get "/", MyAppWeb.MetricsController, :index
end
defmodule MyAppWeb.MetricsController do
use MyAppWeb, :controller
def index(conn, _params) do
metrics = TelemetryMetricsPrometheus.scrape(:my_app_metrics)
conn
|> put_resp_content_type("text/plain")
|> send_resp(200, metrics)
end
end4. Phoenix LiveDashboard
Add to dependencies:
{:phoenix_live_dashboard, "~> 0.8"}Configure:
import Phoenix.LiveDashboard.Router
scope "/" do
pipe_through :browser
live_dashboard "/dashboard",
metrics: MyApp.Telemetry,
additional_pages: [
route_name: MyCustomPage
]
endDefine telemetry module:
defmodule MyApp.Telemetry do
import Telemetry.Metrics
def metrics do
[
# Phoenix metrics
summary("phoenix.endpoint.stop.duration",
unit: {:native, :millisecond}
),
summary("phoenix.router_dispatch.stop.duration",
tags: [:route],
unit: {:native, :millisecond}
),
# Database metrics
summary("my_app.repo.query.total_time", unit: {:native, :millisecond}),
summary("my_app.repo.query.decode_time", unit: {:native, :millisecond}),
summary("my_app.repo.query.query_time", unit: {:native, :millisecond}),
summary("my_app.repo.query.queue_time", unit: {:native, :millisecond}),
summary("my_app.repo.query.idle_time", unit: {:native, :millisecond}),
# VM metrics
summary("vm.memory.total", unit: {:byte, :kilobyte}),
summary("vm.total_run_queue_lengths.total"),
summary("vm.total_run_queue_lengths.cpu"),
summary("vm.total_run_queue_lengths.io")
]
end
end5. Request ID Tracking
Implement request ID middleware:
defmodule MyAppWeb.Plugs.RequestID do
import Plug.Conn
require Logger
def init(opts), do: opts
def call(conn, _opts) do
request_id = get_request_id(conn)
conn
|> put_resp_header("x-request-id", request_id)
|> Logger.metadata(request_id: request_id)
end
defp get_request_id(conn) do
case get_req_header(conn, "x-request-id") do
[request_id] -> request_id
[] -> generate_request_id()
end
end
defp generate_request_id do
binary = <<
System.system_time(:nanosecond)::64,
:erlang.phash2({node(), self()})::32,
:erlang.unique_integer()::32
>>
Base.url_encode64(binary)
end
end
plug MyAppWeb.Plugs.RequestID6. Error Tracking with Sentry
Add Sentry:
{:sentry, "~> 10.0"}Configure:
config :sentry,
dsn: System.get_env("SENTRY_DSN"),
environment_name: System.get_env("ENV", "production"),
enable_source_code_context: true,
root_source_code_paths: [File.cwd!()],
tags: %{
env: System.get_env("ENV", "production"),
version: Application.spec(:my_app, :vsn)
}
config :logger,
backends: [:console, Sentry.LoggerBackend]Use in code:
try do
risky_operation()
rescue
exception ->
Sentry.capture_exception(exception,
stacktrace: __STACKTRACE__,
extra: %{user_id: user_id, context: additional_context}
)
reraise exception, __STACKTRACE__
end7. Application Performance Monitoring (APM)
New Relic integration:
{:new_relic_agent, "~> 1.27"}config :new_relic_agent,
app_name: "MyApp",
license_key: System.get_env("NEW_RELIC_LICENSE_KEY")
config :new_relic_agent,
automatic_attributes: [:request_id, :user_id]Custom transactions:
defmodule MyApp.BackgroundJob do
use NewRelic.Transaction
@transaction_name "BackgroundJob/process"
def perform(data) do
NewRelic.start_transaction(@transaction_name)
result = process_data(data)
NewRelic.stop_transaction()
result
end
end8. Health Checks
Comprehensive health check:
defmodule MyAppWeb.HealthController do
use MyAppWeb, :controller
require Logger
def index(conn, _params) do
checks = [
database: check_database(),
redis: check_redis(),
external_api: check_external_api()
]
all_healthy = Enum.all?(checks, fn {_name, status} -> status == :ok end)
status_code = if all_healthy, do: 200, else: 503
response = %{
status: if(all_healthy, do: "healthy", else: "unhealthy"),
checks: format_checks(checks),
timestamp: DateTime.utc_now()
}
conn
|> put_status(status_code)
|> json(response)
end
defp check_database do
case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT 1", [], timeout: 5_000) do
{:ok, _} -> :ok
{:error, _} -> :error
end
catch
:exit, _ -> :error
end
defp check_redis do
case Redix.command(:redix, ["PING"]) do
{:ok, "PONG"} -> :ok
_ -> :error
end
catch
:exit, _ -> :error
end
defp check_external_api do
case HTTPoison.get("https://api.example.com/health", [], recv_timeout: 5_000) do
{:ok, %{status_code: 200}} -> :ok
_ -> :error
end
catch
:exit, _ -> :error
end
defp format_checks(checks) do
Enum.map(checks, fn {name, status} ->
%{
name: name,
status: if(status == :ok, do: "healthy", else: "unhealthy")
}
end)
end
end9. Custom Metrics
Business metrics:
defmodule MyApp.Metrics.Business do
def track_signup(user) do
:telemetry.execute(
[:my_app, :business, :signup],
%{count: 1},
%{
plan: user.plan,
referral_source: user.referral_source,
country: user.country
}
)
end
def track_purchase(order) do
:telemetry.execute(
[:my_app, :business, :purchase],
%{
amount: order.total,
count: 1
},
%{
payment_method: order.payment_method,
currency: order.currency,
user_tier: order.user.tier
}
)
end
def track_feature_usage(feature_name, user) do
:telemetry.execute(
[:my_app, :features, :usage],
%{count: 1},
%{
feature: feature_name,
user_plan: user.plan,
user_tenure_days: calculate_tenure(user)
}
)
end
endVariations
StatsD Integration
{:statix, "~> 1.4"}
config :statix,
host: System.get_env("STATSD_HOST", "localhost"),
port: String.to_integer(System.get_env("STATSD_PORT", "8125"))
Statix.increment("user.signup", 1, tags: ["plan:premium"])
Statix.histogram("api.request.duration", duration_ms)
Statix.gauge("queue.size", queue_length)CloudWatch Integration
{:ex_aws, "~> 2.0"},
{:ex_aws_cloudwatch, "~> 2.0"}
defmodule MyApp.Metrics.CloudWatch do
def put_metric(namespace, metric_name, value, unit \\ "Count") do
ExAws.Cloudwatch.put_metric_data([
%{
metric_name: metric_name,
value: value,
unit: unit,
timestamp: DateTime.utc_now()
}
], namespace)
|> ExAws.request()
end
endAdvanced Patterns
1. Distributed Tracing
{:opentelemetry, "~> 1.0"},
{:opentelemetry_phoenix, "~> 1.0"},
{:opentelemetry_ecto, "~> 1.0"}
config :opentelemetry,
span_processor: :batch,
traces_exporter: :otlp
config :opentelemetry_exporter,
otlp_endpoint: System.get_env("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")2. Log Aggregation
Centralized logging with structured JSON:
config :logger, :console,
format: {MyApp.LogFormatter, :format},
metadata: :all
Logger.metadata(
deployment: "production",
region: "us-east-1",
instance_id: System.get_env("INSTANCE_ID")
)3. Alerting
defmodule MyApp.Alerts do
require Logger
def check_error_rate do
rate = calculate_error_rate()
if rate > 0.05 do # > 5% error rate
Logger.error("High error rate detected",
error_rate: rate,
alert: true,
severity: "critical"
)
send_pagerduty_alert("High error rate: #{rate}")
end
end
endUse Cases
Debugging:
- Request tracing across microservices
- Identifying slow queries
- Understanding error patterns
Performance:
- Response time distributions
- Resource utilization trends
- Bottleneck identification
Business:
- User signup funnels
- Revenue metrics
- Feature adoption rates
Troubleshooting
Missing Metrics
:telemetry.list_handlers([:my_app, :accounts, :create_user])
:telemetry.list_handlers(:all)High Memory Usage from Logs
config :logger, level: :info
config :logger, :console, format: "$message\n"Best Practices
- Use structured logging: Key-value pairs, not interpolated strings
- Tag metrics appropriately: Enable filtering and grouping
- Set reasonable retention: 30 days for logs, 1 year for metrics
- Monitor the monitors: Alert on monitoring system failures
- Sanitize sensitive data: Never log passwords, tokens, PII
- Use sampling for high-volume events: Don’t log every request
- Set up dashboards: Visualize key metrics
- Define SLOs: Service Level Objectives for uptime, latency
Common Pitfalls
- Logging too much: Disk fills up, costs increase
- Not logging enough: Can’t debug production issues
- Ignoring metrics: Problems go undetected
- Over-alerting: Alert fatigue reduces response
- No request correlation: Can’t trace issues across services