Observability Beyond Dashboards
A wall of green panels tells you what you already thought to measure is fine. Real observability is asking new questions of a live system, no new code.
A wall of green dashboards tells you the things you already thought to measure are fine. It says nothing about the failure you didn't anticipate, which is, by definition, the one that pages you. Real observability is the ability to ask new questions of your running system without shipping new code.
That capability rests on three signals, metrics, logs, and traces, plus the discipline to define what "good" actually means. Let's take them in turn, then watch them move.
Metrics, logs, traces, and what each is for
- Metrics are cheap, aggregated numbers over time, request rate, error rate, latency, saturation. They tell you that something is wrong, fast.
- Logs are discrete events with context. They tell you what happened on a specific request. Structured (JSON) logs are queryable; unstructured text is a graveyard.
- Traces follow a single request across services. In a distributed system they tell you where the time went and which hop failed.
The three connect, or they're three silos
A spike on a metric should link to the traces in that window, which link to the logs for those requests. If your tooling can't pivot between them, you have three dashboards, not observability.
The signals, live
Below is a live service dashboard, the golden signals updating in real time against an SLO. Hit inject incident to watch latency and error rate climb, the SLO burn-rate accelerate, and the panel cross from healthy into alerting. This is the shape of every incident you've ever been paged for.
SLOs turn noise into decisions
An alert that fires on "CPU > 80%" wakes you for a non-problem. An alert that fires on "we're burning the error budget fast enough to miss our reliability target" wakes you for something that matters. That's the difference SLOs make: they tie alerting to user-visible reliability instead of arbitrary thresholds.
slo: payment-api-availability
objective: 99.9 # 43m of allowed downtime / month
indicator:
good: "http_requests{code!~'5..'}"
total: "http_requests"
alerting:
# page only on FAST burn, multi-window, multi-burn-rate
- burn_rate: 14.4 window: 1h severity: page
- burn_rate: 6 window: 6h severity: ticket
Instrument once, with OpenTelemetry
The pragmatic default now is OpenTelemetry: instrument your code against one vendor-neutral standard and export metrics, logs, and traces to whatever backend you choose. It decouples your instrumentation from your tooling, so switching observability vendors doesn't mean re-instrumenting the entire estate.
Key takeaways
- Metrics say that something broke, traces say where, logs say what.
- The three signals must interconnect, otherwise they're three silos.
- Alert on error-budget burn rate, not arbitrary CPU thresholds.
- Instrument once with OpenTelemetry to decouple code from your backend choice.
Drowning in dashboards, short on answers?
We wire metrics, logs, and traces into one queryable system, define SLOs that matter, and turn noisy alerts into decisions worth waking up for.
Map my observability