Observability

Observability Beyond Dashboards

A wall of green panels tells you what you already thought to measure is fine. Real observability is asking new questions of a live system, no new code.

Cloud X Ops TeamDevOps & SRE Consultancy
April 30, 2026
8 min read

A wall of green dashboards tells you the things you already thought to measure are fine. It says nothing about the failure you didn't anticipate, which is, by definition, the one that pages you. Real observability is the ability to ask new questions of your running system without shipping new code.

That capability rests on three signals, metrics, logs, and traces, plus the discipline to define what "good" actually means. Let's take them in turn, then watch them move.

Metrics, logs, traces, and what each is for

  • Metrics are cheap, aggregated numbers over time, request rate, error rate, latency, saturation. They tell you that something is wrong, fast.
  • Logs are discrete events with context. They tell you what happened on a specific request. Structured (JSON) logs are queryable; unstructured text is a graveyard.
  • Traces follow a single request across services. In a distributed system they tell you where the time went and which hop failed.

The three connect, or they're three silos

A spike on a metric should link to the traces in that window, which link to the logs for those requests. If your tooling can't pivot between them, you have three dashboards, not observability.

The signals, live

Below is a live service dashboard, the golden signals updating in real time against an SLO. Hit inject incident to watch latency and error rate climb, the SLO burn-rate accelerate, and the panel cross from healthy into alerting. This is the shape of every incident you've ever been paged for.

payment-api · golden signals
● healthy
Latency p99
42ms
Error rate
0.2%
Throughput
1.2k/s
SLO · 99.9% availability error budget: 100%
burn rate 1.0×, nominal
Live, golden signals + error-budget burn

SLOs turn noise into decisions

An alert that fires on "CPU > 80%" wakes you for a non-problem. An alert that fires on "we're burning the error budget fast enough to miss our reliability target" wakes you for something that matters. That's the difference SLOs make: they tie alerting to user-visible reliability instead of arbitrary thresholds.

slo.yaml
slo: payment-api-availability
objective: 99.9          # 43m of allowed downtime / month
indicator:
  good:  "http_requests{code!~'5..'}"
  total: "http_requests"
alerting:
  # page only on FAST burn, multi-window, multi-burn-rate
  - burn_rate: 14.4   window: 1h    severity: page
  - burn_rate: 6      window: 6h    severity: ticket
You don't rise to the level of your dashboards; you fall to the level of the questions you can ask after midnight., Why "monitoring" ≠ "observability"

Instrument once, with OpenTelemetry

The pragmatic default now is OpenTelemetry: instrument your code against one vendor-neutral standard and export metrics, logs, and traces to whatever backend you choose. It decouples your instrumentation from your tooling, so switching observability vendors doesn't mean re-instrumenting the entire estate.

Key takeaways

  • Metrics say that something broke, traces say where, logs say what.
  • The three signals must interconnect, otherwise they're three silos.
  • Alert on error-budget burn rate, not arbitrary CPU thresholds.
  • Instrument once with OpenTelemetry to decouple code from your backend choice.

Drowning in dashboards, short on answers?

We wire metrics, logs, and traces into one queryable system, define SLOs that matter, and turn noisy alerts into decisions worth waking up for.

Map my observability
connect.sh
SECURE
cloudxops@client:~$ ./otel-rollout.sh
# Instrumenting services...
[OK] traces ↔ logs ↔ metrics correlated
[INFO] 7 SLOs defined · burn-rate alerts armed
[READY] MTTR down from 47m → 9m
$
Connection strength