Skip to main content

Observability for Distributed Systems

Baalvion Strategic Brief • June 11, 2026

Strategic Intelligence by Baalvion Engineering

Registry Date: June 11, 2026

9 min read

Observability for Distributed Systems

Monitoring tells you what; observability tells you why

Monitoring and observability are routinely conflated, and the conflation is expensive. Monitoring answers questions you anticipated: is CPU above 80 percent, is the queue backing up, is the error rate over last quarter's threshold. It is built on known failure modes and pre-defined dashboards. Observability is the property that lets you answer questions you did not anticipate — to interrogate a running system about a novel failure without shipping new code to instrument it. The difference shows up at three in the morning, when a latency spike hits one tenant in one region through a code path nobody flagged, and the only useful question is one you never put on a dashboard.

At Baalvion we run the Baalvion Operating System — a multi-tenant trade infrastructure connecting commerce, finance, compliance, logistics, and intelligence across 198 markets and 180+ jurisdictions. When a single payment can traverse a dozen services across regulated borders, a system handling 500K+ transactions cannot be operated on dashboards built for last year's incidents. This article walks through the three pillars — logs, metrics, traces — then the connective tissue that makes them useful: service level objectives and alerting that does not train operators to ignore it.

The three pillars, and why they are not interchangeable

Logs, metrics, and traces are often called the three pillars of observability, which is accurate but misleading if it suggests they are three ways of doing the same thing. They answer different questions, have radically different cost profiles, and fail in different ways. A mature practice uses each for what it is good at and resists forcing one to do another's job — indexing every log line as if it were a metric, or sampling traces so aggressively the one you needed was the one you dropped.

  • Metrics answer 'how much' and 'how often' — aggregatable numbers over time (request rate, error rate, latency percentiles), cheap to store and the right substrate for alerting and SLOs.
  • Logs answer 'what exactly happened' — discrete, high-cardinality events with full context, invaluable for forensics but costly to index at scale.
  • Traces answer 'where did the time go and which path did this request take' — the causal chain of a single request across service boundaries, the only pillar that natively explains latency in a distributed call graph.

The unifying move of recent years is to stop treating these as three disconnected tools and emit them through a single instrumentation standard. We standardise on OpenTelemetry across BOS for exactly this reason: a vendor-neutral API and wire format for traces, metrics, and logs, so the instrumentation in our code is decoupled from the backend that stores it. That decoupling is strategic — we can route to Prometheus, Tempo, Loki, or a commercial backend without re-instrumenting, the same separation of concerns we bring to any cloud-native infrastructure work.

Logs: structured, correlated, and budgeted

The single most valuable change a team can make to its logs is to stop writing prose and start writing structured events. An unstructured log line is something you grep. A structured log is a typed record, emitted as JSON, with consistent field names: timestamp, level, service, tenant_id, trace_id, and event-specific fields. That is the difference between searching for the substring 'timeout' and querying for every event where the duration exceeded a threshold for a specific tenant in a specific region — forensics by archaeology versus a question with an answer.

The field that earns its keep above all others is the trace identifier. A log line carrying the trace_id of the request that produced it is no longer an isolated event — it is a node in a causal graph, and you can pivot from a single suspicious log to the entire request across every service it touched. For a platform that must satisfy SOC 2 Type II and ISO 27001 auditors, structured, correlated logs are also evidence: an append-only, tamper-evident record of who did what, when, and in which tenant context. The trade-off is cost — logs are the most expensive pillar to retain at full fidelity, so we tier them: recent logs hot and queryable, older logs in cheaper object storage, with tight control over what gets indexed.

  • Emit structured events, never interpolated prose — typed field names you can query and aggregate.
  • Always carry trace_id and tenant_id so a log can be pivoted to its full request and isolated to one tenant.
  • Set log levels with intent, and never log secrets, tokens, or PII.
  • Budget retention by tiering hot, warm, and cold storage rather than indexing everything at full fidelity forever.

Metrics: the cheap signal that drives alerting

Metrics are numeric time series, and their economy makes them the foundation of alerting and SLOs: a counter or histogram costs a fraction of what a log costs to store and query. The discipline that separates useful metrics from dashboard noise is choosing what to measure. The most durable framing is the one Google's SRE practice popularised — the four golden signals: latency, traffic, errors, and saturation. Instrument those for every service and you can answer the question that matters first in almost any incident: is it slow, overloaded, erroring, or out of a resource?

The recurring mistake is measuring latency as an average. Averages hide the tail, and the tail is where users feel pain — a service with a healthy mean can still be failing the slowest one percent of requests, which in a fan-out call graph may be a large fraction of real user sessions. We track latency as percentiles, p50, p95, and p99, and alert on the high percentiles, because a p99 regression signals a real cohort degraded even while the average looks fine. The other discipline is cardinality control: a metric tagged with an unbounded dimension — a raw user ID, a full URL with path parameters — explodes the time-series count and can take down the metrics backend. High cardinality belongs in traces and logs; metrics should carry bounded labels like region, tenant tier, or endpoint template.

Traces: explaining latency across service boundaries

Distributed tracing is the pillar built for the problem monoliths never had: when one request fans out across many services, no single service's logs or metrics can tell you where the time went or which downstream call failed. A trace stitches the request back together — a tree of spans, each a timed unit of work in one service, linked by a propagated trace context. When a checkout is slow, a trace shows immediately whether the time went to the pricing service, the tax calculation, the foreign-exchange lookup, or a saturated database connection pool.

Context propagation makes this work, and it is also where tracing most often breaks: every service must forward the trace context (the W3C Trace Context headers, in OpenTelemetry's default) to its downstream calls, or the trace fragments at the boundary you most need to see. The unavoidable trade-off is sampling. Tracing every request at full fidelity is prohibitively expensive at high volume, so you sample — but head-based sampling decides at the start of a request, before you know whether it is the slow or failed one you want. We favour tail-based sampling for the paths that matter, keeping traces that are slow or errored regardless of the baseline rate. In the settlement paths behind our real-time cross-border settlement work, a trace capturing a failed multi-leg payment is worth keeping every time.

SLOs: turning signals into a contract

Raw signals tell you what the system is doing, not whether it is healthy enough. The bridge between data and decisions is the service level objective. An SLO is a target for a service level indicator — a measurable property like the fraction of requests served faster than 300 milliseconds — set at a deliberately imperfect level such as 99.9 percent rather than the fantasy of 100 percent. That gap between target and perfection is the error budget: the 0.1 percent of requests you are permitted to fail in the window before the objective is breached.

The error budget reframes reliability from an argument into a measurement. When the budget is healthy, teams ship features; when it is spent, the policy is to stop shipping and invest in reliability until it recovers — ending the unwinnable velocity-versus-stability debate. It also tells you what to alert on: not raw CPU, but the rate at which you are burning the budget. We treat SLOs as the operating contract for every tier of BOS, the same way we treat them as a deliverable in enterprise software engagements — a number both sides agree defines 'good enough,' measured continuously rather than argued about after an outage.

Alerting that works: page on symptoms, not causes

An observability stack that pages an operator forty times a night for things that resolve themselves has not improved reliability — it has degraded it, because alert fatigue trains people to dismiss the page that finally matters. The first principle of alerting that works is to alert on symptoms the user feels, not causes that may affect no one. A node running hot is a cause; it is not worth a 3 a.m. page if the SLO is still being met. A burning error budget is a symptom — real users are being harmed right now — and it is worth waking someone.

The technique that operationalises this is multi-window, multi-burn-rate alerting on the error budget. A fast burn — consuming a large slice of the budget in a short window — pages immediately, because at that rate the budget will be exhausted within minutes. A slow, low-grade erosion opens a ticket for the next working day, because it is real but not urgent. This catches both the sudden outage and the silent regression without flooding the rotation. Every page should also be actionable: a link to the relevant trace, the affected tenant and region, and a runbook. We build that correlation into our DevOps practice so an alert is the start of a diagnosis, and pair it with blameless post-incident review so each alert becomes a lesson.

  • Page on user-facing symptoms and SLO burn, not internal causes with no user impact.
  • Use multi-burn-rate alerting: page on fast burn, ticket on slow burn — match urgency to impact.
  • Make every page actionable: trace link, affected tenant and region, and a runbook.
  • Review alerts continuously — delete the ones that never lead to action, because every false page erodes trust in the real ones.

The Baalvion view

Observability is not a dashboard you buy; it is a discipline that starts in the code with structured, correlated instrumentation and ends with an on-call engineer who trusts the page that wakes them. The three pillars are complementary, not interchangeable: metrics for cheap, alertable signal; logs for forensic detail; traces for causality across service boundaries — all stitched together by a propagated trace context through a single standard like OpenTelemetry. SLOs turn those signals into a contract engineering and the business both accept, and error-budget alerting turns that contract into a page that means something. Done well, observability lets a platform like BOS stay auditable and available across 198 markets while still moving fast. Done poorly, it is an expensive pile of telemetry nobody reads until the outage is over — the difference we work to engineer out in every technology consulting engagement.

Frequently Asked Questions

What is the difference between monitoring and observability?+

Monitoring answers questions you anticipated by tracking predefined metrics against known thresholds — it tells you the system is down. Observability is the property that lets you answer questions you did not anticipate, interrogating a running system about a novel failure without shipping new instrumentation. Monitoring is a subset of what a well-instrumented, observable system enables; the difference matters most during unexpected incidents that no existing dashboard was built to explain.

Are logs, metrics, and traces interchangeable?+

No. They answer different questions and have very different cost profiles. Metrics answer 'how much' and 'how often' cheaply and are the right substrate for alerting and SLOs. Logs answer 'what exactly happened' with full forensic context but are expensive to index at scale. Traces answer 'where did the time go and which path did the request take' across service boundaries. A mature practice uses each for its strength rather than forcing one to do another's job.

Why is a trace ID so important in logs?+

A log line carrying the trace ID of the request that produced it stops being an isolated event and becomes a node in a causal graph. You can pivot from a single suspicious log to the entire request that produced it, across every service it touched. This correlation is what collapses the gap between the three pillars, turning disconnected telemetry into a coherent picture of a single request's journey through a distributed system.

What is an error budget and how is it used?+

An error budget is the gap between an SLO target and perfection. If your objective is 99.9 percent success, the budget is the 0.1 percent of requests you are permitted to fail in the measurement window. It reframes reliability from an argument into a measurement: when the budget is healthy, teams ship features; when it is spent, the policy is to stop shipping and invest in reliability until it recovers — making the velocity-versus-stability trade-off explicit and data-driven.

How should alerting be configured to avoid alert fatigue?+

Alert on user-facing symptoms and SLO error-budget burn, not on internal causes that may have no user impact. Use multi-window, multi-burn-rate alerting: page immediately on a fast budget burn, but only open a ticket on a slow burn. Make every page actionable with a trace link, affected tenant and region, and a runbook. Continuously delete alerts that never lead to action, because every false page erodes trust in the alerts that genuinely matter.

Why use OpenTelemetry instead of a single vendor's agent?+

OpenTelemetry provides a vendor-neutral API and wire format for traces, metrics, and logs, so your instrumentation is decoupled from the backend that stores it. That decoupling lets you route telemetry to Prometheus, Tempo, Loki, or a commercial backend without re-instrumenting your code, avoiding lock-in and letting the storage decision evolve independently of the application code. It also standardises context propagation, which is what keeps distributed traces from fragmenting at service boundaries.