Abstract


Trace: User request to /checkout
  ├── Span: API Gateway handler
  ├── Span: Auth service call
  ├── Span: Payment service call
  │     ├── Span: External Stripe API
  └── Span: Inventory service call
  • A DAG of spans that represents the execution path of a single request across services.
  • Once we identify the span for latency, we can proceed with optimisation
  • We can use tools like Datadog APM, Tempo and Zipkin etc

How traces are sent to a server

Step 1: Instrumentation

  • Code is instrumented (manually or via auto-instrumentation libraries) to emit trace data whenever requests, DB queries, RPC calls, or background jobs happen.
  • Each unit of work generates spans.

Step 2: Local collector/agent (optional but common)

  • Instead of sending traces directly, the app often sends spans to a lightweight agent or collector running on the same host/container.
  • Protocols used: OTLP (gRPC/HTTP), Zipkin/Jaeger format, proprietary formats.

Step 3: Batching and exporting

  • Spans are batched to reduce overhead.
  • Exporters forward them to the tracing backend (Jaeger, Datadog, Tempo, etc.), often over HTTP/gRPC with retries and compression.

Step 4: Backend processing

  • The backend correlates spans into full traces, stores them, and exposes them for querying/visualization.

Trace Collection


  • There are mainly two ways intrusive (in-app instrumentation) and non-intrusive (agent-level / auto-instrumentation)
AspectIntrusive (in-app SDK)Non-intrusive (agent/auto-patching)
DefinitionWe import the tracing SDK into the app. The library hooks into frameworks (Flask, Django, Spring, etc.) and records spans for requests, DB queries, caching calls, etc.Non-intrusive doesn’t mean passive sniffing, it still patches runtime/library calls, but without touching application code.
MechanismManual instrumentation with tracing API (span.start, span.set_tag). Runs inside the process.Injects into runtimes (Python, Java, Node, .NET, PHP) using LD_PRELOAD (Linux), JVM -javaagent, .NET CLR profiler, or monkey-patching in dynamic languages.
Context capturedBoth business + technical: user ID, session ID, customer tier, transaction IDs, error codes, domain-specific tags.Mostly technical only: SQL query text, HTTP path, timings, status codes.
SQL/DB visibilityFull DB span + business context (e.g., “user_id=42 ran this query”).Intercepts DB client library calls (e.g., psycopg2, mysqlclient, JDBC). Sees SQL query strings because they’re passed as plain arguments to functions before execution.
CoverageAny part of the code you choose: API handlers, background jobs, domain logic, custom error handling.Only covers known patched libraries: DB drivers, HTTP clients, RPC frameworks.
Error detailsFull stack traces, custom error messages, domain error codes.Generic runtime errors (e.g., DB exception, HTTP 500) without business meaning.
FlexibilityHigh, you decide what to trace, where to put spans, and what metadata to attach.Low, limited to what the tracer can infer automatically.
Best use caseWhen you need business observability: which customers are affected, which orders are delayed, custom KPIs.When you need fast setup and baseline visibility without changing app code.

Terminologies


Runtime Metrics

Instrumented

  • Code or tools have been added to the application to monitor, measure, or analyze its behavior during execution
  • Provide insights into the application’s performance, functionality, and other operational characteristics
  • This is particularly useful for debugging, performance tuning, and monitoring purposes

Host-side

  • We can tune the Sampling
    • Library Sampling overrides Agent Sampling
  • Trace Metrics are the Metrics, directly connected Instrumented application, calculated based on 100% of the app’s traffic

Datadog backend side

  • Live Search allows us to search Span using any tag or Span
  • Generate Custom Metric from Span
  • Retention Filters - how long we want to retain the trace
  • Dashboard used to give a visual representation of the app for optimisation and debugging