OpenTelemetry observability guide

OpenTelemetry in 2026: The Standard for Modern Observability

You cannot fix what you cannot see. In distributed systems with dozens of microservices, a single user request might touch 10 services, 3 databases, and 2 message queues. When something goes wrong, finding the root cause without proper observability is like debugging in the dark. OpenTelemetry has become the industry standard for making distributed systems visible — and in 2026, it is mature enough for every team to adopt.

What Is OpenTelemetry

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework. It provides APIs, SDKs, and tools to generate, collect, and export three types of telemetry data:

–

Traces — The journey of a request across services (distributed tracing)

–

Metrics — Numerical measurements over time (counters, histograms, gauges)

–

Logs — Structured event records with context

The key word is vendor-neutral. You instrument your code once with OpenTelemetry, and you can export to any backend — Jaeger, Grafana Tempo, Datadog, New Relic, AWS X-Ray, or any combination.

The Three Pillars in Practice

Traces answer: "What happened to this specific request?"

A trace follows a single request from the frontend through every service it touches. Each step is a span — a named, timed operation with metadata.

User Request → API Gateway (12ms)
                └→ Auth Service (8ms)
                └→ Order Service (45ms)
                    └→ PostgreSQL Query (15ms)
                    └→ Payment Service (120ms)
                        └→ Stripe API (95ms)
                    └→ Notification Service (5ms)
                        └→ Redis Pub/Sub (2ms)

Metrics answer: "How is the system performing overall?"

–

Request rate: 1,250 req/s

–

Error rate: 0.3%

–

P99 latency: 450ms

–

Active database connections: 42/100

Logs answer: "What exactly happened at this moment?"

{
  "timestamp": "2026-02-23T10:15:32Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error": "Stripe API timeout after 30s",
  "customer_id": "cust_42",
  "amount": 99.99
}

The power comes from correlation. The trace_id in the log connects to the same trace in your tracing backend, which connects to the same request in your metrics. One ID links all three pillars.

Instrumenting a Spring Boot Application

Spring Boot has excellent OpenTelemetry support through Micrometer and the OTel Java Agent:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # Sample 100% in dev, lower in production
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

logging:
  pattern:
    console: "%d{HH:mm:ss} [%X{traceId}] %-5level %logger{36} - %msg%n"

@RestController
@RequestMapping("/api/orders")
public class OrderController {

    private final OrderService orderService;
    private final ObservationRegistry registry;

    @GetMapping("/{id}")
    public OrderResponse getOrder(@PathVariable Long id) {
        // Automatic span creation via Spring Observation
        return Observation.createNotStarted("order.fetch", registry)
            .lowCardinalityKeyValue("order.type", "standard")
            .observe(() -> orderService.findById(id));
    }
}

@Service
public class OrderService {

    private final JdbcTemplate jdbc;
    private final PaymentClient paymentClient;

    // Custom span for business logic
    @Observed(name = "order.process")
    public OrderResponse findById(Long id) {
        // JDBC calls are auto-instrumented — each query becomes a span
        Order order = jdbc.queryForObject(
            "SELECT * FROM orders WHERE id = ?", orderRowMapper, id);

        // HTTP calls to other services are auto-traced
        Payment payment = paymentClient.getPayment(order.getPaymentId());

        return new OrderResponse(order, payment);
    }
}

With the OTel Java Agent, most instrumentation is automatic — JDBC queries, HTTP client calls, Kafka producers/consumers, and Redis commands all generate spans without code changes.

The OpenTelemetry Collector

The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples your application from the backend:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

  # Filter out health check spans
  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.route
            value: /health

  # Tail-based sampling — keep errors, sample normal traffic
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, filter, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

This configuration receives telemetry via OTLP, processes it (batching, filtering, sampling), and exports traces to Grafana Tempo, metrics to Prometheus, and logs to Loki.

Custom Metrics That Matter

Beyond auto-instrumented metrics, define custom ones for your business:

@Component
public class BusinessMetrics {

    private final MeterRegistry registry;
    private final Counter ordersPlaced;
    private final Timer orderProcessingTime;
    private final AtomicInteger activeCheckouts;

    public BusinessMetrics(MeterRegistry registry) {
        this.registry = registry;

        this.ordersPlaced = Counter.builder("business.orders.placed")
            .description("Total orders placed")
            .tag("channel", "web")
            .register(registry);

        this.orderProcessingTime = Timer.builder("business.orders.processing_time")
            .description("Time to process an order")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);

        this.activeCheckouts = registry.gauge(
            "business.checkouts.active",
            new AtomicInteger(0)
        );
    }

    public void recordOrder(String type, double amount) {
        ordersPlaced.increment();
        registry.counter("business.revenue",
            "type", type,
            "currency", "USD"
        ).increment(amount);
    }
}

Structured Logging with Trace Context

Logs become powerful when they carry trace context:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

@Service
public class PaymentService {

    private static final Logger log = LoggerFactory.getLogger(PaymentService.class);

    public PaymentResult processPayment(PaymentRequest request) {
        // trace_id and span_id are automatically injected into MDC
        log.info("Processing payment for customer={} amount={}",
            request.getCustomerId(), request.getAmount());

        try {
            PaymentResult result = stripeClient.charge(request);
            log.info("Payment successful transaction_id={}", result.getTransactionId());
            return result;
        } catch (PaymentException e) {
            log.error("Payment failed for customer={} error={}",
                request.getCustomerId(), e.getMessage(), e);
            throw e;
        }
    }
}

In Grafana, you can jump from a log line directly to its trace, see every service that request touched, and identify exactly where the failure occurred.

Sampling Strategies for Production

At scale, collecting 100% of telemetry is prohibitively expensive. Smart sampling strategies are essential:

Strategy	Description	Use When
Head-based	Decide at request start (random %)	Simple, predictable cost
Tail-based	Decide after request completes	Need to keep all errors/slow requests
Priority	Always sample certain request types	Critical paths need 100% visibility
Adaptive	Adjust rate based on traffic volume	Variable traffic patterns

The collector configuration above demonstrates tail-based sampling: 100% of errors and slow requests are kept, while normal traffic is sampled at 10%.

The Grafana Stack: Putting It All Together

The most popular open-source observability stack in 2026:

–

Grafana Tempo — Distributed tracing backend (trace storage and search)

–

Prometheus — Metrics collection and alerting

–

Grafana Loki — Log aggregation with label-based indexing

–

Grafana — Unified dashboard and exploration UI

All three backends are connected in Grafana through exemplars and trace-to-logs correlations. Click a spike in a latency graph, see the traces that caused it, click a trace span, see the logs from that exact moment. This workflow transforms debugging from hours to minutes.

Getting Started Checklist

–

Add the OTel Java Agent (or SDK for your language) — auto-instrumentation covers 80% of needs

–

Deploy an OTel Collector as a sidecar or daemonset

–

Export to your backend of choice (Grafana stack is free and excellent)

–

Add trace IDs to your structured logs

–

Define 3–5 custom business metrics that matter to your team

–

Set up tail-based sampling to control costs while keeping error traces

–

Build dashboards with RED metrics (Rate, Errors, Duration) for each service

–

Create alerts on SLO violations, not raw thresholds

Observability is not optional for distributed systems. OpenTelemetry makes it achievable without vendor lock-in, and in 2026, the tooling has matured to the point where there is no excuse not to implement it.