OpenTelemetry in 2026: The Standard for Modern Observability

OpenTelemetry in 2026: The Standard for Modern Observability

You cannot fix what you cannot see. In distributed systems with dozens of microservices, a single user request might touch 10 services, 3 databases, and 2 message queues. When something goes wrong, finding the root cause without proper observability is like debugging in the dark. OpenTelemetry has become the industry standard for making distributed systems visible — and in 2026, it is mature enough for every team to adopt.

What Is OpenTelemetry

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework. It provides APIs, SDKs, and tools to generate, collect, and export three types of telemetry data:

Traces — The journey of a request across services (distributed tracing)

Metrics — Numerical measurements over time (counters, histograms, gauges)

Logs — Structured event records with context

The key word is vendor-neutral. You instrument your code once with OpenTelemetry, and you can export to any backend — Jaeger, Grafana Tempo, Datadog, New Relic, AWS X-Ray, or any combination.

Observability Dashboard

The Three Pillars in Practice

Traces answer: "What happened to this specific request?"

A trace follows a single request from the frontend through every service it touches. Each step is a span — a named, timed operation with metadata.

User Request → API Gateway (12ms)
                └→ Auth Service (8ms)
                └→ Order Service (45ms)
                    └→ PostgreSQL Query (15ms)
                    └→ Payment Service (120ms)
                        └→ Stripe API (95ms)
                    └→ Notification Service (5ms)
                        └→ Redis Pub/Sub (2ms)

Metrics answer: "How is the system performing overall?"

Request rate: 1,250 req/s

Error rate: 0.3%

P99 latency: 450ms

Active database connections: 42/100

Logs answer: "What exactly happened at this moment?"

{
  "timestamp": "2026-02-23T10:15:32Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error": "Stripe API timeout after 30s",
  "customer_id": "cust_42",
  "amount": 99.99
}

The power comes from correlation. The trace_id in the log connects to the same trace in your tracing backend, which connects to the same request in your metrics. One ID links all three pillars.

Instrumenting a Spring Boot Application

Spring Boot has excellent OpenTelemetry support through Micrometer and the OTel Java Agent:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # Sample 100% in dev, lower in production
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

logging:
  pattern:
    console: "%d{HH:mm:ss} [%X{traceId}] %-5level %logger{36} - %msg%n"
@RestController
@RequestMapping("/api/orders")
public class OrderController {

    private final OrderService orderService;
    private final ObservationRegistry registry;

    @GetMapping("/{id}")
    public OrderResponse getOrder(@PathVariable Long id) {
        // Automatic span creation via Spring Observation
        return Observation.createNotStarted("order.fetch", registry)
            .lowCardinalityKeyValue("order.type", "standard")
            .observe(() -> orderService.findById(id));
    }
}

@Service
public class OrderService {

    private final JdbcTemplate jdbc;
    private final PaymentClient paymentClient;

    // Custom span for business logic
    @Observed(name = "order.process")
    public OrderResponse findById(Long id) {
        // JDBC calls are auto-instrumented — each query becomes a span
        Order order = jdbc.queryForObject(
            "SELECT * FROM orders WHERE id = ?", orderRowMapper, id);

        // HTTP calls to other services are auto-traced
        Payment payment = paymentClient.getPayment(order.getPaymentId());

        return new OrderResponse(order, payment);
    }
}

With the OTel Java Agent, most instrumentation is automatic — JDBC queries, HTTP client calls, Kafka producers/consumers, and Redis commands all generate spans without code changes.

The OpenTelemetry Collector

The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples your application from the backend:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

  # Filter out health check spans
  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.route
            value: /health

  # Tail-based sampling — keep errors, sample normal traffic
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, filter, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

This configuration receives telemetry via OTLP, processes it (batching, filtering, sampling), and exports traces to Grafana Tempo, metrics to Prometheus, and logs to Loki.

Custom Metrics That Matter

Beyond auto-instrumented metrics, define custom ones for your business:

@Component
public class BusinessMetrics {

    private final MeterRegistry registry;
    private final Counter ordersPlaced;
    private final Timer orderProcessingTime;
    private final AtomicInteger activeCheckouts;

    public BusinessMetrics(MeterRegistry registry) {
        this.registry = registry;

        this.ordersPlaced = Counter.builder("business.orders.placed")
            .description("Total orders placed")
            .tag("channel", "web")
            .register(registry);

        this.orderProcessingTime = Timer.builder("business.orders.processing_time")
            .description("Time to process an order")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);

        this.activeCheckouts = registry.gauge(
            "business.checkouts.active",
            new AtomicInteger(0)
        );
    }

    public void recordOrder(String type, double amount) {
        ordersPlaced.increment();
        registry.counter("business.revenue",
            "type", type,
            "currency", "USD"
        ).increment(amount);
    }
}

Structured Logging with Trace Context

Logs become powerful when they carry trace context:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

@Service
public class PaymentService {

    private static final Logger log = LoggerFactory.getLogger(PaymentService.class);

    public PaymentResult processPayment(PaymentRequest request) {
        // trace_id and span_id are automatically injected into MDC
        log.info("Processing payment for customer={} amount={}",
            request.getCustomerId(), request.getAmount());

        try {
            PaymentResult result = stripeClient.charge(request);
            log.info("Payment successful transaction_id={}", result.getTransactionId());
            return result;
        } catch (PaymentException e) {
            log.error("Payment failed for customer={} error={}",
                request.getCustomerId(), e.getMessage(), e);
            throw e;
        }
    }
}

In Grafana, you can jump from a log line directly to its trace, see every service that request touched, and identify exactly where the failure occurred.

Sampling Strategies for Production

At scale, collecting 100% of telemetry is prohibitively expensive. Smart sampling strategies are essential:

StrategyDescriptionUse When
Head-basedDecide at request start (random %)Simple, predictable cost
Tail-basedDecide after request completesNeed to keep all errors/slow requests
PriorityAlways sample certain request typesCritical paths need 100% visibility
AdaptiveAdjust rate based on traffic volumeVariable traffic patterns

The collector configuration above demonstrates tail-based sampling: 100% of errors and slow requests are kept, while normal traffic is sampled at 10%.

The Grafana Stack: Putting It All Together

The most popular open-source observability stack in 2026:

Grafana Tempo — Distributed tracing backend (trace storage and search)

Prometheus — Metrics collection and alerting

Grafana Loki — Log aggregation with label-based indexing

Grafana — Unified dashboard and exploration UI

All three backends are connected in Grafana through exemplars and trace-to-logs correlations. Click a spike in a latency graph, see the traces that caused it, click a trace span, see the logs from that exact moment. This workflow transforms debugging from hours to minutes.

Getting Started Checklist

Add the OTel Java Agent (or SDK for your language) — auto-instrumentation covers 80% of needs

Deploy an OTel Collector as a sidecar or daemonset

Export to your backend of choice (Grafana stack is free and excellent)

Add trace IDs to your structured logs

Define 3–5 custom business metrics that matter to your team

Set up tail-based sampling to control costs while keeping error traces

Build dashboards with RED metrics (Rate, Errors, Duration) for each service

Create alerts on SLO violations, not raw thresholds

Observability is not optional for distributed systems. OpenTelemetry makes it achievable without vendor lock-in, and in 2026, the tooling has matured to the point where there is no excuse not to implement it.

Scroll to Top