OpenTelemetry observability guide

OpenTelemetry in 2026: The Standard for Modern Observability

You cannot fix what you cannot see. In distributed systems with dozens of microservices, a single user request might touch 10 services, 3 databases, and 2 message queues. When something goes wrong, finding the root cause without proper observability is like debugging in the dark. OpenTelemetry has become the industry standard for making distributed systems visible — and in 2026, it is mature enough for every team to adopt.

What Is OpenTelemetry

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework. It provides APIs, SDKs, and tools to generate, collect, and export three types of telemetry data:

–

Traces — The journey of a request across services (distributed tracing)

–

Metrics — Numerical measurements over time (counters, histograms, gauges)

–

Logs — Structured event records with context

The key word is vendor-neutral. You instrument your code once with OpenTelemetry, and you can export to any backend — Jaeger, Grafana Tempo, Datadog, New Relic, AWS X-Ray, or any combination.

The Three Pillars in Practice

Traces answer: "What happened to this specific request?"

A trace follows a single request from the frontend through every service it touches. Each step is a span — a named, timed operation with metadata.

User Request → API Gateway (12ms)
                └→ Auth Service (8ms)
                └→ Order Service (45ms)
                    └→ PostgreSQL Query (15ms)
                    └→ Payment Service (120ms)
                        └→ Stripe API (95ms)
                    └→ Notification Service (5ms)
                        └→ Redis Pub/Sub (2ms)

Metrics answer: "How is the system performing overall?"

–

Request rate: 1,250 req/s

–

Error rate: 0.3%

–

P99 latency: 450ms

–

Active database connections: 42/100

Logs answer: "What exactly happened at this moment?"

{
  "timestamp": "2026-02-23T10:15:32Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error": "Stripe API timeout after 30s",
  "customer_id": "cust_42",
  "amount": 99.99
}

The power comes from correlation. The trace_id in the log connects to the same trace in your tracing backend, which connects to the same request in your metrics. One ID links all three pillars.

Instrumenting a Spring Boot Application

Spring Boot has excellent OpenTelemetry support through Micrometer and the OTel Java Agent:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # Sample 100% in dev, lower in production
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

logging:
  pattern:
    console: "%d{HH:mm:ss} [%X{traceId}] %-5level %logger{36} - %msg%n"

@RestController
@RequestMapping("/api/orders")
public class OrderController {

    private final OrderService orderService;
    private final ObservationRegistry registry;

    @GetMapping("/{id}")
    public OrderResponse getOrder(@PathVariable Long id) {
        // Automatic span creation via Spring Observation
        return Observation.createNotStarted("order.fetch", registry)
            .lowCardinalityKeyValue("order.type", "standard")
            .observe(() -> orderService.findById(id));
    }
}

@Service
public class OrderService {

    private final JdbcTemplate jdbc;
    private final PaymentClient paymentClient;

    // Custom span for business logic
    @Observed(name = "order.process")
    public OrderResponse findById(Long id) {
        // JDBC calls are auto-instrumented — each query becomes a span
        Order order = jdbc.queryForObject(
            "SELECT * FROM orders WHERE id = ?", orderRowMapper, id);

        // HTTP calls to other services are auto-traced
        Payment payment = paymentClient.getPayment(order.getPaymentId());

        return new OrderResponse(order, payment);
    }
}

With the OTel Java Agent, most instrumentation is automatic — JDBC queries, HTTP client calls, Kafka producers/consumers, and Redis commands all generate spans without code changes.

The OpenTelemetry Collector

The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples your application from the backend:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

  # Filter out health check spans
  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.route
            value: /health

  # Tail-based sampling — keep errors, sample normal traffic
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, filter, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

This configuration receives telemetry via OTLP, processes it (batching, filtering, sampling), and exports traces to Grafana Tempo, metrics to Prometheus, and logs to Loki.

Custom Metrics That Matter

Beyond auto-instrumented metrics, define custom ones for your business:

@Component
public class BusinessMetrics {

    private final MeterRegistry registry;
    private final Counter ordersPlaced;
    private final Timer orderProcessingTime;
    private final AtomicInteger activeCheckouts;

    public BusinessMetrics(MeterRegistry registry) {
        this.registry = registry;

        this.ordersPlaced = Counter.builder("business.orders.placed")
            .description("Total orders placed")
            .tag("channel", "web")
            .register(registry);

        this.orderProcessingTime = Timer.builder("business.orders.processing_time")
            .description("Time to process an order")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);

        this.activeCheckouts = registry.gauge(
            "business.checkouts.active",
            new AtomicInteger(0)
        );
    }

    public void recordOrder(String type, double amount) {
        ordersPlaced.increment();
        registry.counter("business.revenue",
            "type", type,
            "currency", "USD"
        ).increment(amount);
    }
}

Structured Logging with Trace Context

Logs become powerful when they carry trace context:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

@Service
public class PaymentService {

    private static final Logger log = LoggerFactory.getLogger(PaymentService.class);

    public PaymentResult processPayment(PaymentRequest request) {
        // trace_id and span_id are automatically injected into MDC
        log.info("Processing payment for customer={} amount={}",
            request.getCustomerId(), request.getAmount());

        try {
            PaymentResult result = stripeClient.charge(request);
            log.info("Payment successful transaction_id={}", result.getTransactionId());
            return result;
        } catch (PaymentException e) {
            log.error("Payment failed for customer={} error={}",
                request.getCustomerId(), e.getMessage(), e);
            throw e;
        }
    }
}

In Grafana, you can jump from a log line directly to its trace, see every service that request touched, and identify exactly where the failure occurred.

Sampling Strategies for Production

At scale, collecting 100% of telemetry is prohibitively expensive. Smart sampling strategies are essential:

Strategy	Description	Use When
Head-based	Decide at request start (random %)	Simple, predictable cost
Tail-based	Decide after request completes	Need to keep all errors/slow requests
Priority	Always sample certain request types	Critical paths need 100% visibility
Adaptive	Adjust rate based on traffic volume	Variable traffic patterns

The collector configuration above demonstrates tail-based sampling: 100% of errors and slow requests are kept, while normal traffic is sampled at 10%.

The Grafana Stack: Putting It All Together

The most popular open-source observability stack in 2026:

–

Grafana Tempo — Distributed tracing backend (trace storage and search)

–

Prometheus — Metrics collection and alerting

–

Grafana Loki — Log aggregation with label-based indexing

–

Grafana — Unified dashboard and exploration UI

All three backends are connected in Grafana through exemplars and trace-to-logs correlations. Click a spike in a latency graph, see the traces that caused it, click a trace span, see the logs from that exact moment. This workflow transforms debugging from hours to minutes.

Getting Started Checklist

–

Key Takeaways

Start with a solid foundation and build incrementally based on your requirements
Test thoroughly in staging before deploying to production environments
Monitor performance metrics and iterate based on real-world data
Follow security best practices and keep dependencies up to date
Document architectural decisions for future team members

Add the OTel Java Agent (or SDK for your language) — auto-instrumentation covers 80% of needs

–

Deploy an OTel Collector as a sidecar or daemonset

–

Export to your backend of choice (Grafana stack is free and excellent)

–

Add trace IDs to your structured logs

–

Define 3–5 custom business metrics that matter to your team

–

Set up tail-based sampling to control costs while keeping error traces

–

Build dashboards with RED metrics (Rate, Errors, Duration) for each service

–

Create alerts on SLO violations, not raw thresholds

For further reading, refer to the AWS documentation and the Google Cloud documentation for comprehensive reference material.

Observability is not optional for distributed systems. OpenTelemetry makes it achievable without vendor lock-in, and in 2026, the tooling has matured to the point where there is no excuse not to implement it.

In conclusion, Opentelemetry Observability is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.