Microservices Observability with Spring Boot and OpenTelemetry

Spring Boot Observability with OpenTelemetry: Traces, Metrics, and Logs

When a request fails in a microservices architecture, finding the root cause means correlating logs, traces, and metrics across 5-10 services. Without observability, you are guessing. Spring Boot OpenTelemetry integration provides distributed tracing, metrics collection, and log correlation out of the box. Therefore, this guide shows you how to instrument a Spring Boot application with OpenTelemetry and connect it to Grafana, Jaeger, and Prometheus for full production observability.

The Three Pillars: Traces, Metrics, Logs

Observability rests on three signal types that work together. Traces show the journey of a single request through your system — which services it hit, how long each took, and where it failed. Metrics show aggregate behavior — request rate, error rate, latency percentiles, CPU usage. Logs show detailed events with context. The magic happens when all three are correlated: a spike in error rate metrics leads you to traces of failed requests, which contain trace IDs that you search in logs for detailed error messages.

OpenTelemetry unifies all three under a single standard. Before OpenTelemetry, you needed separate libraries for tracing (Zipkin, Jaeger), metrics (Micrometer, Prometheus client), and logging (SLF4J, Log4j). Moreover, correlating between them required manual trace ID propagation. OpenTelemetry handles instrumentation, collection, and export for all three signals with a single SDK.

Setting Up OpenTelemetry with Spring Boot

Spring Boot 3.x has excellent OpenTelemetry support through Micrometer and the OpenTelemetry Java agent. The simplest approach is the Java agent, which auto-instruments your application without code changes.

<!-- pom.xml dependencies -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-otlp</artifactId>
</dependency>
# application.yml — OpenTelemetry configuration
management:
  tracing:
    sampling:
      probability: 1.0  # Sample 100% in dev, 10% in production
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces
    metrics:
      export:
        enabled: true
        step: 30s
        url: http://otel-collector:4318/v1/metrics

# Set service name for trace identification
spring:
  application:
    name: order-service

# Logging with trace correlation
logging:
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%X{traceId}-%X{spanId}] %-5level %logger{36} - %msg%n"

The logging pattern includes traceId and spanId from MDC (Mapped Diagnostic Context). OpenTelemetry automatically injects these into every log statement. When you see an error in logs, you copy the traceId, search for it in Jaeger, and see the complete request flow across all services.

Spring Boot observability monitoring dashboard
OpenTelemetry correlates traces, metrics, and logs under a single trace ID for end-to-end debugging

Distributed Tracing: Following Requests Across Services

OpenTelemetry automatically traces HTTP calls, database queries, message queue operations, and gRPC calls in Spring Boot. Each operation creates a span — a named, timed operation with metadata. Spans nest hierarchically to form a trace that shows the complete request journey.

@RestController
@RequiredArgsConstructor
public class OrderController {

    private final OrderService orderService;
    private final Tracer tracer;  // Inject OpenTelemetry tracer

    @PostMapping("/orders")
    public ResponseEntity createOrder(@RequestBody OrderRequest request) {
        // Automatic span: "POST /orders" (Spring MVC instrumentation)

        // Custom span for business logic
        Span span = tracer.spanBuilder("validate-order")
            .setAttribute("order.items.count", request.getItems().size())
            .setAttribute("order.customer.id", request.getCustomerId())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            Order order = orderService.createOrder(request);

            span.setAttribute("order.id", order.getId());
            span.setAttribute("order.total", order.getTotal().doubleValue());
            span.setStatus(StatusCode.OK);

            return ResponseEntity.ok(order);
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

@Service
@RequiredArgsConstructor
public class OrderService {

    private final RestClient paymentClient;
    private final JdbcTemplate jdbcTemplate;
    private final KafkaTemplate kafka;

    @Observed(name = "order.creation")  // Micrometer observation
    public Order createOrder(OrderRequest request) {
        // Span: "SELECT orders..." (JDBC instrumentation)
        Order order = saveOrder(request);

        // Span: "POST payment-service/charge" (HTTP client instrumentation)
        paymentClient.post()
            .uri("/charge")
            .body(new ChargeRequest(order.getTotal()))
            .retrieve()
            .body(ChargeResponse.class);

        // Span: "send order-events" (Kafka instrumentation)
        kafka.send("order-events", order.getId(),
            new OrderEvent("ORDER_CREATED", order));

        return order;
    }
}

The resulting trace in Jaeger shows: the HTTP request (150ms total), nested database query (5ms), HTTP call to payment service (80ms, which shows its own nested spans), and Kafka message send (15ms). If the payment call takes 500ms instead of 80ms, you immediately see the bottleneck. Additionally, you can compare traces from fast and slow requests to identify what is different.

Metrics: RED Method and Custom Business Metrics

The RED method tracks three metrics for every service: Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution). OpenTelemetry with Micrometer provides these automatically for HTTP endpoints. Add custom business metrics for domain-specific monitoring.

@Component
@RequiredArgsConstructor
public class OrderMetrics {

    private final MeterRegistry registry;

    // Counter: track business events
    public void orderCreated(String paymentMethod, BigDecimal amount) {
        registry.counter("orders.created",
            "payment_method", paymentMethod,
            "amount_range", amountRange(amount)
        ).increment();
    }

    // Gauge: track current state
    @Scheduled(fixedRate = 30000)
    public void trackPendingOrders() {
        long pending = orderRepository.countByStatus("PENDING");
        registry.gauge("orders.pending", pending);
    }

    // Timer: track operation duration with percentiles
    public  T timeOperation(String name, Supplier operation) {
        return registry.timer("order.operation",
            "operation", name
        ).record(operation);
    }

    // Distribution summary: track value distributions
    public void recordOrderValue(BigDecimal total) {
        registry.summary("order.value",
            "currency", "USD"
        ).record(total.doubleValue());
    }
}
Grafana metrics dashboard for Spring Boot
RED metrics — Rate, Errors, Duration — provide immediate visibility into service health

The OpenTelemetry Collector: Central Pipeline

The OpenTelemetry Collector receives telemetry from your applications, processes it (batching, filtering, sampling), and exports it to your backends (Jaeger for traces, Prometheus for metrics, Loki for logs). Running a collector instead of exporting directly from your applications decouples your instrumentation from your backend choices.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls: { insecure: true }
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Tail sampling in the collector is especially valuable. Instead of deciding at the application level whether to sample a trace, tail sampling waits until the trace is complete and then keeps all error traces, all slow traces, and a random 10% sample of everything else. This ensures you never miss important traces while keeping storage costs manageable.

OpenTelemetry collector and backend architecture
The OpenTelemetry Collector decouples instrumentation from backends — switch from Jaeger to Tempo without code changes

Related Reading:

Resources:

In conclusion, Spring Boot with OpenTelemetry provides production-grade observability with minimal effort. Auto-instrumentation covers HTTP, database, and messaging operations. Micrometer bridges metrics to OpenTelemetry format. Log correlation with trace IDs connects the three pillars. Start with auto-instrumentation and the OpenTelemetry Collector, add custom spans for business logic, and build Grafana dashboards that show RED metrics for every service.

Scroll to Top