OpenTelemetry in 2026: The Standard for Modern Observability
You cannot fix what you cannot see. In distributed systems with dozens of microservices, a single user request might touch 10 services, 3 databases, and 2 message queues. When something goes wrong, finding the root cause without proper observability is like debugging in the dark. OpenTelemetry has become the industry standard for making distributed systems visible — and in 2026, it is mature enough for every team to adopt.
What Is OpenTelemetry
OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework. It provides APIs, SDKs, and tools to generate, collect, and export three types of telemetry data:
–
Traces — The journey of a request across services (distributed tracing)
–
Metrics — Numerical measurements over time (counters, histograms, gauges)
–
Logs — Structured event records with context
The key word is vendor-neutral. You instrument your code once with OpenTelemetry, and you can export to any backend — Jaeger, Grafana Tempo, Datadog, New Relic, AWS X-Ray, or any combination.
The Three Pillars in Practice
Traces answer: "What happened to this specific request?"
A trace follows a single request from the frontend through every service it touches. Each step is a span — a named, timed operation with metadata.
User Request → API Gateway (12ms)
└→ Auth Service (8ms)
└→ Order Service (45ms)
└→ PostgreSQL Query (15ms)
└→ Payment Service (120ms)
└→ Stripe API (95ms)
└→ Notification Service (5ms)
└→ Redis Pub/Sub (2ms)
Metrics answer: "How is the system performing overall?"
–
Request rate: 1,250 req/s
–
Error rate: 0.3%
–
P99 latency: 450ms
–
Active database connections: 42/100
Logs answer: "What exactly happened at this moment?"
{
"timestamp": "2026-02-23T10:15:32Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Payment processing failed",
"error": "Stripe API timeout after 30s",
"customer_id": "cust_42",
"amount": 99.99
}
The power comes from correlation. The trace_id in the log connects to the same trace in your tracing backend, which connects to the same request in your metrics. One ID links all three pillars.
Instrumenting a Spring Boot Application
Spring Boot has excellent OpenTelemetry support through Micrometer and the OTel Java Agent:
<!-- pom.xml -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
# application.yml
management:
tracing:
sampling:
probability: 1.0 # Sample 100% in dev, lower in production
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
logging:
pattern:
console: "%d{HH:mm:ss} [%X{traceId}] %-5level %logger{36} - %msg%n"
@RestController
@RequestMapping("/api/orders")
public class OrderController {
private final OrderService orderService;
private final ObservationRegistry registry;
@GetMapping("/{id}")
public OrderResponse getOrder(@PathVariable Long id) {
// Automatic span creation via Spring Observation
return Observation.createNotStarted("order.fetch", registry)
.lowCardinalityKeyValue("order.type", "standard")
.observe(() -> orderService.findById(id));
}
}
@Service
public class OrderService {
private final JdbcTemplate jdbc;
private final PaymentClient paymentClient;
// Custom span for business logic
@Observed(name = "order.process")
public OrderResponse findById(Long id) {
// JDBC calls are auto-instrumented — each query becomes a span
Order order = jdbc.queryForObject(
"SELECT * FROM orders WHERE id = ?", orderRowMapper, id);
// HTTP calls to other services are auto-traced
Payment payment = paymentClient.getPayment(order.getPaymentId());
return new OrderResponse(order, payment);
}
}
With the OTel Java Agent, most instrumentation is automatic — JDBC queries, HTTP client calls, Kafka producers/consumers, and Redis commands all generate spans without code changes.
The OpenTelemetry Collector
The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples your application from the backend:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
# Filter out health check spans
filter:
spans:
exclude:
match_type: strict
attributes:
- key: http.route
value: /health
# Tail-based sampling — keep errors, sample normal traffic
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 1000 }
- name: default
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, filter, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
This configuration receives telemetry via OTLP, processes it (batching, filtering, sampling), and exports traces to Grafana Tempo, metrics to Prometheus, and logs to Loki.
Custom Metrics That Matter
Beyond auto-instrumented metrics, define custom ones for your business:
@Component
public class BusinessMetrics {
private final MeterRegistry registry;
private final Counter ordersPlaced;
private final Timer orderProcessingTime;
private final AtomicInteger activeCheckouts;
public BusinessMetrics(MeterRegistry registry) {
this.registry = registry;
this.ordersPlaced = Counter.builder("business.orders.placed")
.description("Total orders placed")
.tag("channel", "web")
.register(registry);
this.orderProcessingTime = Timer.builder("business.orders.processing_time")
.description("Time to process an order")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
this.activeCheckouts = registry.gauge(
"business.checkouts.active",
new AtomicInteger(0)
);
}
public void recordOrder(String type, double amount) {
ordersPlaced.increment();
registry.counter("business.revenue",
"type", type,
"currency", "USD"
).increment(amount);
}
}
Structured Logging with Trace Context
Logs become powerful when they carry trace context:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
@Service
public class PaymentService {
private static final Logger log = LoggerFactory.getLogger(PaymentService.class);
public PaymentResult processPayment(PaymentRequest request) {
// trace_id and span_id are automatically injected into MDC
log.info("Processing payment for customer={} amount={}",
request.getCustomerId(), request.getAmount());
try {
PaymentResult result = stripeClient.charge(request);
log.info("Payment successful transaction_id={}", result.getTransactionId());
return result;
} catch (PaymentException e) {
log.error("Payment failed for customer={} error={}",
request.getCustomerId(), e.getMessage(), e);
throw e;
}
}
}
In Grafana, you can jump from a log line directly to its trace, see every service that request touched, and identify exactly where the failure occurred.
Sampling Strategies for Production
At scale, collecting 100% of telemetry is prohibitively expensive. Smart sampling strategies are essential:
| Strategy | Description | Use When |
|---|---|---|
| Head-based | Decide at request start (random %) | Simple, predictable cost |
| Tail-based | Decide after request completes | Need to keep all errors/slow requests |
| Priority | Always sample certain request types | Critical paths need 100% visibility |
| Adaptive | Adjust rate based on traffic volume | Variable traffic patterns |
The collector configuration above demonstrates tail-based sampling: 100% of errors and slow requests are kept, while normal traffic is sampled at 10%.
The Grafana Stack: Putting It All Together
The most popular open-source observability stack in 2026:
–
Grafana Tempo — Distributed tracing backend (trace storage and search)
–
Prometheus — Metrics collection and alerting
–
Grafana Loki — Log aggregation with label-based indexing
–
Grafana — Unified dashboard and exploration UI
All three backends are connected in Grafana through exemplars and trace-to-logs correlations. Click a spike in a latency graph, see the traces that caused it, click a trace span, see the logs from that exact moment. This workflow transforms debugging from hours to minutes.
Getting Started Checklist
–
Add the OTel Java Agent (or SDK for your language) — auto-instrumentation covers 80% of needs
–
Deploy an OTel Collector as a sidecar or daemonset
–
Export to your backend of choice (Grafana stack is free and excellent)
–
Add trace IDs to your structured logs
–
Define 3–5 custom business metrics that matter to your team
–
Set up tail-based sampling to control costs while keeping error traces
–
Build dashboards with RED metrics (Rate, Errors, Duration) for each service
–
Create alerts on SLO violations, not raw thresholds
Observability is not optional for distributed systems. OpenTelemetry makes it achievable without vendor lock-in, and in 2026, the tooling has matured to the point where there is no excuse not to implement it.