OpenTelemetry in 2026: The Standard for Modern Observability
You cannot fix what you cannot see. In distributed systems with dozens of microservices, a single user request might touch 10 services, 3 databases, and 2 message queues. When something goes wrong, finding the root cause without proper observability is like debugging in the dark. OpenTelemetry has become the industry standard for making distributed systems visible — and in 2026, it is mature enough for every team to adopt.
What Is OpenTelemetry
OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework. It provides APIs, SDKs, and tools to generate, collect, and export three types of telemetry data:
–
Traces — The journey of a request across services (distributed tracing)
–
Metrics — Numerical measurements over time (counters, histograms, gauges)
–
Logs — Structured event records with context
The key word is vendor-neutral. You instrument your code once with OpenTelemetry, and you can export to any backend — Jaeger, Grafana Tempo, Datadog, New Relic, AWS X-Ray, or any combination.
The Three Pillars in Practice
Traces answer: "What happened to this specific request?"
A trace follows a single request from the frontend through every service it touches. Each step is a span — a named, timed operation with metadata.
User Request → API Gateway (12ms)
└→ Auth Service (8ms)
└→ Order Service (45ms)
└→ PostgreSQL Query (15ms)
└→ Payment Service (120ms)
└→ Stripe API (95ms)
└→ Notification Service (5ms)
└→ Redis Pub/Sub (2ms)
Metrics answer: "How is the system performing overall?"
–
Request rate: 1,250 req/s
–
Error rate: 0.3%
–
P99 latency: 450ms
–
Active database connections: 42/100
Logs answer: "What exactly happened at this moment?"
{
"timestamp": "2026-02-23T10:15:32Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Payment processing failed",
"error": "Stripe API timeout after 30s",
"customer_id": "cust_42",
"amount": 99.99
}
The power comes from correlation. The trace_id in the log connects to the same trace in your tracing backend, which connects to the same request in your metrics. One ID links all three pillars.
Instrumenting a Spring Boot Application
Spring Boot has excellent OpenTelemetry support through Micrometer and the OTel Java Agent:
<!-- pom.xml -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
# application.yml
management:
tracing:
sampling:
probability: 1.0 # Sample 100% in dev, lower in production
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
logging:
pattern:
console: "%d{HH:mm:ss} [%X{traceId}] %-5level %logger{36} - %msg%n"
@RestController
@RequestMapping("/api/orders")
public class OrderController {
private final OrderService orderService;
private final ObservationRegistry registry;
@GetMapping("/{id}")
public OrderResponse getOrder(@PathVariable Long id) {
// Automatic span creation via Spring Observation
return Observation.createNotStarted("order.fetch", registry)
.lowCardinalityKeyValue("order.type", "standard")
.observe(() -> orderService.findById(id));
}
}
@Service
public class OrderService {
private final JdbcTemplate jdbc;
private final PaymentClient paymentClient;
// Custom span for business logic
@Observed(name = "order.process")
public OrderResponse findById(Long id) {
// JDBC calls are auto-instrumented — each query becomes a span
Order order = jdbc.queryForObject(
"SELECT * FROM orders WHERE id = ?", orderRowMapper, id);
// HTTP calls to other services are auto-traced
Payment payment = paymentClient.getPayment(order.getPaymentId());
return new OrderResponse(order, payment);
}
}
With the OTel Java Agent, most instrumentation is automatic — JDBC queries, HTTP client calls, Kafka producers/consumers, and Redis commands all generate spans without code changes.
The OpenTelemetry Collector
The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It decouples your application from the backend:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
# Filter out health check spans
filter:
spans:
exclude:
match_type: strict
attributes:
- key: http.route
value: /health
# Tail-based sampling — keep errors, sample normal traffic
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 1000 }
- name: default
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, filter, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
This configuration receives telemetry via OTLP, processes it (batching, filtering, sampling), and exports traces to Grafana Tempo, metrics to Prometheus, and logs to Loki.
Custom Metrics That Matter
Beyond auto-instrumented metrics, define custom ones for your business:
@Component
public class BusinessMetrics {
private final MeterRegistry registry;
private final Counter ordersPlaced;
private final Timer orderProcessingTime;
private final AtomicInteger activeCheckouts;
public BusinessMetrics(MeterRegistry registry) {
this.registry = registry;
this.ordersPlaced = Counter.builder("business.orders.placed")
.description("Total orders placed")
.tag("channel", "web")
.register(registry);
this.orderProcessingTime = Timer.builder("business.orders.processing_time")
.description("Time to process an order")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
this.activeCheckouts = registry.gauge(
"business.checkouts.active",
new AtomicInteger(0)
);
}
public void recordOrder(String type, double amount) {
ordersPlaced.increment();
registry.counter("business.revenue",
"type", type,
"currency", "USD"
).increment(amount);
}
}
Structured Logging with Trace Context
Logs become powerful when they carry trace context:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
@Service
public class PaymentService {
private static final Logger log = LoggerFactory.getLogger(PaymentService.class);
public PaymentResult processPayment(PaymentRequest request) {
// trace_id and span_id are automatically injected into MDC
log.info("Processing payment for customer={} amount={}",
request.getCustomerId(), request.getAmount());
try {
PaymentResult result = stripeClient.charge(request);
log.info("Payment successful transaction_id={}", result.getTransactionId());
return result;
} catch (PaymentException e) {
log.error("Payment failed for customer={} error={}",
request.getCustomerId(), e.getMessage(), e);
throw e;
}
}
}
In Grafana, you can jump from a log line directly to its trace, see every service that request touched, and identify exactly where the failure occurred.
Sampling Strategies for Production
At scale, collecting 100% of telemetry is prohibitively expensive. Smart sampling strategies are essential:
| Strategy | Description | Use When |
|---|---|---|
| Head-based | Decide at request start (random %) | Simple, predictable cost |
| Tail-based | Decide after request completes | Need to keep all errors/slow requests |
| Priority | Always sample certain request types | Critical paths need 100% visibility |
| Adaptive | Adjust rate based on traffic volume | Variable traffic patterns |
The collector configuration above demonstrates tail-based sampling: 100% of errors and slow requests are kept, while normal traffic is sampled at 10%.
The Grafana Stack: Putting It All Together
The most popular open-source observability stack in 2026:
–
Grafana Tempo — Distributed tracing backend (trace storage and search)
–
Prometheus — Metrics collection and alerting
–
Grafana Loki — Log aggregation with label-based indexing
–
Grafana — Unified dashboard and exploration UI
All three backends are connected in Grafana through exemplars and trace-to-logs correlations. Click a spike in a latency graph, see the traces that caused it, click a trace span, see the logs from that exact moment. This workflow transforms debugging from hours to minutes.
Getting Started Checklist
–
Key Takeaways
- Start with a solid foundation and build incrementally based on your requirements
- Test thoroughly in staging before deploying to production environments
- Monitor performance metrics and iterate based on real-world data
- Follow security best practices and keep dependencies up to date
- Document architectural decisions for future team members
Add the OTel Java Agent (or SDK for your language) — auto-instrumentation covers 80% of needs
–
Deploy an OTel Collector as a sidecar or daemonset
–
Export to your backend of choice (Grafana stack is free and excellent)
–
Add trace IDs to your structured logs
–
Define 3–5 custom business metrics that matter to your team
–
Set up tail-based sampling to control costs while keeping error traces
–
Build dashboards with RED metrics (Rate, Errors, Duration) for each service
–
Create alerts on SLO violations, not raw thresholds
For further reading, refer to the AWS documentation and the Google Cloud documentation for comprehensive reference material.
Observability is not optional for distributed systems. OpenTelemetry makes it achievable without vendor lock-in, and in 2026, the tooling has matured to the point where there is no excuse not to implement it.
In conclusion, Opentelemetry Observability is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.