OpenTelemetry Collector Pipelines for Production Observability
OpenTelemetry Collector pipelines are the backbone of modern observability infrastructure. The Collector acts as a vendor-agnostic proxy that receives, processes, and exports telemetry data — traces, metrics, and logs — from your applications to any backend. Instead of coupling your services to specific monitoring vendors, you route everything through the Collector and swap backends without code changes.
This guide covers building production-grade Collector configurations, from simple single-node setups to multi-tier architectures handling millions of spans per second. You will learn how to filter noise, enrich data, manage sampling, and reduce observability costs without losing visibility.
Understanding Collector Architecture
The OpenTelemetry Collector has three main component types: receivers (data ingestion), processors (data transformation), and exporters (data output). These components chain together into pipelines, with each pipeline handling one signal type: traces, metrics, or logs.
# Basic collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
send_batch_size: 8192
timeout: 200ms
memory_limiter:
check_interval: 1s
limit_mib: 1024
spike_limit_mib: 256
resourcedetection:
detectors: [env, system, docker, gcp, aws, azure]
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]Advanced Processor Pipelines
Raw telemetry data is noisy and expensive to store. Processors let you filter, sample, transform, and enrich data before it reaches your backend. Moreover, the right processor chain can reduce storage costs by 70% while keeping the signals that matter.
# Advanced processor configuration
processors:
# Tail-based sampling — keep errors + slow traces + sample rest
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 2000}
- name: probabilistic-sample
type: probabilistic
probabilistic: {sampling_percentage: 10}
# Attribute processing — redact PII, add context
attributes:
actions:
- key: user.email
action: hash
- key: http.request.header.authorization
action: delete
- key: deployment.environment
value: production
action: upsert
# Span filtering — drop health checks
filter/traces:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
- 'name == "SELECT 1"'
# Metric transformation — aggregate and rename
transform/metrics:
metric_statements:
- context: datapoint
statements:
- set(attributes["service.namespace"], "production")
- context: metric
statements:
- set(description, "") where name == "process.runtime.jvm.memory.usage"
# Log body parsing — extract structured fields
transform/logs:
log_statements:
- context: log
statements:
- merge_maps(attributes, ParseJSON(body), "insert") where IsMatch(body, "^\\{")
- set(severity_text, "ERROR") where IsMatch(body, "(?i)error|exception|fatal")Additionally, processor ordering matters significantly. Always place the memory_limiter first to prevent out-of-memory crashes, followed by filtering processors to reduce volume early, then enrichment processors, and finally batching.
Multi-Tier Collector Architecture
For production deployments handling significant volume, use a two-tier architecture. Agent Collectors run as sidecars or daemonsets, collecting and forwarding data. Gateway Collectors run as a central service, performing heavy processing like tail sampling that requires cross-service context.
# Gateway Collector — central processing tier
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
processors:
memory_limiter:
check_interval: 1s
limit_mib: 4096
# Group spans by trace ID for tail sampling
groupbytrace:
wait_duration: 15s
num_traces: 500000
num_workers: 4
tail_sampling:
decision_wait: 20s
num_traces: 500000
policies:
- name: keep-errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: keep-high-latency
type: latency
latency: {threshold_ms: 1500}
- name: sample-rest
type: probabilistic
probabilistic: {sampling_percentage: 5}
batch:
send_batch_size: 16384
timeout: 500ms
exporters:
otlp/tempo:
endpoint: tempo-distributor:4317
tls:
insecure: true
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
retry_on_failure:
enabled: true
max_elapsed_time: 300s
service:
telemetry:
metrics:
address: 0.0.0.0:8888
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, groupbytrace, tail_sampling, batch]
exporters: [otlp/tempo]Kubernetes Deployment with Helm
# values.yaml for opentelemetry-collector Helm chart
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
- /var/log/pods/*/otc-container/*.log
operators:
- type: router
routes:
- output: parser-docker
expr: 'body matches "^\\{"'
- id: parser-docker
type: json_parser
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
processors:
k8sattributes:
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128MiWhen NOT to Use OpenTelemetry Collector
The Collector adds network hops and operational complexity. If your organization uses a single observability vendor and has fewer than 10 services, sending telemetry directly from SDKs to the backend is simpler. Furthermore, if you are in a regulated environment where data must not pass through intermediary services, direct export may be required for compliance.
Consequently, small teams running simple architectures should start with direct SDK exporters and introduce the Collector when they need vendor-agnostic routing, data transformation, or cost optimization through sampling. The Collector shines at scale but adds overhead for simple setups.
Key Takeaways
OpenTelemetry Collector pipelines provide the flexibility and power needed for production observability at scale. Start with a simple configuration, add processors as your needs grow, and consider a gateway tier when volume demands tail sampling. The vendor-agnostic approach protects your investment regardless of which backends you choose today or switch to tomorrow.
Key Takeaways
- Start with a solid foundation and build incrementally based on your requirements
- Test thoroughly in staging before deploying to production environments
- Monitor performance metrics and iterate based on real-world data
- Follow security best practices and keep dependencies up to date
- Document architectural decisions for future team members
For related DevOps topics, explore our guide on Kubernetes monitoring with Prometheus and Docker container best practices. Additionally, the official OpenTelemetry Collector documentation and contrib repository are essential references.
In conclusion, Opentelemetry Collector Pipelines Observability is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.