OpenTelemetry Collector Pipelines: Building Production Observability Stacks

OpenTelemetry Collector Pipelines for Production Observability

OpenTelemetry Collector pipelines are the backbone of modern observability infrastructure. The Collector acts as a vendor-agnostic proxy that receives, processes, and exports telemetry data — traces, metrics, and logs — from your applications to any backend. Instead of coupling your services to specific monitoring vendors, you route everything through the Collector and swap backends without code changes.

This guide covers building production-grade Collector configurations, from simple single-node setups to multi-tier architectures handling millions of spans per second. You will learn how to filter noise, enrich data, manage sampling, and reduce observability costs without losing visibility.

Understanding Collector Architecture

The OpenTelemetry Collector has three main component types: receivers (data ingestion), processors (data transformation), and exporters (data output). These components chain together into pipelines, with each pipeline handling one signal type: traces, metrics, or logs.

OpenTelemetry Collector pipelines architecture diagram
Collector pipeline architecture: receivers, processors, and exporters for each signal type
# Basic collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    send_batch_size: 8192
    timeout: 200ms
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256
  resourcedetection:
    detectors: [env, system, docker, gcp, aws, azure]

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Advanced Processor Pipelines

Raw telemetry data is noisy and expensive to store. Processors let you filter, sample, transform, and enrich data before it reaches your backend. Moreover, the right processor chain can reduce storage costs by 70% while keeping the signals that matter.

# Advanced processor configuration
processors:
  # Tail-based sampling — keep errors + slow traces + sample rest
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 2000}
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

  # Attribute processing — redact PII, add context
  attributes:
    actions:
      - key: user.email
        action: hash
      - key: http.request.header.authorization
        action: delete
      - key: deployment.environment
        value: production
        action: upsert

  # Span filtering — drop health checks
  filter/traces:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'
        - 'name == "SELECT 1"'

  # Metric transformation — aggregate and rename
  transform/metrics:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["service.namespace"], "production")
      - context: metric
        statements:
          - set(description, "") where name == "process.runtime.jvm.memory.usage"

  # Log body parsing — extract structured fields
  transform/logs:
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), "insert") where IsMatch(body, "^\\{")
          - set(severity_text, "ERROR") where IsMatch(body, "(?i)error|exception|fatal")

Additionally, processor ordering matters significantly. Always place the memory_limiter first to prevent out-of-memory crashes, followed by filtering processors to reduce volume early, then enrichment processors, and finally batching.

Observability dashboard with metrics and traces
Production observability dashboard powered by OpenTelemetry Collector pipelines

Multi-Tier Collector Architecture

For production deployments handling significant volume, use a two-tier architecture. Agent Collectors run as sidecars or daemonsets, collecting and forwarding data. Gateway Collectors run as a central service, performing heavy processing like tail sampling that requires cross-service context.

# Gateway Collector — central processing tier
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 4096

  # Group spans by trace ID for tail sampling
  groupbytrace:
    wait_duration: 15s
    num_traces: 500000
    num_workers: 4

  tail_sampling:
    decision_wait: 20s
    num_traces: 500000
    policies:
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: keep-high-latency
        type: latency
        latency: {threshold_ms: 1500}
      - name: sample-rest
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

  batch:
    send_batch_size: 16384
    timeout: 500ms

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
    tls:
      insecure: true
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    retry_on_failure:
      enabled: true
      max_elapsed_time: 300s

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, groupbytrace, tail_sampling, batch]
      exporters: [otlp/tempo]

Kubernetes Deployment with Helm

# values.yaml for opentelemetry-collector Helm chart
mode: daemonset

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
    filelog:
      include:
        - /var/log/pods/*/*/*.log
      exclude:
        - /var/log/pods/*/otc-container/*.log
      operators:
        - type: router
          routes:
            - output: parser-docker
              expr: 'body matches "^\\{"'
        - id: parser-docker
          type: json_parser
          timestamp:
            parse_from: attributes.time
            layout: '%Y-%m-%dT%H:%M:%S.%LZ'

  processors:
    k8sattributes:
      extract:
        metadata:
          - k8s.pod.name
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.node.name
      pod_association:
        - sources:
            - from: resource_attribute
              name: k8s.pod.ip

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

When NOT to Use OpenTelemetry Collector

The Collector adds network hops and operational complexity. If your organization uses a single observability vendor and has fewer than 10 services, sending telemetry directly from SDKs to the backend is simpler. Furthermore, if you are in a regulated environment where data must not pass through intermediary services, direct export may be required for compliance.

Consequently, small teams running simple architectures should start with direct SDK exporters and introduce the Collector when they need vendor-agnostic routing, data transformation, or cost optimization through sampling. The Collector shines at scale but adds overhead for simple setups.

Cloud infrastructure monitoring and observability
Scaling observability infrastructure with multi-tier Collector deployments

Key Takeaways

OpenTelemetry Collector pipelines provide the flexibility and power needed for production observability at scale. Start with a simple configuration, add processors as your needs grow, and consider a gateway tier when volume demands tail sampling. The vendor-agnostic approach protects your investment regardless of which backends you choose today or switch to tomorrow.

For related DevOps topics, explore our guide on Kubernetes monitoring with Prometheus and Docker container best practices. Additionally, the official OpenTelemetry Collector documentation and contrib repository are essential references.

Scroll to Top