Argo Workflows Kubernetes Batch Processing

Argo Workflows for Kubernetes Batch Processing

Argo Workflows Kubernetes batch processing provides a cloud-native pipeline orchestration engine that runs directly on your Kubernetes cluster. Unlike external orchestration tools like Airflow that manage Kubernetes jobs remotely, Argo Workflows is a Kubernetes-native CRD (Custom Resource Definition) that defines workflows as Kubernetes resources — giving you the full power of Kubernetes scheduling, resource management, and observability.

This guide covers building production batch processing pipelines with Argo Workflows, from simple sequential steps to complex DAG-based workflows with conditional execution, parameter passing, and artifact management. Moreover, you will learn retry strategies, resource optimization, and integration patterns with data warehouses and ML training pipelines.

Why Argo Workflows Over Airflow

Apache Airflow is the industry standard for workflow orchestration, but it was designed before Kubernetes became the dominant deployment platform. Airflow requires its own infrastructure — a web server, scheduler, metadata database, and workers. Additionally, Airflow DAGs are defined in Python and stored on a shared filesystem, creating deployment and versioning challenges.

Argo Workflows runs as a lightweight controller on Kubernetes. Workflows are YAML manifests versioned in Git alongside your application code. Each step runs in its own container with resource limits, and Kubernetes handles scheduling, scaling, and failure recovery. Furthermore, Argo Workflows integrates natively with Kubernetes RBAC, service accounts, and secrets.

Cloud computing and data processing — Argo Workflows running as a Kubernetes-native pipeline engine

Installing Argo Workflows

# Install Argo Workflows
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.6.0/install.yaml

# Install Argo CLI
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.6.0/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
sudo mv argo-linux-amd64 /usr/local/bin/argo

# Verify installation
argo version

Argo Workflows Kubernetes: Building a Data Pipeline

# data-pipeline.yaml — ETL workflow with DAG
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: data-pipeline-
  namespace: data-processing
spec:
  entrypoint: etl-pipeline
  arguments:
    parameters:
      - name: date
        value: "2026-03-23"
      - name: source-bucket
        value: "s3://raw-data"
      - name: dest-bucket
        value: "s3://processed-data"

  # Artifact repository configuration
  artifactRepositoryRef:
    configMap: artifact-repositories
    key: default-v1

  # Resource limits for the entire workflow
  podGC:
    strategy: OnWorkflowCompletion
    deleteDelayDuration: 600s

  templates:
    - name: etl-pipeline
      dag:
        tasks:
          # Extract from multiple sources in parallel
          - name: extract-orders
            template: extract
            arguments:
              parameters:
                - name: source
                  value: "orders"

          - name: extract-customers
            template: extract
            arguments:
              parameters:
                - name: source
                  value: "customers"

          - name: extract-products
            template: extract
            arguments:
              parameters:
                - name: source
                  value: "products"

          # Transform — depends on all extractions completing
          - name: transform
            template: transform-data
            dependencies: [extract-orders, extract-customers, extract-products]
            arguments:
              artifacts:
                - name: orders-data
                  from: "{{tasks.extract-orders.outputs.artifacts.extracted-data}}"
                - name: customers-data
                  from: "{{tasks.extract-customers.outputs.artifacts.extracted-data}}"
                - name: products-data
                  from: "{{tasks.extract-products.outputs.artifacts.extracted-data}}"

          # Validate transformed data
          - name: validate
            template: validate-data
            dependencies: [transform]

          # Load — only if validation passes
          - name: load
            template: load-data
            dependencies: [validate]
            when: "{{tasks.validate.outputs.parameters.validation-status}} == passed"

          # Notify on completion or failure
          - name: notify-success
            template: notify
            dependencies: [load]
            arguments:
              parameters:
                - name: status
                  value: "Pipeline completed successfully"

          - name: notify-failure
            template: notify
            dependencies: [validate]
            when: "{{tasks.validate.outputs.parameters.validation-status}} != passed"
            arguments:
              parameters:
                - name: status
                  value: "Pipeline failed: validation errors"

    # Extract template — reusable for each source
    - name: extract
      inputs:
        parameters:
          - name: source
      outputs:
        artifacts:
          - name: extracted-data
            path: /tmp/output/
      container:
        image: myregistry/data-extractor:latest
        command: [python, extract.py]
        args: ["--source", "{{inputs.parameters.source}}",
               "--date", "{{workflow.parameters.date}}",
               "--output", "/tmp/output/"]
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
      retryStrategy:
        limit: 3
        retryPolicy: Always
        backoff:
          duration: "30s"
          factor: 2
          maxDuration: "5m"

    # Transform template
    - name: transform-data
      inputs:
        artifacts:
          - name: orders-data
            path: /tmp/input/orders/
          - name: customers-data
            path: /tmp/input/customers/
          - name: products-data
            path: /tmp/input/products/
      outputs:
        artifacts:
          - name: transformed-data
            path: /tmp/output/
      container:
        image: myregistry/data-transformer:latest
        command: [python, transform.py]
        args: ["--input", "/tmp/input/", "--output", "/tmp/output/"]
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
          limits:
            memory: "16Gi"
            cpu: "8"

    # Validation template
    - name: validate-data
      outputs:
        parameters:
          - name: validation-status
            valueFrom:
              path: /tmp/validation-result.txt
      container:
        image: myregistry/data-validator:latest
        command: [python, validate.py]

    # Load template
    - name: load-data
      container:
        image: myregistry/data-loader:latest
        command: [python, load.py]
        args: ["--dest", "{{workflow.parameters.dest-bucket}}"]

    # Notification template
    - name: notify
      inputs:
        parameters:
          - name: status
      container:
        image: curlimages/curl:latest
        command: [sh, -c]
        args:
          - |
            curl -X POST https://hooks.slack.com/services/T.../B.../xxx \
              -H 'Content-Type: application/json' \
              -d '{"text": "Data Pipeline: {{inputs.parameters.status}}"}'

Data pipeline orchestration — DAG-based workflow with parallel extraction and sequential processing

CronWorkflows for Scheduled Batch Jobs

Therefore, recurring batch jobs use CronWorkflows which are the Kubernetes-native equivalent of cron jobs but with Argo’s full workflow capabilities including retries, DAGs, and artifact management.

Key Takeaways

Start with a solid foundation and build incrementally based on your requirements
Test thoroughly in staging before deploying to production environments
Monitor performance metrics and iterate based on real-world data
Follow security best practices and keep dependencies up to date
Document architectural decisions for future team members

# cron-workflow.yaml — Daily data pipeline
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: daily-data-pipeline
  namespace: data-processing
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  timezone: "UTC"
  concurrencyPolicy: Replace
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  workflowSpec:
    entrypoint: etl-pipeline
    arguments:
      parameters:
        - name: date
          value: "{{workflow.scheduledTime.Format "2006-01-02"}}"
    # ... same templates as above

When NOT to Use Argo Workflows

Argo Workflows requires Kubernetes expertise — if your team does not run Kubernetes in production, the learning curve is steep. Airflow or simple cron jobs may be more appropriate for teams without Kubernetes experience. Consequently, the YAML-based workflow definition can become verbose for complex logic that would be more naturally expressed in Python code.

For real-time stream processing, Argo Workflows is not the right tool — use Apache Flink, Kafka Streams, or similar streaming frameworks instead. Argo is designed for batch processing and pipeline orchestration, not continuous data processing.

Infrastructure and orchestration planning — Choosing the right orchestration tool for your workload

Key Takeaways

Argo Workflows Kubernetes batch processing brings pipeline orchestration natively into your cluster, eliminating the need for external orchestration infrastructure. DAG-based workflows with parallel execution, retry strategies, and artifact passing enable complex data processing pipelines. Furthermore, CronWorkflows provide reliable scheduling with full workflow capabilities.

Start by migrating your simplest cron job to an Argo Workflow and gradually tackle more complex pipelines. For comprehensive documentation, see the Argo Workflows docs and the Argo Workflows examples repository. Our guides on Karpenter for node autoscaling and GitHub Actions self-hosted runners complement your Kubernetes operations toolkit.

In conclusion, Argo Workflows Kubernetes Batch is an essential topic for modern software development. By applying the patterns and practices covered in this guide, you can build more robust, scalable, and maintainable systems. Start with the fundamentals, iterate on your implementation, and continuously measure results to ensure you are getting the most value from these approaches.

Argo Workflows: Kubernetes-Native Batch Processing and Pipeline Orchestration

Argo Workflows for Kubernetes Batch Processing

Why Argo Workflows Over Airflow

Installing Argo Workflows

Argo Workflows Kubernetes: Building a Data Pipeline

CronWorkflows for Scheduled Batch Jobs

Key Takeaways

When NOT to Use Argo Workflows

Key Takeaways

Leave a Comment Cancel Reply