Zero-Downtime Database Migrations: Patterns That Actually Work

Zero-Downtime Database Migrations: Patterns That Actually Work

Every developer has been there — you need to rename a column, change a data type, or add a NOT NULL constraint, but your application serves thousands of requests per second and any downtime means lost revenue. Zero-downtime database migrations solve this by restructuring your schema while the application continues running. Therefore, this guide covers the patterns, tools, and real-world strategies that let you evolve your database without taking your application offline.

Why Traditional Migrations Break Production

The fundamental problem is that schema changes and application code must be coordinated. When you run ALTER TABLE users RENAME COLUMN name TO full_name, every running instance of your application immediately breaks because it still queries the old column name. Moreover, long-running ALTER TABLE operations on large tables can lock the entire table for minutes or hours, blocking all reads and writes.

Consider a table with 50 million rows. Adding a column with a default value in older PostgreSQL versions rewrites every row, holding an ACCESS EXCLUSIVE lock the entire time. During this lock, no queries can execute — not even SELECT statements. For a table that handles 5,000 queries per second, that means 5,000 failed requests per second for however long the migration takes.

The solution is to break every dangerous migration into a series of safe, backwards-compatible steps. Each step can be deployed independently, and the application works correctly at every intermediate state. This is the expand-contract pattern.

The Expand-Contract Pattern for Zero-Downtime Migrations

The expand-contract pattern splits every breaking change into three phases: expand (add new structure), migrate (copy data), and contract (remove old structure). At no point does the application break because both old and new structures coexist during the transition.

Here is a concrete example — renaming a column from name to full_name:

-- Phase 1: EXPAND — Add the new column (non-blocking)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_user_name() RETURNS trigger AS $$
BEGIN
    IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
        IF NEW.full_name IS NULL AND NEW.name IS NOT NULL THEN
            NEW.full_name := NEW.name;
        ELSIF NEW.name IS NULL AND NEW.full_name IS NOT NULL THEN
            NEW.name := NEW.full_name;
        END IF;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_user_name
    BEFORE INSERT OR UPDATE ON users
    FOR EACH ROW EXECUTE FUNCTION sync_user_name();

-- Phase 2: MIGRATE — Backfill existing rows (in batches)
UPDATE users SET full_name = name
WHERE full_name IS NULL
AND id BETWEEN 1 AND 100000;
-- Repeat in batches to avoid long-running transactions

-- Phase 3: CONTRACT — After ALL app instances use full_name
-- Deploy app code reading full_name instead of name
-- Wait for all old instances to drain
DROP TRIGGER trg_sync_user_name ON users;
DROP FUNCTION sync_user_name();
ALTER TABLE users DROP COLUMN name;

Each phase is a separate deployment. The expand phase adds infrastructure without breaking anything. The migrate phase copies data while the application reads from either column. The contract phase cleans up after every application instance has been updated. Additionally, if anything goes wrong during any phase, you can roll back without data loss because the old structure still exists.

Zero-downtime database migration monitoring dashboard
The expand-contract pattern keeps both old and new structures active during the transition period

Common Migration Patterns with Zero Downtime

Beyond column renames, several other schema changes require careful handling. Adding a NOT NULL constraint is deceptively dangerous — in PostgreSQL, ALTER TABLE ... SET NOT NULL scans the entire table to verify no NULLs exist, holding a lock the whole time. The safe approach uses a CHECK constraint instead:

-- Safe NOT NULL: use CHECK constraint (non-blocking in PG 12+)
ALTER TABLE orders
ADD CONSTRAINT orders_status_not_null
CHECK (status IS NOT NULL) NOT VALID;

-- Validate separately (takes ShareUpdateExclusiveLock, not blocking writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

-- Now the official NOT NULL is instant because PG knows the constraint holds
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_status_not_null;

Changing a column type is another common need. You cannot safely cast a VARCHAR to an INTEGER in place. Instead, add a new INTEGER column, backfill it with converted data, update the application to write to both columns, then drop the old one. Consequently, every type change becomes an expand-contract operation.

Index creation should always use CREATE INDEX CONCURRENTLY, which builds the index without blocking writes. The downside is that concurrent index creation takes 2-3x longer and cannot run inside a transaction. If it fails, you must drop the invalid index and retry.

Flyway and Liquibase for Migration Management

Migration tools like Flyway and Liquibase track which migrations have been applied and ensure they run in order. However, neither tool enforces zero-downtime patterns — that responsibility falls on you. The tool manages versioning; you manage safety.

// Flyway migration file: V15__add_full_name_column.sql
// Phase 1: Expand — safe to run with app serving traffic
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

// V16__backfill_full_name.sql
// Phase 2: Migrate — run during low traffic
UPDATE users SET full_name = name WHERE full_name IS NULL;

// V17__drop_name_column.sql
// Phase 3: Contract — run AFTER all app instances updated
// WARNING: Only run after verifying no app code references 'name'
ALTER TABLE users DROP COLUMN name;

Liquibase offers a rollback mechanism that Flyway lacks in its free tier. However, rollbacks for schema changes are inherently risky — you cannot un-drop a column with its data. Therefore, the best strategy is forward-only migrations with the expand-contract pattern, where every state is valid and you never need to roll back.

Database migration workflow visualization
Migration tools track versioning — zero-downtime safety is your responsibility

Blue-Green Database Migrations for Large Changes

Some changes are too complex for expand-contract — restructuring a normalized schema into a denormalized one, splitting a table, or migrating between database engines. For these, blue-green database migrations create a parallel database with the new schema, replicate data continuously, and switch traffic atomically.

The workflow uses change data capture (CDC) to stream changes from the old database to the new one. Tools like Debezium capture every INSERT, UPDATE, and DELETE from the old database’s transaction log and apply them to the new database in real-time. When the new database is caught up, you switch the application’s connection string. Moreover, you can keep replication running in reverse for a quick rollback.

This approach works for cross-engine migrations too — moving from MySQL to PostgreSQL, or from a monolithic database to sharded instances. The application never experiences downtime because the switch is a DNS or connection string change, and the new database already has all the data.

Practical Checklist for Safe Migrations

Before running any migration in production, verify these points:

  • Can the current application code work with the new schema? If not, deploy the code change first.
  • Does the migration acquire locks? Use pg_locks in a test environment to verify.
  • How long does it take on production-sized data? Test with a copy of production.
  • Is there a safe rollback? Document it before running the migration.
  • Are there long-running transactions that could block the migration? Check pg_stat_activity.
Database operations checklist and planning
Always test migrations against production-sized data before running them live

Related Reading:

Resources:

In conclusion, zero-downtime database migrations require discipline but follow repeatable patterns. The expand-contract approach handles 90% of schema changes safely. For the remaining 10%, blue-green migrations with CDC provide a path forward. The key insight is that every migration is a series of backwards-compatible steps, never a single breaking change.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top