Production Infrastructure

Zero-Downtime Deployments: Blue-Green, Canary, and Rolling Updates

Deployment strategies that avoid 502/503 errors during releases — blue-green switching, canary traffic splitting, rolling updates with readiness probes.

Why Deployments Cause Downtime

A naive deployment — stop old version, start new version — creates a window where no instances are available to serve traffic. Even a 10-second gap produces 502/503 errors for users and failed health checks in monitoring.

Four root causes of deployment-induced downtime:

  • Connection draining gapload balancer sends requests to a terminating instance
  • Cold start latency — new instance not warm before receiving traffic
  • Database migrations — schema changes that break the running version
  • Incompatible API versions — old and new code simultaneously serving requests

Each deployment strategy addresses these differently.

Blue-Green Deployment

Maintain two identical production environments (blue and green). At any time, one is live and the other is idle or staging.

                    ┌─────────────┐
Users → LB ─────→  │  Blue (v1)  │  ← currently LIVE
                    └─────────────┘
                    ┌─────────────┐
            idle →  │ Green (v2)  │  ← deploy new version here
                    └─────────────┘

# After deploy + smoke test:
                    ┌─────────────┐
            idle →  │  Blue (v1)  │  ← keep for instant rollback
                    └─────────────┘
                    ┌─────────────┐
Users → LB ─────→  │ Green (v2)  │  ← now LIVE
                    └─────────────┘

AWS ALB Target Group Switch

# Deploy v2 to green target group (Blue is still live)
aws ecs update-service \
    --cluster prod \
    --service app-green \
    --task-definition app:42

# Wait for green to be healthy
aws ecs wait services-stable --cluster prod --services app-green

# Switch ALB listener to green target group
aws elbv2 modify-listener \
    --listener-arn $LISTENER_ARN \
    --default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN

# Rollback: switch back to blue in seconds
aws elbv2 modify-listener \
    --listener-arn $LISTENER_ARN \
    --default-actions Type=forward,TargetGroupArn=$BLUE_TG_ARN

Database Considerations

Blue-green is simplest when both versions share the same database schema. If your migration changes the schema in a backward-incompatible way (renaming a column, dropping a field), you cannot safely have both versions running simultaneously.

Use the expand-contract pattern (covered later in this guide) to handle schema changes safely during blue-green deployments.

Advantages: instant rollback, clean cutover, easy smoke testing.

Disadvantages: double infrastructure cost, complex with stateful services.

Canary Deployment

Route a small percentage of traffic to the new version while the rest continues hitting the old version. Monitor error rates and latency before expanding.

Users ──┬──→ (95%)  App v1  ─→
        └──→  (5%)  App v2  ─→ monitor errors

AWS ALB Weighted Target Groups

# Start canary: 5% to v2
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
    --default-actions 'Type=forward,ForwardConfig={
        TargetGroups=[
            {TargetGroupArn='"$V1_TG_ARN"',Weight=95},
            {TargetGroupArn='"$V2_TG_ARN"',Weight=5}
        ]
    }'

# Expand to 50% after monitoring
# Then 100% to complete

Kubernetes Canary with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5      # 5% canary
        - pause: {duration: 5m}
        - setWeight: 20
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 10m}
        # Full rollout
      analysis:
        templates:
          - templateName: error-rate
        args:
          - name: service-name
            value: app

Automatic Rollback Triggers

# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  metrics:
    - name: error-rate
      interval: 1m
      # Roll back if error rate exceeds 1%
      successCondition: result[0] < 0.01
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"5.."}[1m]))
            /
            sum(rate(http_requests_total[1m]))

Rolling Updates

Replace instances one by one (or in small batches), waiting for each batch to pass health checks before proceeding.

Kubernetes Rolling Update Strategy

spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # Allow 2 extra pods above desired count
      maxUnavailable: 0  # Never go below desired count
  template:
    spec:
      containers:
        - name: app
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8000
            # Pod only added to LB after 2 consecutive successes
            successThreshold: 2
            failureThreshold: 3
            periodSeconds: 5
      terminationGracePeriodSeconds: 60
      lifecycle:
        preStop:
          exec:
            # Give LB time to stop routing before SIGTERM
            command: ["sleep", "5"]

With maxUnavailable: 0, Kubernetes adds new pods and waits for them to pass readiness probes before removing old pods. No traffic gap.

Database Migrations: The Expand-Contract Pattern

The expand-contract (aka parallel-change) pattern makes schema changes backward-compatible across deployment boundaries.

Scenario: rename column user_name to username

Phase 1 — Expand (deploy with old + new column)
  Migration: ADD COLUMN username VARCHAR(255);
  Code: writes to both columns, reads from old

Phase 2 — Migrate data (background job or migration)
  UPDATE users SET username = user_name WHERE username IS NULL;

Phase 3 — Switch reads (deploy)
  Code: writes to both columns, reads from new

Phase 4 — Contract (deploy, remove old column)
  Code: writes to new column only
  Migration: DROP COLUMN user_name;

Each phase is a separate deployment. During blue-green or rolling updates, both old and new code versions must work with the schema in each phase.

Feature Flags for Data Changes

# Use a feature flag to gate reads from new column
from django.conf import settings

def get_username(user):
    if settings.USE_NEW_USERNAME_COLUMN:
        return user.username
    return user.user_name

This lets you flip the read path in production without a code deployment — just update the environment variable and reload the app.

Related Protocols

Related Glossary Terms

More in Production Infrastructure