Why Deployments Cause Downtime
A naive deployment — stop old version, start new version — creates a window where no instances are available to serve traffic. Even a 10-second gap produces 502/503 errors for users and failed health checks in monitoring.
Four root causes of deployment-induced downtime:
- Connection draining gap — load balancer sends requests to a terminating instance
- Cold start latency — new instance not warm before receiving traffic
- Database migrations — schema changes that break the running version
- Incompatible API versions — old and new code simultaneously serving requests
Each deployment strategy addresses these differently.
Blue-Green Deployment
Maintain two identical production environments (blue and green). At any time, one is live and the other is idle or staging.
┌─────────────┐
Users → LB ─────→ │ Blue (v1) │ ← currently LIVE
└─────────────┘
┌─────────────┐
idle → │ Green (v2) │ ← deploy new version here
└─────────────┘
# After deploy + smoke test:
┌─────────────┐
idle → │ Blue (v1) │ ← keep for instant rollback
└─────────────┘
┌─────────────┐
Users → LB ─────→ │ Green (v2) │ ← now LIVE
└─────────────┘
AWS ALB Target Group Switch
# Deploy v2 to green target group (Blue is still live)
aws ecs update-service \
--cluster prod \
--service app-green \
--task-definition app:42
# Wait for green to be healthy
aws ecs wait services-stable --cluster prod --services app-green
# Switch ALB listener to green target group
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$GREEN_TG_ARN
# Rollback: switch back to blue in seconds
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$BLUE_TG_ARN
Database Considerations
Blue-green is simplest when both versions share the same database schema. If your migration changes the schema in a backward-incompatible way (renaming a column, dropping a field), you cannot safely have both versions running simultaneously.
Use the expand-contract pattern (covered later in this guide) to handle schema changes safely during blue-green deployments.
Advantages: instant rollback, clean cutover, easy smoke testing.
Disadvantages: double infrastructure cost, complex with stateful services.
Canary Deployment
Route a small percentage of traffic to the new version while the rest continues hitting the old version. Monitor error rates and latency before expanding.
Users ──┬──→ (95%) App v1 ─→
└──→ (5%) App v2 ─→ monitor errors
AWS ALB Weighted Target Groups
# Start canary: 5% to v2
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
--default-actions 'Type=forward,ForwardConfig={
TargetGroups=[
{TargetGroupArn='"$V1_TG_ARN"',Weight=95},
{TargetGroupArn='"$V2_TG_ARN"',Weight=5}
]
}'
# Expand to 50% after monitoring
# Then 100% to complete
Kubernetes Canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: app
spec:
strategy:
canary:
steps:
- setWeight: 5 # 5% canary
- pause: {duration: 5m}
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
# Full rollout
analysis:
templates:
- templateName: error-rate
args:
- name: service-name
value: app
Automatic Rollback Triggers
# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate
spec:
metrics:
- name: error-rate
interval: 1m
# Roll back if error rate exceeds 1%
successCondition: result[0] < 0.01
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))
Rolling Updates
Replace instances one by one (or in small batches), waiting for each batch to pass health checks before proceeding.
Kubernetes Rolling Update Strategy
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 2 extra pods above desired count
maxUnavailable: 0 # Never go below desired count
template:
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /readyz
port: 8000
# Pod only added to LB after 2 consecutive successes
successThreshold: 2
failureThreshold: 3
periodSeconds: 5
terminationGracePeriodSeconds: 60
lifecycle:
preStop:
exec:
# Give LB time to stop routing before SIGTERM
command: ["sleep", "5"]
With maxUnavailable: 0, Kubernetes adds new pods and waits for them to pass readiness probes before removing old pods. No traffic gap.
Database Migrations: The Expand-Contract Pattern
The expand-contract (aka parallel-change) pattern makes schema changes backward-compatible across deployment boundaries.
Scenario: rename column user_name to username
Phase 1 — Expand (deploy with old + new column)
Migration: ADD COLUMN username VARCHAR(255);
Code: writes to both columns, reads from old
Phase 2 — Migrate data (background job or migration)
UPDATE users SET username = user_name WHERE username IS NULL;
Phase 3 — Switch reads (deploy)
Code: writes to both columns, reads from new
Phase 4 — Contract (deploy, remove old column)
Code: writes to new column only
Migration: DROP COLUMN user_name;
Each phase is a separate deployment. During blue-green or rolling updates, both old and new code versions must work with the schema in each phase.
Feature Flags for Data Changes
# Use a feature flag to gate reads from new column
from django.conf import settings
def get_username(user):
if settings.USE_NEW_USERNAME_COLUMN:
return user.username
return user.user_name
This lets you flip the read path in production without a code deployment — just update the environment variable and reload the app.