Implement zero-downtime deployments with blue/green strategy. Health check gates, gradual traffic shift, and automatic rollback on failure.
## Task
Zero-downtime deployment with blue/green traffic switching and auto-rollback.
## Requirements
- Platform: AWS ECS, Kubernetes, or generic Docker hosts
- Load balancer: ALB, nginx, or Traefik
- Health checks: HTTP + deep check (DB connectivity, dependencies)
## Deployment Flow
```
1. DEPLOY GREEN: Start new version alongside old (blue)
2. HEALTH CHECK: Wait for green to pass health checks (30s timeout)
3. SMOKE TEST: Run automated tests against green (internal only)
4. SHIFT 10%: Route 10% of traffic to green (canary)
5. MONITOR: Watch error rate for 2 minutes
6. SHIFT 100%: If healthy, route all traffic to green
7. DRAIN BLUE: Wait for in-flight requests (30s), stop blue
8. ROLLBACK: If any step fails, route 100% back to blue
Health check endpoint:
GET /health → { status: "ok", checks: { db: "ok", redis: "ok", version: "1.2.3" } }
GET /health/ready → 200 (after warmup complete)
GET /health/live → 200 (process is alive)
```
## Implementation Notes
1. Health check must verify ALL dependencies (DB, cache, external APIs)
2. Separate liveness (is process alive?) from readiness (can it serve traffic?)
3. Warm-up period: pre-load caches, establish connection pools before ready
4. Connection draining: configure LB to stop sending new requests, wait for existing
5. Database migrations: must be backward-compatible (run BEFORE deploy, not during)
6. Feature flags: decouple deploy from release — deploy code, enable feature separately
7. Metrics to watch during canary: error rate, P95 latency, 5xx countNo gallery images yet.