Runbook: Portfolio App Deployment Failure

Quick Reference


Scenario	New deployment failed or introduced breaking changes
Severity	High (SEV-2) — Service potentially broken or unavailable
MTTR Target	5 minutes (rollback execution time)
On-Call	Yes — Notify immediately via Slack + PagerDuty
Escalation	VP Engineering if unable to rollback within 5 minutes

Overview

This runbook provides fast rollback procedures when a deployment fails or introduces critical bugs. The goal is to restore service within 5 minutes using either Vercel UI promotion or Git revert.

When to use this runbook:

Deployment shows "Failed" status in Vercel dashboard
Health endpoint returns 500 immediately after deployment
Build logs show compilation/type errors
All routes return 500 errors after deployment
Users report complete site breakage

When NOT to use this runbook:

Gradual degradation (use service degradation runbook)
Performance issues without errors (use performance troubleshooting)
Partial feature unavailability (use service degradation runbook)

Failure Indicators

How to Detect Deployment Failure

1. Vercel Dashboard Indicators

Build Failure (pre-deployment):

Deployment status: ❌ "Failed" (red X icon)
Build logs contain error messages
No production URL generated
Previous deployment still live (good — no user impact yet)

Runtime Failure (post-deployment):

Deployment status: ✅ "Ready" (green checkmark) BUT
Health endpoint returns 500
Routes return 500 errors
Error logs spike immediately after deployment

2. Automated Alerts

Health check monitoring: Returns 500 status
Error rate alert: >50% of requests return 5xx
Uptime monitor: Site unresponsive or returns errors

3. User Reports

"Your site is completely broken"
"I'm getting an error page on every link"
"Portfolio app shows 'Something went wrong'"

4. Manual Health Check

# Test health endpoint
curl -s https://portfolio-app.vercel.app/api/health

# If returns {"status":"unhealthy","error":"..."}
# OR if curl fails with connection error
# → Deployment failure confirmed

Response Procedure

Phase 1: Triage (1 minute)

Objective: Confirm deployment failure and identify last known good deployment.

Step 1: Check Vercel Deployments Page

Go to Vercel Deployments
Identify most recent deployment
Check status:
- ❌ Red X = Build failed (good — not deployed yet)
- ✅ Green checkmark = Build succeeded (check if runtime error)

Step 2: Verify Health Endpoint

# Test production health
curl -s https://portfolio-app.vercel.app/api/health | jq '.'

# Possible outcomes:
# 1. {"status":"healthy"} → False alarm, deployment is fine
# 2. {"status":"unhealthy","error":"..."} → Confirmed runtime failure
# 3. Connection timeout/error → Confirmed deployment failure

Decision tree:

Health Check Result	Action
Returns `status: "healthy"`	STOP. Deployment is fine. Investigate other cause.
Returns `status: "unhealthy"` (500)	Continue to rollback (runtime failure)
Connection timeout or DNS error	Continue to rollback (deployment completely broken)
Returns `status: "degraded"` (503)	Use service degradation runbook instead

Step 3: Identify Last Known Good Deployment

In Vercel Deployments page:

Scroll down to find most recent deployment with green checkmark ✅
Verify it's truly good: Click deployment → Click "Visit" → Test health endpoint
Note the commit SHA and deployment URL

# Example: Verify previous deployment is healthy
curl -s https://portfolio-app-<unique-id>.vercel.app/api/health | jq '.status'

# Expected: "healthy"

Step 4: Assess Failure Type

Build Failure (preferred — no user impact):

Build logs show TypeScript errors, lint failures, or missing dependencies
Deployment never went live
Previous deployment still serving traffic
Impact: None (broken code never deployed)

Runtime Failure (user-impacting):

Build succeeded but application crashes at startup
Health endpoint returns 500 or times out
Routes return "Application Error" page
Impact: High (users see broken site)

Phase 2: Immediate Containment (1 minute)

Objective: Prevent further deployments and notify team.

Step 1: Halt Deployments

Post to deployment channel immediately:

🛑 DEPLOYMENT HOLD

Reason: Deployment failure detected (INC-20260126-002)
Affected: portfolio-app production
Status: ROLLING BACK
ETA: 5 minutes

DO NOT merge or push to main until all-clear posted.

Step 2: Notify Incident Channel

Post to Slack #incidents:

🔴 INCIDENT: Portfolio App Deployment Failure

Incident ID: INC-20260126-002
Severity: SEV-2 (High)
Started: 2026-01-26 15:45 UTC
Impact: Complete service unavailability
Assigned: @oncall-engineer

Action: Executing rollback to last known good deployment
MTTR Target: 5 minutes
Next update: 2 minutes

Phase 3: Execute Rollback (2–3 minutes)

Objective: Restore service by promoting last known good deployment or reverting code.

Two rollback options available (choose fastest):

Option 1: Vercel UI Rollback (FASTEST — Recommended)

When to use: Runtime failure; previous deployment still exists in Vercel

Steps:

Go to Vercel Deployments
Find last known good deployment (verified in Phase 1, Step 3)
Click on that deployment row
Click "Promote to Production" button (top right)
Confirm: "Yes, promote this deployment"
Wait for promotion to complete (~30 seconds)

Verify immediately:

curl -s https://portfolio-app.vercel.app/api/health | jq '.status'
# Expected: "healthy"

MTTR: ~2 minutes (navigate UI + promote + verify)

Pros:

Fastest method (no code changes)
Instant rollback (30 seconds)
No Git history pollution

Cons:

Requires Vercel dashboard access
Doesn't fix underlying code issue (still needs separate fix)

Option 2: Git Revert Rollback (Fallback)

When to use:

Vercel UI rollback unavailable
Need to permanently revert broken commit from Git history
Build failure (no deployment to promote)

Steps:

# 1. Identify broken commit
git log --oneline | head -5

# Example output:
# abc1234 (HEAD -> main) feat: add broken feature  ← This is broken
# def5678 docs: update README                     ← Last known good
# ghi9012 feat: performance improvements

# 2. Revert the broken commit
git revert abc1234 --no-edit

# This creates a new commit that undoes abc1234

# 3. Push revert commit
git push origin main

# 4. Wait for Vercel auto-deploy (~2 minutes)
# Monitor deployment: https://vercel.com/bryce-seefieldts-projects/portfolio-app/deployments

# 5. Verify health check
curl -s https://portfolio-app.vercel.app/api/health | jq '.status'
# Expected: "healthy"

MTTR: ~3–4 minutes (revert + push + Vercel deploy + verify)

Pros:

Fixes code permanently (revert commit stays in history)
Works even if Vercel UI unavailable
Clear audit trail in Git

Cons:

Slower than Vercel UI promotion
Requires Git command line access
Pollutes Git history with revert commits

Phase 4: Verification (1–2 minutes)

Objective: Confirm rollback was successful and service is fully restored.

Verification Checklist

Health check returns 200 (healthy):

curl -s https://portfolio-app.vercel.app/api/health | jq '.status'
# Expected: "healthy"

Vercel Deployments shows latest as "Ready":
- Green checkmark ✅ on most recent deployment
- No error logs in function logs

Homepage accessible (no 500s):

curl -I https://portfolio-app.vercel.app/ | grep HTTP
# Expected: HTTP/2 200

Projects page loads:

curl -I https://portfolio-app.vercel.app/projects | grep HTTP
# Expected: HTTP/2 200

No new errors in logs (last 2 minutes):
- Check Vercel Dashboard → Functions → Logs
- Filter: Last 2 minutes
- Search: "error" (should be empty or only historical)

All-Clear Notification

Once all checks pass, post to incident and deployment channels:

✅ RESOLVED: Portfolio App Deployment Failure

Incident ID: INC-20260126-002
Resolution Time: 4 minutes (MTTR target: 5 min ✓)
Rollback Method: Vercel UI promotion to deployment <deployment-id>
Root Cause: [TBD — pending investigation]

Status: All routes operational, health check returns 200
Impact: 4 minutes complete service unavailability
Deployments: UNFROZEN — normal deployments can resume

Next Steps: Root cause investigation and fix (async)
Postmortem: Scheduled for 2026-01-27 10:00 UTC

Post-Failure Investigation (Async, within 2 hours)

Objective: Understand what went wrong and prevent recurrence.

Step 1: Examine Build Logs

For build failures:

Go to Vercel Deployments
Click failed deployment
Click "Building" tab → View build logs
Search for error keywords: error, Error:, failed, FAILED

Common error patterns:

Error Message	Root Cause	Prevention
`TS2304: Cannot find name 'X'`	TypeScript type error	Add pre-commit `pnpm typecheck`
`Module not found: Can't resolve 'X'`	Missing dependency	Run `pnpm install` before commit
`ESLint: X errors, Y warnings`	Linting failure	Add pre-commit `pnpm lint`
`Out of memory`	Build too large or infinite loop	Investigate bundle size or build script

Step 2: Examine Broken Commit

# View files changed in broken commit
git show <broken-commit-sha> --stat

# Review actual code changes
git show <broken-commit-sha>

# Identify suspicious changes

Root cause categories:

Code Issue: Syntax error, missing import, logic error causing crash
Dependency Issue: Missing or conflicting package, version incompatibility
Environment Issue: Missing environment variable, changed config
Tooling Issue: TypeScript version, Next.js version, build tool issue

Step 3: Reproduce Locally

# Check out broken commit
git checkout <broken-commit-sha>

# Install dependencies
pnpm install

# Try to build
pnpm build

# Expected: Same error as Vercel build

# If build succeeds locally but fails in Vercel:
# → Environment difference (Node version, env vars, etc.)

Implement Fix & Redeploy (Async)

Step 1: Fix the Issue

Create a fix branch:

# Return to main (which now has rollback)
git checkout main
git pull

# Create fix branch
git checkout -b fix/deployment-failure-<issue-number>

# Implement fix (e.g., fix TypeScript error)
# Edit files...

# Test locally
pnpm verify

# Expected: All checks pass

Step 2: Commit and Push

# Commit fix
git add .
git commit -m "fix: resolve deployment failure - <brief description>

Root cause: <explanation>
Fix: <what was changed>

Refs: INC-20260126-002"

# Push fix branch
git push origin fix/deployment-failure-<issue-number>

Step 3: Create PR and Get Review

# Create PR via GitHub CLI
gh pr create \
  --title "fix: resolve deployment failure - <description>" \
  --body "## Problem
Deployment failed due to <root cause>.

## Solution
<fix description>

## Testing
- [x] pnpm verify passes
- [x] Build succeeds locally
- [x] Health check tested

## Incident Reference
Refs: INC-20260126-002

## Reviewers
@team-lead" \
  --label "bug,hotfix,incident-followup"

PR requirements:

All CI checks must pass (pnpm lint, typecheck, build)
Code review approved by team lead
Test evidence included in PR description

Step 4: Merge and Deploy

# Once approved, merge PR
gh pr merge --squash --delete-branch

# Vercel auto-deploys on merge to main
# Monitor deployment progress

Step 5: Verify Fix Deployed

# Wait ~2 minutes for deployment

# Check health
curl -s https://portfolio-app.vercel.app/api/health | jq '.'

# Verify fix is live (check commit SHA)
curl -s https://portfolio-app.vercel.app/api/health | jq '.commit'

# Expected: Shows new commit SHA (not rollback commit)

Postmortem (Within 24 hours)

Template: docs/_meta/templates/template-postmortem.md

Required preventive actions examples:

Action: Add pnpm typecheck to required CI checks (if TypeScript error)
Action: Add pre-commit hook to run pnpm lint (if linting error)
Action: Add automated health check after deployment (Vercel Deployment Checks)
Action: Add build test step before merge (GitHub Actions)

Common Deployment Failures & Quick Fixes

Failure 1: TypeScript Type Error

Build log:

Error: src/app/page.tsx:25:10 - error TS2304: Cannot find name 'Project'.

Cause: Missing type import or typo in type name
Prevention: Add pnpm typecheck to CI required checks
Quick fix: Add missing import, fix typo, redeploy

Failure 2: Missing Dependency

Build log:

Error: Module not found: Can't resolve '@/lib/new-module'

Cause: Imported module not installed in package.json
Prevention: Always run pnpm install <package> before using
Quick fix: pnpm install <package>, commit lockfile, redeploy

Failure 3: Environment Variable Missing

Runtime error in logs:

{
  "level": "error",
  "message": "Config error",
  "context": { "error": "NEXT_PUBLIC_NEW_VAR is not defined" }
}

Cause: New env var referenced in code but not set in Vercel
Prevention: Document env var requirements in PR, verify in deployment checklist
Quick fix: Add env var in Vercel settings, redeploy

Escalation

Duration	Action	Notify
0–5 min	Execute rollback (this runbook)	On-call engineer only
5–15 min	Escalate if rollback fails	VP Engineering via Slack + PagerDuty
>15 min	Full team escalation	All team + CEO (if customer-facing)

Tools & References

Quick Commands

# Health check
curl -s https://portfolio-app.vercel.app/api/health | jq '.'

# Git revert
git revert <commit-sha> --no-edit && git push

# View deployment logs (requires Vercel CLI)
vercel logs <deployment-url>

# Check latest deployment status
vercel ls | head -5

External Links

Last Updated: 2026-01-26
Maintained By: Portfolio Operations Team
Review Schedule: After each deployment failure + quarterly

Quick Reference​

Overview​

Failure Indicators​

How to Detect Deployment Failure​

1. Vercel Dashboard Indicators​

2. Automated Alerts​

3. User Reports​

4. Manual Health Check​

Response Procedure​

Phase 1: Triage (1 minute)​

Step 1: Check Vercel Deployments Page​

Step 2: Verify Health Endpoint​

Step 3: Identify Last Known Good Deployment​

Step 4: Assess Failure Type​

Phase 2: Immediate Containment (1 minute)​

Step 1: Halt Deployments​

Step 2: Notify Incident Channel​

Phase 3: Execute Rollback (2–3 minutes)​

Option 1: Vercel UI Rollback (FASTEST — Recommended)​

Option 2: Git Revert Rollback (Fallback)​

Phase 4: Verification (1–2 minutes)​

Verification Checklist​

All-Clear Notification​

Post-Failure Investigation (Async, within 2 hours)​

Step 1: Examine Build Logs​

Step 2: Examine Broken Commit​

Step 3: Reproduce Locally​

Implement Fix & Redeploy (Async)​

Step 1: Fix the Issue​

Step 2: Commit and Push​

Step 3: Create PR and Get Review​

Step 4: Merge and Deploy​

Step 5: Verify Fix Deployed​

Postmortem (Within 24 hours)​

Common Deployment Failures & Quick Fixes​

Failure 1: TypeScript Type Error​

Failure 2: Missing Dependency​

Failure 3: Environment Variable Missing​

Escalation​

Tools & References​

Quick Commands​

External Links​

Related Runbooks​