Skip to main content

Incident Response Handbook

Purpose

Provide a consolidated incident response handbook that supports on-call responders with severity guidance, quick selectors, and operational patterns. This handbook complements the runbook catalog at /docs/50-operations/runbooks.

Scope

In scope

  • quick selectors to choose the right runbook
  • severity-based response guidance
  • operational patterns for detection, recovery, and verification
  • tools and utilities used during response
  • runbook improvement standards

Out of scope

  • step-by-step procedures (see runbooks)
  • architecture rationale (see ADRs)

Quick Selector: Find Your Runbook

Match your scenario to the appropriate runbook:

I'm seeing...Use this runbook
❌ Deployment shows "Failed" in VercelDeployment Failure
⚠️ Health endpoint returns 503Service Degradation
🔴 All routes return 500Deployment Failure
🐌 Pages load slowly (>3s) but no errorsPerformance Troubleshooting
📦 Bundle size too large (>30MB)Performance Troubleshooting
❓ Unclear incident, need frameworkGeneral Incident Response
🔍 Want to understand monitoring setupObservability & Health Checks
⚡ Want to improve performance proactivelyPerformance Optimization
🔐 CVE alert or dependency vulnerabilityDependency Vulnerability Response
🚨 Suspected secret leak in repoSecrets Incident Response

Severity-Based Quick Reference

Critical Incident (SEV-1) — Immediate Response

Symptoms: Complete service outage, all users affected, all routes return 500

Quick Steps:

  1. Page on-call engineer + VP Engineering (Slack + SMS + phone)
  2. Create incident channel: #incident-INC-YYYYMMDD-NNN
  3. Execute runbook: Deployment Failure if recent deployment, otherwise General Incident Response
  4. Post updates every 5 minutes
  5. All-clear when resolved
  6. Schedule postmortem within 24 hours

MTTR Target: 15 minutes

If Secrets Incident: Execute Secrets Incident Response immediately; MTTR ≤5 min for critical secrets

High Severity (SEV-2) — Urgent Response

Symptoms: Significant user impact, core features broken, partial outage

Quick Steps:

  1. Notify on-call engineer via Slack + PagerDuty
  2. Execute runbook: Service Degradation or Deployment Failure
  3. Target resolution: less than 1 hour
  4. Post updates every 10 minutes
  5. Schedule postmortem within 48 hours

MTTR Target: 1 hour

If Dependency CVE (High): Execute Dependency Vulnerability Response; MTTR 48 hours

Medium Severity (SEV-3) — Normal Response

Symptoms: Minor user impact, slow performance, non-critical features unavailable

Quick Steps:

  1. Notify team lead via Slack
  2. Create GitHub issue to track
  3. Execute runbook: Service Degradation or Performance Troubleshooting
  4. Investigate during business hours
  5. No formal postmortem (document learnings in issue)

MTTR Target: 4 hours

Low Severity (SEV-4) — Low Priority

Symptoms: Cosmetic issues, documentation errors, non-user-facing problems

Quick Steps:

  1. Create GitHub issue with appropriate label
  2. Fix during next sprint
  3. No incident response required

MTTR Target: 24 hours or next sprint


Common Operational Patterns

Error Detection Patterns

PatternWhere to LookWhat to Search For
Deployment errorsVercel Deployments → Build logserror, Error:, failed, FAILED
Runtime errorsVercel Functions → Logs"level":"error", 500, timeout
Performance issuesVercel AnalyticsResponse time >3s, LCP >2.5s
Data issuesHealth endpointprojectCount: 0, status: "degraded"

Recovery Patterns

Issue CategoryRecovery MethodExample
Deployment failureVercel UI rollback or Git revertPromote previous deployment
Data corruptionRestore from backup commitgit show <commit>:file.yml > file.yml
Config issueRevert environment variableVercel Settings → Env Vars → Restore
Resource exhaustionClear cache or scale upVercel Cache → Clear All

Verification Patterns

After any fix, always verify:

# 1. Health check returns 200
curl -s https://portfolio-app.vercel.app/api/health | jq '.status'

# 2. Routes are accessible
curl -I https://portfolio-app.vercel.app/ | grep HTTP

# 3. No errors in recent logs
# Check Vercel Dashboard → Functions → Logs (last 5 minutes)

# 4. Response times normal
time curl -s https://portfolio-app.vercel.app/projects > /dev/null

Tools & Utilities

Quick Commands

# Health check
curl -s https://portfolio-app.vercel.app/api/health | jq '.'

# Test route
curl -I https://portfolio-app.vercel.app/projects | grep HTTP

# View recent deployments (requires Vercel CLI)
vercel ls | head -10

# View logs (requires Vercel CLI)
vercel logs --follow

# Git rollback
git revert <commit-sha> --no-edit && git push

External Dashboards

Monitoring Integrations

  • UptimeRobot: (to be configured)
  • PagerDuty: (to be configured)
  • Slack Alerts: #incidents, #deployments, #alerts

Runbook Improvement & Feedback

Review Schedule

  • After each use: Document any deviations from procedure
  • After incidents: Update with new learnings from postmortem
  • Quarterly: Full review of all runbooks for accuracy and completeness
  • After platform changes: Update commands/screenshots if Vercel UI changes

Submitting Improvements

If you use a runbook and encounter issues:

  • Unclear steps: Create GitHub issue to clarify
  • Missing steps: Add to runbook and submit PR
  • Incorrect commands: Test and correct in PR
  • MTTR targets not achievable: Reassess and update target

Template for runbook improvements:

gh issue create \
--title "Runbook improvement: [runbook-name]" \
--body "Issue found: [description]

Suggested improvement: [what to change]

Context: Used during INC-YYYYMMDD-NNN" \
--label "documentation,runbook,ops" \
--assignee ops-team-lead

Complete Runbook Index

Documentation App Runbooks

  • docs/50-operations/runbooks/rbk-docs-deploy.md
  • docs/50-operations/runbooks/rbk-docs-rollback.md
  • docs/50-operations/runbooks/rbk-docs-broken-links-triage.md

Portfolio App Runbooks (Current Baseline)

Core runbooks:

  • docs/50-operations/runbooks/rbk-vercel-setup-and-promotion-validation.md — Vercel setup
  • docs/50-operations/runbooks/rbk-portfolio-deploy.md
  • docs/50-operations/runbooks/rbk-portfolio-rollback.md
  • docs/50-operations/runbooks/rbk-portfolio-ci-triage.md
  • docs/50-operations/runbooks/rbk-portfolio-secrets-incident.md — secrets incident response
  • docs/50-operations/runbooks/rbk-portfolio-project-publish.md — project publication workflow
  • docs/50-operations/runbooks/troubleshooting-portfolio-publish.md — publication troubleshooting
  • docs/50-operations/runbooks/rbk-portfolio-environment-promotion.md — environment promotion
  • docs/50-operations/runbooks/rbk-portfolio-environment-rollback.md — environment rollback

Performance and incident runbooks:

  • docs/50-operations/runbooks/rbk-portfolio-performance-optimization.md — proactive performance tuning
  • docs/50-operations/runbooks/rbk-portfolio-performance-troubleshooting.md — performance troubleshooting
  • docs/50-operations/runbooks/rbk-portfolio-incident-response.md — incident response framework
  • docs/50-operations/runbooks/rbk-portfolio-service-degradation.md — service degradation procedures
  • docs/50-operations/runbooks/rbk-portfolio-deployment-failure.md — deployment failure recovery
  • Runbook template: docs/_meta/templates/template-runbook.md (internal-only)
  • ADRs: docs/10-architecture/adr/
  • Threat models: docs/40-security/threat-models/
  • Observability: docs/30-devops-platform/observability-health-checks.md

Last Updated: 2026-02-04
Maintained By: Portfolio Operations Team
Next Review: 2026-05-04 (Quarterly)