Incident Response Handbook
Purpose
Provide a consolidated incident response handbook that supports on-call responders with severity guidance, quick selectors, and operational patterns. This handbook complements the runbook catalog at /docs/50-operations/runbooks.
Scope
In scope
- quick selectors to choose the right runbook
- severity-based response guidance
- operational patterns for detection, recovery, and verification
- tools and utilities used during response
- runbook improvement standards
Out of scope
- step-by-step procedures (see runbooks)
- architecture rationale (see ADRs)
Quick Selector: Find Your Runbook
Match your scenario to the appropriate runbook:
| I'm seeing... | Use this runbook |
|---|---|
| ❌ Deployment shows "Failed" in Vercel | Deployment Failure |
| ⚠️ Health endpoint returns 503 | Service Degradation |
| 🔴 All routes return 500 | Deployment Failure |
| 🐌 Pages load slowly (>3s) but no errors | Performance Troubleshooting |
| 📦 Bundle size too large (>30MB) | Performance Troubleshooting |
| ❓ Unclear incident, need framework | General Incident Response |
| 🔍 Want to understand monitoring setup | Observability & Health Checks |
| ⚡ Want to improve performance proactively | Performance Optimization |
| 🔐 CVE alert or dependency vulnerability | Dependency Vulnerability Response |
| 🚨 Suspected secret leak in repo | Secrets Incident Response |
Severity-Based Quick Reference
Critical Incident (SEV-1) — Immediate Response
Symptoms: Complete service outage, all users affected, all routes return 500
Quick Steps:
- Page on-call engineer + VP Engineering (Slack + SMS + phone)
- Create incident channel:
#incident-INC-YYYYMMDD-NNN - Execute runbook: Deployment Failure if recent deployment, otherwise General Incident Response
- Post updates every 5 minutes
- All-clear when resolved
- Schedule postmortem within 24 hours
MTTR Target: 15 minutes
If Secrets Incident: Execute Secrets Incident Response immediately; MTTR ≤5 min for critical secrets
High Severity (SEV-2) — Urgent Response
Symptoms: Significant user impact, core features broken, partial outage
Quick Steps:
- Notify on-call engineer via Slack + PagerDuty
- Execute runbook: Service Degradation or Deployment Failure
- Target resolution: less than 1 hour
- Post updates every 10 minutes
- Schedule postmortem within 48 hours
MTTR Target: 1 hour
If Dependency CVE (High): Execute Dependency Vulnerability Response; MTTR 48 hours
Medium Severity (SEV-3) — Normal Response
Symptoms: Minor user impact, slow performance, non-critical features unavailable
Quick Steps:
- Notify team lead via Slack
- Create GitHub issue to track
- Execute runbook: Service Degradation or Performance Troubleshooting
- Investigate during business hours
- No formal postmortem (document learnings in issue)
MTTR Target: 4 hours
Low Severity (SEV-4) — Low Priority
Symptoms: Cosmetic issues, documentation errors, non-user-facing problems
Quick Steps:
- Create GitHub issue with appropriate label
- Fix during next sprint
- No incident response required
MTTR Target: 24 hours or next sprint
Common Operational Patterns
Error Detection Patterns
| Pattern | Where to Look | What to Search For |
|---|---|---|
| Deployment errors | Vercel Deployments → Build logs | error, Error:, failed, FAILED |
| Runtime errors | Vercel Functions → Logs | "level":"error", 500, timeout |
| Performance issues | Vercel Analytics | Response time >3s, LCP >2.5s |
| Data issues | Health endpoint | projectCount: 0, status: "degraded" |
Recovery Patterns
| Issue Category | Recovery Method | Example |
|---|---|---|
| Deployment failure | Vercel UI rollback or Git revert | Promote previous deployment |
| Data corruption | Restore from backup commit | git show <commit>:file.yml > file.yml |
| Config issue | Revert environment variable | Vercel Settings → Env Vars → Restore |
| Resource exhaustion | Clear cache or scale up | Vercel Cache → Clear All |
Verification Patterns
After any fix, always verify:
# 1. Health check returns 200
curl -s https://portfolio-app.vercel.app/api/health | jq '.status'
# 2. Routes are accessible
curl -I https://portfolio-app.vercel.app/ | grep HTTP
# 3. No errors in recent logs
# Check Vercel Dashboard → Functions → Logs (last 5 minutes)
# 4. Response times normal
time curl -s https://portfolio-app.vercel.app/projects > /dev/null
Tools & Utilities
Quick Commands
# Health check
curl -s https://portfolio-app.vercel.app/api/health | jq '.'
# Test route
curl -I https://portfolio-app.vercel.app/projects | grep HTTP
# View recent deployments (requires Vercel CLI)
vercel ls | head -10
# View logs (requires Vercel CLI)
vercel logs --follow
# Git rollback
git revert <commit-sha> --no-edit && git push
External Dashboards
- Vercel Dashboard: https://vercel.com/bryce-seefieldts-projects/portfolio-app
- Vercel Deployments: https://vercel.com/bryce-seefieldts-projects/portfolio-app/deployments
- Vercel Logs: https://vercel.com/bryce-seefieldts-projects/portfolio-app/logs
- Vercel Status: https://www.vercel-status.com/
- GitHub Repository: https://github.com/bryce-seefieldt/portfolio-app
Monitoring Integrations
- UptimeRobot: (to be configured)
- PagerDuty: (to be configured)
- Slack Alerts:
#incidents,#deployments,#alerts
Runbook Improvement & Feedback
Review Schedule
- After each use: Document any deviations from procedure
- After incidents: Update with new learnings from postmortem
- Quarterly: Full review of all runbooks for accuracy and completeness
- After platform changes: Update commands/screenshots if Vercel UI changes
Submitting Improvements
If you use a runbook and encounter issues:
- Unclear steps: Create GitHub issue to clarify
- Missing steps: Add to runbook and submit PR
- Incorrect commands: Test and correct in PR
- MTTR targets not achievable: Reassess and update target
Template for runbook improvements:
gh issue create \
--title "Runbook improvement: [runbook-name]" \
--body "Issue found: [description]
Suggested improvement: [what to change]
Context: Used during INC-YYYYMMDD-NNN" \
--label "documentation,runbook,ops" \
--assignee ops-team-lead
Complete Runbook Index
Documentation App Runbooks
docs/50-operations/runbooks/rbk-docs-deploy.mddocs/50-operations/runbooks/rbk-docs-rollback.mddocs/50-operations/runbooks/rbk-docs-broken-links-triage.md
Portfolio App Runbooks (Current Baseline)
Core runbooks:
docs/50-operations/runbooks/rbk-vercel-setup-and-promotion-validation.md— Vercel setupdocs/50-operations/runbooks/rbk-portfolio-deploy.mddocs/50-operations/runbooks/rbk-portfolio-rollback.mddocs/50-operations/runbooks/rbk-portfolio-ci-triage.mddocs/50-operations/runbooks/rbk-portfolio-secrets-incident.md— secrets incident responsedocs/50-operations/runbooks/rbk-portfolio-project-publish.md— project publication workflowdocs/50-operations/runbooks/troubleshooting-portfolio-publish.md— publication troubleshootingdocs/50-operations/runbooks/rbk-portfolio-environment-promotion.md— environment promotiondocs/50-operations/runbooks/rbk-portfolio-environment-rollback.md— environment rollback
Performance and incident runbooks:
docs/50-operations/runbooks/rbk-portfolio-performance-optimization.md— proactive performance tuningdocs/50-operations/runbooks/rbk-portfolio-performance-troubleshooting.md— performance troubleshootingdocs/50-operations/runbooks/rbk-portfolio-incident-response.md— incident response frameworkdocs/50-operations/runbooks/rbk-portfolio-service-degradation.md— service degradation proceduresdocs/50-operations/runbooks/rbk-portfolio-deployment-failure.md— deployment failure recovery
Related Documentation
- Runbook template:
docs/_meta/templates/template-runbook.md(internal-only) - ADRs:
docs/10-architecture/adr/ - Threat models:
docs/40-security/threat-models/ - Observability:
docs/30-devops-platform/observability-health-checks.md
Last Updated: 2026-02-04
Maintained By: Portfolio Operations Team
Next Review: 2026-05-04 (Quarterly)