Runbook: General Incident Response
Overview
This runbook provides a framework for responding to any incident affecting the portfolio platform. It defines severity levels, notification procedures, investigation phases, and postmortem processes applicable to all incident types.
When to use this runbook:
- Any incident not covered by specific runbooks (deployment failure, degradation)
- Security incidents
- Data integrity issues
- Multi-component failures
- Unclear or complex incidents requiring structured triage
Severity Levels
| Severity | Definition | Response Time | Escalation | Example |
|---|---|---|---|---|
| SEV-4 (Low) | Non-user-impacting; cosmetic bugs, documentation errors | <24 hours | None | Typo in homepage text |
| SEV-3 (Medium) | Users minimally affected; slow pages, minor features unavailable | <4 hours | Team lead informed | Projects page slow (>5s) |
| SEV-2 (High) | Significant user impact; broken features, data loss risk | <1 hour | On-call engineer paged | Projects page returns 500 |
| SEV-1 (Critical) | Complete outage; all users affected, revenue/reputation impact | Immediate | VP Eng + full team | All routes return 500 |
Incident Classification
How to Determine Severity
Ask these questions during initial triage:
Question 1: Are users blocked from core functionality?
- Yes → SEV-2 or SEV-1 (depending on scale)
- No → SEV-3 or SEV-4 (depending on visibility)
Question 2: How many users are affected?
- All users → SEV-1 (critical)
- >50% of users → SEV-2 (high)
- <50% of users → SEV-3 (medium)
- 0 users (internal only) → SEV-4 (low)
Question 3: Is there a workaround?
- No workaround → Increase severity by 1 level
- Workaround available → Keep current severity
Example Classifications
| Scenario | Severity | Rationale |
|---|---|---|
| All routes return 500 | SEV-1 | Complete outage, all users blocked, no workaround |
| Projects page returns 404 | SEV-2 | Core feature broken, many users affected |
| Contact form email not sending | SEV-3 | Non-core feature, users can email directly (workaround) |
| Typo in CV page | SEV-4 | Cosmetic issue, no functional impact |
| Slow homepage (>5s load) | SEV-3 | Performance degradation, users can still access |
| Analytics not loading | SEV-4 | Non-user-facing feature, internal metrics only |
Incident Notification & Escalation
Who to Notify
| Severity | Notify | Channel | Urgency | Expected Response |
|---|---|---|---|---|
| SEV-4 | Team lead | Slack #portfolio-updates | ASAP (business hours) | Acknowledge within 1 day |
| SEV-3 | On-call eng + Team lead | Slack #incidents | Within 15 min | Start investigation within 15 min |
| SEV-2 | On-call eng + VP Eng | Slack #incidents + PagerDuty page | Immediately | Start investigation within 5 min |
| SEV-1 | Full team + VP Eng + CEO | Slack #incidents + SMS + Phone | Immediately (wake up if needed) | All hands on deck; respond within 2 min |
Notification Templates
SEV-1 (Critical)
🚨 CRITICAL INCIDENT: Portfolio App Complete Outage
Incident ID: INC-20260126-003
Severity: SEV-1 (CRITICAL)
Started: 2026-01-26 16:00 UTC
Impact: ALL ROUTES RETURN 500 - COMPLETE SERVICE OUTAGE
Assigned: @oncall-engineer
Incident Commander: @vp-eng
Action: ALL HANDS - Join #incident-INC-20260126-003
MTTR Target: 15 minutes
Next update: 2 minutes
@channel @here
SEV-2 (High)
🔴 HIGH SEVERITY INCIDENT: Projects Page Broken
Incident ID: INC-20260126-004
Severity: SEV-2 (High)
Started: 2026-01-26 16:00 UTC
Impact: Projects page returns 500, homepage working
Assigned: @oncall-engineer
Action: Investigating root cause
MTTR Target: 1 hour
Next update: 10 minutes
SEV-3 (Medium)
⚠️ INCIDENT: Performance Degradation
Incident ID: INC-20260126-005
Severity: SEV-3 (Medium)
Started: 2026-01-26 16:00 UTC
Impact: Slow page loads (>5s), all routes functional
Assigned: @oncall-engineer
Action: Investigating performance bottleneck
MTTR Target: 4 hours
Next update: 30 minutes
Response Phases
Phase 1: Triage (First 5 minutes)
Objective: Verify incident, assess scope, classify severity, notify stakeholders.
Step 1: Verify the Incident
- Can you reproduce it? Try accessing affected routes/features
- How many users affected? Check logs, analytics, user reports
- What environment? Production? Staging? Both?
Step 2: Assess Scope
- What's broken? Specific routes? All routes? Specific features?
- What's still working? Homepage? Other pages? Health endpoint?
- When did it start? Exact timestamp (helps identify deployments, env changes)
Step 3: Classify Severity
Use severity matrix above to classify as SEV-1, SEV-2, SEV-3, or SEV-4.
Step 4: Assign Incident Number
Format: INC-YYYYMMDD-NNN
Example: INC-20260126-001
Step 5: Open Incident Channel
For SEV-1 and SEV-2 only:
- Create Slack channel:
#incident-INC-20260126-001 - Post initial assessment:
- Incident number
- Severity
- Impact description
- Assigned responder
- Initial hypothesis (if any)
Phase 2: Investigation (10–30 minutes)
Objective: Identify root cause through systematic diagnosis.
Step 1: Gather Context
What changed recently?
- Check recent deployments (last 24 hours)
- Check environment variable changes (Vercel settings)
- Check infrastructure status (Vercel status page)
- Check external dependencies (docs site, GitHub API)
Timeline analysis:
# When did it start?
# Example: Started at 15:30 UTC
# What happened at 15:30?
# Check Vercel deployments:
vercel ls | head -10
# Check Git commits:
git log --since="2 hours ago" --oneline
# Check recent env var changes:
# (View in Vercel dashboard → Settings → Environment Variables → Activity Log)
Step 2: Check Monitoring
- Health endpoint:
curl https://portfolio-app.vercel.app/api/health | jq '.' - Vercel Logs: Filter by time range (when incident started)
- Error patterns: Search logs for keywords related to symptom
- Metrics: Error rate, response time, resource usage
Step 3: Identify Root Cause
Follow specific runbooks if applicable:
| Symptom | Likely Runbook | Root Cause Category |
|---|---|---|
| All routes return 500 after deployment | Deployment Failure | Deployment issue |
| Health endpoint returns 503 | Service Degradation | Data/config issue |
| Slow response times but no errors | Performance Troubleshooting | Performance issue |
| Security breach suspected | Security Incident Runbook (future) | Security issue |
If no runbook applies:
Categorize into:
- Code issue: Bug in application logic, missing error handling
- Infrastructure issue: Vercel platform problem, CDN issue
- Data issue: Database corruption, missing/invalid data
- External dependency: Third-party API down or slow
- Configuration issue: Wrong env var, misconfigured setting
Step 4: Assign Ownership
- Who owns the affected component?
- Who can implement the fix?
- Who needs to be involved in resolution?
Phase 3: Mitigation & Resolution (5–60 minutes)
Objective: Fix the issue or implement temporary mitigation.
Option A: Temporary Mitigation (if full fix takes too long)
Goal: Reduce user impact while root cause fix is developed.
Examples:
- Serve cached/stale data instead of live data
- Disable broken feature temporarily
- Route users to fallback page
- Show graceful degradation message
Option B: Full Resolution
Steps:
- Develop fix on fix branch
- Test locally:
pnpm verifymust pass - Get code review: If SEV-1 and time-critical, skip review but document
- Deploy fix: Merge to main, Vercel auto-deploys
- Monitor: Watch logs for 10 minutes post-deployment
- Verify: Health check returns 200, error rate returns to normal
Phase 4: Communication
During Incident
Post updates every 5–10 minutes in #incident-* channel:
Update format:
[HH:MM] {STATUS}: {update message}
Example updates:
[15:35] INVESTIGATING: Found error in project loading, checking database
[15:40] IDENTIFIED: Root cause is corrupted projects.yml file
[15:45] MITIGATING: Deploying fix - restoring projects.yml from backup
[15:50] MONITORING: Fix deployed, watching error rate
[15:55] RESOLVED: Error rate back to normal, all routes operational
Status keywords:
INVESTIGATING— Diagnosis in progressIDENTIFIED— Root cause foundMITIGATING— Implementing fixMONITORING— Fix deployed, watching for regressionRESOLVED— Issue fixed, service restored
After Incident (All-Clear)
Post in #incidents when fully resolved:
✅ RESOLVED: INC-20260126-001 - Portfolio App Degradation
Duration: 25 minutes (started 15:30, resolved 15:55 UTC)
Impact: Projects page unavailable, homepage operational
Root Cause: Corrupted projects.yml after merge conflict
Resolution: Restored projects.yml from commit abc1234
Postmortem: Scheduled for 2026-01-27 10:00 UTC
Attendees: @oncall-engineer, @team-lead, @developer-who-committed
Preventive Actions (tracked in GitHub):
- #123: Add registry validation to required CI checks
- #124: Add pre-commit hook for YAML syntax validation
Postmortem Phase (Within 24 hours)
Purpose
- Understand why the incident happened (not who to blame)
- Identify gaps in processes, testing, or monitoring
- Implement preventive controls to avoid recurrence
- Share learnings with team
Postmortem Structure
Template: docs/_meta/templates/template-postmortem.md
Required Sections
-
Incident Summary
- Incident ID, severity, duration, impact
- Timeline of key events
-
Root Cause Analysis
- What happened (technical description)
- Why it happened (contributing factors)
- Why it wasn't caught earlier (process/testing gaps)
-
Resolution
- How was it fixed?
- Who fixed it?
- How long did it take?
-
Preventive Actions
- What will prevent this from happening again?
- Assigned owner for each action
- Target completion date
-
Lessons Learned
- What went well?
- What could be improved?
- Process or tool recommendations
Conducting the Postmortem Meeting
Attendees: Responders, team lead, affected developers, observers
Duration: 30–60 minutes
Agenda:
- Timeline review (10 min) — Walk through incident chronologically
- Root cause deep-dive (15 min) — Why did this happen? Contributing factors?
- Five Whys exercise (10 min) — Drill down to fundamental cause
- Preventive actions (15 min) — Brainstorm what could have prevented this
- Action items (10 min) — Assign owners, set deadlines
Rules:
- Blameless — Focus on systems, not individuals
- Fact-based — Use logs, timestamps, metrics
- Forward-looking — What can we improve?
Follow-Up
Preventive Controls
Implement actions identified in postmortem:
Examples:
-
Technical controls:
- Add automated validation (e.g.,
pnpm registry:validatein CI) - Add health checks after deployment
- Add error monitoring alerts
- Add automated validation (e.g.,
-
Process improvements:
- Update runbooks with new learnings
- Improve deployment checklist
- Add testing requirements
-
Monitoring enhancements:
- Add new alerts (e.g., error rate >5%)
- Improve log visibility
- Add custom metrics
Track Preventive Actions
Create GitHub issues for each action:
gh issue create \
--title "Add registry validation to CI (postmortem INC-20260126-001)" \
--body "Prevent corrupted projects.yml from deploying." \
--label "ci,enhancement,postmortem-followup" \
--milestone "Phase 4.3" \
--assignee oncall-engineer
Team Training
- Incident response drill — Simulate incidents quarterly
- Runbook review — Update runbooks after each major incident
- Lessons learned sharing — Discuss in team meeting
Severity-Based Quick Reference
SEV-1 (Critical) — Immediate Response
- Page everyone: VP Eng + full team (Slack + SMS + phone)
- Create incident channel:
#incident-INC-YYYYMMDD-NNN - Assign incident commander: VP Eng or senior engineer
- Execute relevant runbook: Deployment failure or service degradation
- Post updates every 5 min
- All-clear when resolved
- Postmortem within 24 hours
MTTR Target: 15 minutes
SEV-2 (High) — Urgent Response
- Notify on-call engineer via PagerDuty
- Create incident channel (if multi-person response)
- Execute relevant runbook: Service degradation, deployment failure
- Post updates every 10 min
- Escalate to VP Eng if >1 hour
- Postmortem within 48 hours
MTTR Target: 1 hour
SEV-3 (Medium) — Normal Response
- Notify team lead via Slack
- Create GitHub issue to track investigation
- Investigate during business hours
- Fix within 4 hours (or provide ETA)
- No formal postmortem (document learnings in issue)
MTTR Target: 4 hours
SEV-4 (Low) — Low Priority
- Create GitHub issue with
bugordocumentationlabel - Fix during next sprint
- No incident response required
MTTR Target: 24 hours (or next sprint)
Tools & References
Incident Management
- Slack channels:
#incidents,#portfolio-updates - PagerDuty: portfolio-app on-call rotation
- GitHub Issues: Incident tracking label
Monitoring & Logs
Related Runbooks
- Service Degradation
- Deployment Failure
- Performance Optimization
- Performance Troubleshooting
- Observability & Health Checks
Last Updated: 2026-01-26
Maintained By: Portfolio Operations Team
Review Schedule: Quarterly + after each SEV-1 incident