Skip to main content

Incident Response

Purpose

Define the incident response governance model for the portfolio platform. This section sets the minimum operating standard for detecting, triaging, communicating, and learning from incidents.

Incident response is treated as evidence of operational maturity: clear severity definitions, deterministic triage steps, and documented learning loops.

Scope

In scope

  • severity definitions and escalation rules
  • incident lifecycle: detect → triage → contain → recover → review
  • communication expectations and update cadence
  • postmortem and corrective action standards
  • references to runbooks and operational evidence

Out of scope

  • detailed step-by-step procedures (use runbooks)
  • architecture rationale (use ADRs)
  • security threat enumeration (use threat models)

Prereqs / Inputs

Incident responders should know:

  • which system is affected and which environment(s) are in scope
  • where to find logs/health checks for validation
  • which runbook to execute for a given failure mode
  • how to communicate status updates safely (public-safe only)

Incident Lifecycle (Governance)

  1. Detect — alert, report, or health check indicates failure
  2. Triage — confirm impact and assign severity
  3. Contain — stop the bleeding (rollback, disable, isolate)
  4. Recover — restore service and validate health
  5. Review — document root cause, corrective actions, and follow-ups

Severity Model (Minimum Standard)

Severity must map to user impact and recovery urgency. Use the severity guidance in the handbook:

Communications (Minimum Standard)

  • SEV-1/SEV-2: updates every 5–10 minutes
  • SEV-3: updates every 15–30 minutes
  • SEV-4: issue tracker update only

Keep all communications public-safe and avoid sensitive operational detail.

Required Artifacts

Incidents must reference:

  • the executing runbook (for deterministic recovery)
  • related ADRs (if the incident exposes a decision gap)
  • related threat models (if security-relevant)

References