Operations, Runbooks, Incident Response, and DR/BCP
Purpose
This section documents how the portfolio web app (and demos) are operated in a production-like manner. The goal is to demonstrate that you can run software, not just build it.
It includes:
- runbooks (deploy/rollback/triage/config changes)
- incident response practices and postmortems
- DR/BCP planning aligned to service impact
- operational checklists and validation steps
Scope
In scope
- runbooks with prerequisites, step-by-step procedures, validation, rollback
- incident response model: severity, comms, triage, escalation
- postmortem template and expectations (blameless, corrective actions)
- DR/BCP service impact analysis, RTO/RPO targets, recovery playbooks
- operational ownership: who does what, and where responsibilities live
Out of scope
- CI/CD implementation detail (belongs in
30-devops-platform/) - threat model details (belongs in
40-security/) except as operational references
Runbook standards (mandatory format)
Every runbook must contain:
- Purpose
- Scope (when to use / when not to use)
- Prereqs / Access requirements
- Procedure (step-by-step)
- Validation (how to confirm success)
- Rollback / Recovery steps
- Failure modes / Troubleshooting
- References (related ADRs, alerts, dashboards, security notes)
Runbooks must be copy/paste safe and explicit about environment context.
Incident response model (minimum viable enterprise)
Document:
- severity definition (SEV levels and impact criteria)
- triage process and initial containment guidance
- communications templates (public-safe)
- postmortem process: timeline, contributing factors, corrective actions
DR/BCP expectations
Treat the portfolio app as a service:
- identify dependencies (hosting, DNS, CI/CD, third-party services)
- define recovery objectives (RTO/RPO) appropriate for a portfolio service
- document recovery playbooks and validation steps
- document “known hard failures” and mitigation/acceptance
Validation and expected outcomes
Ops docs are “correct” when:
- a reviewer can deploy and rollback deterministically
- incident response steps are actionable and ordered
- recovery guidance exists for common dependency failures
Failure modes and troubleshooting
- Runbooks without validation: procedures end without confirming success → add explicit checks.
- Rollback missing: deployment is documented but rollback is not → fix immediately.
- IR is theoretical: no severity model or comms plan → add minimal viable IR scaffolding.
Runbooks & Procedures
See Runbooks Index for operational procedures:
- General Incident Response — Framework for all incidents (severity levels, triage, postmortem)
- Service Degradation — Diagnose and resolve performance/availability issues (MTTR: 10 min)
- Deployment Failure Recovery — Detect and rollback failed deployments (MTTR: 5 min)
For observability architecture and monitoring setup, see Observability & Health Checks.
References
Operational changes must be synchronized with:
- CI/CD pipeline and environment documentation (
30-devops-platform/) - security controls and threat model impacts (
40-security/) - release notes for meaningful runtime changes (
00-portfolio/release-notes/)