Operations, Runbooks, Incident Response, and DR/BCP

Purpose

This section documents how the portfolio web app (and demos) are operated in a production-like manner. The goal is to demonstrate that you can run software, not just build it.

It includes:

runbooks (deploy/rollback/triage/config changes)
incident response practices and postmortems
DR/BCP planning aligned to service impact
operational checklists and validation steps

Scope

In scope

runbooks with prerequisites, step-by-step procedures, validation, rollback
incident response model: severity, comms, triage, escalation
postmortem template and expectations (blameless, corrective actions)
DR/BCP service impact analysis, RTO/RPO targets, recovery playbooks
operational ownership: who does what, and where responsibilities live

Out of scope

CI/CD implementation detail (belongs in 30-devops-platform/)
threat model details (belongs in 40-security/) except as operational references

Runbook standards (mandatory format)

Every runbook must contain:

Purpose
Scope (when to use / when not to use)
Prereqs / Access requirements
Procedure (step-by-step)
Validation (how to confirm success)
Rollback / Recovery steps
Failure modes / Troubleshooting
References (related ADRs, alerts, dashboards, security notes)

Runbooks must be copy/paste safe and explicit about environment context.

Incident response model (minimum viable enterprise)

Document:

severity definition (SEV levels and impact criteria)
triage process and initial containment guidance
communications templates (public-safe)
postmortem process: timeline, contributing factors, corrective actions

DR/BCP expectations

Treat the portfolio app as a service:

identify dependencies (hosting, DNS, CI/CD, third-party services)
define recovery objectives (RTO/RPO) appropriate for a portfolio service
document recovery playbooks and validation steps
document “known hard failures” and mitigation/acceptance

Validation and expected outcomes

Ops docs are “correct” when:

a reviewer can deploy and rollback deterministically
incident response steps are actionable and ordered
recovery guidance exists for common dependency failures

Failure modes and troubleshooting

Runbooks without validation: procedures end without confirming success → add explicit checks.
Rollback missing: deployment is documented but rollback is not → fix immediately.
IR is theoretical: no severity model or comms plan → add minimal viable IR scaffolding.

Runbooks & Procedures

See Runbooks Index for operational procedures:

General Incident Response — Framework for all incidents (severity levels, triage, postmortem)
Service Degradation — Diagnose and resolve performance/availability issues (MTTR: 10 min)
Deployment Failure Recovery — Detect and rollback failed deployments (MTTR: 5 min)

For observability architecture and monitoring setup, see Observability & Health Checks.

References

Operational changes must be synchronized with:

CI/CD pipeline and environment documentation (30-devops-platform/)
security controls and threat model impacts (40-security/)
release notes for meaningful runtime changes (00-portfolio/release-notes/)

Purpose​

Scope​

In scope​

Out of scope​

Runbook standards (mandatory format)​

Incident response model (minimum viable enterprise)​

DR/BCP expectations​

Validation and expected outcomes​

Failure modes and troubleshooting​

Runbooks & Procedures​

References​