Observability & Health Checks

Overview

Observability is the ability to understand the internal state of an application by examining its outputs (logs, metrics, traces). Unlike traditional monitoring which tells you when something is broken, observability enables you to understand why it's broken and how to fix it.

Why Observability Matters

Faster Incident Response: Detect issues before users report them (MTTR target: < 10 minutes)
Proactive Problem Detection: Identify degradation trends before they become outages
Evidence-Based Debugging: Structured logs provide context for reproducing and fixing issues
Operational Confidence: Clear health status enables automated monitoring and alerting

Observability Pillars

Production applications should implement three observability pillars:

Logs — Structured JSON logs for debugging and audit trails
Health Checks — Application status endpoint for monitoring
Metrics — Quantitative performance measurements (future enhancement)

Health Check Endpoint

Endpoint Specification

URL: GET /api/health
Cache Policy: No caching (revalidate: 0)
Response Time Target: < 500ms

Response Format

The health endpoint returns JSON with the following fields:

Field	Type	Description	Example
`status`	`"healthy" \| "degraded" \| "unhealthy"`	Application health state	`"healthy"`
`timestamp`	`string` (ISO 8601)	Response timestamp	`"2026-01-26T15:30:45.123Z"`
`environment`	`string`	Deployment environment	`"production"`
`commit`	`string`	Git commit SHA (first 7 chars)	`"a2058c7"`
`buildTime`	`string` (ISO 8601)	Build timestamp	`"2026-01-26T15:20:00.000Z"`
`projectCount`	`number`	Number of projects/items loaded	`8`
`message`	`string` (optional)	Error or degradation description	`"No projects loaded"`
`error`	`string` (optional)	Error message (unhealthy only)	`"Cannot load registry"`

Status Codes

HTTP Status	State	Meaning	Action
200 OK	Healthy	All systems operational, all checks passed	No action needed
503 Service Unavailable	Degraded	Core functionality works, but some features unavailable	Monitor for 5 minutes; escalate if persists
500 Internal Server Error	Unhealthy	Critical failure; service broken	Execute incident runbook immediately

Response Examples

Healthy (200 OK)

{
  "status": "healthy",
  "timestamp": "2026-01-26T15:30:45.123Z",
  "environment": "production",
  "commit": "a2058c7",
  "buildTime": "2026-01-26T15:20:00.000Z",
  "projectCount": 8
}

Degraded (503 Service Unavailable)

{
  "status": "degraded",
  "message": "Registry data incomplete",
  "timestamp": "2026-01-26T15:30:45.123Z",
  "environment": "production",
  "commit": "a2058c7"
}

Unhealthy (500 Internal Server Error)

{
  "status": "unhealthy",
  "error": "Cannot read properties of undefined (reading 'length')",
  "timestamp": "2026-01-26T15:30:45.123Z",
  "environment": "production"
}

Testing the Health Endpoint

Local development:

# Start dev server
pnpm dev

# Test health endpoint
curl http://localhost:3000/api/health | jq .

Production:

# Test deployed instance
curl https://production-domain.com/api/health | jq .

# Automated monitoring
curl -sf https://production-domain.com/api/health | jq -e '.status == "healthy"'

Structured Logging

Log Format

All logs follow the LogEntry interface:

interface LogEntry {
  timestamp: string; // ISO 8601 timestamp
  level: 'info' | 'warn' | 'error' | 'debug'; // Severity level
  message: string; // Human-readable description
  context?: Record<string, unknown>; // Structured metadata
  environment?: string; // Deployment environment
}

Example log output:

{
  "timestamp": "2026-01-26T15:30:45.123Z",
  "level": "error",
  "message": "Failed to load project",
  "context": {
    "slug": "portfolio-app",
    "error": "Not found",
    "route": "/projects/portfolio-app"
  },
  "environment": "production"
}

Log Levels

Level	Usage	Examples	Alerting
info	Normal operations, state changes	"User loaded page", "Registry loaded"	No alerts
warn	Unexpected but non-critical issues	"Slow render (3.5s)", "404 Not Found"	Alert if >10/min
error	Failures requiring attention	"Failed to load", "API timeout"	Alert immediately
debug	Development debugging only	"Cache hit", "`Props: {...}`"	Disabled in prod

Context Guidelines

Do include in context:

Route/URL: route: '/projects/portfolio-app'
Operation: operation: 'loadProjects'
Timing: renderTime: 2500, cacheHit: false
Non-sensitive IDs: slug: 'portfolio-app'

Do NOT include in context:

Passwords or API keys
User emails or PII
Sensitive internal URLs
Secrets or tokens

Viewing Logs

Local development:

pnpm dev
# Logs appear in terminal as JSON

Production (Vercel):

Go to Vercel Dashboard
Click "Deployments" → Select deployment
Click "Functions" → View function logs
Filter by time range and keywords

Parsing with jq:

# Filter by error level
cat logs.json | jq 'select(.level == "error")'

# Count errors by route
cat logs.json | jq -r '.context.route' | sort | uniq -c

# Find slow operations (>3s)
cat logs.json | jq 'select(.context.renderTime > 3000)'

Failure Modes Definition

State Definitions

State	Definition	User Impact	HTTP Status	Detection
Healthy	All routes render successfully, no errors	None	200	Health endpoint returns `healthy`
Degraded	Core routes work, but some features unavailable	Minor	503	Health endpoint returns `degraded` OR median response time >3s
Unhealthy	Critical routes fail, registry empty, build failed	Major	500	Health endpoint returns `unhealthy` OR 500 errors on all routes

State Transition Diagram

Healthy ──→ Degraded ──→ Unhealthy
             ↓ (fix)      ↓ (rollback)
            Healthy ←────┘

Healthy → Degraded: Data load issue, slow performance, partial failure

Degraded → Unhealthy: Critical failure, build error, configuration issue

Recovery: Fix deployed or rollback to previous version

Detection Methods

Automated monitoring:

# External uptime monitor polls every 1 minute
curl -s https://production-domain.com/api/health | jq '.status'
# Expected: "healthy"

Alert if:

Status changes from healthy to degraded or unhealthy
Endpoint unresponsive >30 seconds
Response time > 5 seconds

Monitoring Integration

External Monitoring Setup

Recommended monitors:

UptimeRobot (free: 50 monitors)
Better Uptime (statuspage integration)
Pingdom (advanced analytics)

Configuration:

Setting	Value
Monitor Type	HTTP(S)
URL	`https://production-domain.com/api/health`
Check Interval	1 minute (or 5 minutes on free tier)
Alert Condition	Status ≠ 200 OR response time > 5s OR missing `"status":"healthy"`
Alert Channels	Email, Slack, PagerDuty

Alert Thresholds

Condition	Threshold	Severity	Notification	Response Time
Health = degraded	Immediate	Medium	Slack + Email	15 minutes
Health = unhealthy	Immediate	High	Slack + SMS + PagerDuty	5 minutes
Response time > 3s	3 checks	Low	Email	1 hour
Error rate > 5%	1 minute	High	Slack + PagerDuty	10 minutes
Endpoint unresponsive	30 seconds	Critical	All channels	Immediate

Operational Readiness Checklist

Pre-Deployment

Health endpoint deployed and accessible at /api/health
Health endpoint returns 200 in production
Structured logging active in logs
Environment variables correctly set
No secrets in logs or context
Response time < 500ms median

Monitoring Setup

External monitor configured
Alerts configured for 503/500 status
Alert channels verified
Baseline response time documented
Failure threshold tuned

Team Readiness

Testing & Validation

Health check tested locally
Degraded state tested (503 response)
Error state tested (500 response)
Structured logs verified as JSON
Alert test executed

Future Enhancements

Metrics Export (Planned Enhancement)

Add quantitative metrics for performance tracking:

Build metrics: Build time, bundle size, routes generated
Runtime metrics: Request count, error rate, response time percentiles (p50/p95/p99)
Resource metrics: Memory usage, CPU usage

Implementation approach: Prometheus metrics at /api/metrics or StatsD export

Distributed Tracing (Planned Enhancement)

Add request tracing for debugging cross-service calls:

Trace ID: Unique ID for each request
Spans: Measure time in each function/component
Context propagation: Link logs to specific traces

Tools: OpenTelemetry, Vercel APM, Datadog APM

Automated Incident Creation (Planned Enhancement)

Auto-create GitHub issues when incidents occur:

Trigger: Health check fails >3 times OR error rate >10%
Action: Create issue with severity label, link logs, assign to on-call
Integration: GitHub Actions workflow + webhook

Overview​

Why Observability Matters​

Observability Pillars​

Health Check Endpoint​

Endpoint Specification​

Response Format​

Status Codes​

Response Examples​

Healthy (200 OK)​

Degraded (503 Service Unavailable)​

Unhealthy (500 Internal Server Error)​

Testing the Health Endpoint​

Structured Logging​

Log Format​

Log Levels​

Context Guidelines​

Viewing Logs​

Failure Modes Definition​

State Definitions​

State Transition Diagram​

Detection Methods​

Monitoring Integration​

External Monitoring Setup​

Alert Thresholds​

Operational Readiness Checklist​

Pre-Deployment​

Monitoring Setup​

Team Readiness​

Testing & Validation​

Future Enhancements​

Metrics Export (Planned Enhancement)​

Distributed Tracing (Planned Enhancement)​

Automated Incident Creation (Planned Enhancement)​

References​