Skill

SkillsDevOps & Infrastructure › Observability & tracing

incident-runbook-templates

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.

Freerisk: medium
incidentrunbooktemplatessqlgokubernetesstripe

The full skill

— name: incident-runbook-templates description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams. — # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. ## When to Use This Skill – Creating incident response procedures – Building service-specific runbooks – Establishing escalation paths – Documenting recovery procedures – Responding to active incidents – Onboarding on-call engineers ## Core Concepts ### 1. Incident Severity Levels | Severity | Impact | Response Time | Example | | ——– | ————————– | —————– | ———————– | | **SEV1** | Complete outage, data loss | 15 min | Production down | | **SEV2** | Major degradation | 30 min | Critical feature broken | | **SEV3** | Minor impact | 2 hours | Non-critical bug | | **SEV4** | Minimal impact | Next business day | Cosmetic issue | ### 2. Runbook Structure “` 1. Overview & Impact 2. Detection & Alerts 3. Initial Triage 4. Mitigation Steps 5. Root Cause Investigation 6. Resolution Procedures 7. Verification & Rollback 8. Communication Templates 9. Escalation Matrix “` ## Runbook Templates ### Template 1: Service Outage Runbook ““markdown # [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service **Owner**: Platform Team **Slack**: #payments-incidents **PagerDuty**: payments-oncall ## Impact Assessment – [ ] Which customers are affected? – [ ] What percentage of traffic is impacted? – [ ] Are there financial implications? – [ ] What's the blast radius? ## Detection ### Alerts – `payment_error_rate > 5%` (PagerDuty) – `payment_latency_p99 > 2s` (Slack) – `payment_success_rate < 95%` (PagerDuty) ### Dashboards – [Payment Service Dashboard](https://grafana/d/payments) – [Error Tracking](https://sentry.io/payments) – [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope “`bash # Check service health kubectl get pods -n payments -l app=payment-service # Check recent deployments kubectl rollout history deployment/payment-service -n payments # Check error rates curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" “` ““ ### 2. Quick Health Checks – [ ] Can you reach the service? `curl -I https://api.company.com/payments/health` – [ ] Database connectivity? Check connection pool metrics – [ ] External dependencies? Check Stripe, bank API status – [ ] Recent changes? Check deploy history ### 3. Initial Classification | Symptom | Likely Cause | Go To Section | | ——————– | ——————- | ————- | | All requests failing | Service down | Section 4.1 | | High latency | Database/dependency | Section 4.2 | | Partial failures | Code bug | Section 4.3 | | Spike in errors | Traffic surge | Section 4.4 | ## Mitigation Procedures ### 4.1 Service Completely Down “`bash # Step 1: Check pod status kubectl get pods -n payments # Step 2: If pods are crash-looping, check logs kubectl logs -n payments -l app=payment-service –tail=100 # Step 3: Check recent deployments kubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspect kubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrained kubectl scale deployment/payment-service -n payments –replicas=10 # Step 6: Verify recovery kubectl rollout status deployment/payment-service -n payments “` ### 4.2 High Latency “`bash # Step 1: Check database connections kubectl exec -n payments deploy/payment-service — \ curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue) psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() – query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;" # Step 3: Kill long-running queries if needed psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latency curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slow kubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments “` ### 4.3 Partial Failures (Specific Errors) “`bash # Step 1: Identify error pattern kubectl logs -n payments -l app=payment-service –tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking # Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disable curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changes psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() – interval '1 hour';" “` ### 4.4 Traffic Surge “`bash # Step 1: Check current request rate kubectl top pods -n payments # Step 2: Scale horizontally kubectl scale deployment/payment-service -n payments –replicas=20 # Step 3: Enable rate limiting kubectl set env deployment/payment-service \ RATE_LIMIT_ENABLED=true \ RATE_LIMIT_RPS=1000 -n payments # Step 4: If attack, block suspicious IPs kubectl apply -f – <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress: – from: – ipBlock: cidr: 0.0.0.0/0 except: – 192.168.1.0/24 # Suspicious range EOF “` ## Verification Steps “`bash # Verify service is healthy curl -s https://api.company.com/payments/health | jq # Verify error rate is back to normal curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]' # Verify latency is acceptable curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq # Smoke test critical flows ./scripts/smoke-test-payments.sh “` ## Rollback Procedures “`bash # Rollback Kubernetes deployment kubectl rollout undo deployment/payment-service -n payments # Rollback database migration (if applicable) ./scripts/db-rollback.sh $MIGRATION_VERSION # Rollback feature flag curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}' “` ## Escalation Matrix | Condition | Escalate To | Contact | | —————————– | ——————- | ——————- | | > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) | | Data breach suspected | Security Team | #security-incidents | | Financial impact > $10k | Finance + Legal | @finance-oncall | | Customer communication needed | Support Lead | @support-lead | ## Communication Templates ### Initial Notification (Internal) “` 🚨 INCIDENT: Payment Service Degradation Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME] Current Actions: – Investigating root cause – Scaling up service – Monitoring dashboards Updates in #payments-incidents “` ### Status Update “` 📊 UPDATE: Payment Service Incident Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes Actions Taken: – Rolled back deployment v2.3.4 → v2.3.3 – Scaled service from 5 → 10 replicas Next Steps: – Continuing to monitor – Root cause analysis in progress ETA to Resolution: ~15 minutes “` ### Resolution Notification “` ✅ RESOLVED: Payment Service Incident Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4 Resolution: – Rolled back to v2.3.3 – Transactions auto-retried successfully Follow-up: – Postmortem scheduled for [DATE] – Bug fix in progress “` ““ ### Template 2: Database Incident Runbook “`markdown # Database Incident Runbook ## Quick Reference | Issue | Command | |——-|———| | Check connections | `SELECT count(*) FROM pg_stat_activity;` | | Kill query | `SELECT pg_terminate_backend(pid);` | | Check replication lag | `SELECT extract(epoch from (now() – pg_last_xact_replay_timestamp()));` | | Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` | ## Connection Pool Exhaustion “`sql — Check current connections SELECT datname, usename, state, count(*) FROM pg_stat_activity GROUP BY datname, usename, state ORDER BY count(*) DESC; — Identify long-running connections SELECT pid, usename, datname, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start; — Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() – interval '10 minutes'; ““ ## Replication Lag “`sql — Check lag on replica SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() – pg_last_xact_replay_timestamp()) END AS lag_seconds; — If lag > 60s, consider: — 1. Check network between primary/replica — 2. Check replica disk I/O — 3. Consider failover if unrecoverable “` ## Disk Space Critical “`bash # Check disk usage df -h /var/lib/postgresql/data # Find large tables psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" # VACUUM to reclaim space psql -c "VACUUM FULL large_table;" # If emergency, delete old data or expand disk “` “` ## Best Practices ### Do's – **Keep runbooks updated** – Review after every incident – **Test runbooks regularly** – Game days, chaos engineering – **Include rollback steps** – Always have an escape hatch – **Document assumptions** – What must be true for steps to work – **Link to dashboards** – Quick access during stress ### Don'ts – **Don't assume knowledge** – Write for 3 AM brain – **Don't skip verification** – Confirm each step worked – **Don't forget communication** – Keep stakeholders informed – **Don't work alone** – Escalate early – **Don't skip postmortems** – Learn from every incident ## Troubleshooting ### Runbook steps work in staging but fail during a real incident Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note: “`bash # Step: Check pod status kubectl get pods -n payments # Prerequisites: kubectl configured, kubeconfig points to correct cluster # If this fails: run `aws eks update-kubeconfig –name prod-cluster –region us-east-1` # Expected output: pods in Running state “` ### On-call engineer panics and skips steps out of order Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document: “`markdown ## Quick Checklist – [ ] 1. Declare incident severity and open war room – [ ] 2. Check service health (Section 4.1) – [ ] 3. Check recent deployments (Section 4.1) – [ ] 4. Roll back if deploy is suspect (Section 4.1) – [ ] 5. Post initial notification to #payments-incidents – [ ] 6. Escalate if > 15 min unresolved “` ### Runbook is outdated — commands reference old cluster names or endpoints Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid: “`markdown ## Runbook Metadata | Field | Value | |—|—| | Last verified | 2024-11-15 | | Owner | @platform-team | | Review cadence | After every SEV1/SEV2 | “` ### Stakeholder communication is delayed while engineers are heads-down Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template: “` Update every 15 minutes (even if no new information): – Current status (Investigating / Mitigating / Monitoring) – Impact (what is broken, who is affected, % of traffic) – What we are doing right now – Next update in: 15 minutes “` ### Database runbook commands cause additional downtime when run incorrectly Add explicit warnings before destructive SQL commands and require a dry-run output check before executing: “`sql — WARNING: This terminates active connections. Verify count first. — DRY RUN (check count before terminating): SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() – interval '10 minutes'; — EXECUTE only after verifying count is reasonable (< 50): SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() – interval '10 minutes'; “` ## Related Skills – `postmortem-writing` – After resolving an incident, use postmortem templates to capture root cause and preventive actions – `on-call-handoff-patterns` – Structure shift handoffs so the incoming responder has full context on active incidents