Skills › Content & Creative › Writing & copy
postmortem-writing
Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
The full skill
—
name: postmortem-writing
description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
—
# Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
## When to Use This Skill
– Conducting post-incident reviews
– Writing postmortem documents
– Facilitating blameless postmortem meetings
– Identifying root causes and contributing factors
– Creating actionable follow-up items
– Building organizational learning culture
## Core Concepts
### 1. Blameless Culture
| Blame-Focused | Blameless |
|—————|———–|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
### 2. Postmortem Triggers
– SEV1 or SEV2 incidents
– Customer-facing outages > 15 minutes
– Data loss or security incidents
– Near-misses that could have been severe
– Novel failure modes
– Incidents requiring unusual intervention
## Quick Start
### Postmortem Timeline
“`
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
“`
## Templates
### Template 1: Standard Postmortem
“`markdown
# Postmortem: [Incident Title]
**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes
## Executive Summary
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
**Impact**:
– 12,000 customers unable to complete purchases
– Estimated revenue loss: $45,000
– 847 support tickets created
– No data loss or security implications
## Timeline (All times UTC)
| Time | Event |
|——|——-|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: `payment_error_rate > 5%` |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
## Root Cause Analysis
### What Happened
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
### Why It Happened
1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
2. **Contributing Factors**:
– Code review did not catch the connection handling change
– No integration tests specifically for connection pool behavior
– Staging environment has lower traffic, masking the issue
– Database connection metrics alert threshold was too high (90%)
3. **5 Whys Analysis**:
– Why did the service fail? â Database connections exhausted
– Why were connections exhausted? â Each request opened new connection
– Why did each request open new connection? â Code bypassed connection pool
– Why did code bypass connection pool? â Developer unfamiliar with codebase patterns
– Why was developer unfamiliar? â No documentation on connection management patterns
### System Diagram
“`
[Client] â [Load Balancer] â [Payment Service] â [Database]
â
Connection Pool (broken)
â
Direct connections (cause)
“`
## Detection
### What Worked
– Error rate alert fired within 8 minutes of deployment
– Grafana dashboard clearly showed connection spike
– On-call response was swift (2 minute acknowledgment)
### What Didn't Work
– Database connection metric alert threshold too high
– No deployment-correlated alerting
– Canary deployment would have caught this earlier
### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
## Response
### What Worked
– On-call engineer quickly identified database as the issue
– Rollback decision was made decisively
– Clear communication in incident channel
### What Could Be Improved
– Took 10 minutes to correlate issue with recent deployment
– Had to manually check deployment history
– Rollback took 12 minutes (could be faster)
## Impact
### Customer Impact
– 12,000 unique customers affected
– Average impact duration: 35 minutes
– 847 support tickets (23% of affected users)
– Customer satisfaction score dropped 12 points
### Business Impact
– Estimated revenue loss: $45,000
– Support cost: ~$2,500 (agent time)
– Engineering time: ~8 person-hours
### Technical Impact
– Database primary experienced elevated load
– Some replica lag during incident
– No permanent damage to systems
## Lessons Learned
### What Went Well
1. Alerting detected the issue before customer reports
2. Team collaborated effectively under pressure
3. Rollback procedure worked smoothly
4. Communication was clear and timely
### What Went Wrong
1. Code review missed critical change
2. Test coverage gap for connection pooling
3. Staging environment doesn't reflect production traffic
4. Alert thresholds were not tuned properly
### Where We Got Lucky
1. Incident occurred during business hours with full team available
2. Database handled the load without failing completely
3. No other incidents occurred simultaneously
## Action Items
| Priority | Action | Owner | Due Date | Ticket |
|———-|——–|——-|———-|——–|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
## Appendix
### Supporting Data
#### Error Rate Graph
[Link to Grafana dashboard snapshot]
#### Database Connection Graph
[Link to metrics]
### Related Incidents
– 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
### References
– [Connection Pool Best Practices](internal-wiki/connection-pools)
– [Deployment Runbook](internal-wiki/deployment-runbook)
“`
### Template 2: 5 Whys Analysis
“`markdown
# 5 Whys Analysis: [Incident]
## Problem Statement
Payment service experienced 47-minute outage due to database connection exhaustion.
## Analysis
### Why #1: Why did the service fail?
**Answer**: Database connections were exhausted, causing all new requests to fail.
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
—
### Why #2: Why were database connections exhausted?
**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
—
### Why #3: Why did the code bypass the connection pool?
**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
—
### Why #4: Why wasn't this caught in code review?
**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
**Evidence**: Review comments only discuss business logic.
—
### Why #5: Why isn't there a safety net for this type of change?
**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
## Root Causes Identified
1. **Primary**: Missing automated tests for infrastructure behavior
2. **Secondary**: Insufficient documentation of architectural patterns
3. **Tertiary**: Code review checklist doesn't include infrastructure considerations
## Systemic Improvements
| Root Cause | Improvement | Type |
|————|————-|——|
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
“`
### Template 3: Quick Postmortem (Minor Incidents)
“`markdown
# Quick Postmortem: [Brief Title]
**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
## What Happened
API latency spiked to 5s due to cache miss storm after cache flush.
## Timeline
– 10:00 – Cache flush initiated for config update
– 10:02 – Latency alerts fire
– 10:05 – Identified as cache miss storm
– 10:08 – Enabled cache warming
– 10:12 – Latency normalized
## Root Cause
Full cache flush for minor config update caused thundering herd.
## Fix
– Immediate: Enabled cache warming
– Long-term: Implement partial cache invalidation (ENG-999)
## Lessons
Don't full-flush cache in production; use targeted invalidation.
“`
## Facilitation Guide
### Running a Postmortem Meeting
“`markdown
## Meeting Structure (60 minutes)
### 1. Opening (5 min)
– Remind everyone of blameless culture
– "We're here to learn, not to blame"
– Review meeting norms
### 2. Timeline Review (15 min)
– Walk through events chronologically
– Ask clarifying questions
– Identify gaps in timeline
### 3. Analysis Discussion (20 min)
– What failed?
– Why did it fail?
– What conditions allowed this?
– What would have prevented it?
### 4. Action Items (15 min)
– Brainstorm improvements
– Prioritize by impact and effort
– Assign owners and due dates
### 5. Closing (5 min)
– Summarize key learnings
– Confirm action item owners
– Schedule follow-up if needed
## Facilitation Tips
– Keep discussion on track
– Redirect blame to systems
– Encourage quiet participants
– Document dissenting views
– Time-box tangents
“`
## Anti-Patterns to Avoid
| Anti-Pattern | Problem | Better Approach |
|————–|———|—————–|
| **Blame game** | Shuts down learning | Focus on systems |
| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
| **No action items** | Waste of time | Always have concrete next steps |
| **Unrealistic actions** | Never completed | Scope to achievable tasks |
| **No follow-up** | Actions forgotten | Track in ticketing system |
## Best Practices
### Do's
– **Start immediately** – Memory fades fast
– **Be specific** – Exact times, exact errors
– **Include graphs** – Visual evidence
– **Assign owners** – No orphan action items
– **Share widely** – Organizational learning
### Don'ts
– **Don't name and shame** – Ever
– **Don't skip small incidents** – They reveal patterns
– **Don't make it a blame doc** – That kills learning
– **Don't create busywork** – Actions should be meaningful
– **Don't skip follow-up** – Verify actions completed
## Resources
– [Google SRE – Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
– [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)
– [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)