OTP Outage Postmortem Template (2026)
A ready-to-use postmortem template for OTP outages: timeline, root cause categories, customer impact metrics, action items, and a worked example.
StartMessaging Team
Engineering
After every meaningful OTP outage, a written postmortem is the cheapest way to harden the system. This template captures the common shape.
The Template
- Title — date and severity.
- TL;DR — 2 sentences.
- Timeline — every event with timestamp.
- Root cause — narrative with technical detail.
- Customer impact — numbers, not adjectives.
- What went well, what didn’t, where we got lucky.
- Action items with owner + due date.
Timeline
13:42 - First failed OTP send observed.
13:44 - Pager fires.
13:45 - On-call ack.
13:48 - Identified: provider primary route degraded.
13:50 - Manual failover to secondary.
13:52 - Recovery confirmed.Root-Cause Categories
- Provider outage.
- DLT scrubbing change.
- Sender ID expiry.
- Internal code bug.
- Config rollout error.
- Capacity / cost cap hit.
Customer Impact Metrics
- OTP success rate before / during / after.
- Affected user count.
- Lost sign-up funnel events.
- Support tickets opened.
Action Items
- Add automated failover.
- Add alert on per-carrier DLR drop.
- Schedule renewal calendar for sender IDs.
- Run game-day next quarter.
FAQ
Combine with the SLO framework in our SLO guide for an error-budget-aware retro.
Related Articles
Define SLOs for OTP send and verify paths and monitor TRAI-compliant transactional SMS health—not just API uptime—for Indian peak traffic.
Why and how to wrap OTP API calls in a circuit breaker. Failure thresholds, half-open probing, fallback voice OTP, and reference implementations.
OTP delivery delays in India: typical causes, P50/P95 benchmarks, route troubleshooting, provider failover, and concrete fixes that drop latency from minutes to seconds.
Ready to Send OTPs?
Integrate StartMessaging in 5 minutes. No DLT registration required.