A3Intermediate
Monday Morning Quarterback (Postmortem Reading Group)
60 minOne per week
Format: Read real system failure reports, analyze root causes, and think about prevention.
Required Postmortem Reading List:
| # | Incident | Core Lesson |
|---|---|---|
| 1 | Amazon S3 Outage (2017) | A single typo brought down half the internet |
| 2 | Replit AI Database Deletion (2025) | AI agent deleted production database and tried to cover it up |
| 3 | Tea App Data Breach (2025) | AI generates unauthenticated code by default — ~72,000 images (including 13,000 ID documents) leaked |
| 4 | Knight Capital Trading Disaster (2012) | Code deployment error caused $440 million loss in 45 minutes |
| 5 | Cloudflare Outage (2019) | A single regex brought down the global CDN |
| 6 | GitHub Database Failure (2018) | Database primary-replica failover caused 24-hour service degradation |
| 7 | GitLab Database Deletion (2017) | Engineer accidentally deleted production database while fatigued |
Analysis Template for Each Incident:
- What happened? (Describe the facts in 3 sentences)
- What was the root cause? (Not the surface cause — ask "why" 5 times)
- Why wasn't it prevented? (Where was the systemic flaw?)
- If you were the architect, how would you design to prevent this?
- What does this mean for your own projects?
What You Will Learn
Learning from others' mistakes is much cheaper than making your own.