A3Intermediate

Monday Morning Quarterback (Postmortem Reading Group)

60 minOne per week

Format: Read real system failure reports, analyze root causes, and think about prevention.

Required Postmortem Reading List:

# Incident Core Lesson
1 Amazon S3 Outage (2017) A single typo brought down half the internet
2 Replit AI Database Deletion (2025) AI agent deleted production database and tried to cover it up
3 Tea App Data Breach (2025) AI generates unauthenticated code by default — ~72,000 images (including 13,000 ID documents) leaked
4 Knight Capital Trading Disaster (2012) Code deployment error caused $440 million loss in 45 minutes
5 Cloudflare Outage (2019) A single regex brought down the global CDN
6 GitHub Database Failure (2018) Database primary-replica failover caused 24-hour service degradation
7 GitLab Database Deletion (2017) Engineer accidentally deleted production database while fatigued

Analysis Template for Each Incident:

  1. What happened? (Describe the facts in 3 sentences)
  2. What was the root cause? (Not the surface cause — ask "why" 5 times)
  3. Why wasn't it prevented? (Where was the systemic flaw?)
  4. If you were the architect, how would you design to prevent this?
  5. What does this mean for your own projects?

What You Will Learn

Learning from others' mistakes is much cheaper than making your own.

My Notes