Incident Response 101: What to Do When the "PagerDuty" Goes Off at 2 AM

It’s 2 AM. Your phone is screaming. The Prometheus alerts we set up earlier show a 100% error rate on the checkout-service. Here is the architect's framework for incident response.

Step 1: Triage and Communication

Before you touch a keyboard, declare an incident.

The "Commander": One person handles communication (Slack/Statuspage).
The "Investigator": One person (usually you) dives into the logs.
Avoid "Bystander Effect": If five people join the call, assign specific roles immediately.

Step 2: The Search (Using `grep` and `tail`)

Go straight to the source. Use the patterns we discussed in the "Mastering Grep" post to find the specific trace IDs.

# Find the first occurrence of a 500 error in the last 1000 lines
tail -n 1000 app.log | grep -C 5 "HTTP 500"

Step 3: Mitigation vs. Root Cause

Your goal at 2 AM is Mitigation. If a recent deploy caused the spike, Roll Back. Do not try to "roll forward" or fix the bug in production while the site is down.

Mitigate: Restart the pod, scale the cluster, or revert the commit.
Post-Mortem: Save the logs for a "Blame-Free Post-Mortem" the next morning to find the Root Cause.

Step 1: Triage and Communication

Step 2: The Search (Using grep and tail)

Step 3: Mitigation vs. Root Cause

Step 2: The Search (Using `grep` and `tail`)