It’s 2 AM. Your phone is screaming. The Prometheus alerts we set up earlier show a 100% error rate on the checkout-service. Here is the architect's framework for incident response.
Step 1: Triage and Communication
Before you touch a keyboard, declare an incident.
- The "Commander": One person handles communication (Slack/Statuspage).
- The "Investigator": One person (usually you) dives into the logs.
- Avoid "Bystander Effect": If five people join the call, assign specific roles immediately.
Step 2: The Search (Using grep and tail)
Go straight to the source. Use the patterns we discussed in the "Mastering Grep" post to find the specific trace IDs.
# Find the first occurrence of a 500 error in the last 1000 lines
tail -n 1000 app.log | grep -C 5 "HTTP 500"Step 3: Mitigation vs. Root Cause
Your goal at 2 AM is Mitigation. If a recent deploy caused the spike, Roll Back. Do not try to "roll forward" or fix the bug in production while the site is down.
- Mitigate: Restart the pod, scale the cluster, or revert the commit.
- Post-Mortem: Save the logs for a "Blame-Free Post-Mortem" the next morning to find the Root Cause.