Ask HN: How did you learn to debug production incidents?

1 points | by binora 11 hours ago

3 comments

dhavalt 8 hours ago
While debugging a crash or incident, I used to get lost in the specific line of code and trying to fixing symptoms. TLA+ gave me new prespective towards debugging, and started treating system crashes as a state problems rather than just code errors. I stop asking 'Why is this line failing?' and start asking 'How did the system get into this state?'.
And whenever a fix or patch feel like a duct-tape to me, start looking at the architecture and keep asking myself how can this be refactor to make it more resilient.
I realized my brain naturally wants to 'fill in the gaps' with assumptions. Learning to suppress that urge and force myself to verify what is actually happening rather than what I think is happening has been the most important part of my growth.
Its a continues process, you never stop learning.
jackfranklyn 10 hours ago
Honestly, by being thrown into fires and surviving them. A few things that accelerated it:
1. Building mental models of the entire stack - when you understand how data flows from request to database and back, you can narrow down "where" faster than "what"
2. Getting comfortable reading logs before reaching for the debugger. Production debugging is 80% log archaeology
3. Keeping a personal incident journal. After each outage I'd write what I assumed, what was actually wrong, and what evidence I missed. You start recognizing your own blind spots
4. Pair debugging with more senior engineers when possible. Watching how someone else navigates the same chaos teaches patterns you'd never discover alone
The uncomfortable truth is there's no shortcut - you have to accumulate scar tissue from actual incidents. But deliberate reflection on each one compounds faster than just moving on.
marco_z 11 hours ago
Through tears and mud, how else?