A couple of years back, I spent two months reading through over two thousand outage incidents from our incident database: what caused it, how people identified the root cause, what did they do to prevent it from happening. It was more interesting than reading Sherlock Holmes stories. These were real-life incidents, real lessons learned, and real actions taken.
From that study, I have compiled a checklist. I find it quite useful, as it:
- Provides a holistic checklist to quickly build situational awareness.
- Shows a side-by-side comparison of health of multiple systems.
- Identifies which systems need significant investment in operational improvement.
- Identifies which systems are well ahead of others and can share their lessons with others.
Hope this helps others. Feel free to comment and suggest more items.
Here’s the Google Sheet: http://bit.ly/opschklst