Signal to noise ratio is key to having a productive on-call experience that results in maintaining system reliability. Information is only useful if it can be understood and more importantly, acted upon. Too often we're drowned in a sea of information that was originally intended to protect us by providing intimate visibility into our running systems, but without any method to actually analyze, distill and present actionable highlights, it just becomes noise. Noise is the enemy of attention.
Addressing this noise was a key theme that came up several times throughout the conference in a number of contexts. I'll try to address some of them here as I understood them.
Dan Slimmon's (Exosite) session on Smoke Alarms, Car Alarms and Monitoring dealt with the concept that very few people pay any attention to car alarms today but if the fire alarm in a building goes off people are very likely to take notice and often act upon that alert by leaving the building. This illustrates the fact that car alarms have essentially lost their credibility because they've become too prone to false positives. Fire alarms rarely go off unintentionally, they are reliable and as such get attention when they are heard. Systems and application monitoring must follow the same example. If every exception or metric that falls out of bounds triggers an alert requesting action when there is no action to be reasonably taken, then this results in many false positives that lead to on-call fatigue and ultimately train the on-call engineer to begin ignoring alerts. Obviously this leads to a state where actual actionable alerts also become ignored causing larger systematic issues to balloon into failures that affect the customer & the business reputation. It is critical that noise be eliminated so that truly actionable events remain valuable and reliable.
Adam Jacobs also touched on these concepts in his previously referenced How to be Great at Operations session where he describes the importance of MTTD (Mean Time To Diagnose) and MTTR (Mean Time To Repair/Recover). With poor, noisy monitoring, both MTTD and MTTR start to fail miserably as more and more alerts are simply ignored as unimportant or un-actionable.
In much the same vein as Dan's talk discussed above, the session by Ryan Frantz and Laurie Denness from Etsy, Mean Time to Sleep: Quantifying the On-Call Experience focused on the importance of reducing noise for the health and well being of engineers. In this session the focus was on how important it is to maintain good sleep habits to improve productivity through the analysis and reduction of paging noise. Etsy took it to the next level utilizing sleep analyzing equipment to correlate sleep patterns with recorded alert notifications to determine when engineers were awake at night due to being alerted. This caused Etsy to be able to tune or discard a number of alerts that weren't truly critical enough or actionable to warrant disturbing the person. The result was happier, more engaged and more productive engineers which, in turn, resulted in a more reliable system overall. Culling disruptive un-actionable alerts is key to improving systems.
Finally the concept of planned emergency drills and recovery procedures was covered in The Practical Gamemaster session presented by Adele Shakal of Metacloud, Inc. Practicing the process of handling disasters and emergencies is an important practice to ensure the team is able to conduct the processes put in place in the event of a real emergency. Just like a backup is worthless unless it is tested to ensure it can actually be restored, an emergency plan is also worthless if it is not practiced to ensure it can be followed. Plans should be in place, accessible (even offline) and practiced so that everyone on the team is able to confidently perform when real emergency strikes (as it inevitably will). She covered the importance of having trained people in charge of coordinating emergency response and having their contact information readily available when needed. Also she discussed having an emergency binder on-site and regularly updated with relevant information about the architecture, contacts (both internal and vendor), contract information,etc. in an easily accessible, physical medium that can be accessed offline. The key word 'offline' is extremely important as the emergency may be affecting access to the online information that is so often taken for granted.
Having a well planned and practiced emergency plan takes the panic out of emergency response allowing all involved to focus on the task of resolution rather than trying to figure out policies.
Additionally it is critical to properly document the systems in place and have informative information on dealing with alerts so there isn't any guess work required on the part of the on-call engineer in resolving issues. Linking to monitoring documentation in each alert message is a fantastic way for an engineer to quickly understand the system they are being alerted about and have information on how to go about troubleshooting it.
One interesting concept that was brought up (I can't remember the source) is to alert on what matters to customers. The idea being that instead of alerting on every detailed failure in the system, instead alert when the system isn't performing its function properly. For instance, when serving a web application via a cluster of Apache nodes, it is less important to alert on a single node or Apache process being down than it is to alert when the application as a whole isn't performing as expected (perhaps response time slips above a threshold or number of requests handled falls too low, etc.). The idea was extended to include the use of automation to self-heal systems when possible and alert when the healing process fails perhaps. Tools like Nagios' EventHandlers can be used for these kinds of tasks. The danger in this kind of automation is, of course, masking underlying issues so care should be taken when implementing these self-healing tasks and appropriate logging/record keeping should be implemented for later forensics. The point being, don't wake someone up just to restart a service that a computer could easily do instead allowing the person to diagnose the issue during waking hours.
Reviewing on-call rotations is another way to better understand which problem areas need to be addressed either because they are unduly noisy or because they are a legitimate pain point that should be prioritized to be resolved. Etsy offers an interesting tool called OpsWeekly to help generate weekly on-call reports tracking when notified and the frequency or timing of notifications. These automatically generated reports can be the center piece for a weekly operations review meeting in which the team can agree upon actions to take to reduce noise moving forward. Metrics help quantify the results of these efforts.
Ultimately our monitoring systems should be comprehensive but not prone to false positives or noise. We should have good insight into the health of our systems and applications but this insight should include proper analysis to distill only those things which are actionable to alert on. When alerts are found to be un-actionable, they should be audited and perhaps removed on a regular basis. New alerts should be vetted and documented. The priority should be on maintaining system reliability through both proper alerting and the health of those maintaining it.