Articles on ITIL
There’s nothing like grabbing a cup of hot cocoa on a cold blustery day during the festive month of December and curling up with some reading on ITIL theory courtesy of the IT Infrastructure Library. Bah, humbug? It would be more fun to string up a hundred strands of half burned-out Christmas lights during an ice storm on a three story house with a 13/12 pitch roof, you say? Well…take a step back for a second. Think about the investment your hospital has made in the latest, greatest, super-high availability, bulletproof, auto-redundant, nimbus cloud-ious infrastructure to keep your MEDITECH and enterprise systems up and your customers happy. Then why do we find ourselves still managing downtime with its corresponding incidents and problems?
Enter “Availability” and “Problem Management” under the IT Infrastructure Library. While we could spend your entire Christmas break discussing the two key ITIL processes in depth, let’s just spend a few minutes on a sub-activity that is central to both Availability Management (AVM) and Problem Management (PM), which is the Major Problem Review under PM or the Service Outage Analysis under AVM. If you’ve lived the Military life, a similar process is the AAR, or After Action Review. Far too often we implement the coolest High Availability (HA) technology and don’t spend enough time testing and understanding how it works (another BLOG entry!), only to find ourselves managing incidents or problems we didn’t think could ever occur.
When we get in that situation, we can use the System Outage Analysis and corresponding After Action review to examine several general things: a) What Happened? B) How effectively did we manage the event to work around or resolution? C) What is the Root Cause and how do we prevent it from happening again. ITIL defines at least a half dozen tools such as the Component Fault Impact Analysis, Technical Observation Post, Service Outage Analysis, and others to deal with AVM and PM (and you should check them ALL out on New Year's Eve) .
One of the more effective tools I have found, commonly overlooked for managing an SOA/AAR, is the Expanded Incident Lifecycle which breaks down an entire incident response into phases. Though the diagram is to some degree self-explanatory, let me embellish a bit. The “Detection Time” is useful in determining if your notification processes for service disruptions work efficiently. In general, there are two ways notification happens – a customer contacts the service or help desk, or more ideally, your system monitoring tools detect the event automatically and notify your service desk. In the SOA/AAR, a critical question that is asked is “How did we learn about this?” Did a customer report something we should have detected automatically? In either case, how long had the event actually happened before it was detected? Approximately eighty percent (80%) of all incidents occurs from a change that was made in the environment (configuration change, code upgrade at OS, firmware, application, etc.) and often you will learn that regardless of how detected, the event transpired prior to detection. Our job is to minimize that time by looking at both the service desk and monitoring process time.
Next comes the “Response Time” element. Response time is simply the latency from the time of detection to the time it takes for remediation, whether done by a person or an automated process, to begin diagnosis and remediation activity. More commonly it’s the notification and response time of a person and process. This step just checks the integrity of that step.
Finally, the “Time to Repair” is looked at. I divide this into a couple of phases. The first is the duration to diagnose a problem and the second is the time to repair. Discretion should be given to both steps. A technique I learned from a great US Military Veteran, Electrical Engineer, and Professor is the “half-split” method, in which diagnosis attempts to eliminate 50% of the possible causes at each step. You begin by attempting to divide the number of possible components or causes by half by other techniques such as process of elimination, repeatability, etc. Next, divide the remaining 50% in half to be 25% and so on and so forth until the issue is resolved. While the concept is easy, employing it in today’s interwoven IT infrastructure can be tricky. But just having the mindset will alter the course of diagnostic thinking in some cases.
Moving on to the “Recovery” stage, this step evaluates the speed and effectiveness of applying the repair at the component level and returning the full “Service” back in operation inclusive of having the customer understand and being able to consume the service again. In some cases, a final action may be needed by the customer. A simple example would be if customers had to log back out or log back in, restart their web browser, or even reboot their workstation, iPad, or iPhone.
Up until this time, we have been evaluating the “response” to the Incident. The real AVM and PM work begins in the actually evaluation of the most fundamental or root cause. Though it may seem obvious in some cases that the repair was targeted at the root cause, often the initial correction activities are aimed at the symptom and not the actual cause. Sometimes the evaluation is not critical enough and the real root cause slides right under our nose as easily as Santa comes down our chimney on Christmas Eve.
Let me give you an example. Your Service Desk is notified that a good portion of the west wing of St. Nicholas Hospital, obviously located in the North Pole, is “down” and no client work stations or wireless devices work on that wing. Upon quick investigation it is learned that a power supply in a network switch failed resulting in the entire switch and wing being down. “Power supplies are high failure items and go out all the time”, you think to yourself as you sip on a cup of eggnog. “Just swap it out and get back to working on that new ITunes HA Cluster”, you shout to your engineer. Then you remember that all of your distribution switches were installed with redundant power supplies and you learn that the first power supply actually went out two months ago, but no one was notified due to a misconfiguration in the SNMP alerts which feed your monitoring system. So now we have two new pieces of data: A definitive issue with configuration and testing of your monitoring process and a new piece of evidence that not only one, but two power supplies in the same switch failed within. Since no one knew the first power supply failed due to faulty reporting configuration, no one has been in the communications closet where the switch was located. As the engineer logs into the switch to fix the SNMP configuration he learns that the switch has been burping up temperature warnings and alerts for the past 90 days. Upon a physical visit to the closet, it is discovered that a new rack of Fetal Monitoring equipment for the LDRP Unit were slammed in right beside the communication racks by the Biomedical Engineering team, but no one realized that the closet has no dedicated cooling is way too warm. Aaaaaahhhhhh, is this another issue? Lack of change management? Multiple departments having uncontrolled access to critical equipment facilities? The root cause gets deeper and deeper but I think you get the point. Keep turning over the stones until there are none left and you have identified all issues and course of action. Often there are more than one of each.
ITIL processes are best practice guidelines established from years of study of some of the best run IT operations across the world. While ITIL doesn’t provide a lot of “how to” detail, it does an excellent job of establishing what processes, activities, and methods need to be in place to run an A+ Class operation. I look forward to examining a few more of these with you in the future.
Mark Middleton is the Director of Cloud Services at Park Place International. In the past, Mark worked at Christus Health as the System Director for IT Architecture. Mark has been a finalist in the Data Center Executive Excellence Awards and holds degrees in Biomedical Technology and Business Administration, as well as the highest level ITIL Expert Certification.