IT DOES MATTER - The causes and costs of data center system downtime By Upkar Singh, Director IT, FIS

IT DOES MATTER - The causes and costs of data center system downtime

Upkar Singh, Director IT, FIS | Monday, 25 September 2017, 12:00 IST

System downtime may be unavoidable, but planning and training can help restore operations sooner and mitigate the impact on most organizations.

IT managers hate system downtime, but the harsh reality is that even the best plans and preparation cannot prepare for every circumstance, and even the simplest oversights can snowball into serious events that are difficult and costly to remediate. Let us evaluate the underlying causes of data center downtime, the effect on personnel stress and morale, the costs involved and the steps that IT staff can take to mitigate the effects.

Reputable studies have concluded that as much as 75 percent of downtime is the result of some sort of human error. But what is behind those human errors? It's always easy to say "lack of training," but even the best-trained people still make mistakes when they are in a rush, are tired, weren't really thinking, or just thought they could get away with taking a shortcut. My answer probably leans toward the "lack of planning" reason. There is a say ‘Failing to Plan is Planning to Fail’. It has always been my contention that many things (data centers in particular) invite human mistakes, simply because they are illogical in their layout, poorly labeled (if labeled at all) and generally doomed to trap some poor soul into an error that would not have been made if what was being worked on, made sense in the first place.

For example, most everything today is "dual corded," connecting to two different power sources that are supposed to come from two different power centers. Left to their own devices, electricians may connect one source to say breaker 7 in panel A, and the other source to breaker 16 in panel B. Further, they may put circuit labels on the outlets inside a cabinet, which are impossible to read, and put identifications on the panel schedules that don't correspond to the cabinet numbering. This makes it too easy to turn off circuits in different cabinets or fail to power down the intended cabinet.

Morale is seriously affected by system downtime, because IT lives in dread of failures. Small events are bad enough, but big ones suck the life out of staff. IT has become the new "utility." Systems are expected to simply be there, just as power, gas and water are not expected to fail, and are expected to be restored quickly if they do. IT staff know very well that a failure that truly affects the business, or that puts people's lives at risk, will be investigated and maybe even publicized–possibly resulting in job loss. There is daily pressure to avoid downtime, but there is astronomical stress during recovery. I have seen only one data center where the uptime was publicized regularly.

The one most often overlooked cost of system downtime is corporate image. It varies greatly by business, but for some companies, the damage to their image could be beyond monetary valuation. Another is loss of customers. Suppose a manufacturer supplying an auto industry suddenly found that their shipping system, which depended on their central data center, was interrupted by a downtime event. A car company that relies on "just in time" parts delivery would switch to their second source as soon as the delay was realized. That customer may never come back.

It’s hard to mitigate downtime. IT is a pressure business. There's always another server to be installed, or another application to roll out, and rarely enough time or resources to do it carefully or to fully document. Sometimes it's necessary to stand up to management and say, "This time table isn't realistic, and it’s an invitation to a disaster down the road." There has to be a discipline and an insistence on proper planning and procedure, which includes all the things noted above. Human beings are failure-prone. We can't push an IT staff into mistakes then act surprised when downtime occurs.