Network issues are causing more data-center outages

As enterprise computing environments grow more complex, IT system failures and network errors are bringing down data centers in greater numbers, causing more unplanned downtime.

data center network outage candle
Russian Danyliuk / Vladimir Timofeev

Power failures are a common cause of data-center outages, but they’re not the only culprit. As enterprise computing environments grow more complex, IT system and network failures are bringing down data centers in greater numbers.

The Uptime Institute has been studying publicly reported outages to track what’s causing unplanned downtime. Over the past three years, it has culled information from 162 outages reported in traditional media or on social media. During that time, the amount of data available has steadily climbed; researchers collected data from 27 outages in 2016, 57 outages in 2017, and 78 outages in 2018.

“Public outages make the news with ever increasing regularity,” said Andy Lawrence, executive director of research at Uptime Institute, which offers resiliency services, advice on building and running data centers, and certification services.

The industry is now recording “significant outages on a near daily basis somewhere around the world,” Lawrence said as the group unveiled its research findings. That doesn’t necessarily mean that the number of outages is spiking, but downtime is gaining more attention and, “it’s clear to us that the impact of outages is certainly increasing,” he said. 

A key finding from Uptime Institute’s research: Power is less implicated in overall failures, while the network and IT systems are more implicated.

One reason for the shift is that power systems are performing more reliably than they have in the past, which is reducing the number of on-premises data-center power failures.

Over the past two decades, the tech industry has focused on how to design power systems in way that allows IT assets to continue to operate even if there’s a fault or failure somewhere in the power system, said Chris Brown, CTO of Uptime Institute. “The advent of 2N electrical distribution systems feeding dual-corded IT equipment allows systems in IT to continue to operate through a number of single incidents and events,” Brown said.

Meanwhile, the increasing complexity of IT environments is leading to greater numbers of IT- and network-related problems. “Data now is spread across multiple places with some critical dependencies upon the network, the way that applications [are architected], and the way that databases replicate. It’s a very complex system, and it takes less today to perturb that system than perhaps in years past,” said Todd Traver, Uptime Institute’s vice president of IT optimization and strategy.

Rating the severity of data-center outages

To distinguish between an outage that threatens to bring down the business and one that is merely an inconvenience, Uptime Institute has come up with a scale. The rating system allows researchers to see how patterns change over time, Lawrence said. Uptime Institute’s scale has five tiers:

  • Level 1 is a negligible outage. The outage is recordable, but there’s little or no obvious impact on services and no service disruptions.
  • Level 2 is characterized as a minimal service outage. Services are disrupted, but there’s minimal effect on users, customers or reputation.
  • Level 3 is a business-significant service outage. It involves customer or user service interruptions, mostly of limited scope, duration or effect. There’s minimal to no financial impact. Some reputational or compliance impact is incurred.
  • Level 4 is a serious business or service outage. Disruption of service and/or operations is involved. Ramifications include some financial losses, compliance breaches, reputation damage and possibly safety concerns. Customer losses are possible.
  • Level 5 is a business- or mission-critical outage involving major and damaging disruption of services and/or operations. There are possible large financial losses, safety issues, compliance breaches, customer losses and reputational damage.

When Uptime Institute examined all publicly reported data center-outages (Levels 1 to 5) over the three-year period, IT system and network problems outstripped power as the primary cause (see graphic).

data center outages pie chart Network World

The trend is particularly pronounced when year-over-year causes are compared. In 2017, power was the main culprit in 28% of outages. The following year, power was cited as the primary cause in just 11% of outages. IT system-related failures stayed fairly consistent; they were the primary cause in 32% of outages in 2017 and 35% of outages in 2018. The network as a primary cause of outages rose significantly: 19% of outages in 2017 were blamed on the network compared to 32% in 2018.

“It’s the interconnectedness of things. That’s why the big uptick in network outages causing disruption,” Traver said of the 2018 spike. “Things are connected across not one or two sites, but three or four sites or more. Network is playing a bigger and bigger role" in IT resilience.

In addition, as more IT resources are handed off to service providers and are no longer under the direct control of the organization using them, it increases management and operational complexity.

“Two-thirds of the [2018] outrages are network- and IT-related. That’s a big change from years past,” Traver said.

Digging into data-center downtime

Uptime Institute’s research digs into the specific causes of data-center outages. On the network front, common causes of outages include:

  • Fiber cuts outside the data center, with insufficient routing alternatives.
  • Intermittent failure of major switches, with secondary routers not deployed.
  • Major switch failure without backup.
  • Incorrect configuration of traffic during maintenance.
  • Incorrectly configured routers and software-defined networks.
  • Loss of power to non-backed-up single components, such as switches and routers.

Incorrectly configured routers and software-defined networks are “common network problems. They should have been detected with testing,” Traver said.

When it comes to fiber cuts, companies often weren’t aware that they had a single point of failure, Traver said. “They might have had two separate providers, but unknown to them, the fiber was running in the same trench. And they hadn’t done the proper due diligence to determine that.”

When IT is the culprit, some of the causes cited include:

  • A poorly managed upgrade with insufficient testing at the software level.
  • The failure and subsequent data corruption of large disk drives or storage area networks. This is likely caused by hardware failure, exacerbated by configuration or programming errors.
  • Failure of synchronization or programming errors across load balancing or traffic management system.
  • Incorrectly programmed failure/synchronization or disaster-recovery systems.
  • Loss of power to non-backed-up single components, such as servers or large disk drives.

Speaking to the load-balancing/traffic management issue, Lawrence said programming errors and synchronization problems can occur as companies try to distribute IT resources more broadly. “It’s often part of a wider strategy to reduce dependency on a single site; it’s as though you squeeze the balloon and the problem pops up elsewhere,” Lawrence said.

Problems occur when companies “haven’t really planned across all the platforms that their applications and data span or they haven’t tested them regularly,” Traver added.

When power is the culprit, some of the leading causes of outages include:

  • Lightning strikes, leading to surges and lost power. Back-up software/configuration failed.
  • Intermittent failures with transfer switches, leading to failure to start generators, or transfers to second data center.
  • UPS failures and failure to transfer to secondary system.
  • Operator errors, turning off or misconfiguring power.
  • Utility power loss and subsequent failure of generator or UPS.
  • Damage to IT equipment caused by power surges.
  • IT gear not equipped with dual power suppliers to switch to secondary feed.

There’s nothing unfamiliar or surprising among the power-related culprits, Brown said. “These are the things data center-engineers have been struggling with – how to design around, how to mitigate with their designs – for decades,” he said.

In general, companies need to pay more attention to data center resiliency, Traver said. “Know how your system is designed. Understand it fully – all the interdependencies. And also know how it fails, and plan for failure. That’s the piece that I think is missing,” he said.

“Equipment is getting better, management is getting better, experience is getting better. It’s becoming a more mature industry," Lawrence summed up. "But even so, outages continue to be a really major and expensive problem.”

This story, "Network issues are causing more data-center outages" was originally published by Network World.