Data centres are national critical infrastructure, and as was recently demonstrated when London Heathrow Airport was forced to close through loss of power, when critical facilities stop working the impact is costly, lengthy and expensive.
What words best describe data center design and operations challenges in 2025? Complexity? Scale? Density? Liquid?
That AI, power densification, and liquid cooling have created new pressures for data center design and operation engineering teams is no longer news to anyone with any kind of responsibility for ‘keeping the lights on’ in a data hall.
That changes are happening ‘in flight’ as supply chains become stretched and waiting times for new equipment lengthen provides an interesting context.
The bar on the risks associated with equipment failure has been raised.
Downtime is becoming more expensive than ever. The question is: Are the frequency of M+E equipment failures rising at a time when service criticality is growing?
These new layers of complexity have hit just when capacity scale out is the new normal and at a time when speed to market is a key customer requirement.
It is enough to cause sleepless nights.
Don’t Panic
Readers of a certain age will recognise the words “Don’t Panic” from the iconic Hitchhiker’s Guide to the Galaxy series of books. While we are not dealing with the end of the Universe, it would be foolish to pretend that there may be more to supply chain issues than simply sourcing equipment.
So, while we are not panicking we know we are dealing with supply chain disruptions which can’t be ignored. According to Datamation Global supply chain issues have extended lead times for critical components. Electrical equipment like switchgear and generators, which previously had lead times of up to six months, now may take up to 18 months to two years to arrive.
There are also challenges in construction material delays. A 2023 survey reported by Statista indicated that the majority of respondents experienced an increase in data center construction material lead times, with respondents reporting delays of more than eight weeks on average.
It is no surprise the demand for MEP equipment for critical services has never been greater.
Huge growth in the data center industry, alongside mass electrification across every sector from automotive to marine transport, to manufacturing mean power chain equipment makers have full order books. This has come in the aftermath of the huge impact of the Covid pandemic on global supply chains.
Increased complexity and incident responses to multiple failures?
As digitalisation accelerates across every aspect of life, data centre operators have become the infrastructure foundation for the world’s critical services.
Colocation service providers such as Serverfarm want to invest in equipment of high build quality, that operates well in the field, needs minimal servicing and does not fail.
Of course, no-one likes to talk about equipment failures. But failures are a reality. And given the pressures being felt by manufacturers it is a reasonable question to ask if we are going to see more equipment failures in the future.
At this time of surging demand the challenge for manufacturers is to ensure quality does not suffer. But investing the quality equipment is just part of the solution. Effective day to day operations is knowing what to do when serious incidents occur – even where a cascading series of events might see multiple concurrent equipment fall over.
For example, everyone who uses a colocation provider has asked: “What happens if you lose the grid?” In normal circumstances losing the grid means transferring to a second feed.
But how many follow up with the question - what happens if there is an equipment failure and the second feed is offline? And what then happens if one of the gensets fails to start up? How much pressure can the UPS and battery equipment take? And what happens if a failure happens here too?
Is such a series of events unimaginable? Not if you are an engineer.
Were such events to occur in sequence is exactly the time when standards, consistency, processes and professional expertise can prove the difference between ‘incident’ and ‘catastrophe.’
In a well run data center operational expertise includes the on site teams having forensic insight into the design and construction of every aspect of the facility. The best way to achieve this is through those teams participating at the design and construction stage of the data center.
This level of collaboration means that operations can see ‘how this facility is actually going to run’ is understood even before the first GPU server is wheeled into a hall and onto a rack.
This must be backed by demonstrating the highest operational standards. Having proven incident response processes (which are globally consistent, drilled and practiced) and having the right expertise on site could make the difference between maintaining uptime and total shutdown.
Furthermore, being open, transparent and engaging with customers provides reassurances and communicates: ‘This is how we run things.’
Future tense
The last 10 years have seen major changes in how data centers are operated. The next 10 years will see greater and fundamental changes at every layer of the stack.
In terms of uptime, emergency incident response and planned downtime these changes will come in many forms.
More resilience will be baked in at different software layers, from the application to system software.
At the platform layer Cloud-based failover and disaster recovery strategies will provide business continuity by leveraging multi-cloud and hybrid cloud environments, with
automated disaster recovery (DR) systems activated instantly during outages. AI-powered failure analysis will prevent disruptions and enhance recovery strategies.
Redundant hardware combined with AI-driven infrastructure ensures seamless reconfiguration or replacement components. Edge computing and distributed architectures minimize the impact of failures by decentralizing processing, ensuring that workloads can continue.
As these innovations continue to evolve, data centers will become more resilient, efficient, and capable of handling unexpected failures with minimal downtime.
But those running data centers for AI and cloud providers today must not become over distracted by outside noise and new shiny things.
At an operations level we can embrace new methods and technologies without falling into the trap of becoming over focused on future solutions and missing today’s SLAs.
Today’s reality
Supply chains are not simply about “How quickly can I get it?” If failure rates creep up then speed of supply is not the answer. A focus on build quality that provides improved hardware resilience contributes to our ability to provide service reliability.
At Serverfarm we know that we live in a design and operations world of extended lead times for critical equipment, and we must plan carefully to meet project timelines. Proactive measures such as phased delivery strategies are essential to mitigate potential disruptions and ensure seamless operations.
We opened this article with a list of the challenges we face in design and operation of modern data centers. For us to serve our customers we want to invest in equipment of the highest build quality, with long MTTM and even longer MTTR numbers. When we invest in equipment we calculate TCO not simply through a capex and opex lens but also through a quality of service lens for our customers. This is our definition of not failing.