This is important for many businesses, but is especially critical for cloud or managed service providers who provide a platform for multiple organisations. In these cases, an outage or performance issue can have devastating results, not only for a provider’s reputation but also for its bottom line.
This reality has hit home for customers of Dimension Data in Australia (see http://forums.theregister.co.uk/forum/1/2014/07/04/dimension_data_in_cloud_outage/ ). It’s hard to know exactly what went on but Data Dimension has been quite open in admitting it suffered an outage on its EMC storage implementation. The result was no service to customers for more than 24 hours – OUCH!
Sadly this is an all too common occurrence for storage architectures being implemented in “cloud” data centres, but it doesn’t have to be this way. The two most common causes of storage failures in enterprise data centres are:
1. Human error (e.g. knocked cable, wrong controller rebooted, wrong drive pulled);
2. Drive failure, either a RAID rebuild or multiple failures causing outage.
Both of these scenarios are entirely avoidable through the realisation of true zero-touch storage. The storage industry has done a fantastic job of conditioning storage buyers and administrators into believing hard disk failure and subsequent replacement is entirely acceptable and poses no risk. This couldn’t be further from the truth.
I myself have worked (many, many years ago) as a storage field engineer and I’ve seen, heard and (I have to confess) been involved in horror stories involving either human error or multiple drive failures resulting in outage and/or data loss. It doesn’t have to be this way.
A trend is emerging of all flash array vendors arguing these issues can easily be solved by moving to an all SSD/flash architecture, but they forget to mention the same problem can easily occur again. Don’t believe the hype that says “there’s no moving parts so there’s nothing to fail”. The truth is all drives have the ability to fail, spinning or non-spinning. The only way to avoid drive failure is to have the ability to repair drives in-situ with no impact on the workload.
The ideal storage for cloud providers is something that is:
1. Truly zero-touch. Many cloud and managed service providers have remote or even third-party data centres – wouldn’t it be nice if they never had to go near a storage array?
2. Consistent. Storage should give consistent performance and reliability regardless of its utilisation. It should give the same performance at 99% capacity utilisation as it does at 1%.
3. Scalable. You shouldn’t have to buy a 500 disk monster array upfront to get predictable performance, neither should you have to suffer when you add an extra shelf of disks.
4. Commercially viable. At the end of the day, it’s key predictability and reliability doesn’t come a cost that breaks the business model of the service provider.
The good news for cloud providers is that it’s already here. It’s true that not all storage is created equal but that doesn’t mean the right storage for cloud providers doesn’t exist. All they have to do is take the time to look for it.