DR and Backup: like chalk and cheese
The concept of Disaster Recovery (DR) and its pseudonym Business Continuity (BC) is the ability to quickly recover entire critical application sets. This is a vastly different scope from data backup although the two are often placed in the same category. The reason for the confusion stems from the evolution of IT and the wider business world. In the 1970’s and 80’s at the fastest growing point in the information technology adoption curve, the working day was still effectively 9 to 5 for most of the business world. Shops, banks, even governments would effectively shut down IT systems at the days’ end at which point the overnight backup would take place. In a pre-internet era, where manual input was required for many processing tasks, the lengthy overnight backup was not considered a burden.
But as the world moved towards a more 24/7 operational approach and IT systems increasingly automated many more transactional and processing tasks, organisations needed alternatives to avoid the downtime caused by the overnight backup. In addition, the goring volume of data and systems meant that even the fastest backup systems were not completing within the rapidly diminishing overnight window. With the arrival of virtualisation and ubiquitous high bandwidth IP connectivity, organisations started to reconsider approaches to disaster recovery – recognising that although backup is still valuable for certain types of historical data, it was starting to become a burden rather than a benefit.
In essence, disaster recovery is a different conceptual approach. Instead of taking snapshots of data for recovery at some time in the future, DR is the notion that both the data and the critical applications must always be available in secondary locations and often dormant, a state with the ability to quickly recover in the event of a primary production system failure.
If you consider a financial services organisation processing thousands of pounds worth of transactions each minute, a failure that requires a data recovery from the previous days backup would equate to massive financial loss and huge reputational damage. In general, prior to virtualisation, creating a DR solution for these environments required site to site replication. This meant building a complete replica of the production environment off site and synchronising application and data continually.
Every change in software and hardware at the production site would need to be replicated at the DR site leading to extremely high CapEx and ongoing OpEx. The arrival of virtualisation which effectively abstracted the application and dataset from the underlying hardware, storage and operating system (OS) now allowed DR to become suitable for organisations without the resources to build these expensive redundant site solutions. With around 60% of production systems now virtualised, most modern DR solutions are designed to work within these environments. Although even within these environments, DR solutions vary greatly.
Understanding DR Solutions
Modern DR falls into two broad categories. The first is array and appliance based solutions. These work on the principle of taking a copy of the virtualised application and data set directly from the storage. This can either be as a hardware module within the storage array or standalone appliance. In many cases, these hardware solutions require that the storage comes from a certain vendor and these array-based technologies will replicate the entire volume to a secondary storage system. Even if only one virtual machine in the volume needs to be replicated, a full copy is taken which underutilizes the storage and results in what is known as “storage sprawl.”
There is also software variants called a Guest and OS-based Replication which require software components to be installed on each individual physical and virtual server. As software, this removes the need for additional specific hardware as found in the first array/appliance method but conversely, each server within the infrastructure needs individual management. Although more flexible then hardware approaches, this still limits scalability for larger deployments and is also challenging in environments that are constantly changing.
A new type of DR for virtualised environments is called hypervisor based replication. This is effectively a software element that replicates the individual virtual machines with the option to plug directly into the virtual management console such as VMware’s vCenter or Microsoft’s VMM to simplify control of the underlying virtual infrastructure replication. Because the hypervisor based replication is now integral to the control plane, anything that happens within the entire virtualised domain can be replicated in real-time.
Hypervisor based replication also uses a Virtual Replication Appliance (VRA) that is automatically deployed by the virtual management console the ESXi or Hyper-V hosts. The VRA continuously replicates data from user-selected virtual machines, compressing and sending that data to a remote site or storage target over LAN/WAN links.
Because it is installed directly inside the virtual infrastructure, the VRA is able to tap into a virtual machine’s IO stream which means that each time the virtual machine writes to its virtual disks, the write command is captured, cloned, and sent to the recovery site.
Hypervisor based replication offers several advantages. It offers simplified management when compared to OS-based replication and more flexibility when compared to array-based methods by removing the need to replicate a whole storage Logical Unit Number (LUN) which reduces storage volumes and bandwidth utilisation.
Recovery and failover differences
Irrespective of which method is in use, the differences between DR and backup are striking. Where backup and restore of an environment using legacy methods may take several hours, DR solutions that are constantly synchronising data may take as little as an hour to recover in the case of array-based solutions and as little as 10 minutes for hypervisor based replication. This also leads to vastly different recovery points. An overnight backup could be as much as a 23 hours differential between data sets while hypervisor based replication is continuous meaning that following an outage, a recovery will roll the system to the last virtual machine’s IO stream that was captured which may well be just a few seconds earlier.
The last area of major difference between backup and DR is ability to both conduct and test failover and failback. In the old days of tape backup, many organisations were simply unable to test the failback position and had to live with the assumption that all would work if the worst happened. With virtualisation, a modern DR solution can be tested by provision of a replica virtualised infrastructure based on the replicated data set.
This type of testing does not even need to take place on site as the technology is particularly well suited to cloud based DR service providers. Yet for all the technology, organisations need to have a viable strategy for recovery. For any plan to be viable it must be properly tested and updated based on current business conditions and threats. This is arguably one of the most advantageous features of cloud based DR. The ability to spin up an entire replica production environment on virtualised servers in the cloud allows organisations to test the ability to recover critical systems across scenarios ranging from loss of local power, and connectivity to massive disasters that may render a site unusable such as fire, flood or evacuation.
These tests can range from “passive” that validates that remotely backed-up data is accurate and that all software is patched and theoretically ready to run in the event of an outage. However, organisations should also conduct an “active” test that starts up a complete replica of a production environment in parallel with production systems, and then switches over to this backup infrastructure to conduce real world transactions running on the DR cloud for a period of time. The active test would conclude with a full scale fail back to the primary IT environment and synchronisation of all data.
These types of DR tests were incredibly difficult and disruptive in the legacy world of backup tapes and even running it once a year was a rarity for many organisations. It is worth confirming with a cloud DR provider that these full scale tests are possible and how often these can be carried out. Testing is vital to ensure that a DR strategy is effective and should be a regular occurrence as IT systems continually evolve and a future software patch or hardware upgrade may well impact the DR processes.
Final thoughts
Although DR is clearly starting to overtake legacy backup, the older sibling still has a role to play. For non-production environments and for long term archival purpose, traditional backup, even on tape, provides some significant benefits. The relatively low cost of the media and ability to work across hybrid physical and virtualised environments are both major benefits.
Yet, moving forward as organisations increase deployment of virtualisation and adoption of cloud accelerates the notion of backup and restore will start to fade. One word of warning, although backup technologies are starting to rebrand as disaster recovery, it is worth comparing the actual capabilities versus the marketing claim. If a solution has multi-hour recovery times, forces the use of specific storage hardware and fails to continually protect production environments with rapid recovery; then it’s probably a backup solution irrespective of what the marketing messages say.
For organisations seeking to move from backup to true disaster recovery, software based solutions offer the most flexibility and closeness to the virtual management console will help to deliver true business continuity solutions that are able to grow in line with the changing nature of virtualised infrastructure.