Data warehouse applications have been around since the late 1980s. Commonly referred to as a reporting or analytics database where data is ingested from multiple sources and then transformed into meaningful sets, a data warehouse is regarded as essential to many businesses. Data warehouse applications can range enormously in their complexity and are able to read, manipulate, write and update large amounts of data. But why is there need for a separate data warehouse system? Why not just report directly from production? The answers is simple: the need for speed. And since we are all too aware that storage based on traditional hard disk drive is restricting application performance, can Flash storage help data warehouse administrators finally overcome the issues related to latency?
This article will examine these questions with an emphasis on how flash storage, a fast- growing technology, can improve the data warehouse market and bring in new innovations Essentially, the purpose of data warehouse applications is to take raw data and to present it in an insightful way that makes sense for specific financial or operational purposes. The advantages of a data warehouse are wide and include the cleansing, improvement and restructuring of data to generate sets that provide detailed insights for the purposes of reporting or trend analysis. A data warehouse turns data into valuable information yet to do so it requires high-performing and efficient technologies.
So, what is the problem of running data warehouse applications with disk only? There are two distinct obstacles when using disk. First is performance protection. It is well known that spinning disk technology has fallen behind performance demands over the past ten years and is now the slowest component in a datacentre. Disks are unable to maintain performance when under load and are slow for random I/O, both of which are critical issues to most tier one applications. Flash is able to negate these issues due to its superior performance over disk.
Second, to achieve the best possible performance from a disk requires data to be stored sequentially in order to reduce head movement and seek times. This entails employing a single threaded process to first sort the data and then to physically write it to disk which, for larger datasets, is a very slow and drawn out process. This presents a significant challenge in managing batch processes for example, as they do not address the risk of failure or the process overrunning into the production window. There have been attempts to improve data warehouse performance through data partitioning or data sharding, but ultimately the underlying issue of slow disks remains.
So what if we could get rid of the unnecessary and complex steps of a data warehouse application? Inevitably, some of these (such as data cleansing and transformations) are key to the creation of accurate and useful reports. But if we could remove the sequential data sort and write for example, the process would be simpler, allowing for increased performance, lower batch times and reduced costs. Removing sequential data sort and write from the batch process would mean that data remains random hence all I/O would remain random by nature. And this is where traditional disk sees the biggest competition by what is now a very fast-growing technology: flash storage.
Flash storage has been hitting the headlines repeatedly over the past couple of years, with more and more impressive claims of unprecedented IOPS and low latency. As the technology gains a track record, more and more organisations look to deploy it to accelerate mission-critical applications and eliminate bottlenecks. To date the availability of flash storage has also made it possible to overcome some of the existing data warehouse performance problems because its speed far surpasses that of disk especially in the case of random I/O workloads. Some vendors have also fixed the latency-under-load issue to ensure sustained low response times. Loading up production with more and more transactions or workloads is no longer an I/O penalty thanks to flash arrays.
Large data warehouse processing
A typical data warehouse comprises various steps including – but not limited to – importing, staging, transformation, integration and sorting. While some warehouses may add or exclude certain layers, sequential sorting and disk writing are the two key stages. Disks are effective when dealing with sequential I/O but poor at random I/O workloads. For the purpose of reporting, it is vital for data to be stored sequentially on disk to maintain good performance when reading. Yet, to write data onto the disk requires the row order to also be sequential. The physical I/O writes must then be completed using a single threaded process to ensure data remains sequential. When done this way, the disk head doesn’t have to bounce around trying to find the data on large reads as it is now stored together which maintains decent response times.
However, two significant problems remain:
1. More storage capacity is required as multiple copies of data are stored.
Working through the warehouse can require multiple copies of data. For example, when doing transformations or sorting, the application might create a new dataset while retaining the raw data. Therefore, for a period of time, the staging area in the database might be two or three times the size with multiple copies of the same data. This is costly as storage is overprovisioned, but only used during the batch process.
2. Processes such as transformations, sorting and writing to disk with a single threaded process take time.
A data warehouse mainly deals with large datasets - one day or night’s worth of data from multiple sources can be Terabytes in size. Therefore, as each layer has to manipulate terabytes of data, the process can take a long time. Businesses tend to batch up these jobs into overnight activities to avoid negative impacts in production and to ensure data is ready for use the next working day. Commonly, these batch processes take hours to complete and the practice carries with it a significant risk of running into normal working hours.
Flash storage gives excellent raw random I/O performance with very low (microsecond) latency even under heavy load, but how can this enhance the current data warehouse process?
As random I/O is very fast, there is no longer the need to sequentially sort data nor to write it out to physical storage; immediately, two very slow and mundane stages have been eliminated, boosting performance. Additional writes and reads will be ready within a fraction of a second, meaning there is no concern for batch or reporting performance. Furthermore, less storage is required thus further reducing costs.
In contrast, the same, typical warehouse process on flash storage comprises:
Step 1: Import data from multiple sources in to a staging area. This requires lots of random I/O reads and writes
Step 2: Data is ready! This may require some transformations but no more sorting or single threaded sequential writes.
As flash storage is good at random I/O this technology can also be introduced into the transformation infrastructure or stages, further improving performance.
Data Marts
The production of copy data marts i.e. full copies of production for reporting, is used by some business as opposed to complicated data warehouse systems. This method of using a separate database ensures the reporting load will not affect production protecting performance of the core infrastructure. The primary reason for using data marts is because legacy disk storage is unreliable.
Reading from production is also going to cause a lot of random I/O and, as already mentioned, disk performance is very poor with this type of workload. Although data marts reduce the complexity needed to manage a bigger staged warehouse, they add an additional layer to an application infrastructure. With flash storage the need for data marts can be removed completely. If I/O performance is no longer a penalty and the application can achieve microsecond latency and high IOPs with sustained performance under load, then reporting can be done straight from production. This is a major benefit to businesses for the following simple reasons:
First, it enables true real-time reporting or analytics for speed and flexibility, which can allow a business to analyse data as it is being manipulating by users/customers leading to new areas such as real-time trend analysis and faster decision making.
Second, it reduces infrastructure costs by not needing duplicated storage as well as removing the costs of database licencing. Finally, the operational aspect is considerably simplified. By removing an entire database and hardware infrastructure the operational management of the estate becomes more streamlined and allows a company’s IT department to be more productive with its support resources.
These are only some of the examples of how flash storage can transform the data warehouse environment thanks to very fast sustained performance. By replacing slow, unpredictable disks with flash, a warehouse batch process is speeded up and helps move towards real-time analytics while simultaneously saving time and budget. Removing process steps which are no longer necessary can dramatically help a business drive up its productivity and ultimately its bottom line.