Data has been growing and an exponential rate since the birth of the Internet, and it expected to continue growing at an accelerated rated through to 2020. In particular, the big data market is still in its infancy relative to its anticipated growth over the next 10 years. The vast majority of the data being created is file-based and unstructured, challenging traditional enterprise storage environments designed and optimized for structured data sets in a database format. Where is all this data coming from? The main drivers of today’s data explosion are Web 2.0, digital devices, and machine-generated data, also termed sensory data. McKenzie Consulting1 predicts that by 2020, over 31 billion digital devices will be connected to the internet, fueling the phenomenal explosion in data created. In addition, Web 2.0 applications, including Google, Facebook, Netflix, and YouTube will all contribute to the growth of data creation.
IDC has predicted that stored data will grow at a rate of over 40 times per year through 2020, reaching over 15,000 Zettabytes. To put some context around this number, a Zettabyte is approximately one billion terabytes. We are only expected to cross the one Zettabyte threshold in 2014 and it will take over 500 million drives to store this much data. Worldwide manufacturing capacity of Flash based technology is less than 1% of spinning disk and it will take some time before it represents a significant piece of the storage pie. So while Flash may be growing at an exponential rate in the storage industry, the data growth rate will drive the continued need for high density spinning disk.
A further limiter to the use of flash across the storage industry is the massive price difference between spinning disk and Solid state disk. A one terabyte disk is as low as 5 cents per gigabyte in retail while SSD based technology commands close to 60 cents per gigabyte at the lowest end of the scale2. While the gap is closing between the two technologies, the price delta will continue to drive users to hard disk storage for capacity expansion, particularly in light of the massive onslaught of new data being generated every year. Another illustration of the convergence of SSD and spinning disk price can be seen in Pingdom’s tracking survey3.
The excitement within the storage industry over SSD based technology comes from the huge performance advantage of NAND flash over spinning disk. Hard disk drives have physical limitations that have severely limited their performance capability, including a maximum rotational speed of 15,000 revolutions per minute. In essence, hard disk product development has focused on making drives fatter not faster.
By contrast, SSD based technology has lightening fast read and writes compared to spinning disk as it does not have to battle with basic physics of disk drive speeds. Independent tests consistently show SSDs outperform hard disks by a factor of 20x compared to the fastest SAS disk drive, and over 50x compared to a 5,400 RPM SATA disk. SSDs have been game changing for I/O hungry applications and no other technology has had such a major impact on the storage industry in years. Put simply, SSD technology has been transformational in managing small file workloads in the storage industry. However at 12x the cost and only 1 hundred the capacity, SSD does not pose a huge threat to spinning disk any time soon.
Spinning sisk and SSD, living in perfect harmony
SSDs were first deployed as a caching layer to dramatically accelerate the IOPS performance from slower spinning disks. However, unlike spinning disk, SSDs will wear out upon reaching the specified amount of lifetime writes, and using SSD as a caching layer dramatically reduces the life of the drive due to the high level of write/rewrite. Write wear can be alleviated by utilizing highest quality drives, but the cost is prohibitive except for the most critical of applications. The vast majority of enterprise SSD deployments to date have been directed to accelerating mission critical database applications in a SAN environment. SSD deployments in the big data space have been focused within Web 2.0 companies such as Facebook and Amazon to dramatically improve their ability to serve up pages on the Internet. But for the vast majority of companies, SSD is too expensive as a big data store.
Panasas has serviced the scale-out big data market in critical applications in science, engineering, finance, bio-tech and core research for many years and has seen the demand for data storage grow at an exponential rate. Last year for example, Panasas alone shipped over 50 petabytes of disk storage to its big data customers. That is almost 5x the total amount of SSD shipped to the entire storage industry in the same year. Our many years of servicing high performance, high bandwidth big data applications in design and discovery has provided significant insight into the nature of customer file types and sizes. When we quizzed our customers about their data sets across industries as varied as nuclear research to financial quantitative analysis, the vast majority believed they had a “large file” workload. As a result, customers’ “need for speed” had tailored the Panasas products to deliver the highest bandwidth per disk drive in the industry. However, customer data revealed that while over 90% of disk space was taken up by large files, the quantity of small files (files less than 64KB) ranged from 50% of total file count at the low end to over 80% of files for some applications. Put simply, the big data market has a small file problem too and could greatly benefit from the massive small file performance gains of SSD.
Sample Set of small files across different big data workloads
With this insight we built out a storage system that places the right files on the right medium to begin with; small files are automatically stored on SSD while large files are stored on highest capacity, low cost SATA drives. In designing the ActiveStor 14, Panasas also placed file metadata on SSD dramatically improving activities such as directory listings and file system responsiveness. The impact of this subtle change to the storage medium produced dramatic change in product performance. Compared to its previous ActiveStor generation customers now enjoy a 9x improvement in small file reads per second, 6x faster directory listings, 5x faster file statistics, 2x faster NFS operations per second. With the file system improvements, RAID reconstruction rates have also increased dramatically and while we have increased the physical capacity of the system by 33% the RAID rebuild times have remained the same. The product is available in three models with as little as 1% SSD to 10% SSD for IOPS intensive workloads. And because we use SSD as persistent storage rather than a caching layer, we do not suffer the write problem associated with P/E cycles. Consequently we are able to use much cheaper MLC technology, allowing us to minimize the cost burden of SSD. By utilizing highest density 4TB drives, we have succeeded in actually reducing the cost per gigabyte by 15% while dramatically increasing the system performance and scalable capacity.
The final solution has produced a storage platform which is optimized for mixed file workloads, the Achilles heel of the storage industry. All other platforms are broken up into user or vendor defined tiers for high IOPS or high bandwidth. The ActiveStor 14 has taken the Steve Jobs approach of not making the customer have to decide, but instead integrating the intelligence up front so the product behaves in the best optimized way out the door. The result is an industry first use of intelligent tiering for mixed workloads.
The onus is on the storage industry to re-think how we deploy storage systems given this game-changing new technology. Instead of looking to SSD as a way to claw back some margin advantage in return for higher performance, SSDs need to become a building block in a hybrid solution that maximizes capacity scale while maintaining a cost structure that can sustain the data growth in the big data markets. Object aware file systems, such as Panasas PanFS, provides the flexibility to use SSD in the smartest possible way resulting in a very high performing, cost effective storage platform for big data.
1 McKenzie Global Institute, Big data: The next frontier for innovation, competition, and productivity
May. 2011 | by James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers
2 http://news.techworld.com/storage/3420021/micron-1tb-ssd-slashes-average-per-gigabyte-pricing/
3 http://royal.pingdom.com/2011/12/19/would-you-pay-7260-for-a-3-tb-drive-charting-hdd-and-ssd-prices-over-time/