Public Health England [PHE] the executive agency of the Department of Health is using a new big data storage system based on DataDirect Networks [DDN] storage alongside an existing high-performance server cluster to enable faster and effective analysis of genome sequences. These sequences are then used in PHE activities for diagnostics and surveillance of infectious diseases.
The implementation, configuration and integration of the several components of the big data system and cluster have been supported by big data processing, management, storage and analytics provider, OCF.
In December 2012, Prime Minister David Cameron announced the ‘100,000 Genome Project’ where the personal DNA code of up to 100,000 patients, or infections in patients, will be decoded. The Department of Health prioritised a number of areas with infectious disease sequencing undertaken by PHE. PHE has laboratories across England receiving thousands of biological samples per week from patients with unidentified and potentially aggressive pathogens that need urgent identification. This project supports PHE’s goal of being a leader in the adoption of genomics in clinical microbiology to support public health interventions in a quicker and more cost effective manner.
PHE uses Illumina sequencing machines (HiSeq, MiSeq) to generate DNA sequence data from diverse pathogen (bacteria and viruses) samples. A high performance computing cluster, integrated in July 2013 by OCF, is used to assemble and analyse genetic information to provide accurate diagnostics and rapid identification of outbreaks, thereby helping patients and delivering public heath interventions more effectively.
Specifically, the cluster helps by parallelising the analysis process of generated sequences, thereby reducing significantly the time taken to analyse hundreds of genomes to as little as couple of hours (or less) compared to many hours on a normal workstation where analysis is done in a sequential order.
PHE is also now using 300TB of high performance DDNTM SFA® storage integrated by OCF. PHE keeps data for around 3-4 months enabling numerous researchers to analysis data sets simultaneously. The data is then tiered off to a DDN storage archive and also made available for sharing with clinical partners and other research organisations.
Technology
The system uses DDN, HP and IBM hardware, Open Source software including Linux and xCAT and commercial software and technical support from Univa and Red Hat. The system includes:
· HP BladeSystem c7000 with 16 server HP Blade BL460c Gen8
· IBM x3650 nodes for data management services
· To support massive performance and data growth requirements, OCF installed DDN SFA storage, and EXAScaler™ appliance with Lustre® File System
o Configured usable capacity: 300TB (7x 8+P+P)
o Configured performance: 2.5GB/s (OSS-Storage Capability)
o Maximum performance: 6GB/s
The future
In the coming few weeks PHE will expand the high-performance IBM cluster again with another 16 compute nodes further increasing the possible parallelization in the analysis process of generated sequences. PHE is also expanding its archiving storage capacity with DDN WOS® cloud storage (with 250TB of capacity) and implementing the open-source data grid software iRODS to help organise, share, protect, and preserve scientific data. This additional system will also enable:
· Creation of a private cloud environment where researchers can access geographically dispersed and replicated file data using the fastest nodes [normally the data source], not necessarily the closest nodes to their location, improving researchers productivity
· Accessing of data at the same time increasing collaboration amongst researchers
· Addition of metadata to standard file system data enabling researchers to search, browse and retrieve data more quickly