I’ve been working with some clients on their “Big Data” service. I purposely put “Big Data” in quotes as to me it’s a marketing term rather than the name of a specific technology. I think we’ve all heard it referred to when actually a real technology term should have been used. It gets used interchangeably with analytics for example, where actually analytics has had a perfectly good name, analytics, since it became a thing many years ago. It also gets bandied around as the cure for all ills and this is where the problems begin and the reason behind my opening question.
The problem is, just as we have learnt in the past that not all data is equal, it is also true that not all analytics are equal. It is pretty much horses for courses. Take for instance an online banking system. This produces quite a lot of data of differing types. For this example we will take a simplistic view of it. Let us assume that using an online banking system we generate the following data:
A transaction record. This is the actual record of what money has moved from where to where. It is essentially an event that has caused a change in one or more entities. An audit entry. This records who triggered the event and some details about where the event was triggered from and the degree of authority that the task executor had.
A modification of the account record. The change that the event has triggered is applied to the persistent record for the impacted account or accounts.
Lets leave it at those three for the sake of this example. What analytics could apply to this one transaction? Firstly, there is a real time analysis that is required to assess whether this transaction is acceptable with the previous usage pattern or is an outlier that could indicated a fraudulent transaction that needs to be blocked or at the very least verified by some other means. This needs to happen extremely quickly so that the transaction can flow without unacceptable delay to the customer executing it. Another basic analysis includes assessing if the transaction is possible based on the balances and policies around the accounts involved. This actually, does not need to be a real time analysis.
The process can be based on data that could potentially be 24 hours old. Sometimes this can be advantageous to the institute managing the transaction as insufficient funds could result in a charge to the customer, or advantageous to the customer as a later credit might provide a positive balance at the end of the day.
Finally, at least in the case of the example, there are some after the fact analytics that can occur to identify trends on the account to play into anti-fraud calculations of the future, transaction analysis to identify upsell possibilities in the future, such as the suitability to various credit products.
So we have three different types of analytics that can be formed around a single transaction and on the whole various technologies can handle each type in better ways. The three we have are real time trend analysis; structured query based testing and large-scale batch data processing. I will not talk about specific technology at this point in terms of the analytics engine, but I will point out that the fashion for the modern analytics engine, usually involving MapReduce is for locally managed, usually directly attached disks.
This works perfectly well when taken in isolation and with only a single analytics workload in play. However, when the requirement to then pass that data on in either a processed or raw format to a different engine or a structured relational database then suddenly a data movement is required.
If the data sets being considered are large, which is more commonly the case these days, given that it’s earned the moniker “Big” data, this movement is not insignificant. Not only is the move itself a heavy workload, but often the transformation required to make the data native for the next engine adds to that. With the three different types of analytics we have outlined above that could produce three moves and three transformations. The other problem of moving data around in the manner is that each move creates new versions that could potentially be updated. There are now three locations of data that could potentially be the most valid and needing to be synchronised into the other locations.
So what is the alternative? Well, not to move of course but this is not usually a solution that is considered. Anyone that has been involved in data migration particularly involving a change of application is very wary of any kind of change such as this. The part of this statement that surprises me is that these are normally the same people perfectly happy to accept the three migrations that have become business as usual.
The answer is to design the whole system around a single data store that allows exporting of the data and tagging of the data objects in a way that allows them to be tracked through their life.
With this method, the data does not need to move. Interfaces can be used to expose the data in its static form to the various engines. If a certain amount of transformation is required then caching capabilities can be used to create a read only subset of the data, even on a faster media such as solid state should it be required, that can allow the analytics to be run against that version before depositing the results back in the single store. This eliminates the need for mass data movements, reigns in the number of active copies that can lead to a “split brain” and allows a single instance for making disaster recovery and backup copies where required.
As for the tagging capability, my suggestion is that the Cloud Data Management Interface (CDMI) standard is investigated. This is a Storage Networking Industry (SNIA) standard that allows a rich metadata layer that can not only describe the data in its current state, but can also describe how the data needs to be treated in different circumstances and linking that data to relevant copies or instances of that data in different layers of the architecture.
Analytics are not new, but the means of applying those techniques to large pools of data using commodity hardware and less specialist skill and software than previously required. However, in the rush to adopt them, little thought is applied in where the data must live and what restrictions these new applications are applying to that data. It is far better to remove technology from the original application planning and to think about the use cases that can be applied to the data. What types of query or analysis are possible? What benefits it will bring and at what cost? When in possession of this degree of insight into how the data functions in the business can a true data orientated architecture be realised.
To learn more about Analytics & Big Data, CDMI, SNIA and
SNIA Europe please visit www.snia-europe.org