“We find that whole communities suddenly fix their minds upon one object, and go mad in its pursuit; that millions of people become simultaneously impressed with one delusion, and run after it, till their attention is caught by some new folly more captivating than the first.”
Extraordinary Popular Delusions and the Madness of Crowds.
That the IT industry is primarily driven by fashions is fairly obvious. Even MIT Sloan Management Review published an article about the career benefits of becoming a dedicated follower of IT fashion. Big data is the latest IT fad to get the fashionistas drooling at the mouth, and as with all IT fashions, some organizations will look much better and some much worse for adorning themselves with this latest garment. But at a time of crowd madness such subtleties become largely ignored.
Just to level the playing field, here is a a quick overview of what big data is about. Commentators are forever telling us that the quantity, diversity, velocity and volatility of data are rapidly increasing – yawn. Yes we all know this, and some bright spark contorted the issue sufficiently that five words beginning with ‘V’ could be used to explain this phenomenon. But let’s not go there. The traditional relational database does a lot of work making sure things hang together (not being able to delete a customer record while there are still open orders referring to that customer for example). Various types of integrity are maintained and because database has been traditionally used for transactional data we store the details of each transaction as a single unit. For very large amounts of data this is not a good scenario. Scalability is limited by the fact that some central entity has to make the whole thing hang together, and for analytics work the row based relational model is fairly useless. So it has gradually dawned on people that stripping away the overhead of the relational model and doing away with the row based paradigm might be a good idea. The result is big data technology which is characterized by massive scalability and great flexibility. The central construct used in big data is the key-value pair. Instead of storing a transaction as a single record it is broken down into multiple key-value pairs. So if customer Joe Smith has a key of 101234 then we might see several key-value pairs 101234:Joe, 101234:Smith, 101234:50, 101234:New York etc. Using the common key, the details for Joe can be reconstituted if needed, although this really isn’t a particularly efficient thing to do. But if we wanted to total the sales for the current month then we just need to rip down the ‘purchased this month’ key-value pair and total them. These key-value pairs can be distributed over multiple servers with a minimum of centralized control (a job that Hadoop performs).
As with all things these benefits do not come for free. All the tying together that relational databases performed now has to be implemented in program code, and complexity mushrooms.
In fact big data is so new that we don’t really know how damaging this complexity will be.
It reminds me very much of the early days of client/server computing – which was also about distributing stuff that had once been centrally controlled on a mainframe. Disaster stories became the order of the day as system management issues reared their ugly head.
The rational approach to big data, and any other new IT fashion is as follows:
- If you really, really need big data then by all means go for it. But be aware that this is a high risk route and should be balanced by a solid conviction that the benefits will be higher.
- If you need big data, but not tomorrow, then by all means prototype. Take your time and let others make the mistakes. This also allows the skills market to mature and the price for such skills to fall. Then do it when prices are low and lessons have been learned.
- Stick with what you have got if there is no need for big data, and five to ten years down the track when you possibly might need it the technology will be mature, skills less expensive and much smaller risks will be involved.
Please note the use of the word ‘rational’. But generally speaking we are not rational, and most certainly not where IT is concerned. Personal career agendas, emotions, cognitive biases and so on make us anything but rational despite the pretense. So here is what will happen. Big data will be very ‘big’. In a few years from now we will start to see disaster stories emerging, although in reality these will only be the tip of the iceberg, since most IT cockups are hidden from public view. Social networks just make the whole scenario much more likely as consultants, managers and technicians jostle for position.
This is not an anti-big data article. Big data is here to stay, although in typical fashion we overestimate the short term effects and underestimate the long term effects – which will be profound. But that is another story.