Big data answers the age old issue of how we deal with proliferating quantities of increasingly diverse data. Of course what is big data today will probably be small data in ten years from now, and in fact the word ‘big’ is relative, as well as being one of the most successful buzz words the IT industry has ever created. The situation that faces us can simply be summarized as the inadequacy of existing data technologies and the emergence of alternative ways of handling the glut of new types of data organizations need to handle.
As always there are no free lunches, and the extra capability big data technology delivers comes at a price. Of course there are the hard dollars needed to buy the technology and skills, but big data also introduces considerable complexity – more than most in the IT industry would be willing to admit. Right now the big data technology marketplace resembles a zoo, with strange esoteric species making an almost daily appearance. For technicians this is the nearest thing to nirvana, and for business managers a bewildering jumble of names and acronyms. Right now big data technologies are very immature and unless there are pressing needs it is best to wait for the dust to settle somewhat.
The technology itself is based on a ‘divide and conquer’ approach to handling data. Traditional data management involves centralized databases, often deployed on specialized hardware and system software, and this in itself restricts the way these databases can grow. Not only this but the dominant model for database technology, the relational database management system (RDBMS), is unsuited for many types of data (text for example) and is hopelessly inefficient at handling other data types (e.g. streaming financial data). Big data technology typically utilizes commodity hardware and systems software (lots of it) and will divide a workload so that it is split over many computers to execute in parallel. It sounds simple enough, but the supporting software technology needed to make this happen is sophisticated and complex. The best known environment for big data is called Hadoop. At its heart is a distributed file management system (to make sure files on multiple computers know about each other) and mapReduce, a technique for dividing the workload (map) and then pulling the results back together again(reduce). Hadoop also has a seemingly endless set of associated utilities to handle the many issues that arise a soon as work and data are distributed over multiple computers.
Most importantly big data technologies can handle a diverse set of data types including traditional record based data, text data, high volume streaming data, graphing data, spatial data and pretty much anything else an organization might want to store and use. A multitude of database systems exist to address each of these needs, and so far at least there is no single database that can handle all of them. Clearly this increases complexity considerably with the need for a diverse set of skills – something that only larger businesses can afford.
This does not mean we have to throw the baby out with the bath water. The trusty RDBMS based systems can probably be left alone unless there are serious capacity issues. The newer and less well catered for types of data however will find a home in a big data environment. Text based data, often around 80% of the data an organization handles, can be readily accommodated in a big data environment, as can huge volumes of social data, data from sensors, streaming financial data, and so on. Big data allows us to put a few more pieces of the data jigsaw onto the board as we all strive to build a complete picture of what has happened, is happening and will happen in our businesses.
It should be clear by now that big data will almost certainly be an ongoing big project deserving of careful planning and well defined business objectives. New opportunities will present themselves, and most notably the ability to analyze huge amounts of data at lightening speed, making real-time analytics an achievable objective for those who need them.
The best strategy right now is to wait and see, unless of course there are serious pressing issues which simply cannot wait. In just a few years from now technology and skill costs will have dropped and lessons will have been learned (at the expense of others). There will inevitably be high profile big data screw-ups because overconfidence and a poor appreciation of complexity led to over-ambitious projects. As always with information technology – start small, learn and grow.