“We find that whole communities suddenly fix their minds upon one object, and go mad in its pursuit; that millions of people become simultaneously impressed with one delusion, and run after it, till their attention is caught by some new folly more captivating than the first.”
Extraordinary Popular Delusions and the Madness of Crowds.
It’s starting to feel like a dot com bubble. Big data is everywhere – on TV, in magazines, on the web and on your coffee cup. Of course this hysteria is being whipped up by the second largest industry on the planet, namely the tech industry, and it is very hard to ignore. And once again, as with the dot com era, there is a general perception that funky names are best.
Ultimately these trendy new kids must sell to businesses, or face the inevitable fate that awaits startups at a time like this – go broke, or get swallowed up by the men in suits. So beneath the cute logos and ‘it’s so easy’ claims there is a need to show that this stuff works. Enter the anoraks and the suits.
By anoraks I mean the people who are in love with the technology, who are eager to jump on every new trend and every new product. By suits I mean the people who have to manage budget and make sure that an investment shows a return. Information technology has always been dominated by these two archetypes and Putt’s law encapsulates it very nicely – the people with budget do not understand the technology, and the people who do understand have no budget. And it is just as well really, or large corporations would become a technology playground.
So back to big data. The actual technology is clever, innovative, but hardly revolutionary. It’s just a different way of storing data so that much larger volumes can be handled with greater speed. It has some downsides – complexity being the most serious, but there is no such thing as a free lunch. The anoraks tend to overlook the downside – they are enthusiasts after all. The complexity issues associated with big data are of sufficient magnitude that I will make a prediction. In just a few years from now we will start to see disaster stories emerging. I’m an old hand in this industry and have seen the waves of irrational exuberance many times. Eventually the complexity issues will be addressed, but not before big data gets some big egg on its face.
The adolescent enthusiasms that grip the anoraks mean that way too much is expected in the short term, and that the really significant long term implications are overlooked. Right now we are are full of over-expectation, but typically don’t give much thought to the long term implications of big data – which are sociological, as well as commercial.
So far we’ve only really talked about the data – more of it, processed at greater speed. The analytics which can be applied to this big data tend to get lumped in with the term. And here too we see unreasonable expectations of what this technology can deliver, and only scant attention to the dangers. If you get someone who has used analytics techniques for some time to spill the beans, they may well tell you that the process of building predictive models is not really all that scientific. In fact it’s all a bit clunky, although the process can be made somewhat more rational by engaging business domain experts – since they tend to understand the data and what makes sense. Even so, it is perfectly normal to apply a data mining algorithm to data which has been generated randomly and for the algorithm to find convincing patterns, both in the training and test data. More sophisticated techniques can also be conned – leave-one-out, cross validation etc. Data mining bias is very hard to avoid – and the harder you mine the more likely you are to find what you are looking for. As a result we can say with some certainty that businesses are already using patterns that are nothing more than ghosts in the data. Here too we will start to see disaster stories emerging.
The Big Data Bubble Countdown
- Present – 2015. It’s ‘big data’ everything. A proliferation of suppliers, with managers, technicians and consultants desperate to get ‘big data’ on their résumé.
- 2015-2017. Big data disasters caused by massive over-confidence and poor appreciation of complexity, start to appear in the press. Many big data startups fail.
- 2017 – .Once the bubble has burst the big data tech market consolidates (a handful of large suppliers of tech), an oversupply of skills means rates fall, and the serious work begins. Big data is dead – long live big data.
- 2020 -. The next bubble begins.
I appreciate this is a very unpopular message, but some prudence is required. Right now we are being told that the mathematics behind analytics holds some sort of magical key – it doesn’t. I’m an ex-mathematician, but as Albert Einstein very wisely said “As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality.”
The best strategy to pursue is to let the technology mature. This has the advantage of letting others pay for consultants to train, allowing others to make the cockups, let the price of skills and technology fall, and to sleep more soundly at night. If there is a compelling reason to adopt big data technologies then proceed with caution – and whatever you do, do not trust the math. It’s just a sausage machine that will happily take garbage as input and deliver very convincing garbage as output (even with very nice p-values).