Of course it’s easy to sit back and say ‘I told you so’, but actually there are several pre-Brexit articles on this web site talking about the dangers of big data and analytics. – they are listed at the end. That the poll forecasters got it so wrong for Brexit and the US Presidential elections, and that Hillary Clinton’s big data effort seemed so ineffective, points to something. After half a decade of unadulterated big data hype, some dissenters are starting to come out of the wood work. There have been voices of reason all through this journey, but they are typically not listened to. Michael Jordan (no-not the sports personality) has been warning that big data isn’t just bigger small data, for some time, and he is a respected academic , as well as sitting on the Board of several big data analytics technology companies. His talks stress the fact that big data is new, and that the problems associated with big data are not only not understood, but we don’t actually know what the problems are.
This isn’t an attempt to diminish the importance of big data, but it is an attempt to urge caution – because we don’t know where the monsters are hiding. The best misuse of big data I have come across, and is mentioned in other articles, is the UK insurance company that is using 400 features/attributes/variables in its big data analytics. This is like saying to the data – show me every spurious pattern, and because there are so many, some of them will be certified as sane. The danger of big data is that power laws run rampant. 400 variables gives rise to 2400 combinations of variables – more than the number of atoms in the universe. Such is the nature of a power law – 2n in this case. As Taleb says in his book Anti-Fragile:
A very rarely discussed property of data: it is toxic in large quantities – even in moderate quantities.
But there is a human element in all of this too. Perhaps the most widely read paper on statistics during the past decade is Why Most Published Research Finding are False, by a brave, and very smart statistician, by the name of John Ioannidis. His findings refer specifically to life sciences research, but the lessons have broad application. In fact Bayer was so challenged by this paper that they conducted research to see if it was true. Sure enough two-thirds of their analysis could not be confirmed by repeating the analysis. In essence this paper says that in analyzing data we generally tend to get what we are looking for – we confirm our biases. This is probably not too wide of the mark when considering Brexit and Trump’s victory. Very few expected the British to reject the EU and very few expected Trump to win – and the analysis was a classic case of confirming biases.
I wrote a short article prior to the Trump victory claiming that big data may well be damaging value in businesses. After the failure of most polling organizations to even get Brexit and Trump forecasts right, and the failure of Clinton’s big data efforts, such a claim does not seem all that far fetched.
So what is the solution? Firstly we should avoid inviting power laws into our analysis. Fewer attributes and smaller amounts of data are safer. The recent book Signal by Stephen Few throws doubt on the need for big data in many business applications anyway. Depending on the task at hand, ten thousand rows of data may tell you just as much as ten million. Of course some large corporations needs to store huge amounts of data – nothing wrong with that. It’s when we come to analyze it that the fun begins.
Finally there is another elephant in the room. Oddly enough the application of deep learning to object recognition in images is a fairly safe use of machine learning. A cat is a cat is a cat – even a thousand years from now. The future is not going to nullify our finest efforts at recognizing cats, people, cars, bicycles – and so on. But businesses do not have this luxury. The sentiments and preferences of customers change by the day. All that big data analysis of historical data is as nothing if a trendy new fashion appears on the scene. It’s one of the ironies of machine learning and big data, that the seemingly difficult is actually relatively easy and safe – object recognition, language processing, drug discovery, and so on. Trying to use big data to predict how customers will behave is like shooting at a moving target, and having no idea where it will move. The positioning of big data analytics to businesses is usually trite and massively oversimplified. So the motto to follow is – small is beautiful except when it isn’t. That way we avoid the power laws as much as possible, and make life somewhat easier for ourselves. If we do have to stir the power law beast from its slumber then we had better be at least partially aware of what we are dealing with.