It would be easy to think right now that the road to prosperity and fortune is paved with analytics tools. Supplier claims are endless – better sales, more accurate marketing, improved customer retention, and so on. As is usually the case with an exuberant technology market no one cares to even suggest that there might be risks involved. With analytics there are plenty of risks, and you would have to be a real party skunk to suggest such a thing when the euphoria is so contagious!
Let’s start with the easy stuff – data visualization. We can now aggregate, manipulate, visualize and analyze our data in 3D, with animation, using designer color schemes and create visually stunning graphs and charts. The elephant in the room is whether the analyses actually mean anything, or is this simply the triumph of form over content. Various authors (Taleb, Kahneman) have repeated that our innate tendency to find patterns where none exist leads to all sorts of problems. That trend shown on a visually pleasing graphic – is it real or just a random accident? If it is up and shows increasing sales, then sales management will undoubtedly claim it is real. If it points down then maybe it is just random variation. The political use of data visualization represents a major use of the technology (hence its popularity). Knowing whether a visualization actually means anything requires some understanding of probability and statistics. No doubt some users will know how to determine the likelihood that a rising trend represents something real – but most don’t. You only have to look at the silly marketing messages put out by suppliers to appreciate that they are not appealing to reasoned analysis. The result of all this will be changes in business strategy based on nothing more than random noise dressed up to look like meaningful information.
Let’s move on to data mining. The aim here is to trawl through historical data to find patterns of behavior which might be exploited. The technology suppliers are making it very easy for us to access large, diverse sets of data and with the press of a button generate a model that promises to predict future behavior. Applications include identifying targets for a marketing campaign, fraud detection, credit risk, hospital readmission and so on. The people who really know how to make this stuff work also know how difficult it is, and how the whole exercise is strewn with traps. It is very, very easy to generate predictive models from random data, and sometimes they will even work on test data that has not been seen before. I have had demonstrations from suppliers who proudly showed a predictive model that delivered 90+ percent accuracy on training data. This was proof of their technology as far as they were concerned. When I asked for the same model to be run on a test data set the predictive accuracy was near 50% – you might as well flip a coin. Less experienced people can now load data, generate a model and start using it with a speed that is frightening. And we should be frightened. The title of one of Taleb’s first books says it all – ‘Fooled By Randomness’ – and there is plenty of that happening right now.
Finally let’s create a stink around big data – one of the most successful buzz words ever created by the tech industry. A few days ago I was speaking with the CEO of an analytics company who surprised me by saying that big data was just so much rhetoric. He then quoted a professor from a German university who quite rightly said that analyzing two billion data instances is not really likely to show much more than analyzing a hundred thousand. In fact it’s a delicate topic, and the volume of data needed to create useful insights and models is usually relatively small. Of course big data is concerned with other issues too, but the word ‘big’ is actually fairly meaningless. For those riding the big data technology wave with gusto we can expect that complexity will become a pressing issue. Goodness knows we’ve found it difficult enough to manage relatively straightforward relational databases. Add in text, spatial, streaming and other data sources and images of the sorcerer’s apprentice come to mind.
So having created something of a smell at this party, it is worth saying that all these technologies are immensely valuable. It is the way they are being used that is worrying. The best strategy is to make small moves, do not listen to supplier claims, have at least one person around who understands the statistics of randomness, learn from the mistakes of others by not being too eager to adopt these technologies, and above all trust people more than machines. People know that cheese production and sheep farming in Bangladesh do not affect the world’s major stock indexes – this was a real predictive model as reported in the Wall Street Journal.