It’s becoming fairly well known that big data can be problematical, and particularly when we come to analysis. Perhaps the best known issue is that of the power law that accompanies the proliferation of attributes and features often associated with big data. Instead of holding maybe twenty or thirty customer attributes, we can now store hundreds and engineer hundreds more. Since most machine learning methods combine these attributes, we soon run into a power law (2n to be precise) where billions of combinations create all sorts of ghosts in the data – the appearance of significant patterns which turn out to be nothing other than random noise making itself look respectable.
So that data science teams can get to grips with this phenomenon, it would be wise to create big data that is populated with wholly random data. The attributes should be the same as those used in the genuine article, but the instances should be populated with random values. This is not difficult to do. We can ascertain the distribution of values in a real database fairly easily, and then populate our phony database with a similar distribution of values. Anyone familiar with R and/or Python should be able to do this.
We should then apply our machine learning algorithms to the random data and note how many ‘useful’ patterns emerge – particularly as we increase the number of attributes we make available to the algorithms. By contrast we could think carefully about the attributes that are made available, and ascertain whether we are still getting patterns.
It would also be instructive to give this data to people who analyze data visually, and see how many ‘insights’ they get. We could of course be sneaky and do something like a blind trial, so that the users of the data do not know whether they are analyzing real data or random data.
The whole point of this exercise is to instill caution in those who believe that patterns discovered by machine learning algorithms, or those that look significant on a graph, are necessarily real. And it should be added that even validation techniques such as cross validation can still be fooled by random data if the number of generated patterns is large enough.
There are many other problems associated with big data, and not least the long tail effect. Most databases are dominated by a few values – maybe 80% of your sales relate to 20% of the products your business sells. These long tail effects mean that even in very large databases, the data available for analysis that relates to many of the variable values might be quite sparse. And we shouldn’t think that adding yet more data will address the issue – all we do is create sparse data for new variable values..
That many businesses using big data will be making these types of error is a certainty, in fact I have come across some in my own experience. The solution, as always, is to tread carefully and apply a good measure of common sense – something that machine learning has yet to master. A big data analytics benchmark of this nature will soon sort out the wheat from the chaff.
Michael Jordan – advisor to multiple analytics technologies companies and Professor at Berkeley talks on some of these topics during an IEEE interview.