Several years ago a paper was published titled “Why Most Research Findings are False”. It was specifically targeted at medical research and in a nutshell it showed that even though the statistics for a study might stack up, this wasn’t enough to validate many research findings. When a study was repeated it very often failed to reach the same conclusions, and the failure rate for studies which are routinely accepted was in excess of ninety per cent – hence the title of the paper.
The IT industry is currently enjoying one of the most riotous parties it has ever hosted. The star attractions are data mining and big data. No one has the slightest inclination to suggest that things might be getting slightly out-of-hand, they are all having such a good time – the hangovers are some way off yet. Consultants spar with each other to show how clever they are and how such-and-such corporation really could not have done without them. Suppliers are teasing claims out of their customers that profits have risen by some unlikely amount. Everyone wants to get on the band-wagon and it will be some time before reports of significant project failures and business damage start to dampen the atmosphere (as they always do).
So how do these two phenomena connect? Well it’s simple really. Like any powerful tool (nuclear energy for example) there is always the possibility that the power can be used constructively or destructively, and data mining technologies are very powerful. So someone could quite legitimately publish another paper called “Why Most Data Mining Findings are False” and it would contain more than a grain of truth. Using data mining technologies calls for some simple, but difficult to apply techniques.
Here is the typical data mining process. We find we have a pile of data and suspect that there are some hidden and exploitable relationships within the data. So the data is pre-processed to make it suitable for processing and we start to apply data mining techniques (I’ll include statistics within this). We are presented with a neat list of results which satisfy some requirement for statistical significance and we congratulate ourselves. But wait a minute. Just how many combinations of variables were considered to find those patterns. Typically this might be millions and so we would expect that hundreds of these might satisfy the usual statistical significance tests – just by chance. The patterns we have found might be real and they might not – there is no way of telling. And so we get a little cleverer and split our data into two parts. One for finding patterns (training data) and one for testing them to see if they hold up on data that has not been seen before (the test data). Out of the hundreds of patterns that are found maybe ten hold up in the test data. But again, wait a minute. With a hundred patterns we would expect some to fall in the statistically significant category with the test data just by accident. So once again which ones are real and which ones just happen to show statistical significance in training and test data. And so on and so on.
There is an unspoken truth in data mining, and it is this. The harder you look for patterns the more likely those patterns are spurious. To be much more confident we need to repeat the data mining exercise on several sets of data. Happily there are ways to achieve this, k-fold cross validation being one of them. It does however mean slicing the data set up into several pieces and as such long-term trends might get overlooked.
If we can solve the problem of significance the next monster that rears its ugly head is persistence. In some domains patterns are fairly persistent, meaning they apply over long time periods. Many of the sciences are afforded the luxury of persistent patterns since physical and biological behaviour does not change very much over time. In business however we cannot make the same assumption. What was fashionable last year may be definitely uncool this year. The people who purchased a certain product over the last decade might not have done so the decade before or during the coming decade. This is where human judgement comes into the equation and there is no substitute for it. Those fiendishly clever data mining algorithms are unaware of nuances and the broader world.
Data mining is actually a very subtle art with a dash of science thrown in. If you trust the science too much (as the banks did before the 2008 meltdown) then you are asking for trouble. Please do not trust a model that has been created on a single unpartitioned data set – there is a good chance it will be erroneous. And do engage domain experts to sanitise the results.
Unfortunately the IT industry needs to persuade us that analytics and data mining are things we can do by dragging and dropping a few icons, after which we are supposed to stand back in amazement. Eventually you might stand back in amazement – but for completely different reasons. Right now there are thousands of organisations using sexy data mining tools and putting the resulting models into production. With absolutely no doubt we can assume that “Most Data Mining Findings are False” – it’s just not that easy.
Finally do not be blinded by science despite all its pretensions to super human knowledge and accuracy. To quote Albert Einstein:
So far as the theories of mathematics are about reality, they are not certain; so far as they are certain, they are not about reality.