If you have any experience at all of using data mining and statistics methods then, unless in a deep state of denial, you have probably found two very large elephants stood in the analytics room.
Generally speaking it is best to ignore them, for the sake of career, and of course no one wants to be seen as a trouble maker. But they will, and very often do trample over the best analytics efforts, and no amount of clever mathematics, big data technology, or vendor PowerPoint presentations serve to scare these particular elephants away.
The first of these elephants is well known to the financial services industry. No doubt you have seen the ‘past performance is not an indicator of future performance’ disclaimer – a useful get out for investment companies who do not have a clue what they are doing. The same applies to data mining and statistics, and this is to some extent embraced by the statistical notion of stationarity. The basic idea is that there is no earthly reason why past behaviours should continue into the future, and so all that effort to find patterns in training data from the past might be a totally futile exercise. People with blond hair who tend to purchase gold rimmed spectacles might not oblige over the coming year. A judgement call is needed and close monitoring of the performance of predictive models.
The second elephant might be called the ‘seek and you will find’ elephant. Data mining will often oblige by finding patterns even though they may be erroneous or unusable. This behaviour is variously called curve fitting, over fitting, data mining bias and so on. Testing patterns on data which has not been seen before is part of the solution, but even here blind chance can make a totally fictitious pattern look like it is valid. And yes, you can be absolutely certain that there are thousands of businesses around the globe basing their activities on patterns that are simply fits with randomness.
These two elephants are hardly ever discussed, it just isn’t polite. Suppliers have no interest in mentioning them, and neither do consultants, data scientists or managers eager to get ‘big data’, ‘predictive analytics’ or whatever the fashion of the day happens to be, onto their CV. But they have been mentioned here just for the record, and so that in two or three years from now when big data and data mining disasters are hitting the headlines I can say ‘I told you so’!