Some of the predictive models currently being used by organizations of all types will be causing damage. Suppliers and consultants don’t talk about this because it isn’t exactly going to generate business (quite the opposite). When a few disaster stories eventually hit the headlines no doubt everyone will become more interested in predictive model error rates.
First of all we need to clarify a point. The term ‘error rates’ is used in data mining to depict the rate of false predictions a model gives. In this article I’m using it at a higher level of granularity – the number of erroneous models as a fraction of all models. Yes, organizations are deploying defective models for a variety of reasons – some of which can be addressed, and some of which cannot. Here is a rundown of why some of the predictive models used in your organization might be erroneous (with statistically meaningless actual prediction rates):
- The people developing models do not understand the nuances and procedures necessary to ensure a model is fit for purpose (the most common reason). Do not assume that a PhD Statistician will have the requisite knowledge – their thesis may well have been concerned with some obscure mathematical properties of a particular distribution function. Finding people who know about the application of important techniques such as the Bonferroni correction is not so easy.
- Your data may be messed up and simply unable to deliver accurate models. Even mild amounts of mess can produce incorrect models (this is particularly true of techniques such as decision trees, which are inherently unstable).
- Suppliers have convinced your management that it’s all about dragging visually appealing icons around a graphical interface and pressing the ‘go’ button. The ‘ease-of-use’ promise is as old as the hills, and is the technology supplier’s best friend when selling to confused managers. Trouble is that it works, but always leads to trouble.
- The fundamental act of searching through a large space of variable, variable value, parameter and data set combinations means that a very high percentage of such combinations are meaningless – but of course your algorithms do not know this. Such a scenario is ideal for a simple application of Bayes rule, which invariably shows that error rates are going to be much higher than one might imagine. Read Big Data – Fool’s Gold if you want more on this topic.
- Political pressure. ‘We’ve got piles of this item in stock, produce a predictive model that shows if we drop its price by 50% we will also sell 200% more of this other item.” – the Sales Director. “Oh and by-the-way, if it all goes belly-up I’ll blame the model.’ There is nothing to say here really is there, other than this was common practice in the banks prior to the 2008 collapse – and no doubt still is.
- Things change. The data used to build a modelĀ are always historical (unless you have discovered time travel). What was fashionable one year (remember Facebook) might not be fashionable next. Predictive models assume everything remains the same – it doesn’t.
I imagine there are other reasons, but 6 is already one too many for me. Reasons 1 and 2 are addressable – 3,4,5,6 probably not.
Predictive models are being used to great effect, but everyone will also be using models that are just plain wrong. The key to eliminating defective models is monitoring and management on an ongoing basis. Without such vigilance you may just end up with the dumbest smart applications in your industry.