Here is a real idiot’s guide to predictive analytics:
- Get the relevant data.
- Let the predictive analytics algorithms loose on the data.
- Find the best patterns.
- Apply them to new data to predict outcomes – who might be a bad loan risk, which customers will respond to an offer etc.
Suppliers may talk in these terms because it is in their interest to make it sound easy and without risk – the opposite is true. There are many reasons why your predictive models might end up being more of a liability than an asset, but I’ll focus on just one – curve fitting, which is also known by several other names. An example will clarify. Imagine tossing a coin ten times, recording the outcome as H – head and T – tails. Lets say the outcome is H H T T H H T H H T.
Now any pattern detection software worth its salt will proudly deliver the following pattern – that after 2 heads a tail follows. But wait a minute. If you are willing to wager that this will happen in the future then ruin will almost certainly be the outcome. Each flip of the coin is a random independent event. We all know this.
Now scale the whole thing up to billions of records in a database with possibly hundreds of attributes (Name, Phone, Age, Salary, Zip etc). Is it possible that random patterns appear in the data to mislead us – yes absolutely. Now the people who conduct the analysis generally know about these things, and so they will reserve part of the data set to test a pattern. In fact they may use something called k-fold cross validation where the reserved section of data is varied across multiple attempts to build a model. But look at our sequence of heads and tails. If we had reserved the last three flips to test our hypothesis then we would still have come to the conclusion that it is true. These random patterns, which data mining algorithms will happily throw out as candidate predictive models, are just ghosts in your data.
The whole issue of whether a pattern is valid or not is actually extremely complicated, and well beyond the understanding of many who practice in this field. The more frequently a data set is interrogated for patterns the more likely we are to find them, and the more likely they are to be erroneous – it’s called data mining bias among other things. Big data with thousands of attributes attracts problems of its own, and despite the popular view that billions of records can’t tell lies, in reality they tell a whole new genre of lies.
Fortunately there is a sanity check on all of this. Domain experts will know whether a pattern makes sense. If it doesn’t then it may just be that it is newly discovered knowledge, but with a domain expert at hand we can be suspicious until the suspicions are negated. Just another example of how truly stupid computers are.