Guide to Predictive Analytics
The goal of predictive analytics software is to create predictive models which can predict the outcome of future events, with an accuracy that is greater than is currently possible. This is achieved by trawling through historical data, looking for patterns that might be useful predictors of future events. A real, practical example is that of a bank deciding whether a loan should be granted to an applicant. By analysing historical data, patterns can be found which indicate whether the applicant is a good risk or not. An example of such a pattern might be – if someone earns over US$60,000 a year, has no mortgage and no dependents, they are a good credit risk. This pattern, that has been learned from historical data, is then applied to loan new loan applicants. In reality there may be hundreds of such patterns, that might be stored in a business rule management system.
Let’s get some terminology out the way. Predictive patterns are found by data mining. This is the act of analysing historical data with the aim of finding patterns that are useful. We are in effect ‘mining data’. Data mining in turn uses a variety of algorithms, many of which come from the field of machine learning. As the name suggests, a machine learns something when it can perform a task more effectively – by using a discovered pattern. In the example above, we might mine data using machine learning algorithms, and the resulting patterns might help the machine (computer) decide which loan applicants should be approved, in a manner that is more effective than was the case prior to the new patterns. So predictive analytics employs data mining methods, which in turn employs machine learning algorithms.
It turns out there are four primary types of data mining. These are:
- Classification – where we want to classify new data as it comes in. In the loan example, we simply want to classify applications as approved or declined. Some classification schemes might have many classes – approve with favourable rates, approve with standard rate, refer to manual approval, decline. Classification is the most common form of predictive analytics and is widely employed in diverse business applications.
- Regression – is when we want to predict a value. In the loan example we might want to predict the loan threshold a candidate should be eligible for.Or another example might be the prediction of next month’s sales.
- Clustering – is the act of looking for groups of entities that are similar in some way. Clusters are groups of objects which are near to each other, using some measure of distance. This is often used in predictive marketing applications where we are looking for prospects and customers who share similar profiles.
- Association – looks for items that are associated with each other in some way. The classic application is market basket analysis – identifying items which tend to be purchased together. If shoppers typically buy bread when they buy milk, then it might make sense to move the dairy near to the bakery.
There are two modes that can be used in data mining. Supervised mining is when we look for patterns in historical data and give the algorithm the answer for each instance of data. So in the loan application example, we not only supply all the previous loan candidate details, but we also supply whether a loan was repaid without any hitches. In this way we are ‘supervising’ the learning – telling the algorithm what works and what doesn’t. Once it has learned the relevant patterns they can be applied to new loan applicants.
Unsupervised learning simply requires that we present the data to the algorithms and let them pick out interesting patterns. In association mining for example we simply provide a database of shopping baskets – items that were purchased together in a single basket. The algorithm then trawls through the data looking for items that were commonly purchased together – without any guidance from us. Clustering is also usually unsupervised – the algorithm homing in on the attributes that cluster the data most effectively.
If it was as simple as just throwing data at algorithms and discovering useful patterns, we could all mine data and apply the results. Unfortunately, as clever as the technology seems, it is actually quite stupid. Machine learning algorithms not only find useful patterns, they also find patterns that have no bearing on reality at all. All data contains noise – data instances that carry no meaning whatsoever. Unfortunately the algorithms do not know this, and so they will find pattens in the noise and present them as valid. To avoid this behaviour we split our data into two parts. The first part is used to find patterns, and the second part is used to test them. The basic idea is that the random noise in the training data will not be duplicated in the test data. It certainly weeds out many false patterns, but not all of them. And so things got a bit more sophisticated with something called k-fold cross validation. The training segment of data, and the test segment are repeatedly changed for each execution of the algorithms. This is much better, but again, not totally foolproof. And so we have to be vigilant and apply some common sense. If a rule says that more barbecues are sold when Henry who works on checkout, is on holiday, then we know we can probably discount that rule.
If you want to start a journey in predictive analytics, and specifically data mining and the use of machine learning technologies, there is plenty of choice. Some free options include:
- RapidMiner Community – a full blown data mining workbench with hundreds of functions.
- KNIME – another excellent data mining workbench that is totally open source.
- Orange – an open source platform with a somewhat gentler learning curve.
- WEKA – a collection of machine learning algorithms and an interface to use them.
You might also want to try some cloud based machine learning platforms. Free ones include:
- Microsoft Azure Machine Learning – a basic free level of subscription, and paid for subscriptions.
- Amazon Machine Learning – the first year free.
Many businesses will buy predictive analytics solutions instead of building their own analytics team. Such solutions already exist for a broad range of applications, although predictive sales and marketing is particularly heavily targeted. Either way, predictive analytics is here to stay and early adopters will gain the advantage. Eventually predictive analytics software will become just another cost of doing business, and we’ll be looking for the next technology that might make a difference.