A number of predictive analytics techniques (applications of data mining technologies) are explained which are frequently used in predictive modeling.
Supervised Learning Techniques
The techniques shown below are used in a supervised learning scenario. This is where a data set is provided for the tools to learn from, so that new data can be classified or a value predicted through regression.
Bayes Classifiers
Bayesian classifiers use a probabilistic approach to classifying data. Unlike many data mining algorithms Bayesian classifiers often work well with a large number of input attributes, without hitting what is called the dimensionality problem. Naive Bayes is the technique most often employed – the term ‘naive’ coming from the fact that input attributes should be independent of each other (ie there are no correlations between them). Despite the fact that this is often not true, naive Bayes still gives good results. Unfortunately it is often overlooked for more esoteric methods, whereas it should actually be a first port-of-call if relevant to the problem and where most attributes are categorical (ie categorised).
Bayes works by combining what are called conditional probabilities into an overall probability of an event being true. Explaining Bayes is difficult (as evidenced by the large number of explanatory videos on youtube). But if you want to learn more an introductory article can be found here.
Decision Trees
Decision trees are a favorite tool to use in data mining simply because they are so easy to understand. A decision tree is literally a tree of decisions and it conveniently creates rules which are easy to understand and code. We start with all the data in our training data set and apply a decision. If the data contains demographics then the first decision may be to segment the data based on age. In practice the decision may contain several categories for segmentation – young/middle age/old. Having done this we might then create the next level of the tree by segmenting on salary – and so on. In the context of data mining we normally want the tree to categorise a target variable for us – whether someone is a good candidate for a loan for example.
The clever bit is how we order the decisions, or more accurately the order in which we apply attributes to create the tree. Should we use age first and then salary – or would the converse produce a better tree? To this end decision trees in data mining uses a number of algorithms to create the best tree. The most popular algorithms are Gini (which uses probability calculations to determine tree quality) and information gain (which uses entropy calculations).
When large data sets are used there is the very real possibility that the leaf nodes (the very last nodes where the target variable is categorised) become sparsely populated with just a few entries in each leaf. This is not useful because the generalisation is poor. It is also the case that the predictive capability drops off when the leaves contain only a few records. To this end most data mining tools support pruning, where we can specify a minimum number of records to be included in a leaf. There is no magical formula that will say what the level of pruning should be, it’s just a matter of trial and error to see what gives the best predictive capability.
Virtually all data mining tools implement decision trees and some offer elaborations on the basic concept – regression trees for example where the tree is used to predict a value, rather than categorise.
Decision trees are often used to get a feel for data even if they are not part of the resulting model, although good results are to be got from decision trees in many business applications.
Nearest Neighbors (k-NN)
Entities can often be classified by the neighborhood they live in. Simply ask whether your own neighborhood gives a fair representation of you, in terms of income, education, aspiration, values and so on. It doesn’t always work – but usually it does – birds of a feather and all that. A similar mechanisms has been developed to classify data – by establishing which neighborhood a particular record lives in. The official name for this algorithm is k-Nearest Neighbor, or k-NN for short.
The essential idea is this. Imagine you are interested in buying a second hand car. Mileage, fuel efficiency, service history and a number of other attributes will typically be of interest. Someone you know has a database of used cars which includes these details and each car is categorised as a peach or a lemon. By entering the details of the car you are interested in the k-NN algorithm will find the 5 (so k=5 in this instance) cars with the closest match to yours. If more are peaches then lemons then you might have a good car – and that’s it.
Obviously it gets a bit more involved with large commercial data sets – but the idea is simple enough. It works best where most of the attributes are numbers that measure some sort of magnitude, so that the algorithm can establish where the nearest neighbors are. Attributes that represent classifications can be a problem and so k-NN may not be suitable. Even so this simple algorithm is widely used and can deliver good results.
Neural Networks
If decision trees represent transparency and good behaviour then neural networks epitomise opaqueness and temperamental behaviour. But what else would you expect from a sometimes brilliant and other times obstinate technology? Neural networks are used for prediction and classification, and through the development of self-organising maps (SOM), for clustering. They are called neural networks because they supposedly mimic the behaviour of neurons in the nervous system, taking inputs from the environment, processing them and creating an output. And just in the same way that neurons are linked together, so are nodes in a neural network. As with other data mining techniques neural networks demand that a good selection of relevant inputs are available, that the target output is well understood and that copious amounts of data are available for training.
The most commonly used type of neural network is called a feed forward network. As the name suggests it works by feeding the outputs from each node forward to the next node as its inputs. The flow is essentially one direction, although something called back propagation is used to tune the network by comparing the network’s estimate of a value against the actual value. Nodes in a network do two things. They combine the inputs by multiplying each input by a weight (to simulate its importance) and summing the products – this is called the combination function. Other functions are used, but this is the most common. Secondly, the output from the combination function (a single number) is fed into a transfer function which usually takes the form of a sigmoid (an S shaped curve) or a hyperbolic tangent. These curves allow the network to deal with non-linear behaviour. In essence they create a linear relationship for small values, but flatten out for large values. This form of non-linearity is an assumption – but it often works well. The output from the transfer function is then fed to the next node in the network.
Most neural networks have three layers – the input layer, a hidden layer, and the output layer. The hidden layer is so named because it is invisible, with no direct contact to inputs or outputs. Knowing how large to make the hidden layer is one of the crucial issues in using a neural network. Make it too large and the network will simply memorise the training set with absolutely no predictive capability at all. Make it too small and useful patterns will be missed.
Using a neural network requires a considerable amount of skill and the results can range from the sublime to the ridiculous simply by modifying any one of a number of parameters. The most important parameters include:
- The size of the training set.
- The number of hidden layers and the number of nodes in each hidden layer.
- Parameters affecting how quickly the network learns.
- The features to use as input.
- The combination functions and transfer functions.
This is by no means exhaustive and issues such as normalising inputs, converting categorical inputs and so on, all have a profound effect on the quality of the network produced. Some of the plug and play analytics tools omit neural networks altogether, and for good reason. Other methods produce equally good results without the temperamental behaviour. Having said this, neural networks can detect patterns that evade detection by other means, and they are very good at picking up some non-linear behaviours.
Support Vector Machines
Support Vector Machines (SVMs) are one of the most powerful classes of predictive analytics technologies. They work by separating out data into regions (by hyperplanes in multi-dimensional spaces for those interested), and as such classify the data. Oracle for example has a predictive analytics Excel add-on that uses SVMs exclusively. Having said this they are not relevant tool for all analytics problems and can over-fit the data in the same way as neural networks – although there are mechanisms for minimizing this effect.
SVMs are an essential component in any analytics toolkit and virtually all suppliers include an implementation.
Unsupervised Learning Techniques
These techniques are used to find relationships within data without being offered a data set to learn from. As such there is no special nominated attribute in a data set that is to be categorized or calculated (or scored in the lingo of predictive analytics). Despite this these techniques do allow new data to be allocated to a cluster or associated with a rule. The two dominant techniques here are called clustering and association.
Clustering
Clustering is very similar to the k-NN technique mentioned above but without specifying a particular attribute that is to be classified. Data are simply presented to the clustering algorithm, which then creates clusters using any one of a number of techniques. This is as much an exploratory technique as a predictive one. A typical example might be clustering patients with similar symptoms.
Association Rule Mining
Summary – Use Association Rule Mining (ARM) when you want to discover rules relating attribute values to each other. The general form of a rule is ‘IF this THEN that’, and in a supermarket shopping habits analysis an example might be IF milk THEN bread. Establishing the usefulness of rules is a major part of most ARM projects.
On the face of it association rule mining (ARM) is a simple enough activity. We simply let loose the ARM algorithms on the data of our choice and a plethora of rules will be found. The general format for these rules is as follows:
IF this THEN that
To add some meat to the bones let’s consider the habits of shoppers at a supermarket. A specific rule might say:
IF (bread and milk) THEN (Jam)
The ARM algorithm will have plowed through millions of combinations of products to identify this particular rule and thousands of others. In fact this is one of the problems associated with ARM, it finds an excess of rules, many of which are simply not useful.
ARM is generally categorized as an unsupervised learning method. Supervised learning involves giving the data mining algorithms a target variable to classify (a good or bad loan prospect for example), and a data set to learn from. Unsupervised learning does not involve a target variable and is more exploratory in nature.
In the supermarket example the ARM algorithm will typically produce thousands of rules of the form mentioned above. Another one might be IF (cheese and apples) THEN (beef and oranges). When we look at the data we might find that the number of people buying cheese and apples together is quite small, and that only 100 cases out of millions of shopping baskets obeyed this rule – clearly it doesn’t mean very much.
So that we are delivered a meaningful set of rules, there are a number of measures that can be used to filter out the more trivial ones. The use of such measures is a major part of using ARM and so we will give it some attention. This involves the use of a Venn diagram, just to make things easier to understand.
In the diagram above we have three basic shapes. The blue rectangle represents all the shopping baskets we wish to consider, and this has a total of Nt shopping basket details The yellow circle represents all the shopping baskets with the ‘this’ combination of goods. The green circle representing all the ‘that’ combination of goods, and the intersection of the two circles represents out rule. Going back to our basic rule formulation we can rephrase it as:
IF left THEN right
Both ‘left’ and ‘right’ represent combinations of items, and are labeled such because they appear on the left and right hand side of the rule. So Nl is the number of instances (shopping baskets) that satisfy the ‘left’ combination of items (cheese and apples say). Nr is the number of instances that satisfy the ‘right’ combination of items (beef and oranges). The items which satisfy the actual rule are represented by the intersection of the two circles and number Nb. So having got the terminology out the way we can define three central measures used to filter useful rules.
The first is Confidence and this is calculated as Nb/Nl. This is a measure of the number of baskets that satisfy the rule as a fraction of baskets that satisfy the left hand combination. To make this meaningful we will go back to our first rule:
IF (bread and milk) THEN (Jam)
Let’s say there are 500,000 shopping baskets that contained bread and milk (Nl) out of a total of 1,000,000 shopping baskets (Nt). And there are 300,000 shopping baskets that contained jam (Nr). The number of shopping baskets which contained bread, milk and jam is 200,000 (Nb). Confidence for this rule would be 200,000 / 500,000 = 0.4 or 40%. Confidence gives a measure of how exclusive a rule is. In this case 60% of the bread and milk purchases were associated with other items. Confidence should be much higher than this – around 75%. So knowing someone purchased bread and milk does not really tell us all that much.
Support is specified as Nb/Nt. This gives us some measure of how ‘big’ this rule is. Typically users of ARM methods want a support of greater than 1%. A rule really doesn’t mean very much if there are only five instances of it out of 1,000,000. In our example support is calculated as 200,000 / 1,000,000 = 0.2 or 20% – a very respectable number.
Completeness is specified by Nb/Nr. This tells us whether the rule predicts a substantial number of the instances where the ‘right’ combination is present. If it only predicts a small fraction then it would not be much use in promoting those products on the ‘right’. The completeness of our rule is 200,000 / 300,000 or 0.67 or 67% – once again very acceptable. There are many other measures of a rule, but these are the most common. We will look at others in subsequent articles.
ARM is used for many purposes other than analyzing shopping baskets. It can be used to analyze credit card purchases, medical symptoms and other uses. In the next article we will get more into the algorithms used for ARM with their relative strengths and weaknesses.