The central notion behind data mining is this: we look for patterns of behavior in historical data with the aim of exploiting these patterns in the future. For example we might find patterns that identify a certain group of customers (by age, income and education say) as being more likely to respond to the promotion of a certain product. These patterns are found by analyzing historical data and a decision has to me made about the likelihood of this relationship still being true. In fact this is one of the most common applications of data mining technology and there are many technology suppliers providing solutions in this domain.
The term ‘data mining’ embraces a large number of technologies and techniques, but for our purposes we will separate out statistical methods. These are fundamentally different from many of the methods used in data mining. We are all familiar with simple statistical methods – the mean, standard deviation and regression where we fit a line to a set of data. Statistics are predefined metrics of a data set and either they make sense for the data or they do not, and it is often very difficult to establish which is the case. Data mining on the other hand often determines the metrics as a model is built. We’ll get to some of the methods used later, but this is an important point and worth remembering.
If data mining was as simple as applying the various techniques to data and exploiting the patterns that are found there would be little need to go any further. Alas this is not the case. There are two major issues that data mining throws up. The first is the fact that these methods often discover patterns that are nothing more than accidental fits with the data. It might just happen that during the time period analyzed customers with a certain profile purchased a particular product, but that the pattern has no validity outside this time period. Various names are given to this phenomenon – over-fitting being one of the most common. The algorithms will find patterns that are nothing more than a fit with data that are essentially random. There are ways to minimize this behavior, but the danger is always present.
The second major challenge presented by data mining is the problem of knowing whether the patterns that have been found are persistent. If we can establish that the patterns really do represent real behavior then is it reasonable to assume that the same behavior is still manifesting? This is where human judgment is needed, and there is no substitute for it. In some cases an extrapolation of the behavior into the future will be perfectly logical, in other cases transient fashions and biases may make us much more suspicious.
In reality data mining is as much an art as a science, even to the extent that the way data are presented to the data mining algorithms will make a huge difference in the patterns that are found. Consider an attribute such as age. If we feed the age into an algorithm simply as a number of years, many algorithms will have problems dealing with a continuous variable of this nature. It might be much more productive to categorize age as twenties/thirties/forties and so on. There really is no substitute for experience although the technology is becoming more user friendly with built in knowledge of common attributes used in a business environment.
Data Mining Methods
A number of data mining techniques are explained which are frequently used in many types of data mining activity.
Supervised Learning Techniques
The techniques shown below are used in a supervised learning scenario. This is where a data set is provided for the tools to learn from, so that new data can be classified or a value predicted through regression.
Bayes Classifiers
Bayesian classifiers use a probabilistic approach to classifying data. Unlike many data mining algorithms Bayesian classifiers often work well with a large number of input attributes, without hitting what is called the dimensionality problem. Naive Bayes is the technique most often employed – the term ‘naive’ coming from the fact that input attributes should be independent of each other (ie there are no correlations between them). Despite the fact that this is often not true, naive Bayes still gives good results. Unfortunately it is often overlooked for more esoteric methods, whereas it should actually be a first port-of-call if relevant to the problem and where most attributes are categorical (ie categorised).
Bayes works by combining what are called conditional probabilities into an overall probability of an event being true. Explaining Bayes is difficult (as evidenced by the large number of explanatory videos on youtube). I have made a video which can be seen here, which will hopefully explain things further for those interested.
Decision Trees
Decision trees are a favorite mechanism for finding rules which classify data simply because they are very easy to understand. We’ve all seen decision trees. They start at a single root which represents all the data and branches move out which represent various values of one of the variables in the data set. For example we might have three branches which represent high, medium and low income for bank loan customers. Each of these branches is then subdivided by the categories of another variable in the data set (age for example – young, middle-age and old). This process continues until all the attributes have been accounted for, terminating in classification of the data – a good or bad loan candidate for example.
The real trick that produces meaningful decision trees is the order in which attributes are selected to split the data on the way down to the terminating leaf nodes. This is usually an internal function, but for those interested the most common method is to calculate how much information is gained by using each particular attribute – the highest information values get preferential treatment.
Nearest Neighbors (k-NN)
Entities can often be classified by the neighborhood they live in. Simply ask whether your own neighborhood gives a fair representation of you, in terms of income, education, aspiration, values and so on. It doesn’t always work – but usually it does – birds of a feather and all that. A similar mechanisms has been developed to classify data – by establishing which neighborhood a particular record lives in. The official name for this algorithm is k-Nearest Neighbor, or k-NN for short.
The essential idea is this. Imagine you are interested in buying a second hand car. Mileage, fuel efficiency, service history and a number of other attributes will typically be of interest. Someone you know has a database of used cars which includes these details and each car is categorised as a peach or a lemon. By entering the details of the car you are interested in the k-NN algorithm will find the 5 (so k=5 in this instance) cars with the closest match to yours. If more are peaches then lemons then you might have a good car – and that’s it.
Obviously it gets a bit more involved with large commercial data sets – but the idea is simple enough. It works best where most of the attributes are numbers that measure some sort of magnitude, so that the algorithm can establish where the nearest neighbors are. Attributes that represent classifications can be a problem and so k-NN may not be suitable. Even so this simple algorithm is widely used and can deliver good results.
Neural Networks
As the name suggests a neural network is modeled on the way a neuron within our nervous system works (or how we believe it works). There are many variations of neural networks, but they are mainly used for classification of data and work most naturally with continuous input variables (usually anyway). While neural networks have proven to be useful in many fields such as pattern recognition (fingerprints, face recognition etc) they tend to over-fit the data – finding patterns in random noise. The growth in the variety of neural networks, and the skill needed to make them do anything useful means they should only be used by someone who really knows what they are doing. It is noticeable that some of the ‘plug-and-play’ analytics tools suppliers avoid them altogether (11Ants for example).
Support Vector Machines
Support Vector Machines (SVMs) are one of the most powerful classes of predictive analytics technologies. They work by separating out data into regions (by hyperplanes in multi-dimensional spaces for those interested), and as such classify the data. Oracle for example has a predictive analytics Excel add-on that uses SVMs exclusively. Having said this they are not relevant tool for all analytics problems and can over-fit the data in the same way as neural networks – although there are mechanisms for minimizing this effect.
SVMs are an essential component in any analytics toolkit and virtually all suppliers include an implementation.
Unsupervised Learning Techniques
These techniques are used to find relationships within data without being offered a data set to learn from. As such there is no special nominated attribute in a data set that is to be categorized or calculated (or scored in the lingo of predictive analytics). Despite this these techniques do allow new data to be allocated to a cluster or associated with a rule. The two dominant techniques here are called clustering and association.
Clustering
Clustering is very similar to the k-NN technique mentioned above but without specifying a particular attribute that is to be classified. Data are simply presented to the clustering algorithm, which then creates clusters using any one of a number of techniques. This is as much an exploratory technique as a predictive one. A typical example might be clustering patients with similar symptoms.
Association Rule Mining
Unlike the supervised learning methods association rule mining is unsupervised and is concerned with the discovery of any rules which might exist between attributes. This sounds fairly straightforward, but is riddled with potholes – the most common being the discovery of hundreds (thousands) of rules that are either trivial or spurious. However used well this technique does unearth previously unknown relationships and forms the backbone of basket analysis – a common application used in retail
Predictive Analytics
Predictive analytics is a particular application of data mining technologies.The typical mechanism used to predict in predictive analytics is scoring. We might wish to score the credit worthiness of a new customer, or score the likelihood of machine failure in a manufacturing plant. A large number of algorithms are available to find the patterns from historical data, which are then used to score new data. The names given to these algorithms are all suitably off-putting, but in essence most of them rest on fairly simple ideas – I’ll be exploring them in other articles – there is no reason why a manager should not know when regression is used, or when it shouldn’t be used.
The overwhelming use of predictive analytics is in sales and marketing – trying to assess receptive candidates in a marketing campaign or who should be targeted for up-sell/cross-sell activity. Other uses include fraud detection, credit rating and increasingly health-care analysis. The application of predictive technologies is as broad as human activity, so these initial applications are just the early uses of the technology.
The suppliers of predictive technologies fall into three main camps:
- Enterprise solutions suppliers who bolt on a predictive analytics capability. These are typically quite weak offerings, although IBM is a notable exception.
- Proprietary analytics solutions aimed at large organisations. KXEN, Angoss, SAS and others come into this category.
- Open Source offerings are often the most capable, but least user friendly. Revolution Analytics and Rapid-i have taken open source solutions and made them enterprise ready – these are some of the best tools available to experienced analysts.
In my opinion predictive analytics is the thin end of a very broad wedge, representing a move away from dumb applications to smart applications. IBM was one of the first major suppliers to see this shift and has profited from it accordingly. The effective use of smart technologies is becoming a real differentiator, and will perhaps become the differentiator in many markets.