Data mining is the act of searching through historical data with the aim of finding useful patterns of behaviour. A bank for example might scan through customer records to find patterns associated with people who have difficulty paying back loans. These would then be used to classify new loan applicants.
Generally speaking, the patterns we are looking for are used to establish the value of some variable. In the case of the bank this variable would have values of ‘approve’ or ‘reject’ a loan application. The variables we wish to calculate are called labels. When historical data are provided with known values for the labels the mode of data mining is known as supervised learning. When no labels are provided, and the algorithms are looking for other patterns, this is known as unsupervised learning.
There are four broad categories of learning that data mining can perform:
- Classification (as with the bank example) aims to classify records by some pre-defined classification scheme (accept or reject for example).
- Regression is used to calculate the value of numeric labels – the number of hours before a machine should be serviced for example.
- Association seeks to find any relationships which might exist between any of the variables. This is often used in market basket analysis where items that tend to be purchased together are identified.
- Clustering groups records using some measure of similarity. We might cluster customers to establish similarities.
The ‘records’ we refer to are often called instances or examples. Data mining is often used in predictive analytics, where the aim is to predict some variable (or variables) associated with new instances. Examples include customers who are likely to defect, patients who might be readmitted to hospital, movies you might want to watch, people who should be targeted in a marketing campaign – and so on.
The golden rule in data mining is to keep things as simple as possible. Complexity has a large number of problems associated with it, and not least the fact that the data mining algorithms use complexity to simply fit patterns to the data with no generalisation. This means they fit patterns to random noise with no predictive capability.
Ultimately the success or otherwise of data mining activities will depend on the skills of the analyst and the know-how of domain experts.