What is Data Mining?
Data mining is the act of trawling through historical data looking for patterns that might have a future use. Imagine a database of customers and their orders. We can mine this data for useful patterns – maybe people who purchased one product tend to purchase another also. If this is the case we, we can use this information when dealing with future customers – maybe offering both products together with a slight discount.
Data mining is used in predictive analytics to find the patterns that can predict. The methods, or algorithms, that are used in data mining are large in number and apply to different types of mining. Many of these algorithms come from machine learning, and as the name suggests the algorithms learn from the data. Other algorithms may come from statistics.
It turns out there are four types of data mining:
- Classification is used when we want to classify something. Maybe we want to classify customers as a good credit risk or a bad one. In classification we select one characteristic from our data as the one to be classified. So we might hold all types of data on customers – age, income, previous buying history, location – and so on. But we hold one back – the one we want to classify. In this example we might want to classify whether a customer should be targeted with a special offer, based on people who have previously purchased an item. The reserved attribute is called a label, or target. This is the one we want to predict with customers who have not purchased this item before.
- Regression is used when we want to predict a value – a number. We might for example want to predict next month’s sales based on sales history. Or we might want to calculate the amount of credit a customer should have, based on the history customer credit.
- Clustering is used when we want to find similarities in our data. For example, we might want to cluster customers by their demographics, income, age and any other features that we think might be important. This is usually used as of means of discovery – finding similarities that might not have been previously considered.
- Association mining is used to find associations between items. The classic application is market basket analysis. This is used to find items that are commonly purchased together. For example it might be that people tend to buy bread when they buy butter. This information can then be used to determine the layout of items in the store.
Classification and regression are usually what is usually called supervised learning. It is called this because we give the learning algorithms the answer while the learning process is taking place. In the customer credit example, for every customer we supply information on whether there were any problems with payment. In this way the algorithm can learn to identify which attributes determine credit abuse.
The other form of learning is called unsupervised learning. We simply present the data to the algorithm and it finds patterns and associations that we might not have expected. Clustering and association usually operate in this way.
To make sure patterns work in the future, we divide our data into training and testing data sets. The training data set is used by the algorithms to learn. This is usually the larger part of the data. Once we have found useful patterns, we apply them to the test data. If the patterns hold up in the test data then we can be more certain that they will work in the future.
Common data mining algorithms include decision trees, naïve Bayes, neural networks, k-means clustering and association rule mining. Decision trees are a good place to start because the resulting diagrams give insight into the rules that have been found. There are literally hundreds of algorithms, and knowing which ones to use in which situations is purely a matter of experience.
There are many, many data mining tools available, and some of them are free or open source. Click here to see a list of free data mining, and machine learning tools. So when someone asks What Is Data Mining – the answer is not as obscure as many suspect.