Decision trees are a favorite tool used in data mining simply because they are so easy to understand. A decision tree is literally a tree of decisions and it conveniently creates rules which are easy to understand and code. We start with all the data in our training data set and apply a decision. If the data contains demographics then the first decision may be to segment the data based on age. In practice the decision may contain several categories for segmentation – young/middle age/old. Having done this we might then create the next level of the tree by segmenting on salary – and so on. In the context of data mining we normally want the tree to categorise a target variable for us – whether someone is a good candidate for a loan for example.
The clever bit is how we order the decisions, or more accurately the order in which we apply attributes to create the tree. Should we use age first and then salary – or would the converse produce a better tree? To this end decision trees in data mining uses a number of algorithms to create the best tree. The most popular algorithms are Gini (which uses probability calculations to determine tree quality) and information gain (which uses entropy calculations).
When large data sets are used there is the very real possibility that the leaf nodes (the very last nodes where the target variable is categorised) become sparsely populated with just a few entries in each leaf. This is not useful because the generalisation is poor. It is also the case that the predictive capability drops off when the leaves contain only a few records. To this end most data mining tools support pruning, where we can specify a minimum number of records to be included in a leaf. There is no magical formula that will say what the level of pruning should be, it’s just a matter of trial and error to see what gives the best predictive capability.
Virtually all data mining tools implement decision trees and some offer elaborations on the basic concept – regression trees for example where the tree is used to predict a value, rather than categorise.
Decision trees are often used to get a feel for data even if they are not part of the resulting model, although good results are to be got from decision trees in many business applications.