Data that come in the form of numerical values can vary enormously in range. If we operated a second hand auto business we might record the mileage of a vehicle and the number of services it has had. The recorded mileage might range from 2,000 to 300,000 miles, whereas the number of services might vary between 1 and 20. For many analytical purposes it might be useful to have both these measures represented by the same scale. Looking for clusters of vehicles using k-means for example would absolutely require that the two measures are normalized to the same scale. Fortunately this is very easy, and all we need do is subtract a value from the minimum value and divide it by the full range. This is shown in the equation below:

Most analytical tools contain a normalization function (e.g. KNIME, RapidMiner etc), and some will normalize data automatically should it be necessary (Tableau for example). Working through an example might clarify. Going back to our vehicle mileage measure, the range of values is derived by subtracting the minimum value from the maximum. This is 298,000 and represents the denominator in the above equation. So the minimum normalized value is zero since the numerator is zero. The maximum value is one, since the numerator and denominator are the same. Any value between the minimum (2,000) and maximum (300,000) will be represented by a value between 0 and 1.

In many instances this approach works quite well, although problems can occur when data contains significant outliers. It has the effect of compressing most of the ‘normal’ data into a small range, with the outliers sparsely scattered over the rest of the 0 to 1 range. Various work-arounds exist, such as taking the logarithm of values before normalization. If your data has extreme outliers you will need to think very carefully about how you handle them.