Human beings like categorizing things, and categories are so fundamental to the way we process information that the world would be largely incomprehensible without them. So it would make sense for our business analytics toolsets to support categories. In fact we employ elementary forms of categorizing almost every time we use analytical tools – although the more common name is grouping. We might group sales by period, or territory, and we might group customers by the value of orders raised.
But this article is about advanced analytics – analytical methods where some level of intelligence is displayed by the methods we use. So instead of simple (or even more complex) types of grouping, there are times when we would like the data itself to tell us how data points should be grouped. The technique that enables this is clustering, and as the name suggests it is a means of determining whether data instances should be considered as similar based on an algorithm that determines how close the data points are to each other.
The classic business application for clustering is market segmentation. We might use measures such as age and income to cluster items purchased by customers. There may be some useful clusters here, and there may not. Clustering does however allow sales and marketing people to explore the way their markets segment.
Clustering isn’t something that is particularly new, and is available in many analytics toolkits. However using clustering has traditionally needed high skill levels, since issues such as data normalization and manually adjusting the number of clusters need to be addressed. However this is no longer the case, and a few suppliers of easy-to-use visual analytics tools now include clustering. Tableau is perhaps the best example, since it introduced wholly automated clustering in version 10 of its product. What we mean by wholly automated is that issues such as normalization, number of clusters, and other parameters are adjusted automatically. No business user wants to grapple with issues such as whether Euclidean or Manhattan distances should be used!
The diagram above shows how Tableau has automatically clustered some astronomical data. Along the axes is the radius of the orbit, versus the actual radius of a planet. The size of the dots indicates the temperature of the planet. Although this is a 2-D display, the measures used in the clustering calculation can number as many as are needed, and do not have to correspond with the measures shown on the graph. Tableau has decided that 3 clusters best categorize these data, although users can manually specify the number of clusters if they wish.
Further to this Tableau will show the statistics that relate to a cluster analysis, as shown below. There is nothing particularly bewildering here and it should be within the capabilities of even novice users to perform a cluster analysis.
Having said all this users should be cautious about their use of clustering, and apply a certain level of common sense. Clustering will work with almost any data, but that doesn’t mean the analysis is meaningful. Determining meaning is wholly down to domain expertise. Someone in sales or marketing will know whether a particular market segmentation makes sense or not.
Clustering a powerful tool for those who need to segment their data in some way. Easy to use clustering tools make this technique available to a much broader audience, and many business users would be well advised to at least see what it can do for them – they may be pleasantly surprised.