“As far as the laws of mathematics refer to reality, they are not certain, and as far as they are certain, they do not refer to reality.” – Einstein
As people progress with their visual analytics tools, so they move on from pie charts and histograms and start to embrace more ‘advanced analytics’. Some suppliers of these tools call linear regression an advanced analytic, although in reality there is nothing particularly advanced about it. If we plot a series of data points (a scatterplot), a linear regression is just a best fit line. ‘Best fit’ means that the distance between the line and the various points is minimized overall.
To illustrate how misleading linear regression can be I’ve used an index of house prices in the US, plotted against year (0 means 2000 etc). Below we have this data for the years 2000 to 2008. It shows the data points and a linear regression line drawn through these points. I’ve also show the value of R squared. The nearer this value is to unity, so the better the correlation. The value on this graph, 0.9745, is the sort of thing statisticians dream of. And so the sub-prime mortgage crisis was nowhere to be seen except by people who actually looked at the detailed data (rising defaults for example). The illusion of mathematical certainty was just too good to ignore, so much so, that the rating agencies did not even consider that this line might be a complete illusion.
And so reality bites. Our reassuring regression line was giving us absolutely no indication of future behavior. The graph below shows what happened after 2008 – as we all know. Some people, less impressed with linear regression than most, actually saw the crash coming – well documented in the movie “The Big Short”. In fact simple visual representations of data very rarely give any insight into future behaviors, and while analyzing the past is all fine and dandy, if the analysis says nothing meaningful about the future then it is just navel gazing.
But not to be thwarted, some visual analytics tools provide forecasting capabilities. These generally work by fitting an equation to the set of data points in a time series, and as shown below can be just as useless as a linear regression. The dotted line shows a curve that is a second order polynomial – in other words a curve where the highest term is to the power of 2 – or squared. But as you can see it gives a fairly poor representation of the data.
So, being keen to get a better fit, we can create a polynomial curve that has an order of 5 (3 and 4 were also poor fits). This is quite a good fit, as shown below. But here is the rub. The higher the order of the polynomial, the more it just fits the curve to the existing data, and the less forecasting power it has.
All the methods discussed above are abstractions, moving further and further away from the detail – and detail matters. The guy who won big in the sub-prime mortgage crisis, did so because he worked through all the detail. The half-wits in many of the rating agencies and banks however, believed in the abstractions. The same applies to your business. You will move further away from understanding what is really going on as you indulge in more visual abstractions. Don’t do it. The detail is where the gold lies – but of course that requires hard work. It might be easier to draw a few linear regressions after all.