Statistics are wholly contrived metrics which attempt to characterise the data they are applied to. There is nothing sacred about a mean, median, standard deviation or any measure used in statistical analysis. This approach to data analysis belongs to an era when we thought everything could be represented by an equation of some sort. Of course we could not have managed without using statistics, particularly in the sciences, but even here the approach is flawed. One of the most heavily downloaded documents dealing with statistics over the last decade deals with flawed statistical methods used in various research domains. ‘Why Most Published Research Findings are False‘ is an in-depth study of the unreliability of statistical analysis in many crucial areas of life such as clinical studies, and for those with in-depth statistical knowledge is often something of a revelation. Many commentators have laid a heavy hand to the naive statistical assumptions researchers and business managers make, and Taleb (The Black Swan), blames our blind faith in the normal distribution for much of the trouble.
It’s all become very complex and clunky (a sure sign that there is something wrong), and multivariate analysis often assumes that each of the variables is independent, even though they usually are not. In an attempt to redress this situation (termed multicollinearity) statisticians introduced the interaction term function, what Nisbet, Elder and Miner in their book ‘Handbook of Statistical Analysis and Data Mining Applications’ termed a ‘magnificent kludge’. It’s all starting to look like the complex models of planetary movement prior to the revelation that the Earth orbits the Sun rather than the converse – a simple ellipse will suffice to describe planetary motion, instead of many circles within circles.
Despite these reservations and weaknesses the statistical approach has been beaten into a shape that does often produce good results. But there are many, many potholes.
Data mining on the other hand makes no assumptions about distributions, metrics or any other artificially imposed measures of the data. The typical data mining algorithm simply ploughs through the data looking for patterns, and its primary weakness is that it will find patterns in abundance – those that are trivial, unusable, simply incorrect, and of course those that are extremely useful and create great insights. This is a bottom-up approach – if 300 instances of data show a certain behaviour then maybe all instances with similar attribute values will too. Various tests are used to validate whether this indeed the case, cross-validation techniques being the most common. Some data mining techniques (decision trees, nearest neighbour) will provide strong visual insights while others (neural networks) provide none at all.
Much research into machine learning techniques is ongoing and over the last decade ensemble methods have largely revolutionised our understanding of how to construct powerful predictive models. These techniques are often counter intuitive (which is always a signal that something interesting is happening), and many weak models can often be combined to outperform a single strong model. Throwing random variations at the learning algorithms can also produce stunningly accurate composite models.
Some commentators have gone as far as suggesting that classical analytical techniques are as good as dead. In reality there is room for both, but the advantage always comes from new technology developments, and globally the eyes are on machine learning – much to the annoyance of the statistics community.