Machine learning can be viewed as a sophisticated form of search – searching through billions of combinations of variables, variable values, algorithm parameters and data selections to find useful patterns.
Now many of the patterns a machine learning algorithm will identify are simply ‘accidents’ of the data, with no counterpart in reality. These phantoms in the data tend to be weeded out by testing the patterns on previously unseen data, but even here some of these misleading patterns will escape detection.
When it comes to big data the situation can be much, much worse. This is particularly true when the data are represented by many attributes. The ability to collect increasing numbers of variables on an entity (customer, supplier, machine etc.) has some fairly profound dangers associated with it. As we increase the number of attributes, so the machine learning algorithms will happily create a very large number of combinations of these attributes as they look for useful patterns. In fact this is an exponential increase in combinations represented by 2n, where n is the number of attributes. A simple example will demonstrate the dangers here.
Imagine we start with 10 attributes – this gives 1024 different combinations. Now consider that we believe in the big data mantra and expand our attributes to 40 – a 4x increase. The number of combinations of these attributes is now 1,100,000,000,000 – a thousand billion. The machine learning algorithms can have a party – picking up every spurious pattern within the data.
Some will argue that having more instances of the data will offset this tendency – but typically our data volumes will not be increasing by a factor of a billion. And of course when we start to consider hundreds of attributes the situation becomes totally untenable.
Testing will not eliminate all the defective patterns which have no correlation with reality. Sure it will eliminate many, but certainly not all. And of course, while we revel in the increasing speed of hardware and algorithms, a billion-fold increase in processing requirements is not easily satisfied.
So we need to be careful with our big data, and not simply trust that collecting more attributes will deliver more accurate predictive models. In some ways big data is a fools-gold, unless some real understanding is used to sanitize the whole process.
This is not a popular message, and without doubt some organizations, unaware of issues such as this one, will be deploying patterns derived from data mining, that are simply wrong. And the bigger the data gets, the more of these erroneous patterns they will find