Big data is not just bigger small data. It introduces issues that never really reared their head with traditional, modestly sized transaction databases. This is particularly true of analytics, and apart from regulatory requirements, why would we possibly want to store petabytes of data other than for analysis? Overfitting, or curve fitting, is one of the perils of data mining and machine learning. It simply means that our algorithms haven’t been able to generalize, and instead create patterns from the noise within the data. Traditional wisdom says that if we have enough data this behavior is much less likely. However it depends on what we mean by ‘enough data’. Not only are we holding more data instances, but we are also collecting more data features or variables. Most algorithms create combinations of these variables during their pattern searching, and so the more variables we make available, the more combinations the algorithms will produce. This is a power law, and if we have n variables then, generally speaking, the algorithm will create 2n combinations. There is some intelligence built into many algorithms and they will attempt to establish the most relevant variables. Even so, there is still a power law at play.
As a rule of thumb we should have as many instances of data as there are combinations of variables, algorithms, algorithm parameters, and data sets. This soon becomes a large number. For example, take twenty variables, five parameters (each of which cycles through 10 values), five algorithms used to find patterns, and 10 data sets. This creates five trillion combinations. Just adding another variable bumps this up to 10 trillion combinations. Large numbers of patterns will be discovered, most of which are nothing more than noise in the data, that happen to look respectable by pure accident. Part of this is called the curse of dimensionality – something well known to anyone who has used data mining and machine learning technologies. The answer to this problem is more data. But trillions of rows of data are usually not available, and this is just a modest example. I have seen data scientists boast of using hundreds of variables. They might use techniques such as principle component analysis to reduce the number of variables (or dimensions), but it does not bite into the magnitude of the problem at hand. Skilled analysts will choose their variables very carefully, and keep them to a minimum – perhaps in the low tens. Even so the problem is still very apparent.
The inevitable conclusion is that we need more data – big big data. And remember, every extra dimension or variable, doubles the data we need. This isn’t really a solution, and as Stephen Few points out in his book Signal, maybe we ought to drop the big data mentality and start looking for ways we can use smaller amounts of data with just a handful of variables, and restricted algorithm tuning options.
This big data power law is well documented, but rarely talked about, since it undermines the big data mentality to some extent. At the present time there seems to be no way around it, and so we had better be somewhat more modest in our ambitions, if we are not to deliver completely spurious patterns to business users, that are nothing more than noise dressed up to look pretty.