Big data is exciting, full of potential, and dangerous. The term now encapsulates data mining, business intelligence, predictive analytics, data visualization, and pretty much everything else we might want to do with data. So, here are five ways you can really damage your business with big data:
Predictive Analytics and Data Mining is Simple – no it isn’t. Technology suppliers, keen to make a fast buck from the latest technology rave, are making claims that their platforms include easy-to-use data mining tools. For those unfamiliar with this term, data mining is the act of trawling through data with the aim of finding patterns of behavior that might be useful in the future. Typically these methods are applied to customer data. The ease-of-use claim comes from some big, and not so big names, who either don’t know better, or are simply saying what users want to hear to make a sale. The data mining process is tedious, labor intensive, and requires good skills. Don’t believe the ‘automatically build decision trees’ rap. Decision trees are one of the least reliable methods used in data mining. They are unstable, and changing just a few values in a few rows of your million row database may well produce an entirely different decision tree. Don’t take my word for it read Witten, Frank and Hall (Data Mining), page 353. The same applies to many other methods. So the first way to screw your business is to throw data at an algorithm and then start using the resulting model.
Use data visualizations to make decisions. Oh dear. Give a data visualization to ten people and each person will interpret it differently. This is qualitative analysis, reliant on human judgement and interpretation. I’m sorry, but we are very, very good at interpreting random noise as meaningful – even random noise displayed in multi-colored, animated, three dimensional data visualizations. Read Taleb, Kahneman and numerous other writers in this area if you need convincing. The net result of course is that our new shiny data visualization platform, sucking in data from throbbing big data servers, may well be leading us up the garden path. Qualitative analysis is extremely useful for describing data and diagnostic work. But we need quantitative methods to tell us whether that trend on a scatterplot is real, or possibly just a random blip that has occurred many times before. And, just to put the final nail in the coffin – more data doesn’t mean better data – as you will see in a moment.
More data is better data. So instead of building our predictive models from ten thousand rows, we can use ten million. Or if you really want to big yourself up at a big data conference, why not use a billion rows. In most data mining applications ten thousand rows will tell you just as much as a billion. The cost of producing models will drop, and life becomes much simpler. Trying to ensure data quality on a billion rows is a Herculean task – and often unnecessary. And the issues of model instability apply just as much to big data as they do to small data. Read this article in The Wharton Journal if you need convincing (and there are plenty of others).
More attributes are better attributes. This should have been the first way to screw your business really, but I didn’t want to be so cruel. Hopefully you’ve got a bit acclimatized by now. I recently saw a representative from a large Uk insurance business speaking about the predictive models being built with the aim (as always) of selling more stuff to customers. She proudly declared that the models they build often involve four hundred customer attributes. We no longer just hold half a dozen customer details – address, phone, name, age maybe – but we can now collect information on what was purchased, when it was purchased, intervals between purchases, age, dependents, number of cars, marital status, employment status – etc. etc. etc. Data mining algorithms love this sort of data – they can throw out all sorts of nonsense, and sometimes get away with it. In the process of mining, the algorithms will combine attributes in the search for meaningful patterns. If you have just 2 attributes there are four ways of combining – none, first, second, both. In fact it’s an exponential law that follows 2n. So if you have 40 attributes the algorithms can have a wonderful time with over a billion combinations (400 attributes would give more combinations than atoms in the universe!). All that random noise can sometimes be made to look respectable – just by accident. This is where skill is needed to select the most relevant attributes and not throw a whole bucket full at the algorithms, just because we can. So if you deal with this particular insurance company don’t be surprised if you get an offer to insure your elephant at a discounted rate.
Believing Suppliers. I don’t want to be unnecessarily unkind to big data technology suppliers – we all need to put food on the table. But big data is new, experience is limited, expectations are sky high, and so it is easy to make promises that make big data seem all warm and cuddly. As Michael Jordan (perhaps the person who knows most about big data analytics) says – most of the problems associated with big data analytics are not only not solved, but they are not even understood. So this is going to be a journey of discovery, and not a ‘light the blue fuse and retire” exercise.
The best advice here, is the advice that applies to all new technology adoption. If you have a desperate need, and a clearly defined business advantage that might be realized, then dip your toe in the water. Then move carefully, and even better, try to engage suppliers who have been doing analytics (big or little data) for a long time. They will know the dangers, and how to avoid them. If there isn’t a pressing need, then sit back, let the screw-ups happen elsewhere and learn from them. And when the technology, skills and methods have matured somewhat, then it becomes more likely that investment in big data will return the benefits expected.