Abstract:
The overwhelming use of IT in most organizations centers around the filing and retrieval of data. While the filing may be electronic, it is nonetheless simply a more efficient means of acquiring, sorting and retrieving data. This is simply a cost associated with running a business, there is very little value in it. Data mining is a determined attempt to extract value from the terabytes of data most organizations acquire on an ongoing basis. Without such technology these data are simply wasted assets and the organization misses opportunities to gain insights into customers, operational activities, suppliers and even employees. And it should be added that these insights are often not trivial in nature.
What is Data Mining?
Data mining is the act of looking for patterns of behavior that are reliable enough to be used in current operational activity, to improve both efficiency and efficacy. The process typically involves analysis of historical data, the discovery of patterns, validation of patterns, sanitisation by domain experts, and if accepted, the implementation of newly discovered patterns in business processes. Perhaps one of the best known examples is that of deciding whether a candidate is a good risk for a loan. By mining through historical data it can usually be established which customers have proved to be bad risks, and which proved to be good risks. Attributes such as salary, age and so on will often be used by the data mining algorithms to determine the behavior of customers. Once the patterns have been discovered they can be implemented in the systems that support loan approval. The use of data mining in marketing is also very common, and patterns are often discovered which indicate the best candidates to be approached in a marketing campaign, or the best customers for up-sell or cross-sell opportunities. The younger brother of data mining is predictive analytics. This is simply a particular use of data mining technologies to predict behavior based on scoring. The loan approval process detailed above is one such application. Predictive analytics has assumed a high profile, probably because the name is more appealing than ‘data mining’ – but it uses exactly the same technologies.
More than business intelligence, data mining and predictive analytics provide mechanisms to extract value from otherwise wasted information. While none of these terms has a rigorous definition, most of us identify manual data exploration and reporting activities with BI, and automated processes with data mining and predictive analytics. As data sources become greater in number, and each provides larger data volumes and more diversity in the type of data, so the need for automated analysis increases. Of course data mining does not replace BI activities, it is simply an essential activity that complements it.
Another IT Fad?
The IT industry has no problem with hype and inflated expectations. We’ve seen it with customer relationship management (CRM), Service Oriented Architectures (SOA) and any number of other three letter acronyms. However data mining has been around for well over a decade and some of the methods it uses originate with the renaissance – so it is here to stay. It may be polished up and given go-faster stripes (as is really the case with predictive analytics) now and then, but the underlying technology and philosophy is not going to go away. The departure from how we have traditionally used IT needs to be emphasized. Until recently IT has been almost exclusively targeted at cost reduction through labor displacement. This has involved the use of computers as glorified filing cabinets and calculators. Data mining is something quite different. With data mining we are actually looking for the technology to add value – to make our operations more effective. This may mean targeting customers more accurately, predicting machine failure on a production line, analyzing which suppliers are most responsive – and so on. In a nutshell we are talking about the introduction of intelligence into the systems we use. In many ways it emphasizes the distinction between efficiency and efficacy. While efficiency is necessary for business success it is not sufficient, the extra ingredient needed is efficacy. We may be able to produce green mugs with pink handles more efficiently than anyone else, but does anyone want them? Where is the fashion in mugs moving? What associated products could be sold (mats for example)? Data mining provides the means to answer these questions (through targeted market research).
Costs and Risks
Since data mining is a relatively new notion for many organizations it is a good idea to phase implementation so risks are reduced. The best place to start is small, and embrace larger projects as confidence grows. Of course this goes against many human biases, but experience shows over and over again that an incremental approach is best.
Starting small may mean prototyping with an Excel add-in, or utilizing one of several open source data mining tools. Organizations with an existing commitment to a particular data mining tools supplier should also start small – with a number of pilot projects. Not only should the initial projects be brief and well defined, but they should also address areas of business functionality where mistakes can easily be spotted and where impact is measurable but minimal. Data mining technologies are complex and validating the patterns they find is a subtle process.
If an organization employs experienced professionals then the risk of curve fitting, over-fitting, data mining bias and so on should be minimized. These things happen when the patterns that are found are no more than accidental fits, and when we mine data until it is forced to show us what we want to see. Even with the various statistical checks and cross checking methods that data mining employs it is best to involve domain experts. They will have a good feel for whether a pattern represents something real or not.
Initial costs will be modest and will relate more to skills than technology. However as confidence builds there may be a decision to embrace big data, and for this considerable infrastructure will be required. Eventually a full team of analysts, statisticians, programmers and domain experts may be needed, although clearly someone has to be measuring delivered benefits against ongoing costs.
Benefits
Assuming we know how to deal with the risks associated with data mining the benefits are potentially as broad as the operations of the organization. At the time of writing it has been revealed that HP used data mining to identify employees who might leave. So it has application in every part of the business and the benefits will range from the trivial (predicting the demand for stationery in various departments) through to the profound (discovering a simple adjustment of marketing message that produces a significant boost in sales).
Analytics activities will become the primary information processing activity in most organizations, demanding the most resources and delivering the most benefits. Traditional transaction based systems exist for administrative purposes and as fodder for analytics.
Summary
The business case for data mining is very simple really. Every aspect of the systems used in organizations should be complemented with intelligence. Much of this will eventually happen by default – suppliers of solutions will embed data mining type processes into their products. But as always it will be the bespoke analytics that deliver the most benefit and for this skills and resources will be needed. In a nutshell this is a move from the efficient enterprise to the intelligent enterprise – the latter encompassing the former.