CRISP-DM (Cross Industry Standard Process for Data Mining) is a suggested sequence of steps a user of data mining technologies should employ to best develop a model. It’s origins go back to a consortium including SPSS, Teradata, Daimler AG, NCR and OHRA, and the first version was made available in 1999.
The steps involved in CRISP-DM are fairly straightforward and include:
- Business Understanding – as the name suggests this involves understanding project objectives from a business perspective and mapping this into a data mining activity.
- Data Understanding – is concerned with familiarization with data; it’s quality, the quantities and data most relevant to the project.
- Data Preparation – involves cleaning, transformation and data selection. This is an iterative process.
- Modeling – is the process of applying data mining algorithms to the data, with parameter selection and honing in on the techniques that produce the most meaningful models. Data preparation will undoubtedly have to be revisited as models evolve.
- Evaluation – is validation of the models from a business perspective, and sanity checking by domain experts.
- Deployment – which may be simply the creation of a report, or embedding data mining models into production systems.
CRISP-DM is used by many data mining practitioners, but is most widely promoted by IBM and is incorporated into SPSS Modeler. As with all methodologies of this nature, users can get bogged down in details or take a more moderate approach and simply use it as a useful framework.