Text mining is the act of analysing a set of documents for meaningful patterns. We might for example be interested in mining email conversations between customer services and customers to establish whether there are any patterns in behavior – telling the customer that she is a nuisance might correlate with the closing of accounts for example.
There are a number of fairly well defined stages associated with text mining. The first involves creating access to the relevant documents (emails, pdf, word processing, web pages etc). In their raw state these documents typically cannot be mined and so preprocessing is necessary. This involves extracting features from the documents so they assume some form of similarity. Preprocessing is often the most difficult part of the process and may involve taxonomies, ontologies and a number of other constructs that add meaning.
Once the documents are in this state we can apply various mining techniques to discover patterns. Typically the patterns are communicated to users via a presentation layer in the form of graphical displays. There are many modes of display including clusters, hierarchies and so on.
Text mining tends to provide an over-supply of patterns and so some form of trimming is usually carried out to home in on the most pertinent information. In many ways text mining is just a specialised application of data mining but with the added complications that unstructured data brings, and hence the emphasis on considerable preprocessing. Contemporary text mining tools will automate much of the donkey work and provide a rich, configurable user interface.