It is often said that 80% of the data held by a typical organisation is unstructured, and much of this is text. Email, social content, customer notes, medical records, contracts and a seemingly endless array of other text based data is produced in prolific quantities on a day to day basis. The 80% statistic doesn’t tell us how important these data are, but in the main text is much richer in content than structured data held in databases. Important hidden knowledge and nuances are held in textual data and considerable value can be created by analysis. But, and here is the rub, extracting information from text is a far more subtle art, and a more demanding science than the analysis of structured data. In fact there are several approaches which serve several fairly well defined needs, although there are overlaps.
Information Retrieval tends to simply serve the need of gaining access to text data when it is needed. This is pretty much the same as a Google search where a search term is entered and relevant documents extracted. The term Enterprise Search has been coined to embrace this approach, and the technology serves the purpose of providing a ‘private Google search’ on enterprise data. Without such a facility an organisation has the equivalent of its own Internet but without any search facility – a true waste of resources. Many suppliers offer solutions to this need including IBM, Oracle and Autonomy (now part of HP).
Natural Language Processing embraces a wide variety of capabilities, many of them quite specialised. These include language translation, identifying relationships among entities, speech recognition, question answering (used in some automated assistants), and text summarisation. Some organisations have need for extensive NLP capabilities, and particularly those engaged in knowledge intensive business.
Text Mining has recently received a great deal of attention, not least because it holds so much promise. The basic idea is simple enough. Establish a collection of documents, process them so the text becomes meaningful to data mining algorithms, and then mine the data to look for meaningful patterns. The ultimate aim is usually to find patterns that enhance future decision makingĀ – in sales, marketing, customer relationships, human resources, and any area of the business where meaningful patterns might be found. Surprisingly simple methods can give good results. The simplest is to treat documents as a ‘bag of words’, stripping away all sequencing, punctuation, grammar andĀ semantics. The frequency with which words occur in a specific document or within a corpus of documents often reveal meaningful patterns when these data are combined with associated structured data (e.g. processing customer comments in conjunction with customer records in a database). Other approaches will assign documents to topics which have been discovered in a corpus. Analysing the topics will then reveal insights that were previously unknown.
A great deal of attention is being given to text mining and it is here that most progress is being made. The results are very often probabilistic in nature providing insights into the level of certainty associated with a predicted outcome (e.g. a certain customer might have a 60% probability of defaulting on a payment, and a such this may not be high enough for action to be taken).
At the present time text mining requires considerable skills and as such will be out of reach for smaller organisations. However it is likely that technology suppliers will build applications addressing specific text mining needs making the technology accessible to a much broader audience.