Natural language text is not a medium readily understood by computer systems, in contrast to the neatly arrange rows and columns in a database. This is the primary reason that text analytics has had such a long gestation before it could be usefully employed in a business arena. It also means that much of the effort involved in text analytics is preparatory work, to make sure the data are in a format that can be processed by text applications.
The first stage in dealing with text data is nearly always the process of identifying individual words and phrases (usually called tokens). Even this is not as simple as it sounds since abbreviations, acronyms, synonyms and ambiguity make the task quite involved (the word ‘wave’ has multiple meanings for example). It is also usually necessary to identify ‘parts-of-speech’, and specifically which words are nouns, verbs, adjectives and so on. Many words are meaningless as far as text analysis is concerned and can be ‘stopped out’. Words such as ‘a’, ‘it’, ‘and’, ‘to’ and so on can usually be stopped and unsurprisingly are called stop words. A significant part of natural language processing is dedicated to these tasks, and it is a prerequisite before other work can be done. At the heart of this approach is an attempt to infer some level of meaning within documents (identify important entities, concepts and categories).
A wholly different approach can be adopted by using statistical methods. Here we might simply count the number of times various words appear in a corpus of documents and infer some level of importance from the resulting frequencies. One of the most useful metrics based on this approach is called the inverse document frequency. This increases in importance as a particular word appears frequently in a given document, but is not common in all documents. The word ‘loan’ may appear frequently in a corpus of documents and have no particular importance in a particular document. Whereas the word ‘default’ would appear less often (hopefully) and have more significance in a specific document. This approach can give useful results, but context and meaning is almost entirely lost. In an attempt to address this, short sequences of words called n-grams can be processed. This does at least offer the opportunity for frequent combinations of words to be identified. Significantly more sophisticated approaches are often used in commercial text analytics products, a good example being probabilistic latent sematic analysis where documents can be assigned to discovered topics.
The above methods, and many others, can be used to generate input to data mining activities. We might have a detailed transactional history of customer activity, but with little notion of how happy or otherwise the customers are. To this end we might use some of the above methods to identify customer sentiment and add additional variables (usually called features) to our customer records. This approach is proving to be successful in many applications.
There are two ways to address the complexities associated with text analytics. The first is simply to buy a ‘solution’ for the task at hand. Various suppliers provide analytics solutions to a range of vertical and horizontal business needs. The benefits associated with this approach include fast time to implementation, reduced need for in-house staff and the ability to call upon a supplier that has experience in a particular domain. The downside is usually lack of tailoring to particular needs, less awareness of how an application actually works, and a potential dead-end if a solution cannot be modified sufficiently. The alternative is to build an in-house facility with associated costs, but with the opportunity to address specific needs accurately.
Text analytics can deliver simple to use functionality, often seen in information retrieval and characterised by some form of search capability. But it is making its way into data mining activities and it is here that more capable organisations will realise significant advantage.
Next article in this series – Unstructured Meets Structured Data
Text Analytics: a business guide
A report for business and technology managers wishing to understand the impact of rapidly evolving text analytics capabilities, and their application in business.
The Business Value of Text Analytics
What is Text Analytics?
Text Analytics Methods
Unstructured Meets Structured Data