Text data are typically held as notes, documents and various forms of electronic correspondence (emails for example). Structured data on the other hand are usually contained in databases with fixed structures. Many data mining techniques have been developed to extract useful patterns from structured data and this process is often enhanced by the addition of variables (called features) which add new ‘dimensions’, providing information that is not implicitly contained in existing features. The appropriate processing of text data can allow such new features to be added, improving the effectiveness of predictive models or providing new insights.
Incorporating features derived from customer notes, email exchanges and comments can improve lead targeting, flag possible defection and even contribute to the identification of fraud. The methods used to extract useful features from text depend on the domain, the nature of the business application and the characteristics of the text data. However a statistical approach based on a frequency matrix (a count of words appearing in various text sources) often yields useful new features after the application of appropriate statistical techniques. Other techniques might employ named entity extraction (NEE) where probabilities can be assigned to the likelihood that a particular document refers to a given entity (people, places, products, dates etc.).
A prerequisite for combining text and structured data is some form of integrated data environment, since the sources of data could be highly diverse and volatile. While building predictive models can be facilitated using various interfaces to unstructured data, implementing the resulting models requires tight data integration and a scalable computing environment. This can be achieved through big data infrastructure such as that offered by Hadoop and associated technologies, although this is definitely not a trivial undertaking. The alternative is to embrace integrated infrastructure and tools provided by some of the larger suppliers in this domain.
Such is the complexity of integrating text analytics with structured data that most organisations will opt to buy solutions for their needs. There is no point reinventing the wheel here, and some advanced solutions are already emerging for customer and marketing applications where text data are incorporated into data mining activities. Probabilistic latent semantic analysis and specifically Latent Dirichlet Allocation is a highly sophisticated technique used to associate documents with a topic. Specialist knowledge is needed to employ such techniques and many businesses will simply opt to buy capability rather than explore such highly complex statistical methods. The techniques used are just a small part of the story with infrastructure, skill sets, presentation methods, management and performance monitoring representing the larger piece of the cake.
The integration of text data sources with structured data will see significant progress over the next few years. Organisations that are willing to integrate the missing 80% of their data that text represents (missing from current analytical activities) will gain insights and operational improvements that would otherwise not have been possible.
Next article in this series: Text Analytics Applications
Report Download
Text Analytics: a business guide
A report for business and technology managers wishing to understand the impact of rapidly evolving text analytics capabilities, and their application in business.
Contents:
The Business Value of Text Analytics
What is Text Analytics?
Text Analytics Methods
Unstructured Meets Structured Data
Business Application
Strategy