Text mining concerns itself with discovering structure and patterns in unstructured data – usually text. There are many different approaches to this task, some focus on ancillary structures such as taxonomies and ontologies, some focus on semantics and natural language processing, while others use various algorithms to categorise and summarise. It all depends on need as to which will be the most appropriate.
GATE (General Architecture for Text Engineering)
This is a large full-lifecycle open source text mining software suite with several components:
* GATE Developer is an integrated environment consisting of language processing components which incorporate the widely used Information Extraction system along with other plugins.
* GATE Teamware provides a collaborative environment for document annotation. This is built around a workflow paradigm.
* GATE Embedded is a Java object library to provide an interafec to other applications within the organisation.
KNIME Text processing is a plug-in to the (free) KNIME data mining suite. It supports a six step text processing process which starts with the reading and parsing of text, followed by named entity recognition, filtering and manipulation, word counting and keyword extraction, bow and vector representation, and finally visualisation.
LPU (learning from Positive and Unlabeled Examples)
This is a text learning and classification system that utilises support vector machines (SVM) and EM (Expectation Maximisation) techniques. Runs in a DOS window.
This is an add-in to the free Orange data mining suite. It operates within the visual analytics tools provided in Orange and adds the ability to process unstructured data.
RapidMiner Text Extension
This provides operators for the RapidMiner environment for statistical text analysis. Many data sources are supported including plain text, HTML and pdf. A large number of filtering techniques are supported and support for tokenization, stemming, stopword filtering and n-gram generation. This is all embraced within the graphical interface provided by RapidMiner (which is a free data mining suite) and many tasks can be completed through drag and drop functionality.
Other products of interest include:
Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
Apache Mahout supports recommendation mining taking users’ behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
A report for business and technology managers wishing to understand the impact of rapidly evolving text analytics capabilities, and their application in business.
The Business Value of Text Analytics
What is Text Analytics?
Text Analytics Methods
Unstructured Meets Structured Data
Click on image to download