Luminoso primarily identifies concepts within text data by employing a decade’s worth of research from MIT Labs – specifically MIT Lab’s Concept Net. It requires no training or setting up of ontologies, taxonomies or dictionaries, and at the present time understands English, French, Spanish, German, Italian, Portuguese, Japanese and Chinese. The service operates through a web browser and simply requires an upload to the SaaS of documents in CSV format. An API is also available for embedding text understanding within existing systems.
The concept based approach makes the technology ideal for sentiment analysis, and this is indeed one of its major uses. It’s primary interface is the dashboard, a rich graphical interface supporting the exploration of concepts found within text data. Concepts can in turn be grouped into topics. Positive sentiment for example might include the concepts ‘happy’, ‘satisfied’, ‘very good’ and so on. A concept cluster chart is also available to explore correlations between concepts – the concept ‘happy’ might correlate with the concept ‘inexpensive’ for example.
Luminoso will also auto-tag a corpus of documents, and can do this is three ways:
- Unsupervised topic-document correlation generates numerical scores for each document relative to a set of interesting topics. Example: a news article is tagged with “Nuclear Energy” because it’s about the Fukushima Plant.
- Classification with supervised machine learning associates a document with one of many classes established from ground-truth. Example: a social-media post is classified as spam based on it’s similarity to other known spam articles.
- Regression with supervised machine learning generates numerical probability that a document belongs to one of many ‘classes’ established from ground-truth.
Luminoso actually provides a very nice balance between ease-of-use and sophistication. The product itself requires minimal training and someone knowledgeable of their domain will quickly be extracting useful understanding from their otherwise ignored text data.