Self Service Data Preparation Platforms Summary
Greater data variety pretty much translates to greater data complexity. As businesses embrace diverse data sources, so there is a pressing need to make those data more easily accessible and to combine them in ways which can add value. This variety comes in the form of text data (customer comments for example), social data, location data, data from devices and sensors, and the traditional rows and columns of relational databases. We shouldn’t assume that a given organization will be indulging in all of these, but there is an undeniable trend for more diverse and complex data to be used in many businesses. It would also be folly to assume ‘big data’ has to be a part of this. Many businesses would be happy if they could easily access, combine and process their traditional data sources.
The platforms and tools listed below all aim to enable self-service data preparation. This means that savvy business users and business analysts can pre-process their data prior to use via easy-to-use interfaces where much of the grunt work has already been done. To this end these platforms often use machine learning techniques to automate the identification of field types (numeric, date, zip code, telephone number, product etc.). They will also find relationships between data sources, and make attempts at correcting incorrect values. None of this is foolproof, but all these platforms provide a useful dialogue between the user and their own assumptions about the nature of the data being analyzed. The end result of this is curated data that is suitable for various forms of analytics, including data visualization, data mining, and business intelligence applications.
ClearStory provides a cloud based platform that takes business users from data to visualization with minimal need for technical skills. In common with several other suppliers it uses machine learning techniques and Apache Spark in-memory processing, to take data from its raw state, to a state where business users can create the data visualizations they need. As data sources become more diverse so businesses are using the Hadoop big data platform as a ‘data lake’, and ClearStory increasingly supports this and many other diverse data sources.
Users are presented with a collaborative environment where data discovery, exploration and visualization efforts can be shared via StoryBoards. These provide one or more visualizations in the context of a story line, and various users can annotate and comment as the StoryBoard evolves.
FORMCEPT provides the data analysis platform MECBot, available both on-premises and as a cloud deployment. At the heart is the C3 (compare, correlate and classify) engine allowing organizations to build data driven cognitive applications. This determines patterns and linkages across data and automatically generates views that are stored for future use. MECBot employs several open source technologies including Hadoop, HBase, and Solr. Overall it automates much of the effort associated with creating a unified view of data.
Paxata automates many data preparation tasks, and offers high levels of performance with its Spark based architecture. Big data preparation tasks are well accommodated, and business analysts will find the platform easy to use.
The core capability of Paxata leverages Hadoop and specifically Spark, so that large scale in-memory processing is available for the machine learning algorithms that give Paxata much of its power. Paxata can be deployed on premises or accessed as a cloud service. The on-premises deployment requires a Hadoop environment (either dedicated or shared).
Platfora provides an end-to-end big data data discovery and exploration platform that starts at data ingestion and ends with visualisation. In many ways it is the right product at the right time. Had Platfora tried to deliver its platform just four or five years ago, it would have been shooting at a moving target, since big data technologies were very immature. Today however the growing use and acceptance of Spark in-memory processing, in addition to a maturing of Hadoop, means that Platfora can deliver massively scalable data exploration and discovery tools that overcome many of the problems associated with traditional data warehousing platforms. These are built with predefined needs in mind, using limited data sets, and so are inflexible and slow to design, build and use. Three month latency between need and capability is common with these platforms.
Platfora uses the Hadoop distributed file system (HDFS) as a data store, ingesting data from transaction based systems, devices, external data feeds, and so on. Data is catalogued and prepared using Spark machine learning and in-memory processing. What this means in practice is that users get to see the connections between data sources, and data that has been prepared for analysis.
Tamr addresses the seemingly straight forward task of cataloging data sources, identifying attributes, metadata integration over multiple sources, data cleansing, and publishing a cohesive view of data to applications, services and tools that need it. Of course, as anyone who concerns themselves with such issues will know, it isn’t straight forward at all. Even tasks as simple as deduplication are beset with difficulties. To make the whole task somewhat more feasible Tamr employs machine learning technologies to carry out much of the grunt work. The algorithms work alongside relevant domain experts so that ambiguities and other issues can be resolved.
At the simplest level Tamr allows users to register data sources in a centralized catalog, facilitates the creation of a unified schema, cleanses data, and publishes via a RESTful interface a single version of the truth.
Trifacta provides a self-service data preparation platform that automates many data preparation tasks and allows users to interrogate their data in an efficient manner. The platform learns as users refine their data so that subsequent operations become more automated. It is a highly visual platform with copious graphical representations of data to aid the data wrangling process.
Machine learning algorithms sit at the heart of the capability typically providing rank listed suggested operations that are relevant to the data. A complete data preparation task is stored as a script which can be compiled to run on a variety of systems.