Big data analytics is moving on. Until recently it meant wrestling with MapReduce and glueing a dozen or more components together to make a viable big data platform. The big data analytics tools listed below ‘shrink wrap’ big data technologies to some degree and make the whole effort much more palatable. ClearStory, Platfora and Datameer provide business user interfaces, while Databricks and ZoomData are more geared toward the creation of production analytics applications. Paxata on the other hand is primarily concerned with everything prior to the data visualization.
ClearStory provides a cloud based platform that takes business users from data to visualization with minimal need for technical skills. In common with several other suppliers (Tamr, Paxata, Platfora and others) it uses machine learning techniques and Apache Spark in-memory processing, to take data from its raw state, to a state where business users can create the data visualizations they need. As data sources become more diverse so businesses are using the Hadoop big data platform as a ‘data lake’, and ClearStory increasingly supports this and many other diverse data sources. Users are presented with a collaborative environment where data discovery, exploration and visualization efforts can be shared via StoryBoards. These provide one or more visualizations in the context of a story line, and various users can annotate and comment as the StoryBoard evolves.
Databricks provides a platform as a service (PaaS) environment for business analysts and data scientists to conduct analysis in a performant big data environment, utilizing the Spark in-memory architecture and associated ecosystem. In fact Databricks has been the major contributor to the development of Spark, and there is probably no other organization who knows more about it. The essence of the Databricks offering is that it makes the job of data ingestion, data preparation, analysis and deployment of analytical models much simpler – certainly simpler than wrestling with mapReduce code. Databricks uses ‘Notebooks’ as a workspace for analysts, and these can be converted into executable jobs if needed. The Spark platform comes with Spark SQL for familiar SQL processing of data, APIs for Java, Python, Scala and R, the MLlib machine learning library, support for streaming data, and the GraphX graph structured data processing engine. It is not a business user platform, but much more suited to the creation of production analytics jobs, and particularly the creation of predictive models that need to execute in near real-time.
Datameer provides the means to bring large data sets, that display great diversity (text, relational, streaming data etc) into the Hadoop environment. Once there Datameer supplies the data wrangling tools necessary to profile and transform data into useable formats. Analysts and data scientists can then use the large set of algorithms provided by Datameer to create predictive models and perform other forms of quantitative analysis. Finally, business users can visualize data using a wide variety of charts and dashboards, and more advanced visualizations such as clustering and decision trees can be created via an easy-to-use interface.
Paxata automates many data preparation tasks, and offers high levels of performance with its Spark based architecture. Big data preparation tasks are well accommodated, and business analysts will find the platform easy to use. The core capability of Paxata leverages Hadoop and specifically Spark, so that large scale in-memory processing is available for the machine learning algorithms that give Paxata much of its power. Paxata can be deployed on premises or accessed as a cloud service. The on-premises deployment requires a Hadoop environment (either dedicated or shared).
Platfora provides an end-to-end big data data discovery and exploration platform that starts at data ingestion and ends with visualization. The growing use and acceptance of Spark in-memory processing, in addition to a maturing of Hadoop, means that Platfora can deliver massively scalable data exploration and discovery tools that overcome many of the problems associated with traditional data warehousing platforms. These are built with predefined needs in mind, using limited data sets, and so are inflexible and slow to design, build and use. Three month latency between need and capability is common with these platforms. Platfora uses the Hadoop distributed file system (HDFS) as a data store, ingesting data from transaction based systems, devices, external data feeds, and so on. Data is catalogued and prepared using Spark machine learning and in-memory processing. What this means in practice is that users get to see the connections between data sources, and data that has been prepared for analysis.
Skytree will primarily appeal to large organizations with some experience in the use of machine learning technologies; and in fact Skytree positions itself as ‘The Machine Learning Company’. Its Infinity platform provides the tools for analysts and data scientists to create predictive models, in a manner that is both productive and effective. Productivity benefits come from the automation of many tedious tasks that typically require weeks of fine tuning, and the effectiveness of the resulting predictive models is due mainly to the extensive data exploration tools, model performance monitoring, and the fact that most of the algorithms have been designed ground up, specifically for big data analytics.
ZoomData connects to most data sources – big data, database, text data, and so on. It also supports in memory processing in the form of Spark and Databricks. The connectors are a real feature, and the speed of processing is in part due to native connectors to Cloudera Impala, Amazon Redshift and MongoDB. Its micro-query facility means users do not have to wait for a large query to complete, the visualizations are built incrementally as data are processed.The Visualization Studio uses various JavaScript libraries (D3, Leaflet, NVD3) and users can include other libraries if needed. Text search is particularly sophisticated with faceted search and filters (Elastic Search and Solr are supported). It supports the creation of complex dashboards and the embedding of visualizations into production applications.