Datameer Review Summary
Datameer makes big data, and specifically the Hadoop platform, accessible for business users and analysts. It never was going to be the case that most businesses would employ an army of technicians to wrestle with mapReduce and other aspects of Hadoop, just to make it work. Big data technologies are inexorably being shrink wrapped and given a more approachable interface, and Datameer is at the forefront of this trend. It is no exaggeration to say that Datameer will be a one-stop-shop for many organisation wishing to handle big data, and unlike some platforms, it does not display any weaknesses in any of the four primary functions it address – data integration, data preparation, quantitative analytics, and data visualisation, although Tableau can be used to further enhance this last area of functionality.
In summary Datameer provides the means to bring large data sets, that display great diversity (text, relational, streaming data etc) into the Hadoop environment. Once there Datameer supplies the data wrangling tools necessary to profile and transform data into useable formats. Analysts and data scientists can then use the large set of algorithms provided by Datameer to create predictive models and perform other forms of quantitative analysis. Finally, business users can visualise data using a wide variety of charts and dashboards, and more advanced visualisations such as clustering and decision trees can be created via an easy-to-use interface.
Datameer Professional is a SaaS big data analytics platform designed for department-specific deployments. It provides a Hadoop as a Service platform for business users to do their own analysis.
Under the hood Datameer has recently been enhanced to optimise workloads, and provides the means to ensure adequate governance and compliance with regulatory authorities. As ‘big data’ becomes more prevalent, as it surely will with the emerging Internet of Things (IoT), so Datameer will find itself in a prime position to address the needs of many organisations. It is largely unchallenged for its ability to make big data useful to those who most need to use it – business analysts and users.
Big Data Shrink Wrapped
Datameer offers over sixty connectors to data sources of various kinds – relational, unstructured, streaming, log data, cloud data sources, application data, and so on. These serve to suck data into the Hadoop distributed file system (HDFS) in their native formats. Data can be wrangled with a rich set of tools that support transformations, joining related data sets and enriching data. Profiling tools enable statistical analysis of data, management of metadata and impact analysis. These tools are designed for ease of use – data sources can be pulled in using a wizard based tool, and data set profiles viewed simply by ‘flipping’ a tabular display of the data (a bit gimmicky, but very useful).
Loading of data can be incremental with time or trigger based schedules, and tolerances can be set so that dirty data generates email notifications. Issues such as data lineage (the history of the data), the creation of audit trails and a field level catalogue can all be queries via a REST interface.
Resulting data sets can be exported to other data platforms – data warehouse, relational databases and even spreadsheets, where it can be used for further analysis. The embedding of analytics in other applications is enabled via a REST API.
Execution of analytical tasks has been enhanced by what Datameer calls Smart Execution. This uses the profiles of data sets, which are known to Datameer, to select an optimum compute configuration. A job’s data flow graph is also used to aid in this process. Optimisations include large data analyses in a distributed cluster using Tez-based optimised MapReduce, and small data analyses on a single YARN node or as a distributed, in-memory process. Smart Execution currently leverages recent advances in the Hadoop ecosystem including YARN and Apache Tez.
Datameer offers up nearly 300 pre-packaged algorithms that can be employed in a spreadsheet like interface. It also provides some useful advanced visualisations in the form of clustering (using K-means), decision trees, variable dependency diagrams and recommendations. Support is provided for Predictive Model Markup Language (PMML) allowing integration with analytical models created in other languages (R and SAS for example).
Data visualisation is largely a drag and drop affair, similar with many other data visualisation tools. Over 30 visual widgets are supported, but if this isn’t enough then Tableau can be used as a visualisation front end. Mobile devices are inherently supported because the visualisations are rendered in HTML 5.
Management and Governance
Since a great deal of data are sensitive in nature, and HDFS is not well know for data security, Datameer adds several layers of security control. This includes role based access control permissions, and integration with Apache Sentry. It also adds column and row security functions and data lineage at the artefact/file level. This extends to dependency graphs and worksheet lineage. The profiling tools add to the governance and regulatory task, allowing dirty, and invalid data to be identified at any stage of the analytics process.
LDAP/Active Directory integration, role-based access control, permissions and sharing, integration with Apache Sentry 1.4, and column and row security/anonymization functions which provide enhanced security on top of Hadoop Distributed File System’s built-in capabilities.
Built on top of a thin micro kernel foundation, Datameer is comprised of hundreds of plugins, similar to architectures like Eclipse. This allows the system to be extremely flexible, extendable and robust. Individual plugins run in a shared nothing sandbox classloader environment and different plugins can run different versions of code without any effect to the system. This architecture allows extension and/or replacement of individual parts of the system with custom plugins such as decryption of data during import.
Datameer includes a no cost SDK that makes writing custom plugins for import, export, functions, and visualizations possible for any future need. Integrating existing user-defined functions from Apache Pig, Hive or other systems is a minimal development effort and developers can go much further by integrating functions from R, Python or other languages and systems.
All of Datameer’s major components are exposed through REST APIs. This enables system monitoring with any “industry standard” or custom monitoring tool. The system can also be provisioned with import / export jobs, workbooks and infographics, all from the REST API.
Obvious alternatives to Datameer include Platfora and ClearStory. These use Spark to add intelligence to the data preparation task, something not yet available in Datameer. However Datameer is much more a shell around Hadoop than either of these products, and so they serve somewhat different functions. Many BI and data visualisation tools will access HDFS and other big data databases, but none provide the secure, robust, easily managed environment of Datameer.