It really is quite difficult to fault Pentaho Community (PC). It does of course come with the issues associated with open source solutions, particularly when they involve multiple components as PC does. The most significant of these is making sure all the parts play together well, and this is not a trivial problem. Nonetheless PC provides a true enterprise solution provided the relevant skills are available to make it all work. The alternative of course is to go for the Enterprise Edition of Pentaho – but that is a separate issue.
The main components of PC are the reporting tools, the data integration platform, the ROLAP analytics platform and the data mining tools. Here is a description of them taken from the PC web site:
Reporting
With the Pentaho-Report-Designer you can create report-definitions in a graphical environment. Reports are usually published to the Pentaho-Platform, which allows you to manage, run and schedule the reports you created. If you are new to Pentaho-Reporting, you probably want to start with the Pentaho Report-Designer.
Internally, reports are executed by the Pentaho Reporting Classic Engine. Pentaho Reporting encompasses more than two dozen software projects that facilitate creating and publishing data-driven business reports. If you are a developer or power user, the book “Pentaho Reporting 3.5 for Java Developers” (by Pack-Publishing) provides a great reference guide for all your needs.
Reporting’s development is driven by the goal to create a flexible yet simple to use reporting engine. It is a a suite of Open Source tools that includes the Report Designer, Reporting Engine and Reporting SDK. The Report Designer is a desktop reporting tool that provides a visual design environment to easily create sophisticated and rich reports. It is geared towards experienced and power users, who are familiar with the concepts and data-sources used. The Reporting Engine is an embeddable JAVA reporting library that is used by the report designer to generate reports. The library can be used in both in server-side and client side scenarios. Originally known as JFreeReport, it was designed to follow the approach of banded reporting with absolutely positioned elements. The Reporting Software Developers Kit (SDK) is a packaging of the Classic Engine, documentation and all of the supporting libraries required to embed the Pentaho Reporting Engine into your application.
Data Integration
Pentaho Data Integration (PDI, also called Kettle) is the component of Pentaho responsible for the Extract, Transform and Load (ETL) processes. Though ETL tools are most frequently used in data warehouses environments, PDI can also be used for other purposes:
- Migrating data between applications or databases
- Exporting data from databases to flat files
- Loading data massively into databases
- Data cleansing
- Integrating applications
PDI is easy to use. Every process is created with a graphical tool where you specify what to do without writing code to indicate how to do it; because of this, you could say that PDI is metadata oriented. PDI can be used as a standalone application, or it can be used as part of the larger Pentaho Suite. As an ETL tool, it is the most popular open source tool available. PDI supports a vast array of input and output formats, including text files, data sheets, and commercial and free database engines. Moreover, the transformation capabilities of PDI allow you to manipulate data with very few limitations.
Analysis
Pentaho Analysis consists of the Mondrian ROLAP engine, an analysis schema creation tool called Schema Workbench, and an analysis cube performance enhancement tool called Aggregation Designer. Pentaho Analysis is most commonly seen from an end-user perspective through the Pentaho BI Server’s analysis view interface.
The Mondrian OLAP System uses Partitioned Cubes, allowing several fact tables. It also consists of four layers; working from the eyes of the user to the inward of the data center. These are: the Presentation Layer, the Dimensional Layer, the Star Layer, and the Storage Layer.
- The Presentation Layer determines what’s shown in the user’s screen, and how he can interact to ask new questions. There are many ways to present multidimensional data sets, including pivot tables, pie, line or bar charts and advanced visualization tools such as clickable maps and dynamic graphics.
- The Dimensional Layer parses, validates and executes Multidimensional Expressions (MDX) queries. A query is evaluated in multiple phases and the axes are the first to be computed, right before the values of the cells within the axes.
- The Star Layer is responsible for maintaining an aggregate cache. An aggregation is a set of measure values (cells) in memory, qualified by a set of dimension column values.
- The Storage Layer is a Relational Database Management System (RDBMS) and is responsible for providing aggregated cell data and members from dimension tables.
Partitioned Cubes – Whereas a regular cube has a single fact table, a Partitioned Cube has several fact tables, which are joined together. One partition might contain today’s data, while another might hold historical data, giving a useful contribute for real-time analysis.
Data Mining
Pentaho Data Mining, based on Weka project, is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules, and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data set or called from your own JAVA code. It is also well suited for developing new machine learning schemes. Weka’s main user interface is the Explorer, featuring several panels which provide access to the main components of the workbench: the Preprocess Panel, the Classify Panel, the Associate Panel, the Cluster Panel, the Select Attributes Panel, and the Visualize Panel.
- The Preprocess Panel has facilities for importing data from a database, a CSV file, or other data file types, and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria.
- The Classify Panel enables the user to apply classification and regression algorithms (indiscriminately called classifiers in Weka) allowing you to the resulting data set, to estimate the accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC curves, etc., or the model itself (if the model is amenable to visualization like, e.g., a decision tree).
- The Associate Panel provides access to association rule learners that attempt to identify all important interrelationships between attributes in the data.
- The Cluster Panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.
- The Select Attributes Panel provides algorithms for identifying the most predictive attributes in a data set.
- The Visualize Panel shows a scatter plot matrix, where individual scatter plots can be selected, enlarged and analyzed using various selection operators.
The Weka Scoring Plugin is a tool that allows classification and clustering models created with Weka to be used to “score” new data as part of a Kettle transform. “Scoring” simply means attaching a prediction to an incoming row of data. The Weka scoring plugin can handle all types of classifiers and clusterers that can be constructed in Weka.
The ARFF Output Plugin is a tool that allows you to output data from Kettle to a file in WEKA’s Attribute Relation File Format (ARFF). ARFF format is essentially the same as comma separated values (CSV) format, except with the addition of meta data on the attributes (fields) in the form of a header.
Weka packages are bundles of additional functionality, separate from the capabilities supplied in the core system. A package consists of some jar files, documentation, metadata, and possibly source code. This allows users to select and install only what they need or are interested in, and also provides a simple mechanism for people to use when contributing to Weka. Some of the existing packages are provided by the Weka team, while others come from third parties. Weka includes a facility for the management of packages and a mechanism to load them dynamically at runtime – there are both a command-line and a GUI package manager.