Predictive Analytics Software Summary
Predictive analytics works on the simple notion that historical data contains useful patterns of behavior that can be discovered by apply pattern seeking algorithms to those data. Any candidate patterns can then be tested and validated in various ways, and if found fit, can be used to predict future behaviors – of people taking out loans, of manufacturing machinery, of employees, and of course customers.
A number of terms are used in this field. Data mining is the act of actually trawling through data looking for patterns, and the algorithms that are used typically come from a field known as machine learning. These terms are often used interchangeably.
In reality most of the time and effort involved in building a predictive model is the data preparation. This is a big topic, but data have to be checked for errors, cleaned, profiled and transformed. Running the pattern seeking algorithms might just be 20% of the total effort. And so it is very important that software for building predictive models supports the preparation of data, and makes it as pain free as possible. It should also support the creation of deployable models – possibly Java, C++ or as PMML (Predictive Model Markup Language).
Many of the free predictive analytics software listed here come in the form of libraries, or languages. However platforms such as KNIME, RapidMiner and Orange rival commercial products in their ease-of use.
Commercial Predictive Analytics Software
Alteryx is an interesting product, filling a void that most other analytic platforms do not address. It allows skilled business users and analysts to analyze their data using a combination of data visualization and predictive analytics tools. It also supports spatial analytics where location is important. Alteryx places a great deal of emphasis on the ease-of-use of its platform and the creation of workflows via a drag-and-drop graphical interface. Resulting analysis can be shared, both the analytic models and the results of analysis. Virtually all commonly used data sources are supported, and via a process of data blending users can shape data into the form they need for analysis. Various analytic techniques can then be applied to the data, and the results output in visual formats and/or formats which can be consumed by other tools (e.g. Tableau and QlikView).
[button color=”gray” size=”” type=”square_outlined” target=”_blank” link=”http://www.butleranalytics.com/alteryx-review/”]Read Alteryx Review[/button]
Angoss provides a broad suite of analytical tools and solutions which cover predictive analytics, text analytics, document exploration, scorecards and advanced modeling in an integrated environment. It has recently expanded the capability of its products considerably to meet general model building needs and provide model management capabilities. It is truly an enterprise solution for analytical needs, providing the infrastructure and management controls necessary to deploy predictive models into a production environment. Angoss provides an analytics platform of considerable breadth and capability, and joins an elite group of no more than five suppliers who truly offer enterprise capability.
[button color=”Gray” size=”” type=”square_outlined” target=”_blank” link=”http://www.butleranalytics.com/angoss-review/”]Read Angoss Review[/button]
Dataiku Data Science Studio (DSS) provides a productive data-to-production analytics workbench. Many of the time consuming steps that slow down analytic model production have been automated and streamlined, enabling skilled business analysts and data scientists to quickly prepare and understand data, build a model, and quickly integrate it into a production environment. Data scientists and business analysts alike will find that Dataiku DSS provides a productive, flexible, analytics workbench that is capable of addressing virtually all analytical needs. A free community version is available that is limited to 100,000 rows of data.
Datameer provides the means to bring large data sets, that display great diversity (text, relational, streaming data etc) into the Hadoop environment. Once there Datameer supplies the data wrangling tools necessary to profile and transform data into useable formats. Analysts and data scientists can then use the large set of algorithms provided by Datameer to create predictive models and perform other forms of quantitative analysis. Finally, business users can visualize data using a wide variety of charts and dashboards, and more advanced visualizations such as clustering and decision trees can be created via an easy-to-use interface.
IBM SPSS provides statistical and data mining capability, and has associated predictive applications, particularly in marketing. Most likely to be of interest to existing IBM customers.
Predixion Software distinguishes itself by supporting the deployment of predictive analytics on devices (a machine for example), in the cloud and at points of data aggregation. It is firmly focused on the Internet of Things, and real-time predictive analytics dealing with data from sensors, devices and other streaming data sources.
Salford Systems delivers a portfolio of products capable of traditional descriptive analytics and predictive analytics. What distinguishes this company is the lack of hype around the technology it offers and a willingness to discuss the pitfalls and traps associated with predictive analytics – which ironically is a prerequisite for successful analytics. The SPM Salford Predictive Modeler supports both traditional descriptive and predictive analytics. CART (Classification and Regression Tree) supports classification and the discovery of hidden relationships between attributes. It embodies a number of proprietary methods and patented extensions to the original work done in the eighties.
SAS predictive analytics is part of the very broad analytics capability offered by SAS. It not only offers the data preparation and data mining tools, but also a run-time environment. The main complaint is the cost.
Skytree will primarily appeal to large organizations with some experience in the use of machine learning technologies; and in fact Skytree positions itself as ‘The Machine Learning Company’. Its Infinity platform provides the tools for analysts and data scientists to create predictive models, in a manner that is both productive and effective. Productivity benefits come from the automation of many tedious tasks that typically require weeks of fine tuning, and the effectiveness of the resulting predictive models is due mainly to the extensive data exploration tools, model performance monitoring, and the fact that most of the algorithms have been designed ground up, specifically for big data analytics. It is hard to see why a large, sophisticated organization might not consider Skytree in its efforts to realize the benefits associated with decision automation.
Statistica, now part of Dell, is an integrated predictive analytics tool, which includes text analytics and statistical analysis. It has recently gone through a makeover in Version 13, with support for streaming data. There’s a lot going on here and well worth a look.
Commercial Cloud Based Software
Algorithmia provides a cloud based platform for algorithm developers to share their work, and for application developers to incorporate algorithms into their applications. Hundreds of algorithms are already available addressing most conceivable tasks including text analytics, computer vision, graphs, machine learning and others. Costing is based on the frequency of algorithm usage and compute time.
algorithms.io provide a cloud hosted service to collect data, generate classification models and score new data. Code is added to web and portable device applications which stream data to the algorithms.io service, where it is captured and processed using random forest, support vector machine, K-Means, decision tree, logistic regression and neural network algorithms. The resulting model is then used to categorize new data. The results are passed back as a parsed data stream to power apps, or as reports and visualizations. A set of APIs are provided for developers to integrate machine learning into web and mobile applications. The algorithms are categorized as anomaly detection, clustering, classification and collaborative filtering.
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Once your models are ready, Amazon Machine Learning makes it easy to get predictions for your application using simple APIs, without having to implement custom prediction generation code, or manage any infrastructure.
Dotplot provides data mining, statistics, text mining and predictive analytics tools in an integrated, highly graphical cloud based environment. All that is needed to use dotplot is a browser. Resulting models can be integrated with other applications via web services using SOAP and REST protocols. Dig a little deeper and dotplot is actually a much needed graphical front end to R and Weka functions. This accounts for the very large number of functions supported and the broad capability.
FICO arguably has the most experience of any supplier in the application of statistical and machine learning technologies to business problems. FICO Analytic Cloud embraces machine learning, statistics, optimization and business rules management, in the context of a well managed environment. It also serves as a marketplace for developers of analytic solutions and users who have a need for them.
Microsoft Machine Learning Studio features a library of time-saving sample experiments, R and Python packages and best-in-class algorithms from Microsoft businesses like Xbox and Bing. Azure ML also supports R and Python custom code, which can be dropped directly into your workspace.
BigML is a cloud based machine learning platform with an easy to use graphical interface. It also provides simple mechanisms to incorporate predictive models into production applications through its REST API. The platform combines supervised learning (to build predictive models), unsupervised learning (to understand behavior), anomaly detection (used in fraud detection), data visualization tools (scatter-plots and Sunburst diagrams) and many mechanisms for exploring data. The modest pricing will make it attractive to medium and large businesses who want the benefits associated with machine learning without large upfront costs and implementation delays. BigML is a pragmatic, low cost, easy to use platform for building powerful predictive models.
Ersatz is a web-based general purpose platform for machine learning with support for GPU-based deep learning. It’s geared towards aspiring and working data scientists with stuff to do. Ersatz has a number of components designed to make modern machine learning workflows much more efficient. Primarily, these include tools for data wrangling, model training, and machine learning infrastructure.
Google Prediction API can integrate with App Engine, and the RESTful API is available through libraries for many popular languages, such as Python, JavaScript and .NET. The Prediction API provides pattern-matching and machine learning capabilities.
IBM’s Watson Analytics provides a great deal of intelligence for data handling and exploration, and a conversational type interface. It automatically does the hard math to show the most relevant facts, patterns and relationships. A free version is offered with limitations on data volumes.
wise.io provides an extremely fast implementation of the Random Forest machine learning technique that is suitable for classifying complex, high dimensional data orders of magnitude more quickly than most alternatives. The technology was originally developed to search data generated by astronomical observations. The power of the technology is well expressed by its ability to learn and categorize handwriting within just a few minutes, compared with a week of learning using the favorite technology for this problem, the support vector machine.
Yottamine includes comprehensive capabilities for importing and applying models in a real-world setting. It is designed to allow users to take full advantage of scalable on-demand cloud computing, and eliminate the high costs of a dedicated infrastructure. The Yottamine Predictive Service allows for building models or making predictions in two simple steps. Via integration with scalable cloud computing it provides high speed and efficiency. It also conforms with the SSL industry security standard and can exports to PMML those models that are supported by the standard. Data scientists can connect and control Yottamine Predictive Web Services using R programming language via YottamineR package.
Open Source and Free
Apache Mahout spark machine leaning supports mainly three use cases: Recommendation mining takes users’ behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
Jubatus is the first open source platform for online distributed machine learning on the data streams of Big Data. Jubatus uses a loose model sharing architecture for efficient training and sharing of machine learning models, by defining three fundamental operations; Update, Mix, and Analyze, in a similar way with the Map and Reduce operations in Hadoop.
KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, …), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc.
KNIME is arguably the premier open source platform for creating predictive models. It provides a drag and drop graphical interface for the creation of workflows of any complexity. The basic edition is free, and various extensions are available, along with support and training for business use.
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency.
Orange is a very capable open source visualization and set of data mining tools with an easy to use interface. Most analysis can be achieved through its visual programming interface (drag and drop of widgets) and most visual tools are supported including scatterplots, bar charts, trees, dendograms and heatmaps. A large number (over 100) of widgets are supported.
R – is described as a project for statistical computing, but it might be more accurately described as the lingua franca of analytics. A large number of commercial analytics tools support R (Oracle, Microsoft, FICO, TIBCO, Angoss …), simply because it does pretty much everything. The out-of-the-box runtime environment is fairly slow, and so vendors such as TIBCO, Microsoft and Lavastorm provide speeded up runtime support. The great thing about R is that almost everything is possible, the downside is that it’s a language and needs to be programmed. However various packages exist to make the whole thing more productive and easier.
Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modeled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.
RapidMiner is similar to KNIME, in that it delivers a graphical, drag and drop interface for the creation of predictive models. Like KNIME it provides hundreds of functions to prepare data, process data, find patterns, and visuals to display graphs and charts. This link is for the open source version. The commercial edition comes with big data support and greater sophistication.
scikit learn provides many easy to use tools for data mining and analysis. It is built on python and specifically NumPy, SciPy and matplotlib.
TANAGRA is a free set of data mining tools for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area.
WEKA is set of data mining tools incorporated into many other products (Knime and Rapid Miner for example), but it also a stand-alone platform for many data mining tasks including preprocessing, clustering, regression, classification and visualization. The support for data sources is extended through Java Database Connectivity, but the default format for data is the flat file.