It is clear from the positioning of these startups that machine learning may soon be wrested from the hands of machine learning experts, and put into the hands of the people who most need it – namely business users. All five of these startups are targeting the business user, and although this potentially has significant dangers associated with it, the advantages are obvious. This is after all the way of all technology – moving from the esoteric to the everyday, and it is very encouraging to see.
Alpine Data Labs provides an enterprise solution to the problems of data access, model building, model deployment and model performance monitoring for predictive analytics applications. As data mining and machine learning technologies move out of the realm of esoterica, so organizations will need a full implementation and management environment. I spoke with Steven Hillion, the Chief Product Officer at Alpine Data Labs and he was very aware of the need to satisfy the requirements of business management, as well as those of IT and data scientists. Many analytics technologies focus on the technical aspects with scant regard for the monitoring of model performance and the sharing of information in a collaborative environment. Although this is one of the less glamorous aspects of predictive technologies, in many ways it is one of the most important. Without the means to establish confidence in predictive models the technology will always be underexploited and untrusted.
One of the central philosophies of the Alpine approach concerns the movement, or rather lack of it, of data. Most data mining technologies require that data be extracted from the production or data warehousing environment and set aside for model development. Alpine takes a different approach and executes data exploration, data transformation, modeling, testing and deployment in the native database environment. In practice this means that models are created and execute in the same operating environment and on the same hardware as the database. Hadoop, Greenplum and most relational database systems are supported.
One of Alpine’s central messages is a direct outcome of its ability to access data sources directly. Cross functional analytics is certainly possible with Alpine, and in some organizations will be possible. In reality however, highly stovepiped businesses will have a hard time exploiting this capability.
The user interface to Alpine’s capabilities is through a the web browser, accessing a server that simply functions as an interface to data resources and as the broadcaster of the web based environment. At the data exploration phase it provides a plethora of visualization tools – frequency diagrams, box plots, scatter charts and so on. It provides the tools for data transformation modeling, testing and scoring.
All the popular mining algorithms are supported and there is ongoing activity aimed at creating fast implementations of these. Logistic regression and support vector machine are two methods which have been subjected to this treatment. Random Forests comes in two flavors. One is essentially exploratory in nature and executes faster than the full blown implementation, which is more likely to be used for creating a final model.
A new feature called Chorus is one of the stronger differentiators. This supports strong collaboration between data scientists, IT and management, and allows information pertaining to models to be freely shared. It’s an effective way of building up a knowledge base for those working on and using models, and will eventually give management the visibility they need. It seems likely that Alpine will partner with BI and data visualization tools providers to open up the analysis of data, and performance monitoring to a wide audience.
Alpine is one of a new generation of predictive analytics solution, providing a platform that satisfies all those with a stake in the exploitation of data mining technologies, and specifically IT, data scientists and management.
BigML offers a cloud based predictive analytics capability that is both refreshingly straightforward and extremely powerful. These two qualities are usually mutually exclusive, but by using decision trees and decision tree ensembles in conjunction with some very useful visualizations, the whole process of building and testing predictive models becomes so much easier.
The steps required to build, test and use a model are simple enough. Upload data to the BigML SaaS platform, build a model (which might be just a single operation), test it on test data and if all is well download it as Java, Python, PMML or any one of several formats. Then plug it in to your production systems.
The users of BigML will range from skilled business users through to data scientists and consultants. Obviously some level of knowledge and training is necessary, but a savvy business user should get the hang of things very quickly.
While BigML will produce a decision tree with considerable speed (typically in seconds or minutes), the real power is to be found in the decision tree ensembles where many trees are created and an ‘average’ created. A technique known as bagging is used where the data are randomly sampled multiple times (with replacement) and a tree created from each sample. It emulates having a much larger data set and nearly always produces much more accurate models.
The decision tree graphics are not only very visually appealing, but contains a great deal information and are interactive. A Sunburst visualization shows which classifications have most support and confidence in a highly graphic manner, allowing users to quickly home in on the the most useful classifications.
In my opinion the focus on decision tree ensembles is very appropriate. Various ensemble methods have won the vast majority of machine learning competitions in recent years, and have been called the most significant development in machine learning over the last decade. This is a good strategic decision by the founders of bigML.
The technology has found a broad range of applications including predictive marketing, fraud detection, recommendation systems, image analysis, pricing optimization and many others that satisfy very specific needs.
BigML went live to the public less than a year ago and obviously it plans the roll out of further capability and product. These include additional learning methods (k-means for example), non-linear decision trees and time series analysis. It will also be beefing up its cloud offerings to include virtual private clouds (VPC), and multi-cloud (for other cloud platforms – eg Azure).
BigML satisfies the requirement that ‘Things should be made as simple as possible, but not any simpler.’ very well, and is worthy of investigation by any organization (of any size) that needs to employ predictive tools.
Context Relevant is mainly an analytics solutions provider and ships with behavioral analytics libraries for finance, content personalization and the online travel industry. This is another startup targeting users who really do not want to know about the technology. There are dangers here and it isn’t really clear how they are addressed.
SkyTree is primarily a server based approach to implementing data mining and machine learning techniques. An end-user tool (Adviser) is also being introduced to give power users access to machine learning techniques. Looks promising, but more information is needed. It isn’t clear how the resulting models are put into a production environment, or how the end-user tool protects users from invalid models.
SkyTree Server connects with most commonly used data sources and executes many data mining methods to address categorization, clustering, association and regression. It incorporates advanced versions of many data mining and machine learning algorithms that both speed up execution and enhance capability.
The spec of SkyTree can be enhanced through PowerPack plugins. These include a Nonparametric Power Pack which includes specialized nearest neighbor algorithms, Prediction Power Pack for testing the validity of models and the Ensemble Power Pack for utilizing ensemble based methods (Random Forest and Gradient Boosted methods).
SkyTree Adviser is in Beta at the time of writing and is targeted at power users who want to do their own analysis. This is fraught with dangers, but it is to be hoped that the product makes users aware of these and offers mechanisms for addressing them. Advisor will handle up to 100,000 rows and will connect to databases, local data, and the web.
A variety of services are offered to provide initial support and ongoing training and support.
wise.io provides an extremely fast implementation of the Random Forest machine learning technique that is suitable for classifying complex, high dimensional data orders of magnitude more quickly than most alternatives. The technology was originally developed to search data generated by astronomical observations. The power of the technology is well expressed by its ability to learn and categorize handwriting within just a few minutes, compared with a week of learning using the favorite technology for this problem, the support vector machine.
I spoke with Joshua Bloom, the CEO of the company and UC Berkeley Associate Professor Astrophysics. He was keen to emphasize that the learning algorithm is just a small part of the overall model production and deployment cycle, although having a resource that executes at this speed makes many otherwise difficult problems amenable to a solution.
The company offers a SaaS facility called Machine Intelligence Engine where users upload their data, build a model and then either download it, or have it execute in the SaaS environment. Fees are levied according to the level of usage, and users typically require a period of hand-holding, which may range from a few hours to a few days. WiseRF on the other hand is downloadable and allows models to be built in-house. It comes in three flavors – Pine, Oak and Sequoia with increasing scalability and capability. A 15 day trial can be downloaded.
Applications range from OTC trading through to industrial safety, and while the accuracy of Random Forest is widely appreciated, having these very high levels of performance means that ‘real-time’ problems can be addressed.
wise.io is supported by its customers and is cash flow positive (a rare state for a startup) and will undoubtedly make a nice acquisition should the firm wish to go that way.