Summary: Dataiku Data Science Studio (DSS) provides a productive data-to-production analytics workbench. Many of the time consuming steps that slow down analytic model production have been automated and streamlined, enabling skilled business analysts and data scientists to quickly prepare and understand data, build a model, and quickly integrate it into a production environment. Data scientists and business analysts alike will find that Dataiku DSS provides a productive, flexible, analytics workbench that is capable of addressing virtually all analytical needs. A free community version is available that is limited to 100,000 rows of data.
Context: Analysts and data scientists who build analytical models will know that most time is spent accessing, preparing and engineering data into a usable form. Model building is also an iterative process, and in some instances there is a need to drop down into code to refine models. DSS is a true workbench with the necessary tools to accelerate many of these processes and deliver significant improvements in productivity. It is also the case that model building is usually a collaborative effort with domain experts, analysts, business managers and data scientists all needing to provide input and validate that a model is fit for purpose. DSS provides a productive collaborative environment – from raw data through to production application.
Technology: The DSS interface is primarily graphical, with the ability to drop down into code if necessary. This environment addresses three main activities associated with analytic model production:
- Data preparation – Most databases, file formats and big data sources are supported with specific connectors (Excel, SAS, JSON, Hadoop, Cassandra, MongoDB, S3 …). Data encoding problems are significantly eased by the ability to automatically detect data types and infer relevant formats. Commonly used data variables (gender, IP address, URL, Date etc.) are also automatically inferred. Once the data are correctly understood by DSS it is then possible to visualize and understand it with graphs, charts and various statistics. The data cleansing and enriching process can be enhanced by bespoke code if needed, although most common activities in this domain are accomplished by easy-to-use tools – over 60 graphical processors to both correct and enrich data. These data preparation ‘recipes’ can be saved for reuse if needed.
- Analysis: a large number of machine learning algorithms are available in DSS, which primarily calls upon the scikit-learn Python library. Classification, regression, clustering and dimensionality reduction are all available modes with algorithms such as SVM, nearest neighbor, decision tree, naïve Bayes, random forest and most other commonly used methods. These models can then be integrated within overall workflow. DSS will actually train and test best-fitting models, finding relevant algorithms according to size, complexity and shape of data. Various statistical indicators and graphics reveal how algorithms are processing data, and it is a simple process to modify parameters and see the results immediately. The inbuilt support for quick iteration means that analysts can combine and transform variables on-the-fly. For more advanced users a Python script can be generated and this can be modified at will in the DSS programming interface. Other languages are also supported including R, SQL, Hive and Pig.
- Production: DSS supports an integrated model build, test, and deployment environment. This means the build, production, data refresh and model update activities all take place within the DSS platform. Rebuilds can be scheduled to run automatically and DSS will automatically detect problematic new data and provide alerts. Models and predicted value are accessible to other applications through a REST API, and these same can be made available to other destinations such as ElasticSearch, FTP servers or internal data warehouses.
Application: Dataiku DSS has found effective use in a number of medium and large size businesses. One of the more recent applications is to the availability of parking in Paris – using a predictive model to inform motorists where parking is likely to be problematical or more easily found. Many of the more usual applications of predictive models are easily addressed using Dataiku, including customer churn, predictive maintenance, fraud detection, marketing and logistics.
Dataiku is headquartered in Paris, France and was founded by some of the leading technologists from Exalead, the search engine company. It has just raised US$3.6 million in funding for further growth.