Trifacta Review Summary
Trifacta provides a self-service data preparation platform that automates many data preparation tasks and allows users to interrogate their data in an efficient manner. The platform learns as users refine their data so that subsequent operations become more automated. It is a highly visual platform with copious graphical representations of data to aid the data wrangling process.
Machine learning algorithms sit at the heart of the capability typically providing rank listed suggested operations that are relevant to the data. A complete data preparation task is stored as a script which can be compiled to run on a variety of systems.
This is a very sophisticated product that will be of interest to large organizations struggling to prepare large data sets for analytical purposes.
Functionality
A broad range of functions include data assessment, shaping, enrichment, transformation and others, all within the framework of a well governed environment. Assessment gives a high level overview of data quality with detection of missing and unusual values, gaps, data skew, and automatic detection of data types. Enrichment involves the combining of data from different sources to complete the data picture. This might involve using various dictionaries, joining data, creating derived fields and aggregating. Useful transformations can be saved and shared with others in the organization as a reusable script. Shaping concerns itself with creating generating data at the right level of granularity, and Trifacta uses data inference techniques to introspect the data and automatically apply initial shaping and metadata recommendations for the user.
Trifacta calls its approach to data preparation Predictive Interaction. This leverages a two way interaction between users and Trifacta platform where Trifacta will recommend and users can accept or modify the recommendations. The output from this interaction is a script, and the whole users interface is designed to provide menus and drag and drop so that the actual act of coding can be avoided.
Architecture
Trifacta is composed of three integrated layers. Direct data manipulation allows users to select and modify data as needed. The recommendations are generated by machine learning algorithms that learn as the platform is used. These are rank listed suggested transforms that are relevant to the task in hand. All transformations can be viewed in real time in the actual data itself.
The Learning Layer is the province of the machine learning algorithms. These immediately process data and transform it into a usable format as soon as a data source is connected. This includes delimiting, identification of data types (a url for example), and attribute properties with an initial assessment of the statistical distribution of values.
The Data Layer supports most data types – from CSV to Hadoop installations.