Self-Service Data Preparation Summary.
Once upon a time data was neatly arranged into rows and columns. It was a natural format for the highly structured data that accompanied most administrative functions within the business. Sales orders, employee data, purchase orders, accounting information – each transaction was exactly the same format as all other transactions. It made life relatively easy, and database designers could create descriptions of the data (schemas) that were set in concrete, with very little change as the years ticked by.
Nearly all data served the purpose of helping businesses become more efficient, and the focus was firmly on the bottom line. Very few businesses considered that IT might be used for anything other than labour displacement and cost reduction. My own research shows that just a few years ago only one in ten business managers considered that information might be used for top line growth. In fact there was an unwritten protocol that such considerations might be nothing more than flights of fancy. If senior managers in your business still think this way it might be time to jump ship, because businesses in almost every industry are starting to use data in a wholly different way.
Today data is gushing everywhere – social data, data from devices and sensors, text data (customer comments, documents), click-stream data from web sites – and so on. It doesn’t mean that all this data is useful. Much of it isn’t, but some of it is very useful indeed, and the new science of data is concerned with teasing out the data that might lead to a new market opportunity, result in happier customers, spot a key product enhancement, and a thousand other things which could contribute to top line growth.
This change of emphasis needs a change of mind set. Administrative functions are prescriptive in nature and business intelligence platforms have served to dish out regular reports, and more recently various charts and dashboards. These can tell a business user what has happened or is happening, but they cannot disclose why these things have happened. For that we need causal analysis, and not the descriptive and diagnostic analysis associated with business intelligence (BI).While regular reporting and dashboards with KPIs are as prescriptive as the processes they describe, exploring data for the nuggets that might enhance customer relationships, or reveal a product opportunity, is an iterative, exploratory process.
Now if the only data resource we had was our traditional transaction oriented data there might be little opportunity to significantly enhance the top line. Fortunately businesses increasingly have access to much richer sources of data, and these are data which add wholly new dimensions to understanding customers and markets. Text data in the form of customer comments is a rich source of sentiment, and with the relevant tools can reveal insights that might not be gained any other way. Location data freely transmitted from mobile devices can give businesses an opportunity to communicate with customers when they are near to a store, or a gas station, hotel, airport or any other location sensitive business opportunity. Connections between customers and particularly those revealed in social networks can be analyzed for relationships using graph databases. Streaming data from sensors and devices, both domestic and industrial, allow businesses to anticipate problems and opportunities. Clearly this is something different, and we increasingly find ourselves dealing with data that is much more unstructured, which in turn denies us the opportunity of assuming the data will neatly fit into existing data stores.
The net result of this is that data are becoming more complex, diverse, more urgent and of course being generated in much greater volumes – enter big data.
Big Data
It is increasingly acknowledged that while the term ‘big data’ is one of the most successful technology buzzwords ever invented, data does not have to be big to qualify. In fact it is diversity of data that is generally much more important when we want to understand customers better, or seize market opportunities in a more timely manner.
The term ‘big data’ arose naturally from the development of the Hadoop platform – expressly designed to handle very large volumes of data using low cost infrastructure. Companies like Google, Yahoo and Facebook needed to store petabytes of data in a scalable manner, and were the main innovators of big data technology. Many businesses however do not necessarily want to store petabytes of data, but they do want to process greater diversity of data types, and process it in near real-time. In typical fashion, the people who create technology cliches came up with the three Vs of big data – volume, variety and velocity. Some added others (also starting with V) such as veracity. If we were only dealing with volume the problem would almost be trivial – set up Hadoop infrastructure and just keep pumping the data in. The much harder problem is that of variety. We now have protocols for data transmission and specification that support and infinite variety of data structures (XML and JSON for example). The good news is that we can capture complex data. The bad news is that we need to make sense of it and mould it into a form that is amenable to analytics activities – and it is this requirement that is being met by a new generation of data preparation platforms.
At the present time big data, and especially the Hadoop ecosystem of technologies, is something of a ‘do it yourself’ assembly kit. This is exacerbated by the continuing rapid evolution of various components, and users of big data technologies need a considerable skill set to both implement a big data installation and keep up with developments. At the current time it is estimated that the total number of organizations using big data in anger number in the low thousands – three or four thousand. Big data will not become mainstream until it becomes more a ‘plug-and-play’ infrastructure. This will happen, but it is difficult right now to say when and how.
Data Wrangling
So our ‘big data’ infrastructure will be handling much greater diversity of data. It is no longer simply a matter of connecting to a set of well structured data sources and displaying reports and charts. The data has to be shaped for analysis. Without getting too technical, much modern data may be hierarchical in nature, or formatted on a record by record basis. Analytical tools just cannot make sense of data in these formats. So the term ‘data wrangling’ has been introduced to cover the processes that are necessary to prepare data for analysis. There are three quite well defined steps needed to take structured, semi-structured, and unstructured data and make it fir for purpose:
Exploration – obviously the first thing to do is determine exactly what is in our data stores. With structured data this is easy – the schema will tell us. With semi-structured data we need to extract the ‘schema’ as the data is being explored. The end result should be a description of the data (metadata) and the nature of the content.
Curation – is the process of building a structure that can be used by analytical processes once the exploration has disclosed the nature of the data.
Production – should the resulting data prove to be useful it may be desirable to production line the creation of the resulting data sets. A host of additional issues include robustness of processes, security, and the automation process itself.
Various skills are involved in this whole process including the data analyst, database administrator, data scientist, and of course the end user. All three steps are driven by need, and data may be needed for BI, data mining and/or data visualization purposes.
In the era of pure structured data much of this task was accomplished through extract, transform and load (ETL) tools. With diverse semi-structured data types this needs to be augmented with profiling.
Accessing data sources in the first step in the technical process. Many contemporary data preparation platforms come with a large number of connectors to big data sources, traditional relational databases and external applications and cloud based data sources.
The next step is transformation of the data, although transformation and profiling are iterative. The aim of transformation is to create a usable schema and variables that are relevant to the task in hand. This process is complete when data quality is adequate and the data are well described by metadata. The ideal scenario is to have records that are self-contained, in that all the information relevant to an event is contained in a single record. Data can also be enriched with data from other data sources, and of course cleaned for erroneous values, missing values and other discrepancies.
Profiling is largely concerned with creating descriptive statistics. An example might be the distribution of values in an order value field. In fact there are two types of profiling – type based, where the percentage of values in a field that meet the specification for the field is indicated, and distributional where deviation based anomalies are identified.
This is not a trivial undertaking, and while we all want to see dashboards and use algorithms to find useful patterns, the bulk of the work (currently estimated at fifty to eighty per cent of the analytics task) is data preparation. And so it should be clear that technologies that might automate this process, partly or completely, should be of considerable value.
Data Preparation Platforms
A number of alternatives present themselves for data preparation. At the most basic level programming languages can be used to transform data, and R, SQL, python and SAS are good examples. Even the Unix shell contains commands that can be used for data transformation, although none of these can be called self-service. Others might choose to use specialized spreadsheet applications, although these do not scale particularly well, and a considerable amount of skill is required.
Traditional ETL tools now often come with a visual interface where boxes that perform various functions are dragged on a canvas, and connections are made between these functions to create a workflow. This is definitely a step up from writing program code, but these tools generally offer little added intelligence to aid productivity.
Fortunately a new generation of data preparation tools is available that do contain intelligent functions to considerably speed the data preparation task. These often employ big data technologies, and specifically Spark with its associated machine learning tools, to automate many data preparation tasks. These are typically visual in nature with rich profiling information and instant feedback on the effects of various transactions. They also tend to suggest transformations based on the nature of the data being processed, and will often join data sources together based on common features.
Several products have been reviewed on this web site, including:
ClearStory – data prep to visualization aided with machine learning
Datameer – data prep to visualization with good handling of big data.
Datawatch – data prep for complex data types and real-time data visualization.
Lavastorm – data prep, visualization, predictive models, run time monitoring
Paxata – data prep using machine learning
Platfora – data prep to visualization with machine learning
Tamr – data prep aided with machine learning
Trifacta – data prep aided with machine learning