Fat Free Guide to Big Data Analytics


Big data analytics is concerned with the creation of value from the wide variety and high volume of data associated with big data, and often at a speed that traditional data management infrastructures cannot handle. Until value is extracted from big data it is a pure infrastructure cost. New types of database mean the analytical options are very much broader than they have been in the past. Pre big data analysis largely consisted of reporting on operational activities, typically in the form of tabular reports and maybe a few charts. Big data changes all this.

Big data databases accommodate documents, text, row based transaction data, data streams, graphs (data that is rich in relationships) and more specialised data types such as geospatial data. In parallel with this, a large number of analytical techniques have evolved which can exploit this wide variety of data.

Hadoop is the best known and most popular big data platform. It accommodates massive scaling of data, with databases that support a wide range of data types. Until recently Hadoop was primarily a batch data processing environment, although the emergence of Spark (in-memory) has facilitated real-time processing of data at very high speeds.

Here are the main categories of big data analytics:

Business intelligence and data exploration – is the more traditional analytical use case, primarily focused on transaction data. Since big data supports much greater volumes of data and of greater variety it is possible to integrate many more data types – social data, web data, location data, data from sensors. Newer BI and data exploration technologies support these data and facilitate much richer reporting and data visualisation. This activity is aways a look into the past – usually for diagnostic purposes, but also to monitor operational activities. The promise of big data is that it facilitates any analysis at any time. We are not there yet – but it’s coming.

Predictive Analytics – this concerns itself with trawling through historical data to find patterns of behaviour that might be useful in the future. A good example is the identification of customers who might be receptive to a new marketing campaign – based on the profiles of people who have previously been responsive to a similar campaign. Not only is big data characterised by many more instances (rows) of data, but it increasingly accommodates many more attributes – the details that relate to a person or thing. It is not uncommon now for hundreds of attributes to be used when looking for exploitable patterns. This is the new Klondike gold rush – the belief that nuggets can be found which will reveal exploitable behaviour of customers, suppliers, machines and even employees! There are many dangers associated with predictive analytics, some of which are unique to big data, but there are also many opportunities.

The techniques used in predictive analytics are primarily drawn from machine learning, but also statistics. Data mining is often the name given to the process where these techniques are used to find exploitable patterns, and some judgement is required to establish exactly how much data is relevant to the creation of a particular predictive model. A large number of suppliers are crowding into the machine learning and predictive analytics space, and cloud machine learning platforms are attracting attention because of the virtually unlimited storage and compute resources that are available during model building, when algorithms are searching data for useful patterns and need the resources.

Text Analytics – comes in many flavours. Natural language processing concerns itself with tasks such as identifying entities (people, places, things) in documents, identifying concepts and themes, language translation, and classifying documents. These uses are well known, but big data has enabled new applications and particularly sentiment analysis. This typically involves trawling through huge quantities of social and customer data to identify trends, sentiment toward product, and that toward competitors. One German car maker interrogated terabytes of Twitter data to discover the things that caused people to be dissatisfied with their current car. It then adjusted its marketing campaign to great effect. Big data has seen the emergence of many document databases, although this term should not be taken in its literal meaning. These are often capable of handling graphs, tabular data as well as text data (often using the JSON format). Text data is a large untapped resource in many businesses, and it adds a dimension to analysis that cannot be got any other way.

Graph analytics – is a new form of analytics directly driven by the abundance of graph databases and analytic tools. A graph is a number of nodes with connections between them. In a traditional relational database it is the objects themselves that are of interest (customers, employees, suppliers, products etc), and the relationships between them take second place. These relationships are a secondary construct – usually by defining foreign keys. In a graph database the relationship often takes on primary importance. These relationships are represented by edges on the graph, and graph analysis has found fruitful use in the analysis of social networks. Not only can the direct relationships between people be found, but secondary, tertiary – and as deep as you want to go. This technology is often used by security agencies – keen to understand an individual’s context. In business they can show who is influencing the influencers who influence customers. Graph analytics is still in its infancy – but it’s going to be very big.

Streaming analytics – is already used widely in many firms that operate in capital markets. Time series data, in the form of price information, has to be processed in real-time with the aim of identifying patterns and events that are of interest. The emerging Internet of Things (IoT) is a strong driver for streaming analytics, with devices and sensors delivering high volumes of streaming data . To this end a growing number of stream processing platforms are emerging, with suppliers such as Microsoft offering cloud based streaming platforms.

There are other forms of analytics emerging too, and spatial analytics can be used by retailers particularly to offer customers a discount on their favourite purchases, when they are in close proximity to a store. These are early days, and there is a long way to go. Welcome to the big data analytics age.