The economic driver – rapidly falling data acquisition costs (from web sites, customer feedback, social data etc), and so we accumulate much more data for a lower cost.
Big data addresses data volumes, data variety and data processing speeds. Any one of these will qualify, and so big data does not have to be ‘big’! Data variety includes documents, column based data, tabular data, geospatial data, graph data, streaming data, and so on. High throughput is needed for any number of reasons – real-time streaming data, real-time querying and reporting, and scientific processing (CERN for example). Data from devices and sensors (the Internet of Things) will soon be adding to the need for big data technologies.
Best strategy – wait until the dust settles if you can. If not, then make sure you understand the risks (building castles on shifting sand is a risky business), and have a large enough business return to warrant the risks.
Examples of big data technology include:
Hadoop – not a single product but an ecosystem of data management technologies and distributed processing capability (so that commodity hardware can be used). Batch processing has been the dominant model of data processing. MapReduce has been at the heart of its processing methods. This allows processing to be mapped to multiple computers and then reduces back into a single result. The Hadoop Distributed File System (HDFS) is a clustered approach to managing files in a big data environment.
Other components of Hadoop include:
- YARN (yet another resource negotiator) that allocates jobs to processors so that workload is well distributed.
- Hive – batch oriented data warehousing layer with SQL facilities.
- Pig – a script based language for simpler access to Hadoop facilities.
- Spark (currently flavour of the month) supports in-memory processing and can run outside Hadoop!
Databases
- Riak – a key-value pair database best used with social data and communities.
- MongoDB and CouchDB – document databases suited for high volume content management.
- Hbase – a columnar database well suited for messaging systems.
- Neo4J – a graph database often used in social networking applications.
- PostGIS/OpenGEO – for spatial data. Used in 3D modelling and spatially distributed sensor analysis.
If your head is already spinning then be thankful I haven’t mentioned literally dozens of other components that enable other types of big data processing.
Big data is complex, costly (particularly skills), and fairly risky, because it is still a Wild West Frontier. But if you absolutely have to process big data then the technology does at least exist to do it.
Overall it is a technician’s dream (lots of complicated stuff) and a business manager’s nightmare. And remember – big data is a cost. Not until you use the data in some way (analytics for example) does it provide any return.