Big Data Stream Processing Platforms


Hadoop has primarily been used to support batch processing – the batch execution of tasks against very large amounts of data.

As big data continues to evolve so the need to handle real-time streaming data has been embraced. Three candidates stand out at the current time:

InfoSphere Streams is part of IBM’s extensive Infosphere data management architecture. It supports very high throughput rates for real-time analytics and is available in a free Quick Start Edition, so users can get a feel for stream computing.

S4 is currently an Apache Incubator product which provides a distributed, scalable, fault-tolerant platform for the processing of continuous unbounded stream of data. S4 was released by Yahoo in 2010 and has been used extensively by Yahoo to support thousands of search queries per second.

Storm is an open source real-time computation system capable of supporting real-time analytics, machine learning, computation and anything else where a stream of data needs real-time attention.

Processes on Storm are called topologies, and unlike a Hadoop job, they never finish (until stopped). Obviously this is an ideal stream processing paradigm. In Storm a stream is defined as an unbounded sequence of tuples, and one of the primary functions of Storm is to distribute the stream across multiple servers in an optimal manner. A full tutorial on Storm can be found here.