Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. There are many projects that are helping to accelerate and speed up this innovation. All of these projects rely on batch and streaming processing, but what is the difference between batch and streaming processing? Let’s dive into the debate around batch vs. streaming.
What is Streaming Processing in the Hadoop Ecosystem
Streaming processing typically takes place as the data enters the big data workflow. Think of streaming as processing data that has yet to enter the data lake. While the data is queued it’s being analyzed. As new data enters the data is read and the results are recalculated. A streaming processing job is often times named a real-time application because the ability to process quickly changing data. While streaming processing is very fast it has yet to be truly real-time (maybe some day).
The reason streaming processing is so fast is because it analyzes the data before it hits disk. Reading data from disk incurs more latency than reading from RAM. Of course this all comes at a cost. You are only bound by how much data you can fit in the memory (for now..).
To understand the differences between batch and streaming processing let’s use a real-time traffic application as an example. The traffic application is a community driven driving application that provides real-time traffic data. As drivers report conditions on their commute the data is processed to share data with other commuters. The data is extremely time sensitive since finding out about traffic stop or fender bender an hour later would be worthless information. Streaming processing is used to provide the updates on traffic conditions, estimate time to destination and recommend alternative routes.
What is Batch Processing in the Hadoop Ecosystem
Batch processing and Hadoop are thought of as being the same thing. All data is loaded into HDFS and then MapReduce kicks off a batch job to process that data. If the data changes the job needs to be ran again. Step by Step processing that can be paused or interrupted, but not changed from a data set perspective. For a job in MapReduce typically the data already exist on the disk in HDFS. Since the data already exists on the DataNodes, the data must be read from each disk in the cluster where the data is contained. The processing of shuffle this data and results becomes the constraint in batch processing.
Not a big deal unless batch process takes longer than the value of the data. Using the data lake analogy the batch processing analysis takes place on data in the lake (on disk) not the streams (data feed) entering the lake.
Let’s step back into the traffic application to see how batch is used. What happens when a user wants to find out what her commute time will be for a future trip. In that case real-time data will be less important (the further away from the commute time) and the historic data will the key to setting that model. Predicting the commute could be processed with a batch engine because typically the has already been collected.
Is Streaming Better Than Batch?
Asking if streaming is better than batch is like asking if a which Avenger is better. If the Hulk can tear down buildings does that make him better than Ironman? What about the Avengers in Age of Ultron when they were trying to reflect off the Vibranium core? How would that all strength have helped here? In this case Ironman and Thor were better than the Hulk.
Just like with the Avengers, streaming and batch are better when working together. Streaming processing is extremely suited for cases when time matters. Batch processing shines when all the data has been collect and ready to test models. There is no one is better than the other augement right now with batch vs. streaming.
The Hadoop eco-system is seeing a huge shift into the world of streaming and batch coupled together to provide both processing models. Both workflows types come at a cost. So analyzing data with a streaming workflow that could be analyzed in a batch workflow is putting added cost to the results. Be sure the workflow will match the business objective.
Batch and Streaming Projects
Both workflows are fundamental to analyzing data in the Hadoop eco-system. Here are some of the projects and which workflow camp they fall into:
MapReduce – MapReduce is where it all began. Hadoop 1.0 was all about storing your data in HDFS and using MapReduce to analyze that data once it was loaded into HDFS. The process could take hours or days depending on the amount of data.
Storm – Storm is a real time analysis engine. Where MapReduce process data in batches, storm does analysis in streams or as data is ingested. A project once seen as the defacto streaming analysis engine but has lost some of that momentum with emergence of other streaming projects.
Spark – Processing engine for streaming data at scale. Most popular streaming engine in the Hadoop eco-system with the most active contributors. Does not require data to be in HDFS for analysis.
Flink – Hybrid processing engine that uses streaming and batch processing models. Data is broken down into bound(batch) and unbound (unbound) sets. Stream processing engine that incorporates batch processing
Beam – Another hybrid processing engine breaking processing into streaming and batch processing. Runs with both Spark and Flink. Heavy support from the Google family. A project with a heavy amount of optimism right now in the Hadoop eco-system because of it’s ability to run both batch and streaming processing depending on the workload.
Advice on Batch and Streaming Process
At the end of the day, a solid developer will want to understand both workflows. It’s all going to come down to the use case and how either workflow will help meet the business objective.