Spark & Hadoop Workloads are Huge
Data Engineers and Big Data Developers spend a lot of type developing their skills in both Hadoop and Spark. For years Hadoop’s MapReduce was King of the processing portion for Big Data Applications. However for the last few years Spark has emerged as the go to for processing Big Data sets. Still it can be unclear what the differences are between Spark & Hadoop. In this video I’ll breakdown the differences every Data Engineer should know between Spark & Hadoop.
Make sure to watch the video below to find out the differences and subscribe to never miss an episode of Big Data Big Questions.
Transcript – What is the Difference Between Spark & Hadoop
Hi, folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s question is one I’ve been wanting to tackle for a long time, I’m not sure why I haven’t gotten to it, but I’m ready to it. So, it’s the ultimate showdown. What’s the difference between Hadoop and Spark, and which one will win in the fight. So, find out how I’ll answer that question right after this.
Welcome back. So, today’s question comes in from a user. It came on through a YouTube comment section. So, post your question down here below. You can actually go to my website and go to Big Questions. So, thomashenson.com/bigquestions. Put it out on Twitter. Use the hashtag Big questions. I’ll look it up, try to answer those questions.
So, today’s question comes in and it says, YouTube comment, “Nowadays, there are predominantly two softwares that are used for dealing with big data, Hadoop Ecosystem and Spark. Could you elaborate on the similarities and differences in those two technologies?”
So, that’s an amazing question. It’s one that we hear all the time. So, Hadoop, very sure technology, it’s been out there. Really, it’s associated with a lot of things. There are going on in the Big Data community and a lot of things you talk about, you say big data, it’s almost anonymous, synonymous that you’re going to say Hadoop as well. But with Hadoop being over 10 years, maybe 13 years old, just depending on how you look at it, a lot of people are calling for its death, and Spark is the one that’s going to do that.
But there’s a little bit of difference. Like I said, we say that Hadoop is this all-encompassing thing. You hear me say it all the time, the Hadoop Ecosystem. So, I call it an ecosystem because a lot of things get pulled into the Hadoop Ecosystem. A lot of people say things, like assuming that Hadoop runs and does all the processing, and has all the functionality for your applications, or if you’re running it. But in a lot of data centers, you can run big data clusters and not be using Hadoop or not be using MapReduce.
And so, let me explain a little bit what I mean really by the true definition for Hadoop and then we’ll talk a little bit about Spark. So, Hadoop is built of two components. So, we separated it out into two different components. And so, the first one we’re going to break down is MapReduce. So, you’ve probably heard of MapReduce that’s what started that being able to process large datasets, and so it’s an indexing, somewhat of an indexing way to do data. So, if you have a cluster, you’re able to run your mapper and your reducer jobs, and be able to process data that way, and that functionality is called MapReduce. That’s one portion of Hadoop.
Another part of Hadoop, the really cool, the part that I’ve been involved with a ton is called the Hadoop Distributed File System or HDFS. And so, HDFS is the way that all the data is stored. And so, we have our MapReduce that’s controlling how the data is going to be processed, but HDFS is how we store that data. And, so many applications whether they’re in the Hadoop Ecosystem or new to just data processing or even just scripting, uses that Hadoop or HDFS to be able to pull data and be able to use your data as a file system.
And so, you have those two pieces right there and those two components.
When they talk about Hadoop being old, or Hadoop being slow, or portions of Hadoop that people aren’t interested in, most of the time they’re talking about the MapReduce portion. And so, there’s been a lot of things that have come out. So, there’s been MapReduce 1, and then MapReduce version 2, and Tez, and just different components around to compete with MapReduce, and Spark is one of those technologies as well.
And so, Spark is a framework. It’s called lightning fast, but it’s a framework for processing data. And so, you can still process your data that exist in HDFS, that exist in S3. There’s other places that your data can exist and be processed by Spark, but predominantly, most of the data centers, they still have their data in HDFS. So, things were built upon HDFS. HDFS is where your data is housed and so you process it whether you’re using Spark, whether you’re using Tez, whether you’re using any new way of processing the data, or you still may be using MapReduce, but you can have all that in HDFS. So, when you think about it, the two do compete, they do compete, but primarily from a processing engine.
And so, I’ve got a couple blog post out there that I’ll link to here in the show in the show notes, but you can go out and see where I break down the difference between batch and streaming, and some of those different workloads. And so, Spark really came on whenever we started talking about being able to stream data and being able to process data faster as it comes in.
And so, that’s why you see a lot of people that are talking about Hadoop being the past technology and then Spark’s, the newer technology that’s going to take over the world. There’s still going to be components from the traditional Hadoop like we talked about with HDFS. That’s probably still going to be used I don’t think for a long time. Like I said, there’s still a ton of people and a ton of developers still using MapReduce. And so, MapReduce has its functionalities for when we talk about batch workloads and there’s still development going on with MapReduce 1, and then Tez and some other platforms that are encompassed in the Hadoop community.
So, I would say, if you’re looking at it from a learning perspective, all right, which one do I want to learn, do I want to learn Hadoop, or do I want to learn Spark, and thinking that it’s all or nothing. I would say it’s not. I would focus mainly depending on what you’re looking to do, but I would definitely focus and learn HDFS, and so understand how the file system works and how you can compress, and how you can make those calls because chances are you’re going to be using HDFS and you’re also going to be using Spark, and Tez, and HBase, and Pig, and Hive, and a lot of different other tools in the ecosystem.
And so, I would say, it’s not an either or, you’re not going to pick and say, “I’m only going to do Spark,” or, “I’m only going to do Hadoop.” You’re more than likely going to be using a lot, using Spark too for your streaming applications and for your processing of your data, but you’re still using Hadoop, and the things in Hadoop with HDFS and being able to manage your data maybe with the [INAUDIBLE 00:06:28], and some of the other functionalities that are in that ecosystem. So, it’s not an all or nothing thing. And so, learning one is not going to stop you from getting your job or is going to stop you or prevent you from having to not learn another one. So, it’s not an either-or thing, but if you’re asking who will win in the future, I would say they both win.
Well, that’s all I have for today. Make sure to subscribe to the channel so you never miss an episode. We’ve got a lot of things that we’re working on, so we got some Isilon quick tips that are still rolling out. We’ve got some book reviews starting to get some interviews, so you can see some interviews that [INAUDIBLE 00:07:03] in the past, and then also these Big Data Big Questions. And so, anything that you want to see, just pop here in the comment section and I’ll try to answer it or try to tackle at the best I can. Thanks again.