Comparing Growth in Kubernetes & Hadoop
How do you select a career path in Data Engineering? There are some many options for technologies to learn for System Administrators in Data Engineering. While Hadoop has been king of for the past decade, we must make way for a higher level technology. Kubernetes (K8s) is a distributed computing technology but vastly different from Hadoop on the data processing side. K8s is an open-source technology for automating, managing, and scaling containers. If we are comparing to Hadoop think of comparing to the Systems Administrative side of Data Engineering not Software Engineering.
The questions is how does the K8s ecosystem compare to that Hadoop?
Should I add K8s to my list of must learn technologies?
Kubernetes vs. Hadoop Transcript
Hi, folks. Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, in this episode we’re going to be talking and breaking down Kubernetes versus Hadoop and talking about specifically which one I would prefer, if I was starting out today, to learn as a data engineer. Before we even get into it, I’m not saying that these are the same technologies, but I am comparing the popularity and I’m also comparing what some of the innovations are that we want to study as data engineers, or just people in the industry as a whole, where do we see those markets. So, let’s jump right into that right after this.
All right, so, today’s question comes in… We’re going to talk a little bit about the popularity of Kubernetes specifically, and Hadoop, and where we’re at. One of the biggest things right now is the popularity of Kubernetes in the container world has eclipsed the popularity of Hadoop and the Hadoop ecosystem. If we were looking at a chart, we would see the number of contributions and development to those open source platforms, that Hadoop has just been eclipsed by it. Now, Kubernetes is not replacing Hadoop, but it is changing the way… And there are innovations in Hadoop that are taking advantage of containers and specifically Kubernetes. So, let’s break it down and talk a little bit about each one of these and then we’ll do a comparison of where we see it going for data engineers and then also provide recommendations for which one I would choose if I was starting fresh today.
Kubernetes is an open source orchestration system for automating application deployment, scaling, and management. It was originally designed by Google. Hmmm. That sounds familiar, right? [Laughs] So many things from the open source world comes. But if you think about Kubernetes… If you’ve done anything with containers, and specifically around Docker, Kubernetes is that orchestration, that layer that allows for you to cluster these together. Think of how Yarn works in the Hadoop world. We’ll talk a little bit about Hadoop here soon. Think of it as just being able to orchestrate, “I have all these different nodes that are deploying my application or running different portions of my application.” Kubernetes is the secret sauce to say, “Hey, I need three of these, or four of these,” and be able to not only do some of the load balancing and some of the other pieces, but the orchestration to scale those up and make them elastic and scale them down.
Kubernetes is synonymous with cloud-native. What’s going on with a cloud-native? Being able to move applications from Azure to AWS and make it seamless, or on-prem. To be able to develop something on-prem and be able to push it out. Really cool, really popular. Just another open source piece that’s came out of Google. Man, thank goodness for Google and all the things that they’ve done for the open source world. But, just another way to do that orchestration. So, from a data engineer’s perspective, really cool for us because it changes how we can deploy and use our applications. Back to what we were talking about, even the cloud-native piece. Being able to deploy out, start something, from a POC perspective, to be able to start and take advantage of being able to use something on-prem or do it in the cloud and then pull it back down, that’s really cool. And then also changes that application layer, but first let’s talk a little bit more about Hadoop, then we’ll talk about how it all fits together.
So, Hadoop, we’ve talked about it a bit here. Been around for a long time, synonymous with being able to scale out and make large-scale data decisions, like, be able to have that storage layer and also be able to analyze your data too. So, think of it as a node-based architecture similar to what we were talking about with Kubernetes, little bit different, but a node-based architecture that’s going to allow for us to analyze data and be able to make decisions with it over a large cluster, like, you’re building out a huge system here. So, synonymous with, originally, in early days of MapReduce, but it’s taking over more of a Hadoop ecosystem where we’re talking about not only that storage layer of HDFS, but also that processing layer that actually is going to allow for us to process the data on individual nodes, bring those back to the user, and be able to take advantage of all that under the covers. Also another product built and written of off research that was published by Google. Not specifically open sourced by Google, but some of the research papers that they pushed out there led to the writing of a research paper and the open source portion of Hadoop that became popular. It’s been around… If you follow this channel, you’ve heard me talk about Hadoop a good bit too.
Now, let’s talk a little bit about the architecture. When we’re talking about Hadoop and the architecture, we’re talking about running our data processing maybe with Spark, or Hive, or it could be Pig, or anything like that. Then you’ve got your layer that is your orchestration and that’s where Yarn comes in. It’s going to process that we have the resources on all these different nodes and spread out. And then your data layer comes in at HDFS. That’s where my data is stored, with an HDFS perspective. Well, what Kubernetes can do, so how it fits in the Big Data world and where we really see it is… Think of Spark, TensorFlow, any of those tools that we were talking about. Now our orchestration layer can be actually with Kubernetes. And then we can do persistent storage in our databases, S3, or there are some innovations out there that you can to do it in HDFS, but you’re seeing it used more in a cloud-native perspective, so little bit different change of the architecture. The really cool thing about it is you’re actually abstracting away, so you’re not only just using Yarn just for building out this cluster, building out a system, you’ll be using Kubernetes, which can also do your application development too, like, you are… I’m sorry, application deployment. So, you know, being able to build cloud-native applications that are not just for data analytics but maybe serving out your web host and some of the other pieces. So, really a lot of innovation around that and really cool. There’s a lot of stuff and I just can’t go into it. I’m trying to give you a high level from here but there’s a lot of resources, a lot of courses out there, a lot of other things that maybe we should pick up at some point to talk about, around the innovations with Kubernetes. It’s really cool. It’s really changing, it’s in flux. If anybody is saying that, “Hey, it’s always going to be this way,” it’s one of those things that’s continuing to innovate, just like the data analytics area too.
At the end of the day, I guess the question is, if I were starting out today, would I focus solely on Hadoop, would I focus solely on Kubernetes if I can only choose one? Which, you can never only choose one, but I appreciate and I love these types of questions too because they really push me to make a decision. So, today… One of the things, the biggest thing… Cloud is a huge topic and being able to do things cloud-natively, so being able to support it on-prem, being able to support it and push it out to different multi-cloud… There’s so many different topics and buzzwords in that, but if we really look at the to that and that decision making, Kubernetes is really one of the big portions around that, and that has a huge impact on what we’re doing from a data analytics perspective.
And frankly, because of Hadoop’s not so much ease to use it in the cloud, I think that’s one of the reasons that we’ve seen Hadoop wane with their growth. We’ve seen Hortonworks, Cloudera merge together, MapR be picked up and purchased, and then also IBM’s BigInsights, because of the fact that these were systems that would only work on-prem. You had other options in other cloud perspectives, but AWS had their version versus Azure had their version under the covers, but if you wanted to really pull it back into your own on-prem area or push it out, it was a little more complicated, and you couldn’t just move it from AWS to Azure. Kubernetes has really pushed that, not just from the analytics world, but from that perspective.
So, if you made me choose today and you said, “Hey, man, you can only choose one and it’s something that you’ve got to get skilled up on in the next three to six months,” I’d choose Kubernetes. Not saying that I would not learn Hadoop from that perspective, but if I had to choose between the two of those right now, I think there’d be a bigger opportunity for data engineers and specifically systems administrators and those kinds of people that are more hands-on with the administration piece. I think that’s where we’re going to see a lot… And you’ve seen a lot with the open source tools out there in the Hadoop ecosystem, like Spark and some of the things going on with Project Submarine, just being able to support containers.
That’s all I have today for Big Data Big Questions. If you have a question make sure you put it in the comment section here below or reach out to me. I’ll do my best to answer those questions here, on the next episode of Big Data Big Questions.