Major Hadoop Release!
Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.
Transcript – What’s New in Hadoop 3.0?
Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.
Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.
It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.
Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.
The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?
Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.
So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.
So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.
That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.
Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.
So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.
And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.
Show Notes
Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing
DynamoDB NoSQL Database on AWS