Ultimate Battle Tensorflow vs. Hadoop

The Battle for #BigData

This post has been a long time coming!

Today I talk about the difference between Tensorflow and Hadoop. While Hadoop was built for processing data in a distrubuted fashion their are some comparison with Tensorflow. One of which is both originated out of the Google development stack. Another one is that both were created to bring insight to data although they both have different approaches to that mission.

Who now is the king of #Bigdata? To be fair the comparison is not like for like but most of the time are bound together as it has to be one or the other. Find my thoughts on Tensorflow vs. Hadoop in the latest episode of Big Data Big Questions.

Transcript – Ultimate Battle Tensorflow vs. Hadoop

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question is really a conversation that I heard from, actually, my little brother when he was talking about something that he heard at a conference. He brought it to my attention. “Hey, Thomas, you’re involved in big data. I was talking to some folks at a GIS conference around Hadoop and TensorFlow.” He’s like, “One person came up to me and said, ‘Ah! Hadoop’s dead. It’s all TensorFlow now.” I really wanted to take today to really talk about the differences between Hadoop and TensorFlow, and just do a level set for all data engineers out there, all big data developers, or people that are just interested in finding out. “Okay, what’s happening in the marketplace?” Today’s question is going to come in around TensorFlow versus Hadoop and find out all the things that we need to know from a data engineering perspective. Even in the end, we’ll talk about which one’s going to be around in five years. Find out more right after this.

Welcome back. Today, as promised, what we’re going to do is, we’re going to tackle the question around which is better, what’s the differences of TensorFlow versus Hadoop, where does it fit in data analytics, the marketplace, and solving the world’s problems? If you’re watching this channel, and you’re invested in the data analytics community, you know how we feel about it, and we’re passionate about, we’re being able to solve problems using data. First thing we’re going to do is break them down, and then at the end, we’re going to talk about some of the differences, where we see the market going, and which one is going to make it in five years. Or, will both? Who knows. First, what is TensorFlow. We’ve talked about it a good bit on this channel, but TensorFlow is a framework to do deep learning. Deep learning gives you the ability to subset, and a branch of machine learning, but it’s just about processing data. The really cool thing about TensorFlow, and the reason TensorFlow and frameworks similar to TensorFlow in the deep learning realm are so awesome is because it gives you the portability to run and analyze your data on your local machine or even spread it out in a distributed environment. It comes with a lot of different algorithms and neural networks that you can use and incorporate into solving problems. One of the cool things about deep learning is just the ability to actually look and analyze more video data or voice recognition, right? Or, if you’re going on Instagram or you’re going on YouTube, and you’re looking for examples on deep learning, chances are somebody’s going to build some kind of video or some kind of photo identification that will help you identify a cat. That’s the classic example that you’ll see, is, “Hey, can we detect a cat by feeding in data, and looking, and analyzing this?” Tensorflow doesn’t use Hadoop, but TensorFlow uses big data. You use these large data sets to train your models that can be used on edge devices. If you’re even used a drone, or if you’ve ever used a remote control to use natural language processing to change the channel, then you’ve used some portion of deep learning or natural language processing. Not saying it’s TensorFlow, but that’s what TensorFlow, it really does. It’s very popular, developed by Google, open sourced, and housed by Google. A lot of free resources out there, and for data scientists and machine learning engineers, it’s a very, very exciting product to be able to build out and be able to start analyzing your data quicker and in a very popular fashion. Couple together the excitement for deep learning, couple together the ease of use of TensorFlow, and that’s why the market has just been hot for TensorFlow and those other frameworks.

What is Hadoop? Hadoop, it’s all about elephants, right? Hadoop has really been around since, I don’t know, we’re probably in 12 to 13 years of it being open source, but if we think back to what we did from analyzing data that was coming in from the web, think about being able to index the entire web, it’s kind of what Google helped develop that, and Yahoo, and a lot of the other teams from Cloudera and HortonWorks, really helped to push Hadoop into the open source arena. Hadoop is synonymous with saying big data. You can’t say big data without thinking about Hadoop. Hadoop’s been around for a long time. There’s a lot of different components to Hadoop, and even on this channel, whenever we talk about Hadoop, we’re specifically really talking about the ecosystem. The ability to process data, but the ability to also store large amounts of data with HDFS, so the Hadoop distributed file system, there’s a lot of components in there. There are APIs, and there are other tools that help for you to do it, but one of the things that I really like to think about when we talk about Hadoop and why it was so record-breaking, and just really open the market for big data was just the ability to set up distributed systems and be able to analyze large amounts of data. These large amounts of data would be more in the unstructured data, so think of it not being in a database, but a lot of it would still be in text-based. You could go out there, very popular example is going out here, setting up an API to pull in Twitter data, and be able to do cinnamon analysis [Phonetic 00:05:13] over that. Not so much the deep learning. They’re trying to get into the deep learning area right now, but more of machine learning, using algorithms like singular value decomposition or [Inaudible 00:05:25] neighbor, but being able to do that over large sets of data. Large sets of data with multiple machines. Hadoop, been around for a while, more seen as replacing the enterprise data warehouse. With TensorFlow now on the scene, where does Hadoop fit in, and what’s going on, and what are some of the differences?

Hadoop was written in Java. TensorFlow was written in C++. Both of them have APIs. They give you the ability to, whenever we’re talking about the processing of data, you can do it in Java, you can do it in Python, you can do it in Scala. There’s a lot of different options there from a Hadoop perspective. TensorFlow, too. You can see C++. You can also see it in Python. Python’s one of the more popular ones, actually did a course using TF Learn and TensorFlow to show that. When we think about the tools, it’s a little bit different. When we think about Hadoop, we’re actually building out a distributed system. Then, we’re using things like maybe Spark. Think of using Spark to be able to analyze that data. We’re going to pull insight from that data back to our cinnamon analysis that’s going to say, “Hey, these specific words in here, when we see them, this tweet is unhappy,” or, “This tweet is happy.” Versus TensorFlow, same thing. More of a processing engine, like framework to be able to pull in, analyze the data, and give you insights on whether that image contained a cat or not a cat. You’re starting to see some of the differences. We talked about Python versus Java. Both of them, there’s different APIs that you can start to use those. I’m probably talking right now about saying that I haven’t seen a lot of Java and TensorFlow, but I’m sure somebody has an API or some kind of framework out there that works on it. Another big difference, too, is the way that the processing is done. The Hadoop ecosystem’s really trying to get into it right now, but from a TensorFlow perspective, we’re really seeing it on GPUs, right? Think of being able to use GPUs to process data, 10-20, a lot faster than what we see on a CPU. Where Hadoop is more CPU-based, the way that we’re solving problems with Hadoop is we’re throwing a lot of CPUs in a distributed model to process the data and then pull it back in. TensorFlow, same thing, distributed networks. As you start to scale out your data, you really need to distribute those systems, but we’re doing it with GPUs. That’s speeding up the process. Little bit of a difference there, just in the approach, but that’s one of the big key differences. If we’re a data engineer, and we’re evaluating these, where do they come in? Ease of use, Hadoop, you’re building out your distributed system. Really Java-based, so if you have a Java background, it’s really good, but you can get by without it in some areas. It’s really not so much of a comparison with ease of use, but if we’re talking about just being able to stand something up and start messing around with it, it’s going to be a little bit more complicated and harder to do it from a Hadoop perspective with TensorFlow. You can actually look at an NFS file system. You can feed in data from different file systems, where with Hadoop, you’re building that system out, and also building out a file system. You’re building out distributed systems, and you’re building out disaster recovery and some of the other components. It’s harder to do from a Hadoop perspective, but there’s more expertise in it, because you’re actually building out a whole solution set, versus TensorFlow is the processing system that you’re using. The comparison on that perspective is somebody tries to talk to you about that, kind of explain that it’s, these are two different systems, right? When we’re talking about which are we using, that really comes down to it. If you’re looking for a project, and somebody says, “Hey! Should we use TensorFlow here, or Hadoop?” It’s going to be pretty easy to spot those, I think, because when you’re starting to look at them, if you think of Hadoop, think of something that’s replacing or falling in line to the enterprise data warehouse. What are we doing? Do we have massive amounts of data. It could be structured, semi-structured, but you’re wanting to offload, and you’re wanting to run huge analytics over that processing. Then, that’s probably going to be a Hadoop perspective. We’re probably building out that system when we think of the traditional enterprise data warehouse. That’s the bucket that we’re going to fall in. If we’re talking about doing some sort of artificial intelligence or doing some things with deep learning, maybe not so much in the machine learning era, you’re going to want to look at TensorFlow. Especially, listen for keywords like, hey, what are we doing from the perspective of images, or video, or voice? Any of those media-rich types of data, then you’re probably going to use TensorFlow, too. If you have machine learning engineers, a data scientist, and you’re trying to do rich media, TensorFlow’s going to be your really popular one. If you have more data analysts, and even your data scientist, but from the perspective of, we’re looking at large amounts of data and wanting to marry it, but we have it in some kind of structure and some kind of standardized system, then Hadoop may be your bucket.

Which one of these is going to be around in five years? I think they’ll both be around, but I will say that the popularity for Hadoop will continue in some degrees, but it’s more continuing to replace that enterprise data warehouse. Think of what you do from a traditional perspective in holding all your company’s information, from that perspective, where we’re seeing more product development, more media-rich things that are being done from an artificial intelligence. We’ll see more TensorFlow there. Will TensorFlow still be the number one deep learning framework in five years? Will deep learning, I can’t answer that here. Would I learn it if I were just starting out as a data engineer? Yeah, definitely. Definitely from the perspective of, I want to learn how to implement it and how to use it. You don’t have to become an expert. We’re not trying to become a data scientist from that perspective, but start looking at some of the frameworks, and building out, going through some of the simple examples that they have, and then heavy use on docker, container, and that whole world of being able to build those out. That’ll help you if you’re really trying to look into, hey, what could be next for data engineers? Or, what’s going on now? What’s cutting edge from that perspective? I hope you enjoyed this video, please, if you have any comments on it, if I missed something, put it in the comments section here below. I’m always happy to carry on the discussion. Until next time, see you again on Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.