Big Data Big Questions: Big Data Kappa Architecture Explained

Learning how to develop streaming architectures can be tricky and difficult. In Big Data the Kappa Architecture has become the powerful streaming architecture because of the growing need to analyze streaming data. For the past few years the Lambda architecture has been king but in past year the Big Data community has seen a transformation to the Kappa Architecture.

What is the Kappa Architecture? How can you implement the Kappa Architecture in your environment? Watch this video and find out!

Transcript

(forgive any errors the video was transcribed by a machine..)

Hi folks Thomas Henson here with thomashenson.com and this is another episode of big data big questions and so today what we’re going to do is we’re going to tackle the Kappa architecture and explain how we can use that in Big Data and why it’s so popular right now find out more right after this.

[Music]

So in a previous episode we talked about the lambda architecture and how the land architecture is kind of the standard that we’ve seen in big data before we had spark and streaming and Flink and you know all those processing engines that work with a you know in big data to do streaming and so you can find that video right here Oh check it out we’re in the same shirt pretty cool so after you watch that video now we need to talk about the capital architecture and the reason we’re going to talk about the Kappa is because it’s based and it’s more kind of morphed actually from what the lambda architecture is and so when we talk about the Lambda architecture we talked how we had a to dualistic you know framework so we have your speed layer and your batch or MapReduce later but more of a transactional and right so you have two layers you’re still moving your data in HDFS you’re still point your data into Q well the capital architecture what we’re trying to do there and where the industry is going is not to have to support two different frameworks right so I mean anytime you’re supporting two but two versions of code or two different layers of code it just it’s more complicated you know you mean more developers and is just more risk right you know you look at the 80/20 rule you’re always going have you know probably 20% of you know 20% of bugs cause 80% of your problems so you know why have to manage two different layers and so what we’re starting to see is we’re starting to move all our data into one system where we can interact with it through our API and you know pull out you know whether you know whether we’re running a you know flute job or whether we’re running some kind of distributed search maybe using solar or ElasticSeach but we want to collapse all that down into one different framework and so okay that that sounds pretty simple but it’s not really implemented like we think and so one of the big tips and one thing I want you to pay attention to is when you’re talking about the capital architecture you’re saying okay I’m going have all this let I’m going have this one layer here that’s going to interact and I want to run all my jobs you know whether I’m running through spark around through Flink that’s how we’re going to process this data what you want to make sure is we want to make sure that you’re not just using Kafka or some kind of message queue and you know you’re pulling your job you’re still doing you know you’re still pulling this your API’s and still running your streaming jobs from there but you may still be you know taking that data and moving it into HDFS and still running some processing here and so really what we want to see with the Kappa architecture is we want to see where we’re taking our data and you know whatever our queuing system is you can check out per Vega I oh and there’s some information there about that architecture layer and what you’ll see is you want that data to be able to you so your source data comes in you want your data to exist and it’s a kind of queuing system but then you also want that to auto to your app but you don’t want your API’s where you’re writing directly to HDFS because then you’re just writing to two different systems as well so you want something to abstract away all that storage so whether your data comes in and it’s more archival and it’s sitting in HDFS or sitting in some kind of object-based storage or it’s the streaming you know it’s the streaming applications and you’re trying to pull that data off as fast as you can you only want to interact with that one system and so that’s what we say when we talk about Kappa and that’s what Kappa really is intended to be so remember you want to abstract away that storage layer once your queuing system where you’re only dealing with API’s and so you want to be pulling your spark jobs your Flink jobs in your stripping research through one pipeline not through two different pipelines where you’re breaking up your speed layer and you’re breaking up you know maybe your batch of your transactional layer so that’s what the Kappa architecture is explained make sure you subscribe to this video so you never miss an episode you definitely want to keep up with what’s going on in Big Data any questions you have submit those big data big questions do in the comments below send me an email you know put it on the comment section or go to the Big Data big question section on my blog thanks again and I’ll see you next time.