Big Data Big Questions: Kappa Architecture for Real-Time

Should I Use Kappa Architecture For Real-Time Analytics?

Analytics architectures are challenging to design. If you follow the latest trends in Big Data, you’ll see a lot different architecture patterns to chose from.

Architects have a fear of choosing the wrong pattern. It’s what keeps them up at night.

What architecture should be used for designing a real-time analytics application. Should I use the Kappa Architecture for real-time analytics? Watch this video and find out!

Video

Transcript

Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of Big Data, Big Questions. Today’s question is all about the Kappa architecture and real-time analytics. So, our question today came in from a user, and it’s going to be about how we can tackle the Kappa architecture, and is it a good fit for those real-time analytics, for sensor networks, how it all kind of works together. Find out more, right after this.

So, today’s question came in from Francisco. And it’s Francisco from Chile, and he says, “Best regards from Chile.” So, Francisco, thanks for your question, and thanks for watching. So, his question is, “Hi, I’m building a system for processing sensor network data in near real-time. All this time, I’ve been studying the Lambda architecture in order to achieve this. But now, I’ve ran into the Kappa architecture, and I’m having trouble deciding between which one.” He says, what he wants to do is, he wants to analyze this near real-time data in real-time. So, as the data is coming from the sensors, he wants to obtain knowledge, and then push those out in some kind of a UI. So, some kind of charts and graphs, and he’s saying, do we have any suggestions about why we would choose one of these architectures that we would recommend for him? Well, thanks again, Francisco, for your question. And so, yes, I have some thoughts about how we should set up that network. So, but let’s review, real quick, about what we’ve talked about in previous videos of the Lambda architecture, and what the Kappa architecture is, and then how we’re going to implement those.

So, if you remember, the Lambda architecture – we have two different streams. And so, we have a batch-level stream and we have a real-time. So, as your data comes in, it might come in through something that’s a queueing system, called Kafka, or it’s in a area right there, where we’re just using it to queue all the data as it comes in. And so, for that real-time, you will follow that real-time stream, and so, you might use Spark, or Flink, or some kind of real-time processing engine that’s going to do the analytics, and push that out to some of your dashboards for data that’s just as it’s coming in, right? So, as soon as that data comes in, you want to analyze it as quick as you can – it’s what we call near real-time, right? But, you also have your batch layer. So, for your batch processing, for your storing of the data, right? Because, at some point, your queueing system, whether it’s Kafka or something, it’s going to get very, very large, and some of that data’s going to be old, and you don’t need to have it in a area where you can stream it out and analyze it all the time. So, you want it to be able to tier, or you want to move that data off to, maybe, HTFS, or S3 Object. And so, from there, you can use your distributed search, you can have it in HTFS, use Cassandra, or some other kind of… maybe it’s HBase, or some kind of NoSQL database that’s working on top of Hadoop. And then, you also can run your batch jobs there. So, you can run your MapReduce jobs there, whether it’s traditional MapReduce, or whether it’s Spark’s batch-level processing. But, you have two layers.

And so, that’s one of the challenges with the Lambda architectures – you have these two different layers, right? So, you’re supporting two levels of code, and for a lot of your processing, a lot of your data that’s coming in, maybe you’re just using the real-time there, but maybe the batch processing is used every month. But, you’re still having to support those two different levels of code. And so, that’s why we talk about the Kappa architecture, right? So, the Kappa architecture, it simplifies it. So, as your data comes in, you want to have your data that’s in one queueing system, or one storage device – where your data comes in, you can do your analytics on it, so you can do your real-time processing, and push that data our to your dashboards, your web applications, or however you’re trying to consume that data. Then, also do your distributed search, as well. So, you can… If you’re using ElasticSearch, or some other kind of distributed search, maybe it’s Solr, or some of the other ones, you can be able to analyze that data and have it supporting that real-time search, as well. But, you might use Spark, and Flink, for your real-time analytics, but you also wanted to do your batch, too. So, you’re going to have some batch processing that’s going to be done. But, instead of creating a whole ‘nother tier, you want to be able to do that within that queueing system that you have. And so, whether you’re using Kafka, or whether you’re using Pravega, which is a new open-source product that just was released by Dell, you want to be able to have all that data in one spot, so that when you’re queueing that data, you know that it’s going to be there. But, you can also do your analytics on it. So, you can use it for your distributed search, you can use it for those streaming analytics jobs, but also, whenever you go back to do some of your batch, or some of your transitional processing, you know that it’s in that same location, too. That way, there’s not as much redundancy, right? So, you’re not having to store data in multiple locations, and it’s taking up more room than you really need.

And so, this is what we call the Kappa architecture, and this is why it’s so popular right now is, it simplifies that workstream. And so, when we start deciding between those two, back to Francisco’s question – Francisco, your application – it seems like it has a real need for real-time, right? So, there’s a lot of things that are going on there from the network, and a lot of traffic that’s coming in. And so, this is going to be where we break down a couple of different concepts. And so, we talked about bound and unbound. So, a bound dataset is data that we know how much data is going to come in, right? Or, we wait to do, and do the processing on that data, after it’s already came in. And so, when you think of bound data, think of sales orders, think of interview numbers. And so, we know that’s largely what we would consider transactional data, so we know all the data as it’s coming in, and then we’re running the calculation then. But, what your data is, is unbound. And so, when we talk about unbound data is, you don’t know how much data is coming in, right? And it’s infinitive. It’s not going to end. So, with network traffic, you don’t know how long that’s going to be going on. So, the network traffic’s going to continue to come in, you don’t know… You might get one terabyte at one point, you might get ten terabytes, you might scale up all in one second. And then, as the data comes in, it might come in uneven, right? So, you might have some that’s timestamped a little bit earlier than other data that’s coming in, too. And so, that’s what we call unbound data.

And so, for unbound data, the Kappa architecture works really well. It also works really well for bound data, too. So, when we start to look at that, and looking at your project, my recommendation is to use the Kappa architecture – go ahead and use it because you’re using real-time data. But then, for those batch levels, and I’m sure that you’ll start having some processing and some pieces that you start doing that are batch – you can also consume that in the Kappa architecture, as well. And so, there are some things you can look into, so, you can choose streaming analytics, with Spark streaming, you can look at Flink, Beam – those are some of the applications you can use. But, you can also use distributed search, so you can use Solr, you can use ElasticSearch – all those are going to work well, whether you choose the Kappa architecture, or whether you choose the Lambda architecture. My recommendation is, go with the Kappa architecture.

Well, thanks guys, that’s another episode of Big Data, Big Questions. Make sure you subscribe, so that you never miss an episode. If you have any questions, have your question answered on Big Data, Big Questions – just go to the website, put your comments below, reach out to me on Twitter, however you want. Submit those questions, and have me answer those questions here on Big Data, Big Questions. Thanks again, guys.