AI Archives | simplyblock

Network Infrastructure for AI | Marc Austin

Rahil Parekh — Tue, 17 Sep 2024 22:08:36 +0000

Introduction:

In this episode of the Cloud Frontier Podcast, Marc Austin, CEO and co-founder of Hedgehog, explores network infrastructure for AI. He covers the need for high-performance, cost-effective AI networks similar to AWS, Azure, and Google Cloud. Discover how Hedgehog democratizes AI networking through open-source innovation.

This interview is part of the simplyblock Cloud Frontier Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, and our show site.

Key Takeaways

What Infrastructure is Needed for AI Workloads?

AI workloads need scalable, high-performance infrastructure, especially for networking and GPUs. Marc explains how hyperscalers like AWS, Azure, and Google Cloud set the standard for AI networks. Hedgehog seeks to match this with open-source networking software. As a result, it enables efficient AI workloads without the high costs of public cloud services.

How does AI Change Cloud Infrastructure Design?

AI drives big changes in cloud infrastructure, particularly through distributed cloud models. AI inference often requires edge computing, deploying models in settings like vehicles or factories. This need spurs the development of flexible infrastructure that operates seamlessly across public, private, and edge clouds.

What is the Role of GPUs in AI Cloud Networks?

GPUs are crucial for AI workloads, especially for training and inference. Marc discusses how Luminar, a leader in autonomous vehicle tech, chose private cloud infrastructure for efficient GPU use. By using private GPUs, they avoided public cloud costs, recovering their investment within six months compared to a 36-month AWS commitment.

In addition to highlighting the key takeaways, it’s essential to provide context that enriches the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Rob Pankow. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

How do you Optimize Network Performance for AI Workloads?

Optimizing network performance for AI workloads involves reducing latency and ensuring high bandwidth to avoid bottlenecks in communication between GPUs. Simplyblock enhances performance by offering a multi-attach feature, which allows multiple high-availability (HA) instances to use a single volume, reducing storage demand and improving IOPS performance. This optimization is critical for AI cloud infrastructure, where job completion times are directly impacted by network efficiency.

Simplyblock Insight:

Simplyblock’s approach to optimizing network performance includes intelligent storage tiering and thin provisioning, which help reduce costs while maintaining ultra-low latency. By tiering data between fast NVMe layers and cheaper S3 storage, simplyblock ensures that hot data is readily available while cold data is stored more economically, driving down storage costs by up to 75%.

What are the Hardware Requirements for AI Cloud Infrastructure?

The hardware requirements for AI cloud infrastructure are primarily centered around GPUs, high-speed networking, and scalable storage solutions. Marc points out that AI workloads, especially for training models, rely heavily on GPU clusters to handle the large datasets involved. Ensuring low-latency connections between these GPUs is crucial to avoid delays in processing.

Simplyblock Insight:

Simplyblock addresses these hardware needs by optimizing storage performance with NVMe-oF (NVMe over Fabrics) architecture, which allows data centers to deploy high-speed, low-latency storage networks. This architecture, combined with storage tiering from NVMe to Amazon S3, ensures that AI workloads can access both fast storage for active data and cost-effective storage for archival data, optimizing resource utilization.

Additional Nugget of Information

Why is Multi-cloud Infrastructure Important for AI Workloads?

Multi-cloud infrastructure provides the flexibility to distribute AI workloads across different cloud environments, reducing reliance on a single provider and enhancing data control. For AI, this allows enterprises to run training tasks in one environment and inference at the edge, across multiple clouds. Multi-cloud strategies also prevent vendor lock-in and enable enterprises to use the best cloud services for specific workloads, enhancing both performance and cost efficiency.

Conclusion

Marc Austin’s journey with Hedgehog reveals a strong commitment to making AI network infrastructure accessible to companies of all sizes. By leveraging open-source software and focusing on distributed cloud strategies, Hedgehog is enabling organizations to run their AI workloads with the same efficiency as hyperscalers — without the excessive costs. With AI infrastructure evolving rapidly, it’s clear that companies will increasingly turn to innovative solutions like Hedgehog to optimize their networks for the future of AI.

Tune in to future Cloud Frontier Podcast episodes for insights on cloud startups, entrepreneurship, and bringing visionary ideas to market. Stay updated with expert insights that can help shape the next generation of cloud infrastructure innovations!

The post Network Infrastructure for AI | Marc Austin appeared first on simplyblock.

AI Storage: How To Build Scalable Data Infrastructures for AI workloads?

Rob Pankow — Tue, 30 Apr 2024 16:30:00 +0000

AI workloads bring new requirements to your AI storage infrastructure, marking a significant change compared to the “ML era” of Big Data storage. The average scale of an AI dataset is multiple times higher than ML data sets used in training. This triggers a question of whether the approach to data infrastructure needs to be revisited accordingly and with respect to the massive scale and performance requirements of AI workloads.

In this article, we explore the impact of unstructured data on data volumes. We’ll emphasize the shift from ML to AI. Finally, we underscore the significance of a forward-looking data architecture for businesses aiming to be data-first in the era of AI. Scale-out storage infrastructure plays a key role in this process. We will put that in the context of Intelligent Data Infrastructure (IDI).

Understanding the Scale-Out Approaches in AI Storage

Scale-up approaches keep adding more resources like CPU, RAM, and storage to existing servers. While traditional data centers rely heavily on such scale-up architecture, modern AI workloads demand vertical and horizontal scaling capabilities. Scale-out architecture involves adding more nodes to distribute the workload. It has become crucial for handling massive AI datasets. However, organizations need the flexibility to scale up individual nodes. Flexibility for performance-intensive workloads and to scale out their infrastructure to handle growing data volumes.

Comparing Scale Up vs Scale Out Approaches

Scale-up architecture focuses on increasing the capacity of existing nodes by adding more CPU, memory, and storage resources. This vertical scaling approach offers benefits like:

Simpler management of fewer, more powerful nodes
Lower network overhead and latency
Better performance for single-threaded workloads

In contrast, scale-out architecture distributes workloads across multiple nodes, offering advantages such as:

Linear performance scaling by adding nodes
Better fault tolerance through redundancy
More cost-effective growth using commodity hardware

For AI workloads, organizations often need both approaches—scaling up nodes for compute-intensive tasks like model training and scaling out storage and processing capacity for massive datasets. The key is finding the right balance based on specific workload requirements.

Why AI Storage for Unstructured Data?

One of the defining characteristics of the AI era is the exponential growth of unstructured data. It is estimated that even up to 95% of today’s data is unstructured. That means it is not considered “data” in the context of current data infrastructures. These are images, videos, text documents, social media feeds, and other types of “data” that aren’t used as a base for data-driven decision-making today. AI is changing that with its ability to convert unstructured data into structured data. AI models feed themselves with diverse data types that are invaluable for their training, yet they also pose a significant challenge in terms of storage, processing, and retrieval. All the data that nobody cared about in cold storage as of yesterday is now at the core of data infrastructure today.

Unstructured data, such as images and videos, tends to be larger in size compared to structured data. This exponential growth in data volumes strains traditional data infrastructure, necessitating more scalable solutions. It also comes in a myriad of formats and structures. Managing this complexity becomes critical for organizations. They aim to harness the insights buried within unstructured datasets. Data Infrastructure’s adaptability is indispensable in handling the variety and complexity inherent in unstructured data.

The Scale of Training Data

AI models that leverage unstructured data, especially in tasks like image recognition or natural language processing, require significant computational power. The demand for scalable compute resources becomes paramount, and the ability to dynamically allocate resources between storage and compute is key for efficiency at scale. Distinct from traditional Machine Learning (ML) datasets, these AI-scale datasets, in the realm of image recognition, natural language processing, and complex simulations, reach massive scales and often come with storage requirements in the hundreds of terabytes. Data infrastructure must be tailor-made for such workloads, enabling dynamic resource allocation and efficient management of these vast datasets.

Introducing Intelligent Data Infrastructure (IDI)

Figure 1: Definition of Intelligent Data Infrastructure (IDI)

Intelligent Data Infrastructure (IDI) is a novel concept that reimagines how organizations handle and utilize their data. At its core, it involves decomposing traditional monolithic data systems into modular components that can be dynamically orchestrated to meet specific requirements. IDI can be built on public or private clouds, on-premises, or hybrid cloud scenarios. This modular, containerized, and fully portable approach enables organizations to build a data infrastructure that is not only scalable but also adaptable to the evolving needs of AI applications and businesses. The Intelligent Data Infrastructure (IDI) requires a few components to deliver its full potential.

Decoupled Storage and Compute

Intelligent Data Infrastructure (IDI) separates storage and compute resources, allowing organizations to scale each independently. This decoupling is particularly beneficial for AI workloads, where computational demands vary significantly. By allocating resources dynamically, organizations can optimize performance and cost-effectiveness.

Metadata-Driven Architecture

A metadata-driven architecture is crucial to Intelligent Data Infrastructures (IDI). Metadata provides essential information about the data. That makes it easier to discover, understand, and process. In the context of AI, where diverse datasets with varying structures are common, a metadata-driven approach enhances flexibility and facilitates efficient data handling. Storing and accessing large amounts of metadata might require the ability to scale IOPS without limitations to accommodate the unpredictability of the workloads. Today, IOPS limitations are a common problem faced by users of public clouds.

API-Based Connectivity

Intelligent Data Infrastructure (IDI) relies on APIs (Application Programming Interfaces) for seamless connectivity between different components. This API-centric approach enables interoperability and integration with various tools and platforms. This fosters a collaborative ecosystem for AI development.

Orchestration and Automation

Orchestration and automation are pivotal in Intelligent Data Infrastructure (IDI). By automating tasks such as data ingestion, processing, and model deployment, organizations can streamline their AI workflows and reduce the time-to-value for AI projects. Automation on the storage layer is key to cater to these requirements.

Portability, Tiering, and Containerization

Workload portability has never been better. But data portability (or data infrastructures) has yet to catch up. Kubernetes made it easy to orchestrate the movement of workloads. However, due to storage and data gravity, the most common use cases for Kubernetes are stateless workloads. The shift of stateful workloads into Kubernetes is consistent with the rise of Intelligent Data Infrastructure. Intelligent storage tiering further allows us to build data infrastructures in the most efficient and agnostic way.

Building Storage for the AI Era

Unlike traditional systems, Intelligent Data Infrastructure (IDI) is architected to handle the massive scale of AI datasets. The flexibility to both scale up individual nodes for performance-intensive workloads and scale out storage across multiple nodes. The ability to scale out storage horizontally and vertically, coupled with dynamic resource allocation, ensures optimal performance for AI workloads. Future-proofing data platforms is crucial in the fast-paced AI era. With its modular and adaptable design, Intelligent Data Infrastructure (IDI) enables organizations to stay ahead by easily integrating new technologies and methodologies as they emerge, ensuring longevity and relevance.

As AI becomes a driving force across industries, every business is poised to become a data and AI business. Intelligent Data Infrastructure (IDI) facilitates this transition by providing the flexibility and scalability needed by businesses to leverage data as a strategic asset. The modular nature of Intelligent Data Infrastructure (IDI) empowers organizations to adapt to evolving AI requirements. Whether integrating new data sources or accommodating changes in processing algorithms, a flexible infrastructure ensures agility in the face of dynamic AI landscapes.

Organizations can optimize infrastructure costs by decoupling storage and computing resources and dynamically allocating them as needed. This cost efficiency is particularly valuable in AI, where resource requirements can vary widely depending on the nature of the tasks at hand. While cloud services are becoming commoditized, the edge lies in how businesses build and optimize their data infrastructure. A unique approach to data management, storage, and processing can provide a competitive advantage, making businesses more agile, innovative, and responsive to the demands of the AI era.

How can Organizations adopt Intelligent Data Infrastructure as AI Storage?

In the era of AI, where unstructured data reigns supreme and businesses are transitioning to become data-first, the role of Intelligent Data Infrastructure (IDI) cannot be overstated. It addresses the challenges posed by the sheer volumes of unstructured data and provides a forward-looking foundation for businesses to thrive in AI. As businesses strive to differentiate themselves, a strategic focus on building a unique and scalable data infrastructure will undoubtedly be the key to gaining a competitive edge in the evolving world of artificial intelligence. The first step of adopting IDI in your organization should be identifying bottlenecks and challenges with the current data infrastructure. Some of the questions one should ask are:

Is your current data infrastructure horizontally scalable?
Do you face IOPS limits?
Are you resorting to the use of sub-optimal storage services to save costs? (e.g., using object storage because it’s “cheap”)
Can you scale out storage and compute resources without scaling storage or vice versa?
Based on workload demands, can your infrastructure scale up (add resources to existing nodes) and scale out (add new nodes)?
Can you easily and efficiently migrate data and workloads between clouds and environments?
What is the level of automation in your data infrastructure?
Do you use intelligent data services (such as deduplication and automatic resource balancing) to decrease your organization’s data storage requirements?

Organizations following the traditional approaches to data infrastructures would not be able to answer these questions easily, which by itself would be a warning sign that they are far from adopting IDI. As always, awareness of the problem needs to come first. At simplyblock, we help you adopt an intelligent data infrastructure without the burden of re-architecting everything, providing drop-in solutions to boost your data infrastructure with the sight of the AI era.

Figure 2: Benefits of an Intelligent Data Infrastructure implementation with Simplyblock

How can Simplyblock help to build a Storage System for IDI?

Simplyblock’s high-performance scale-out storage clusters are built upon EC2 instances with local NVMe disks. Our technology uses NVMe over TCP for minimal access latency, high IOPS/GB, and efficient CPU core utilization, surpassing local NVMe disks and Amazon EBS in cost/performance ratio at scale. Ideal for high-performance Kubernetes environments, simplyblock combines the benefits of local-like latency with the scalability and flexibility necessary for dynamic AWS EKS deployments, ensuring optimal performance for I/O-sensitive workloads like databases. Using erasure coding (a better RAID) instead of replicas helps to minimize storage overhead without sacrificing data safety and fault tolerance.

Additional features such as instant snapshots (full and incremental), copy-on-write clones, thin provisioning, compression, encryption, and many more, make simplyblock the perfect solution that meets your requirements before you set them. Get started using simplyblock right now, or learn more about our feature set.

The post AI Storage: How To Build Scalable Data Infrastructures for AI workloads? appeared first on simplyblock.

Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview)

Chris Engelbert — Fri, 26 Apr 2024 12:13:28 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Gunnar Morling (X/Twitter) , from Decodable , a cloud-native stream processing platform that makes it easier to build real-time applications and services, highlights the challenges and opportunities in stream processing, as well as the evolving trends in database and cloud technologies.

Chris Engelbert: Hello everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today I have a really good guest, and a really good friend with me. We know each other for quite a while. I don’t know, many, many, many years. Another fellow German. And I guess a lot of, at least when you’re in the Java world, you must have heard of him. You must have heard him. Gunnar, welcome. Happy to have you.

Gunnar Morling: Chris, hello, everybody. Thank you so much, family. Super excited. Yes, I don’t know, to be honest, for how long we have known each other. Yes, definitely quite a few years, you know, always running into each other in the Java community.

Chris Engelbert: Right. I think the German Java community is very encapsulated. There’s a good chance, you know, a good chunk of them.

Gunnar Morling: I mean, you would actively have to try and avoid each other, I guess, if you really don’t want to meet somebody.

Chris Engelbert: That is very, very true. So, well, we already heard who you are, but maybe you can give a little bit of a deeper introduction of yourself.

Gunnar Morling: Sure. So, I’m Gunnar. I work as a software engineer right now at a company called Decodable. We are a small startup in the data streaming space, essentially moving and processing your data. And I think we will talk more about what that means. So, that’s my current role. And I have, you know, a bit of a mixed role between engineering and then also doing outreach work, like doing blog posts, podcasts, maybe sometimes, going to conferences, talking about things. So, that’s what I’m currently doing. Before that, I’ve been for exactly up to the day, exactly for 10 years at Red Hat, where I worked on several projects. So, I started working on different projects from the Hibernate umbrella. Yes, it’s still a thing. I still like it. So, I was doing that for roughly five years working on Bean Validation. I was the spec lead for Bean Validation 2.0, for instance, which I think is also how we met or I believe we interacted somehow with in the context of Bean Validation. I remember something there. And then, well, I worked on a project which is called Debezium. It’s a tool and a platform for change data capture. And again, we will dive into that. But I guess that’s what people might know me for. I’m also a Java champion as you are, Chris. And I did this challenge. I need to mention it. I did this kind of viral challenge in the Java space. Some people might also have come across my name in that context.

Chris Engelbert: All right. Let’s get back to the challenge in a moment. Maybe say a couple of words about Decodable.

Gunnar Morling: Yes. So, essentially, we built a SaaS, a software as a service for stream processing. This means, essentially, it connects to all kinds of data systems, let’s say databases like Postgres or MySQL, streaming platforms like Kafka, Apache Pulsar. It takes data from those kinds of systems. And in the simplest case, it just takes this data and puts it into something like Snowflake, like a search index, maybe another database, maybe S3, maybe something like Apache Pino or Clickhouse. So, it’s about data movement in the simplest case, taking data from one place to another. And very importantly, all this happens in real time. So, it’s not batch driven, like, you know, running once per hour, once per day or whatever. But this happens in near real time. So, not in the hard, you know, computer science sense of the word, with a fixed SLA, but with a very low latency, like seconds, typically. But then, you know, going beyond data movement, there’s also what we would call data processing. So, it’s about filtering your data, transforming it, routing it, joining multiple of those real time data streams, doing things like groupings, real time analytics of this data, so you could gain insight into your data. So, this is what we do. It’s based on Apache Flink as a stream processing engine. It’s based on Debezium as a CDC tool. So, this gives you a source connectivity with all kinds of databases. And yeah, people use it for, as I mentioned, for taking data from one place to another, but then also for, I don’t know, doing fraud detection, gaining insight into their purchase orders or customers, you know, all those kinds of things, really.

Chris Engelbert: All right, cool. Let’s talk about your challenge real quick, because you already mentioned stream processing. Before we go on with, like, the other stuff, like, let’s talk about the challenge. What was that about?

Gunnar Morling: What was that about? Yes, this was, to be honest, it was kind of a random thing, which I started over the holidays between, you know, Christmas and New Year’s Eve. So, this had been on my mind for quite some time, doing something like processing one billion rows, because that’s what it was, a one billion row challenge. And this had been on my mind for a while. And I know somehow, then I had this idea, okay, let me just put it out into the community, and let’s make a challenge out of it and essentially ask people, so how fast can you be with Java to process one billion rows of a CSV file, essentially? And the task was, you know, to take temperature measurements, which were given in that file, and aggregate them per weather station. So, the measurements or the rows in this file were essentially always like, you know, a weather station name and then a temperature value. And you had to aggregate them per station, which means you had to get the minimum, the maximum and the mean value per station. So, this was the task. And then it kind of took off. So, like, you know, many people from the community entered this challenge and also like really big names like Aleksey Shipilëv, Cliff Click, Thomas Wuerthinger, the leads of GraalVm at Oracle and many, many others, they started to work on this and they kept working on it for the entire month of January. And like really bringing down those execution times, essentially, in the end, it was like less than two seconds for processing this file, which I had with 13 gigabytes of size on an eight core CPU configuration.

Chris Engelbert: I think the important thing is he said less than a second, which is already impressive because a lot of people think Java is slow and everything. Right. We know those terms and those claims.

Gunnar Morling: By the way, I should clarify. So, you know, I mean, this is highly parallelizable, right? So, the less than a second number, I think like 350 milliseconds or so this was an old 32 cores I had in this machine with hyperthreading, with turbo boost. So, this was the best I could get.

Chris Engelbert: But it also included reading those, like 13 gigs, right? And I think that is impressive.

Gunnar Morling: Yes. But again, then reading from memory. So, essentially, I wanted to make sure that disk IO is not part of the equation because it would be super hard to measure for me anyway. So, that’s why I said, okay, I will have everything in a RAM disk. And, you know, so everything comes or came out of memory for that context.

Chris Engelbert: Ok. Got it. But still, it got pretty viral. I’ve seen it from the start and I was kind of blown away by who joined that discussion. It was really cool to look after and to just follow up. I didn’t have time to jump into that myself, but by the numbers and the results I’ve seen, I would have not won anyway. That was me not wasting time.

Gunnar Morling: Absolutely. I mean, people pulled off like really crazy tricks to get there. And by the way, if you’re at JavaLand in a few weeks, I will do a talk about some of those things in Java land.

Chris Engelbert: I think by the time this comes out, it was a few weeks ago. But we’ll see.

Gunnar Morling: Ok. I made the mistake for every recording. I made the temporal reference.

Chris Engelbert: That’s totally fine. I think a lot of the JavaLand talks are now recorded these days and they will show up on YouTube. So when this comes out and the talks are already available, I’ll just put it in the show notes.

Gunnar Morling: Perfect.

Chris Engelbert: All right. So that was the challenge. Let’s get back to Decodable. You mentioned Apache Flink being like the underlying technology build on. So how does that work?

Gunnar Morling: So Apache Flink, essentially, that’s an open source project which concerns itself with real-time data processing. So it’s essentially an engine for processing either bounded or unbounded streams of events. So there’s also a way where you could use it in a batch mode. But this is not what we are too interested in so far. It’s always about unbounded data streams coming from a Kafka topic, so it takes those event streams, it defines semantics on those event streams. Like what’s an event time? What does it mean if an event arrives late or out of order? So you have the building blocks for all those kinds of things. Then you have a stack, a layer of APIs, which allow you to implement stream processing applications. So there’s more imperative APIs, which in particular is called the data streaming API. So there you really program in Java, typically, or Scala, I guess, your flow in an imperative way. Yeah Scala, I don’t know who does it, but that may be some people. And then there’s more and more abstract APIs. So there’s a table API, which essentially gives you like a relational programming paradigm. And finally, there’s Flink SQL, which also is what Decodable employs heavily in the product. So there you reason about your data streams in terms of SQL. So let’s say, you know, you want to take the data from an external system, you would express this as a create table statement, and then this table would be backed by a Kafka topic. And you can do a select then from such a table. And then of course you can do, you know, projections by massaging your select clause. You can do filterings by adding where clauses, you can join multiple streams by well using the join operator and you can do windowed aggregations. So I would feel that’s the most accessible way for doing stream processing, because there’s of course, a large number of people who can implement a SQL, right?

Chris Engelbert: Right. And I just wanted to say, and it’s all like a SQL dialect, it’s pretty close as far as I’ve seen to the original like standard SQL.

Gunnar Morling: Yes, exactly. And then there’s a few extensions, you know, because you need to have this notion of event time or what does it mean? How do you express how much lateness you would be willing to accept for an aggregation? So there’s a few extensions like that. But overall, it’s SQL. For my demos, oftentimes, I can start working on Postgres, developing, develop some queries on Postgres, and then I just take them, paste them into like the Flink SQL client, and they might just run as is, or they may need a little bit of adjustment, but it’s pretty much standard SQL.

Chris Engelbert: All right, cool. The other thing you mentioned was the Debezium. And I know you, I think you originally started Debezium. Is that true?

Gunnar Morling: It’s not true. No, I did not start it. It was somebody else at Red Hat, Randall Hauck, he’s now at Confluent. But I took over the project quite early on. So Randall started it. And I know I came in after a few months, I believe. And yeah, I think this is when it really took off, right? So, you know, I went to many conferences, I spoke about it. And of course, others as well. The team grew at Red Hat. So yeah, I was the lead for quite a few years.

Chris Engelbert: So for the people that don’t know, maybe just give a few words about what Debezium is, what it does, and why it is so cool.

Gunnar Morling: Right. Yes. Oh, man, where should I start? In a nutshell, it’s a tool for what’s called change data capture. So this means it taps into the transaction log of your database. And then whenever there’s an insert or an update or delete, it will capture this event, and it will propagate it to consumers. So essentially, you could think about it like the observer pattern for your database. So whenever there’s a data change, like a new customer record gets created, or purchase order gets updated, those kinds of things, you can, you know, react and extract this change event from the database, push it to consumers, either via Kafka or via pullbacks in an API way, or via, you know, Google Cloud PubSub, Kinesis, all those kinds of things. And then well, you can take those events and it enables a ton of use cases. So you know, in the simplest case, it’s just about replication. So taking data from your operational database to your cloud data warehouse, or to your search index, or maybe to cache. But then also people use change data capture for doing things like microservices, data exchange, because I mean, microservices, they, you want to have them self dependent, but still, they need to exchange data, right? So they don’t exist in isolation, and change data capture can help with that in particular, with what’s called the outbox pattern, just on the side note, people use it for splitting up monolithic systems into microservices, you can use this change event stream as an audit log. I mean, if you kind of think about it, it’s, you know, if you just keep those events, all the updates to purchase order, we put them into a database, it’s kind of like a search index, right? Maybe you want to enrich it with a bit of metadata. You can do streaming queries. So I know you maybe you want to spot specific patterns in your data as it changes, and then trigger some sort of alert. That’s the use case, and many, many more, but really, it’s a super versatile tool, I would say.

Chris Engelbert: Yeah, and I also have a couple of talks on that area. And I think my favorite example, that’s something that everyone understands is that you have some order coming in, and now you want to send out invoices. Invoices don’t need to be sent like, in the same operation, but you want to make sure that you only send out the invoice if the invoice was, or if the order was actually generated in the database. So that is where the outbox pattern comes in, or just looking at the order table in general, and filtering out all the new orders.

Gunnar Morling: Yes.

Chris Engelbert: So yeah, absolutely a great tool. Love it. It supports many, many databases. Any idea how many so far?

Gunnar Morling: It keeps growing. I know, certainly 10 or so or more. The interesting thing there is, well, you know, there is not a standardized way you could implement something like Debezium. So each of the databases have their own APIs, formats, their own ways for extracting those change events, which means there needs to be a dedicated Debezium connector for each database, which we want to support. And then the core team, you know, added support for MySQL, Postgres, SQL Server, Oracle, Cassandra, MongoDB, and so on. But then what happened is that also other companies and other organizations picked up the Debezium framework. So for instance, now something like Google Cloud Spanner, it’s also supported via Debezium, because the team at Google decided, that they want to expose change events based on the Debezium event format and infrastructure or ScyllaDB. So they maintain their own CDC connector, but it’s based on Debezium. And the nice thing about that is that it gives you as a user, one unified change event format, right? So you don’t have to care, which is the particular source database, does it come from Cloud Spanner, or does it come from Postgres? You can process those events in a unified way, which I think is just great to see that it establishes itself as a sort of a de facto standard, I would say.

Chris Engelbert: Yeah, I think that is important. That is a very, very good point. Debezium basically defined a JSON and I think Avro standard.

Gunnar Morling: Right. So I mean, you know, it defines the, let’s say, the semantic structure, like, you know, what are the fields, what are the types, how are they organized, and then how you serialize it as Avro, JSON, or protocol buffers. That’s essentially like a pluggable concern.

Chris Engelbert: Right. So we said earlier, Decodable is a cloud platform. So you basically have, in a little bit of a mean term, you have Apache Flink on steroids, ready to use, plus a couple of stuff on top of that. So maybe talk a little bit about that.

Gunnar Morling: Right. So yes, that’s the underlying tech, I would say. And then of course, if you want to put those things into production, there’s so many things you need to consider. Right. So how do you just go about developing and versioning those SQL statements? If you iterate on a statement, you want to have maybe like a preview and get a feeling or maybe just validation of this. So we have all this editing experience, preview. Then maybe you don’t want that all of your users in your organization can access all those streaming pipelines, which you have. Right. So you want to have something like role-based access control. You want to have managed connectors. You want to have automatic provisioning and sizing of your infrastructure. So you don’t want to think too much, “hey, do I need to keep like five machines for this dataflow sitting around?” And what happens if I don’t need them? Do I need to remove them and then scale them back up again? So all this auto scaling, auto provisioning, this is something which we do. Then we will primarily allow you to use SQL to define your queries, but then also we actually let you run your own custom Flink jobs. If that’s something which you want to do, you can do this. We are very close. And again, by the time this will be released, it should be live already. We will have Python, PyFlink support, and yeah, many, many more things. Right. So really it’s a managed experience for those dataflows.

Chris Engelbert: Right. That makes a lot of sense. So let me see. From a user’s perspective, I’m mostly working with SQL. I’m writing my jobs. I’m deploying those. Those jobs are everything from simple ETL to extract, translate, load. What’s the L again?

Gunnar Morling: Load.

Chris Engelbert: There you go. Nobody needs to load data. They just magically appear. But you can also do data enrichment. You said that earlier. You can do joins. Right. So is there anything I have to be aware of that is very complicated compared to just using a standard database?

Gunnar Morling: Yeah. I mean, I think this entire notion of event time, this definitely is something which can be challenging. So let’s say you want to do some sort of windowed analysis, like, you know, how many purchase orders do I have per category and hour, you know, this kind of thing. And now, depending on what’s the source of your data, those events might arrive out of order. Right. So it might be that your hour has closed. But then, like, five minutes later, because some event was stuck in some queue, you still get an event for that past hour. Right. And of course, now the question is, there’s this tradeoff between, okay, how accurate do you want your data to be? Essentially, how long do you want to wait for those late events versus, well, what is your latency? Right. Do you want to get out this updated count at the top of the hour? Or can you afford to wait for those five minutes? So there’s a bit of a tradeoff. I think, you know, this entire complex of event time, I think that’s certainly something where people often have at least some time to learn and grasp the concepts.

Chris Engelbert: Yeah, that’s a very good one. In a previous episode, we had the discussion about connected cars. And connected cars may or may not have an internet connection all the time. So you like super, super late events sometimes. All right. Because we’re almost running out of time.

Gunnar Morling: Wow. Ok.

Chris Engelbert: Yeah. 20 minutes is like nothing. What is the biggest trend you see right now in terms of database, in terms of cloud, in terms of whatever you like?

Gunnar Morling: Right. I mean, that’s a tough one. Well, I guess there can only be one answer, right? It has to be AI. I feel it’s like, I know it’s boring. But well, the trend is not boring. But saying it is kind of boring. But I mean, that’s what I would see. The way I could see this impact things like we do, I mean, it could help you just with like scaling, of course, like, you know, we could make intelligent predictions about what’s your workload like, maybe we can take a look at the data and we can sense, okay, you know, it might make sense to scale out some more compute load already, because we will know with a certain likelihood that it may be needed very shortly. I could see that then, of course, I mean, it could just help you with authoring those flows, right? I mean, with all those LLMs, it might be doable to give you some sort of guided experience there. So that’s a big trend for sure. Then I guess another one, I would see more technical, I feel like that’s a unification happening, right, of systems and categories of systems. So right now we have, you know, databases here, stream processing engines there. And I feel those things might come more closely together. And you would have real-time streaming capabilities also in something like Postgres itself. And I know maybe would expose Postgres as a Kafka broker, in a sense. So I could also see some more, you know, some closer integration of those different kinds of tools.

Chris Engelbert: That is interesting, because I also think that there is a general like movement to, I mean, in the past we had the idea of moving to different databases, because all of them were very specific. And now all of the big databases, Oracle, Postgres, well, even MySQL, they all start to integrate all of those like multi-model features. And Postgres, being at the forefront, having this like super extensibility. So yeah, that would be interesting.

Gunnar Morling: Right. I mean, it’s always going in cycles, I feel right. And even having this trend to decomposition, like it gives you all those good building blocks, which you then can put together and I know create a more cohesive integrated experience, right. And then I guess in five years, we want to tear it apart again, and like, let people integrate everything themselves.

Chris Engelbert: In 5 to 10 years, we have the next iteration of microservices. We called it SOAP, we called it whatever. Now we call it microservices. Who knows what we will call it in the future. All right. Thank you very much. That was a good chat. Like always, I love talking.

Gunnar Morling: Yeah, thank you so much for having me. This was great. Enjoy the conversation. And let’s talk soon.

Chris Engelbert: Absolutely. And for everyone else, come back next week. A new episode, a new guest. And thank you very much. See you.

The post Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview) appeared first on simplyblock.