Cloud Commute Archives | simplyblock

Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview)

Chris Engelbert — Thu, 27 Jun 2024 12:09:00 +0000

Introduction

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this insightful video, we explore the cutting-edge field of machine learning-driven database optimization with Luigi Nardi In this episode of the Cloud Commute podcast.

Key Takeaways

Q: Can machine learning improve database performance? Yes, machine learning can significantly improve database performance. DBtune uses machine learning algorithms to automate the tuning of database parameters, such as CPU, RAM, and disk usage. This not only enhances the efficiency of query execution but also reduces the need for manual intervention, allowing database administrators to focus on more critical tasks. The result is a more responsive and cost-effective database system.

Q: How do machine learning models predict query performance in databases? DBtune employs probabilistic models to predict query performance. These models analyze various metrics, such as CPU usage, memory allocation, and disk activity, to forecast how queries will perform under different conditions. The system then provides recommendations to optimize these parameters, ensuring that the database operates at peak efficiency. This predictive capability is crucial for maintaining performance in dynamic environments.

Q: What are the main challenges in integrating AI-driven optimization with legacy database systems? Integrating AI-driven optimization into legacy systems presents several challenges. Compatibility issues are a primary concern, as older systems may not easily support modern optimization techniques. Additionally, there’s the need to gather sufficient data to train machine learning models effectively. Luigi also mentions the importance of addressing security concerns, especially when sensitive data is involved, and ensuring that the integration process does not disrupt existing workflows.

Q: Can you provide examples of successful AI-driven query optimization in real-world applications? DBtune has successfully applied its technology across various database systems, including Postgres, MySQL, and SAP HANA. For instance, in a project with a major telecom company, DBtune’s optimization algorithms reduced query execution times by up to 80%, leading to significant cost savings and improved system responsiveness. These real-world applications demonstrate the practical benefits of AI-driven query optimization in diverse environments.

undefined

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

Q: Can machine learning be used for optimization?

Yes, machine learning can be highly effective in optimizing complex systems by analyzing large datasets and identifying patterns that might not be apparent through traditional methods. It can automatically adjust system configurations, predict resource needs, and streamline operations to enhance performance.

simplyblock Insight: While simplyblock does not directly use machine learning for optimization, it provides advanced infrastructure solutions that are designed to seamlessly integrate with AI-driven tools. This allows organizations to leverage machine learning capabilities within a robust and flexible environment, ensuring that their optimization processes are supported by reliable and scalable infrastructure. Q: How does AI-driven query optimization improve database performance?

AI-driven query optimization improves database performance by analyzing system metrics in real-time and adjusting configurations to enhance data processing speed and efficiency. This leads to faster query execution and better resource utilization.

simplyblock Insight: simplyblock’s platform enhances database performance through efficient storage management and high availability features. By ensuring that storage is optimized and consistently available, simplyblock allows databases to maintain high performance levels, even as AI-driven processes place increasing demands on the system. Q: What are the main challenges in integrating AI-driven optimization with legacy database systems?

Integrating AI-driven optimization with legacy systems can be challenging due to compatibility issues, the complexity of existing configurations, and the risk of disrupting current operations.

simplyblock Insight: simplyblock addresses these challenges by offering flexible deployment options that are compatible with legacy systems. Whether through hyper-converged or disaggregated setups, simplyblock enables seamless integration with existing infrastructure, minimizing the risk of disruption and ensuring that AI-driven optimizations can be effectively implemented. Q: What is the relationship between machine learning and databases?

The relationship between machine learning and databases is integral, as machine learning algorithms rely on large datasets stored in databases to train and improve, while databases benefit from machine learning’s ability to optimize their performance and efficiency.

simplyblock Insight: simplyblock enhances this relationship by providing a scalable and reliable infrastructure that supports large datasets and high-performance demands. This allows databases to efficiently manage the data required for machine learning, ensuring that the training and inference processes are both fast and reliable.

Additional Nugget of Information

Q: How is the rise of vector databases impacting the future of machine learning and databases? The rise of vector databases is revolutionizing how large language models and AI systems operate by enabling more efficient storage and retrieval of vector embeddings. These databases, such as pgvector for Postgres, are becoming essential as AI applications demand more from traditional databases. The trend indicates a future where databases are increasingly specialized to handle the unique demands of AI, which could lead to even greater integration between machine learning and database management systems. This development is likely to play a crucial role in the ongoing evolution of both AI and database technologies.

Conclusion

Luigi Nardi showcases how machine learning is transforming database optimization. As DBtune’s founder, he highlights the power of AI to boost performance, cut costs, and enhance sustainability in database management. The discussion also touches on emerging trends like vector databases and DBaaS, making it a must-listen for anyone keen on the future of database technology. Stay tuned for more videos on cutting-edge technologies and their applications.

Full Episode Transcript

Chris Engelbert: Hello, everyone. Welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have Luigi with me. Luigi, obviously, from Italy. I don’t think he has anything to do with Super Mario, but he can tell us about that himself. So welcome, Luigi. Sorry for the really bad joke.

Luigi Nardi: Glad to be here, Chris.

Chris Engelbert: So maybe you start with introducing yourself. Who are you? We already know where you’re from, but I’m not sure if you’re actually residing in Italy. So maybe just tell us a little bit about you.

Luigi Nardi: Sure. Yes, I’m originally Italian. I left the country to explore and study abroad a little while ago. So in 2006, I moved to France and studied there for a little while. I spent almost seven years in total in France eventually. I did my PhD program there in Paris and worked in a company as a software engineer as well. Then I moved to the UK for a few years, did a postdoc at Imperial College London in downtown London, and then moved to the US. So I lived in California, Palo Alto more precisely, for a few years. Then in 2019, I came back to Europe and established my residency in Sweden.

Chris Engelbert: Right. Okay. So you’re in Sweden right now.

Luigi Nardi: That’s correct.

Chris Engelbert: Oh, nice. Nice. How’s the weather? Is it still cold?

Luigi Nardi: It’s great. Everybody thinks that Sweden has very bad weather, but Sweden is a very, very long country. So if you reside in the south, actually, the weather is pretty decent. It doesn’t snow very much.

Chris Engelbert: That is very true. I actually love Stockholm, a very beautiful city. All right. One thing you haven’t mentioned, you’re actually the founder and CEO of DBtune. So you left out the best part guess. Maybe tell us a little bit about DBtune now.

Luigi Nardi: Sure. DBtune is a company that is about four years old now. It’s a spinoff from Stanford University and the commercialization of about a decade of research and development in academia. We were working on the intersection between machine learning and computer systems, specifically the use of machine learning to optimize computer systems. This is an area that in around 2018 or 2019 received a new name, which is MLSys, machine learning and systems. This new area is quite prominent these days, and you can do very beautiful things with the combination of these two pieces. DBtune is specifically focusing on using machine learning to optimize computer systems, particularly in the computer system area. We are optimizing databases, the database management systems more specifically. The idea is that you can automate the process of tuning databases. We are focusing on the optimization of the parameters of the database management systems, the parameters that govern the runtime system. This means the way the disk, the RAM, and the CPU interact with each other. You take the von Neumann model and try to make it as efficient as possible through optimizing the parameters that govern that interaction. By doing that, you automate the process, which means that database engineers and database administrators can focus on other tasks that are equally important or even more important. At the same time, you get great performance, you can reduce your cloud costs as well. If you’re running in the cloud in an efficient way, you can optimize the cloud costs. Additionally, you get a check on your greenops, meaning the sustainability aspect of it. So this is one of the examples I really like of how you can be an engineer and provide quite a big contribution in terms of sustainability as well because you can connect these two things by making your software run more efficiently and then scaling down your operations.

Chris Engelbert: That is true. And it’s, yeah, I’ve never thought about that, but sure. I mean, if I get my queries to run more efficient and use less compute time and compute power, huh, that is actually a good thing. Now I’m feeling much better.

Luigi Nardi: I’m feeling much better too. Since we started talking a little bit more about this, we have a blog post that will be released pretty soon about this very specific topic. I think this connection between making software run efficiently and the downstream effects of that efficiency, both on your cost, infrastructure cost, but also on the efficiency of your operations. It’s often underestimated, I would say.

Chris Engelbert: Yeah, that’s fair. It would be nice if you, when it’s published, just send me over the link and I’m putting it into the show notes because I think that will be really interesting to a lot of people. As he said specifically for developers that would otherwise have a hard time having anything in terms of sustainability. You mentioned database systems, but I think DBtune specifically is focused on Postgres, isn’t it?

Luigi Nardi: Right. Today we are focusing on Postgres. As a proof of concept, though, we have applied similar technology to five different database management systems, including relational and non-relational systems as well. So we were, a little while ago, we wanted to show that this technology can be used across the board. And so we play around with MySQL, with FoundationDB, which is the system behind iCloud, for example, and many of the VMware products. And then we have RocksDB, which is behind your Instagram and Facebook and so on. Facebook, very pressing that open source storage system. And things like SAP HANA as well, we’ve been focusing on that a little bit as well, just as a proof of concept to show that basically the same methodology can apply to very different database management systems in general.

Chris Engelbert: Right. You want to look into Oracle and take a chunk of their money, I guess. But you’re on the right track with SAP HANA. It’s kind of on the same level. So how does that work? I think you have to have some kind of an agent inside of your database. For Postgres, you’re probably using the stats tables, but I guess you’re doing more, right?

Luigi Nardi: Right. This is the idea of, you know, observability and monitoring companies. They mainly focus on gathering all this metrics from the machine and then getting you a very nice visualization on your dashboard. As a user, you would look at these metrics and how they evolve over time, and then they help you guide the next step, which is some sort of manual optimization of your system. We are moving one step forward and we’re trying to use those metrics automatically instead of just giving them back to the user. So we move from a passive monitoring approach to an active approach where the metrics are collected and then the algorithm will help you also to automatically change the configuration of the system in a way that it gets faster over time. And so the metrics that we look at usually are, well, the algorithm itself will gather a number of metrics to help it to improve over time. And this type of metrics are related to, you know, your system usage, you know, CPU memory and disk usage. And other things, for example, latency and throughput as well from your Postgres database management system. So using things like pg_stat_statements, for example, for people that are a little more familiar with Postgres. And by design, we refrain from looking inside your tables or looking specifically at your metadata, at your queries, for example, we refrain from that because it’s easier to basically, you know, deploy our system in a way that it’s not dangerous for your data and for your privacy concerns and things like that.

Chris Engelbert: Right. Okay. And then you send that to a cloud instance that visualizes the data, just the simple stuff, but there’s also machine learning that actually looks at all the collected data and I guess try to find pattern. And how does that work? I mean, you probably have a version of the query parser, the Postgres query parser in the backend to actually make sense of this information, see what the execution plan would be. That is just me guessing. I don’t want to spoil your product.

Luigi Nardi: No, that’s okay. So the agent is open source and it gets installed on your environment. And anyone fluent in Python can read that in probably 20 minutes. So it’s pretty, it’s not massive. It’s not very big. That’s what gets connected with our backend system, which is running in our cloud. And the two things connect and communicate back and forth. The agent reports the metrics and requests what’s the next recommendation from the optimizer that runs in our backend. The optimizer responds with a recommendation, which is then enabled in the system through the agent. And then the agent also starts to measure what’s going on on the machine before reporting these metrics back to the backend. And so this is a feedback loop and the optimizer gets better and better at predicting what’s going on on the other side. So this is based on machine learning technology and specifically probabilistic models, which I think is the interesting part here. By using probabilistic models, the system is able to predict the performance for a new guess, but also predict the uncertainty around that estimate. And that’s, I think, very powerful to be able to combine some sort of prediction, but also how confident you are with respect to that prediction. And those things are important because when you’re optimizing a computer system, of course, you’re running this in production and you want to make sure that this stays safe for the system that is running. You’re changing the system in real time. So you want to make sure that these things are done in a safe way. And these models are built in a way that they can take into account all these unpredictable things that may otherwise book in the engineer system.

Chris Engelbert: Right. And you mentioned earlier that you’re looking at the pg_stat_statements table, can’t come up with the name right now. But that means you’re not looking at the actual data. So the data is secure and it’s not going to be sent to your backend, which I think could be a valid fear from a lot of people like, okay, what is actually being sent, right?

Luigi Nardi: Exactly. So Chris, when we talk with large telcos and big banks, the first thing that they say, what are you doing to my data? So you need to sit down and meet their infosec teams and explain to them that we’re not transferring any of that data. And it’s literally just telemetrics. And those telemetrics usually are not sensitive in terms of privacy and so on. And so usually there is a meeting that happens with their infosec teams, especially for big banks and telcos, where you clarify what is being sent and then they look at the source code because the agent is open source. So you can look at the open source and just realize that nothing sensitive is being sent to the internet.

Chris Engelbert: Right.

Luigi Nardi: And perhaps to add one more element there. So for the most conservative of our clients, we also provide a way to deploy this technology in a completely offline manner. So when everybody’s of course excited about digital transformations and moving to the cloud and so on, we actually went kind of backwards and provided a way of deploying this, which is sending a standalone software that runs in your environment and doesn’t communicate at all to the internet. So we have that as an option as well for our users. And that supports a little harder for us to deploy because we don’t have direct access to that anymore. So it’s easy for us to deploy the cloud-based version. But if you, you know, in some cases, you know, there is not very much you can do that will not allow you to go through the internet. There are companies that don’t buy Salesforce for that reason. So if you don’t buy Salesforce, you probably not buy from anybody else on the planet. So for those scenarios, that’s what we do.

Chris Engelbert: Right. So how does it work afterwards? So the machine learning looks into the data, tries to find patterns, has some optimization or some … Is it only queries or does it also give me like recommendations on how to optimize the Postgres configuration itself? And how does that present those? I guess they’re going to be shown in the UI.

Luigi Nardi: So we’re specifically focusing on that aspect, the optimization of the configuration of Postgres. So that’s our focus. And so the things like, if you’re familiar with Postgres, things like the shared buffers, which is this buffer, which contains the copy of the data from tables from the disk and keep it a local copy on RAM. And that data is useful to keep it warm in RAM, because when you interact with the CPU, then you don’t need to go all the way back to disk. And so if you go all the way back to disk, there is an order of magnitude more like delay and latency and slow down based on that. So you try to keep the data close to where it’s processed. So trying to keep the data in cache as much as possible and share buffer is a form of cache where the cache used in this case is a piece of RAM. And so sizing these shared buffers, for example, is important for performance. And then there are a number of other things similar to that, but slightly different. For example, in Postgres, there is an allocation of a buffer for each query. So each query has a buffer which can be used as an operating memory for the query to be processed. So if you’re doing some sort of like sorting, for example, in the query that small memory is used again. And you want to keep that memory close to the CPU and specifically the workman parameter, for example, is what helps with that specific thing. And so we optimize all this, all these things in a way that the flow of data from disk to the registers of the CPU, it’s very, very smooth and it’s optimized. So we optimize the locality of the data, both spatial and temporal locality if you want to use the technical terms for that.

Chris Engelbert: Right. Okay. So it doesn’t help me specifically with my stupid queries. I still have to find a consultant to fix that or find somebody else in the team.

Luigi Nardi: Yeah, for now, that’s correct. We will probably focus on that in the future. But for now, the way you usually optimize your queries is that you optimize your queries and then if you want to see what’s the actual benefit, you should also optimize your parameters. And so if you want to do it really well, you should optimize your queries, then you go optimize your parameters and go back optimize again your queries, parameters and kind of converge into this process. So now that one of the two is fully automated, you can focus on the queries and, you know, speed up the process of optimizing the queries by a large margin. So to in terms of like benefits, of course, if you optimize your queries, you will write your queries, you can get, you know, two or three order of magnitude performance improvement, which is really, really great. If you optimize the configuration of your system, you can get, you know, an order of magnitude in terms of performance improvement. And that’s, that’s still very, very significant. Despite what many people say, it’s possible to get an order of magnitude improvement in performance. If your system by baseline, it’s fairly, it’s fairly basic, let’s say. And the interesting fact is that by the nature of Postgres, for example, the default configuration of Postgres needs to be pretty conservative because Postgres needs to be able to run on big server machines, but also on smaller machines. So the form factor needs to be taken into account when you define the default configuration of Postgres. And so by that fact, it needs to be pretty conservative. And so what you can observe out there is that this problem is so complex that people don’t really change the default configuration of Postgres when they run on a much bigger instance. And so there is a lot of performance improvement that can be obtained by changing that configuration to a better-suited configuration. And you have the point of doing this through automation and through things like DBtune is that you can then refine the configuration of your system specifically for the specific use case that you have, like your application, your workload, the machine size, and all these things are considered together to give you the best outcome for your use case, which is, I think, the new part, the novelty of this approach, right? Because if you’re doing this through some sort of heuristics, they usually don’t really get to cover all these different things. And there will always be kind of super respect to what you can do with an observability loop, right?

Chris Engelbert: Yeah, and I think you mentioned that a lot of people don’t touch the configuration. I think there is the problem that the Postgres configuration is very complex. A lot of parameters depend on each other. And it’s, I mean, I’m coming from a Java background, and we have the same thing with garbage collectors. Optimizing a garbage collector, for every single algorithm you have like 20 or 30 parameters, all of them depend on each other. Changing one may completely disrupt all the other ones. And I think that is what a lot of people kind of fear away from. And then you Google, and then there’s like the big Postgres community telling you, “No, you really don’t want to change that parameter until you really know what you’re doing,” and you don’t know, so you leave it alone. So in this case, I think something like Dbtune will be or is absolutely amazing.

Luigi Nardi: Exactly. And, you know, if you spend some time on blog posts learning about the Postgres parameters you get that type of feedback and takes a lot of time to learn it in a way that you can feel confident and comfortable in changes in your production system, especially if you’re working in a big corporation. And the idea here is that at DBtune we are partnered with leading Postgres experts as well. Magnus Hagander, for example, we see present of the Postgres Europe organization, for example, it’s been doing this manual tuning for about two decades and we worked very closely with him to be able to really do this in a very safe manner, right. You should basically trust our system to be doing the right thing because it’s engineering a way that incorporates a lot of domain expertise so it’s not just machine learning it’s also about the specific Postgres domain expertise that you need to do this well and safely.

Chris Engelbert: Oh, cool. All right. We’re almost out of time. Last question. What do you think it’s like the next big thing in Postgres and databases, in cloud, in db tuning.

Luigi Nardi: That’s a huge question. So we’ve seen all sorts of things happening recently with, of course, AI stuff but, you know, I think it’s, it’s too simple to talk about that once more I think you guys covered those type of topics a lot. I think what’s interesting is that there is there is a lot that has been done to support those type of models and using for example the rise of vector databases for example, which was I think quite interesting vector databases like for example the extension for Postgres, the pgvector was around for a little while but in last year you really saw a huge adoption and that’s driven by all sort of large language models that use this vector embeddings and that’s I think a trend that will see for a little while. For example, our lead investor 42CAP, they recently invested in another company that does this type of things as well, Qdrant for example, and there are a number of companies that focus on that Milvus and Chroma, Zilliz, you know, there are a number of companies, pg_vectorize as well by the Tembo friends. So this is certainly a trend that will stay and for a fairly long time. In terms of database systems, I am personally very excited about the huge shift left that is happening in the industry. Shift left the meaning all the databases of service, you know, from Azure flexible server Amazon RDS, Google Cloud SQL, those are the big ones, but there are a number of other companies that are doing the same and they’re very interesting ideas, things that are really, you know, shaping that whole area, so I can mention a few for example, Tembo, even EnterpriseDB and so on that there’s so much going on in that space and in some sort, the DBtune is really in that specific direction, right? So helping to automate more and more of what you need to do in a database when you’re operating at database. From a machine learning perspective, and then I will stop that Chris, I think we’re running out of time. From machine learning perspective, I’m really interested in, and that’s something that we’ve been studying for a few years now in my academic team, with my PhD students. The, you know, pushing the boundaries of what we can do in terms of using machine learning for computer systems and specifically when you get computer systems that have hundreds, if not thousands of parameters and variables to be optimized at the same time jointly. And we have recently published a few pieces of work that you can find on my Google Scholar on that specific topic. So it’s a little math-y, you know, it’s a little hard to maybe read them parts, but it’s quite rewarding to see that these new pieces of technology are becoming available to practitioners and people that work on applications as well. So that perhaps the attention will move away at some point from full LLMs to also other areas in machine learning and AI that are also equally interesting in my opinion.

Chris Engelbert: Perfect. That’s, that’s beautiful. Just send me the link. I’m happy to put it into the show note. I bet there’s quite a few people that would be really, really into reading those things. I’m not big on mathematics that’s probably way over my head, but that’s, that’s fine. Yeah, I was that was a pleasure. Thank you for being here. And I hope we. Yeah, I hope we see each other somewhere at a Postgres conference we just briefly talked about that before the recording started. So yeah, thank you for being here. And for the audience, I see you, I hear you next week or you hear me next week with the next episode. And thank you for being here as well.

Luigi Nardi: Awesome for the audience will be at the Postgres Switzerland conference as sponsors and we will be giving talks there so if you come by, feel free to say hi, and we can grab coffee together. Thank you very much.

Chris Engelbert: Perfect. Yes. Thank you. Bye bye.

The post Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview) appeared first on simplyblock.

How I designed PostgreSQL High Availability with Shaun Thomas from Tembo (video + interview)

Chris Engelbert — Thu, 20 Jun 2024 12:08:26 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.

In this installment, we’re talking to Shaun Thomas (Twitter/X , personal blog), affectionately known as “Mr. High Availability” in the Postgres community, to discuss his journey from a standard DBA to a leading expert in high availability solutions for Postgres databases. Shaun shares his experiences working in financial services, where he redefined high availability using tools like Pacemaker and DRBD, and the path that led him to authoring a comprehensive book on the subject. Shaun also talks about his current work at Tembo, an organization dedicated to advancing open-source Postgres, and their innovative approaches to high availability, including the use of Kubernetes and containerized deployments.

Chris Engelbert: Hello, welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have – no, I’m not saying that. I’m not saying I have another incredible guest, even though I have. He’s already shaking his head. Nah, I’m not incredible. He’s just known as Mr. High Availability in the Postgres space for a very specific reason. I bet he’ll talk about that in a second.

So hello, Shaun. Shaun Thomas, thank you for being here. And maybe just introduce yourself real quick. Who are you? Well, where are you from? How did you become Mr. High Availability?

Shaun Thomas: Yeah, so glad to be here. Kind of hang out with you. We talked a little bit. It’s kind of fun. My background is I was just a standard DBA, kind of working on programming stuff at a company I was at and our DBA quit, so I kind of had to pick it up to make sure we kept going. And that was back in the Oracle days. So I just kind of read a bunch of Oracle books to kind of get ready for it. And then they had some layoffs, so our whole division got cut. And then my next job was as a DBA. And I just kind of latched onto it from there.

And as far as how I got into high availability and where I kind of made that my calling card was around 2010, I started working for a company that was in financial services. And they had to keep their systems online at all times because every second they were down, they were losing millions of dollars.

So they actually already had a high availability stack, but it was using a bunch of proprietary tools. So when I started working there, I basically reworked everything. We ended up using the standard stack at the time, which was Pacemaker with Corosync and DRBD for distributed replicating block device because we didn’t really trust replication back then; it was still too new.

We were also running Enterprise DB at the time, so there were a bunch of beta features they had kind of pushed into 9.2 at the time, I think. Because of that whole process and not really having any kind of guide to follow, since there were not a lot of high availability tools back in 2010, 2011, I basically wrote up our stack and the process I used. I presented it at the second Postgres Open that was in Chicago. I did a live demo of the entire stack, and that video is probably online somewhere. My slides, I think, are also on the Postgres Wiki. But after that, I was approached by Packt, the publisher. They wanted me to write a book on it. So I did. I did it mainly because I didn’t have a book to follow. Somebody else in this position really needs to have some kind of series or a book or some kind of step-by-step thing because high availability in Postgres is really important. You don’t want your database to go down in a lot of situations. Until there’s a lot more tools out there to cover your bases, being able to do it is important. Now there’s tons of tools for it, so it’s not a big problem. But back then, man, oof.

Chris Engelbert: Yeah, yeah. I mean, you just mentioned Pacemaker. I’m not sure when I heard that thing the last time. Is that even still a thing?

Shaun Thomas: There’s still a couple of companies using it. Yeah, you would be surprised. I think DFW does in a couple of spots.

Chris Engelbert: All right. I haven’t heard about that in at least a decade, I think. Everything I’ve worked with had different– or let’s say other tools, not different tools. Wow. Yeah, cool. So you wrote that book. And you said you came from an Oracle world, right? So how did the transition to Postgres happen? Was that a choice?

Shaun Thomas: For me, it wasn’t really much of a transition because, like I said, our DBA quit at the company I was at. And it was right before a bunch of layoffs that took out that entire division. But at the time, I was like, ooh, Oracle. I should learn all this stuff. So the company just had a bunch of old training materials lying around. And there were like three or four of the huge Oracle books lying around. So I spent the next three or four weeks just reading all of them back to back.

I was testing in a cluster that we had available, and I set the local version up on my computer just to see if it worked and to learn all the stuff I was trying to understand at the time. But then the layoffs hit, so I was like, what do I do now?

I got another job at a company that needed a DBA. And that was MySQL and Postgres. But that was back when Postgres was still 6.5. Back when it crashed if you looked at it funny. So I got kind of mad at it. And I basically stopped using it from like 2005 to 2010. Or no, that was, sorry, from 2001 to 2005. From 2005, I switched to a company that they were all Postgres. So I got the purple Postgres book. The one that everyone used back then was I think it was 8.1 or 8.2. And then I revised their entire stack also because they were having problems with vacuum. Because back then, the settings were all wrong. So you would end up loading yourself out of your disk space. I ended up vacuuming their systems down from I think it was 20 gigs down to like 5. And back then, that was a lot of disk space.

Chris Engelbert: I was just about to say that in 2005, 20 gigabytes of disk space was a lot.

Shaun Thomas: But back then, the problem with vacuum was you actually had to set the size of the free space map. And the default was way too small. So what would happen is vacuum would actually only keep track of the last 200,000 unused reusable rows by default. But by default, it only kept track of the first 200,000.

So if you had more than that, even if you were vacuuming constantly, it would still bloat like a little bit every day until your whole disk was used. So I actually had to clean all that up or their system was going to crash. They were days away from going down when I joined. They had already added all the disks they could. And back then, you couldn’t just add virtual disk space.

Chris Engelbert: I know those situations, not in the Postgres or database space, but in the software development space where– same thing, I literally joined days before it all would fall apart. Let’s say those are not the best days to join.

Shaun Thomas: Hey, that’s why they hired you, right?

Chris Engelbert: Exactly. All right. So let’s talk a little bit about these days. Right now, you’re with Tembo. And you just have this very nice blog post that blew up on Hacker News for all the wrong reasons.

Shaun Thomas: Well, I mean, we created it for all the right reasons. And so let me just start on Tembo a little bit. So Tembo is like they are all in on Postgres. We are ridiculously all in. Basically, everything we do is all open sourced. You can go to Tembo.io on GitHub. And basically, our entire stack is there. And we even just released our on-prem. So you can actually use our stack on your local system and basically have a Kubernetes cloud management thing for all the clusters you want to manage. And it’ll just be our stack of tools. And the main calling card of Tembo is probably our– if you go to trunk, I think it’s called PGT.dev . We just keep track of a bunch of extensions. And it’s got a command line tool to install them, kind of like a PGXN. And we’re so kind of into this that we actually hired the guy who basically maintained PGXN, David Wheeler. Because we were like, we need to kind of hit the extension drum. And we’re very glad he’s re-standardizing PGXN 2. He’s starting a whole initiative. And he’s got a lot of buy-in from tons of different committers and devs and people who are really pushing it. Maybe we’ll create the gold standard of extension networks. Because the idea is to get it all so that it’s packaged, right? Kind of like a Debian or an RPM or whatever package system you want to use. It’ll just install the package on your Postgres wherever it is. Like the source install, if it’s like a package install, or if it’s something with on your Mac, whatever.

So he’s working on that really. And he’s done some demos that are very impressive. And it looks like it’ll actually be a great advancement. But Tembo is – it’s all about open source Postgres. And our tools kind of show that. Like if you’ve ever heard of Adam Hendel, he goes by Chuck. But if you heard of PGMQ or PG Vectorize, which kind of makes PG Vector a little easier to use, those tools are all coming from us, basically. So we’re putting our money where our mouth is, right?

All right. That’s why I joined him. Because I kept seeing them pop up on Twitter. And I’m like, man, these guys really– they’re really dedicated to this whole thing.

Chris Engelbert: Yeah, cool. So back to PG and high availability. Why would I need that? I mean, I know. But maybe just give the audience a little bit of a clue.

Shaun Thomas: So high availability– and I kind of implied this when I was talking about the financial company, right? The whole idea is to make sure Postgres never goes down. But there’s so much more to it. I’ve done conferences. And I’ve done webinars. And I’ve done trainings. And I’ve done the book. Just covering that topic is it’s essentially an infinite font of just all the different ways you can do it, all the different prerequisites you need to fulfill, all the different things you need to set up to make it work properly. But the whole point is keep your Postgres up. But you also have to define what that means. Where do you put your Postgres instances? Where do you put your replicas? How do you get to them? Do you need an intermediate abstraction layer so that you can connect to that? And it’ll kind of decide where to send you afterwards so you don’t have any outages as far as routing is concerned?

It’s a very deep topic. And it’s easy to get wrong. And a lot of the tools out there, they don’t necessarily get it wrong. But they expect the user to get it right. One of the reasons my book did so well in certain circles is because if you want to set up EFM or repmgr or Patroni or some other tool, you have to follow very closely and know how the tool works extremely well. You have to be very familiar with the documentation. You can’t just follow step by step and then expect it to work in a lot of cases.

Now, there’s a lot of edge cases you have to account for. You have to know why and the theories behind the high availability and how it works a certain way to really deploy it properly.

So even as a consultant when I was working at EDB and a second quadrant, it’s easy to give a stack to a customer and they can implement it with your recommendations. And you can even set it up for them. There’s always some kind of edge case that you didn’t think of.

So the issue with Postgres, in kind of my opinion, is it gives you a lot of tools to build it yourself, but it expects you to build it yourself. And even the other stack tools, like I had mentioned earlier, like repmgr or EFM or Patroni, those are pg auto_failover, another one that came out recently. They work, but you’ve got to install them. And you really do need access to an expert that can come in if something goes wrong. Because if something goes wrong, you’re kind of on your own in a lot of ways.

Postgres doesn’t really have an inherent integral way of managing itself as a cluster. It’s more of like a database that just happens to be able to talk to other nodes to keep them up to date with sync and whatnot. So it’s important, but it’s also hard to do right.

Chris Engelbert: I think you mentioned one important thing. It is important to upfront define your goals. How much uptime do you really need? Because one thing that not only with Postgres, but in general, whenever we talk about failure tolerance systems, high availability, all those kinds of things, what a lot of people seem to forget is that high availability or fault tolerance is a trade-off between how much time and money do I invest and how much money do I lose if something really, well, you could say, s***t hits the fan, right?

Shaun Thomas: Exactly. And that’s the thing. Companies like the financial company I worked at, they took high availability to a fault. They had two systems in their main data center and two more in their disaster recovery data center, all fully synced and up to date. They maintained daily backups on local systems, with copies sent to another system locally holding seven days’ worth. Additionally, backups were sent to tape, which was then sent to Glacier for seven years as per SEC rules.

So, someone could come into our systems and maliciously erase everything, and we’d be back up in an hour. It was very resilient, a result of our design and the amount of money we dedicated toward it because that was a very expensive deployment. That’s atleast 10 servers right there.

Chris Engelbert: But then, when you say you could be back up in an hour, the question is, how much money do you lose in that hour?

Shaun Thomas: Well, like I said, that scenario is like someone walking in and literally smashing all the servers. We’d have to rebuild everything from scratch. In most cases, we’d be up – and this is where your RTO and RPO come in, the recovery time objective and your recovery point objective. Basically, how much do you want to spend to say I want to be down for one minute or less? Or if I am down for that one minute, how much data will I lose? Because the amount of money you spend or the amount of resources you dedicate toward that thing will determine the end result of how much data you might lose or how much money you’ll need to spend to ensure you’re down for less than a minute.

Chris Engelbert: Exactly, that kind of thing. I think that becomes more important in the cloud age. So perfect bridge to cloud, Postgres and cloud, perfect. You said setting up HA is complicated because you have to install the tools, you have to configure them. These days, when you go and deploy Postgres on something like Kubernetes, you would have an operator claiming at least doing all the magic for you. What is your opinion on the magic?

Shaun Thomas: Yeah, so my opinion on that is it evolved a lot. Back when I first started seeing containerized systems like Docker and that kind of thing, my opinion was, I don’t know if I’d run a production system in a container, right? Because it just seems a little shady. But that was 10 years ago or more. Now that Kubernetes tools and that kind of thing have matured a lot, what you get out of this now is you get a level of automation that just is not possible using pretty much anything else. And I think what really sold it to me was – so you may have heard of Gabriele Bartolini. He basically heads up the team that writes and maintains Cloud Native Postgres, the Cloud Native PG operator. We’ll talk about operators probably a bit later. But the point of that was back when—and 2ndQuadrant was before they were bought by EDB—we were selling our BDR tool for bi-directional application for Postgres, right? So multi-master. And we needed a way to put that in a Cloud service for obvious purposes so we could sell it to customers. And that meant we needed an operator. Well, before Cloud Native Postgres existed, there was the BDR operator that we were cycling internally for customers.

And one day while we were in Italy—because every employee who worked at 2ndQuadrant got sent to Italy for a couple of weeks to get oriented with the team, that kind of thing. During that time when I was there in 2020, I think I was there for February, for the first two weeks of February. He demoed that, and it kind of blew me away. We were using other tools to deploy containers. And it was basically Ansible to automate the deployment with Terraform. And then you kind of set everything up and then deploy everything. It takes minutes to set up all the packages and get everything deployed and reconfigure everything. Then you have to wait for syncs and whatnot to make sure everything’s proper.

On someone’s laptop, they set up Kubernetes Docker deployment. Kind, I think we were using at that point, Kubernetes in Docker. And in less than a minute, he had on his laptop set up a full Kubernetes cluster of three replicating, bidirectional replicating, so three multi-master nodes of Postgres on his laptop in less than a minute. And I was just like, my mind was blown. And the thing is, basically, it’s a new concept. The data is what matters. The nodes themselves are completely unimportant. And that’s why, to kind of bring this back around, when Cloud Native Postgres was released by Enterprise DB kind of as an open-source tool for Postgres and not the bidirectional replication stuff for just Postgres.

The reason that was important was because it’s an ethos. The point is your compute nodes—throw them away. They don’t matter. If one goes down, you provision a new one. If you need to upgrade your tooling or the packages, you throw away the old container image, you bring up a new one. The important part is your data. And as long as your data is on your persistent volume claim or whatever you provision that as, the container itself, the version of Postgres you’re running, those aren’t nearly as important. So it complicates debugging to a certain extent. And we can kind of talk about that maybe later. But the important part is it brings high availability to a level that can’t really be described using the old methods. Because the old method was you create two or three replicas. And if one goes down, you’ve got a monitoring system that switches over to one of the alternates. And then the other one might come back or might not. And then you rebuild it if it does, that kind of thing.

With the Kubernetes approach or the container approach, as long as your storage wasn’t corrupted, you can just bring up a new container to represent that storage. And you can actually have a situation where the primary goes down because maybe it got OOM killed for some reason. It can actually go down, get a new container provisioned, and come back up before the monitors even notice that there was an outage and the switch to a replica and promote it. There’s a whole mechanism of systems in there to kind of reduce the amount of timeline switches and other kind of complications behind the scenes. So you have a cohesive, stable timeline. You maximize your uptime. They’ve got layers to redirect connections from the outside world through either traffic or some other kind of proxy to get into your actual cluster. You always get an endpoint somehow. And that’s something that was horribly wrong, but that’s true for anything. But the ethos of your machines aren’t important. It spoke to me a little bit because it brings you to a level that sure, their hardware is great. And I actually prefer it. I’ve got servers in my basement specifically for testing clusters and Postgres and whatnot. But if you have the luxury of provisioning what you need at the time, if I want more compute nodes, like I said, show my image, bring up a new one that’s got more resources allocated to it, suddenly I’ve grown vertically. And that’s something you can’t really do with bare hardware, at least not very easily.

So then I was like, well, maybe this whole container thing isn’t really a problem, right? So yeah, it’s all because of my time in 2ndQuadrant and Gabriele’s team that high availability does belong in the cloud. And you can run production in the cloud on Kubernetes and containers. And in fact, I encourage it.

Chris Engelbert: I love that. I love that. I also think high availability in cloud, and especially cloud native are concepts that are perfectly in line and perfectly in sync. Unfortunately, we’re out of time. I didn’t want to stop you, but I think we have to invite you again and keep talking about that. But one last question. One last question. By the way, I love when you said that containers were a new thing like 10 years ago, except for you came from the Solaris or BSD world where those things were –

Shaun Thomas: Jails!

Chris Engelbert: But it’s still different, right? You didn’t have this orchestration layer on top. The whole ecosystem evolved very differently in the Linux space. Anyway, last question. What do you think is the next big thing? What is upcoming in the Postgres, the Linux, the container world, what do you think is amazing on the horizon?

Shaun Thomas: I mean, I hate to be cliche here, but it’s got to be AI. If you look at pgvector, it’s basically allowing you to do vectorized similar researches right in Postgres. And I think Timescale even released pgvectorscale, which is an extension that makes pgvector even better. It makes it apparently faster than dedicated vector databases like Pinecone. And it’s just an area that if you’re going to do any kind of result, augmented generation, like RAG searches, or if you’re doing any LLM work at all, if you’re building chatbots, or if you’re just doing, like I said, augmented searches, any of that kind of work, you’re going to be wanting your data that’s in Postgres already, right? You’re going to want to make that available to your AI. And the easiest way to do that is with pgvector.

Tembo even wrote an extension we call pg_vectorize, which automatically maintains your embeddings, which is how you kind of interface your searches with the text. And then you can feed that back into an LLM. It also has the ability to do that for you. Like it can send messages directly to OpenAI. We can also interface with arbitrary paths so you can set up an Ollama or something on a server or locally. And then you can set that to be the end target. So you can even keep your messages from hitting external resources like Microsoft or OpenAI or whatever, just do it all locally. And that’s all very important. So that I think is going to be– it’s whatever one– not either one, but a lot of people are focusing on it. And a lot of people find it annoying. It’s another AI thing, right? But I wrote two blog posts on this where I wrote a RAG app using some Python and pgvector. And then I wrote a second one where I used pg_vectorize and I cut my Python code by like 90%. And it just basically talks to Postgres. Postgres is doing it all. And that’s because of the extension ecosystem, right? And that’s one of the reasons Postgres is kind of on the top of everyone’s mind right now because it’s leading the charge. And it’s bringing a lot of people in that may not have been interested before.

Chris Engelbert: I love that. And I think that’s a perfect sentence to end the show. The Postgres ecosystem or extension system is just incredible. And there’s so much stuff that we’ve seen so far and so much more stuff to come. I couldn’t agree more.

Shaun Thomas: Yeah, it’s just the beginning, man.

Chris Engelbert: Yeah, let’s hope that AI is not going to try to build our HA systems. And I’m happy.

Shaun Thomas: Maybe not yet, yeah.

Chris Engelbert: Yeah, not yet at least. Exactly. All right, thank you for being here. It was a pleasure. As I said, I think I have to invite you again somewhere in the future.

Shaun Thomas: More than willing.

Chris Engelbert: And to the audience, thank you for listening in again. I hope you come back next week. And thank you very much. Take care.

The post How I designed PostgreSQL High Availability with Shaun Thomas from Tembo (video + interview) appeared first on simplyblock.

Policy Management at Cloud-Scale with Anders Eknert from Styra (video + interview)

Chris Engelbert — Fri, 07 Jun 2024 12:09:23 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Anders Eknert ( Twitter/X , Personal Blog ), a Developer Advocate for Styra, who talks about the functionality of OPA, the Open Policy Agent project at Styra, from a developer’s perspective, explaining how it integrates with services to enforce policies. The discussion touches on the broader benefits of a unified policy management system and how OPA and Styra DAS (Declarative Authorization Service) facilitate this at scale, ensuring consistency and control across complex environments. See more information on what the Open Policy Agent project is, what ‘Policy as Code’ is and what tools are available as well as how OPA can help make simplyblock more secure. Also see interview transcript section at the end.

Key Learnings

What is the Open Policy Agent (OPA) Project?

The Open Policy Agent (OPA) is a framework designed for defining and running policies as code, decoupled from applications, for use cases like authorization, or policy for infrastructure. It allows organizations to maintain a unified approach to policy management across their entire technology stack. Styra, the company behind OPA, enhances its capabilities with two key products: Styra DAS and an enterprise distribution of OPA. Styra DAS is a commercial control plane for managing OPA at scale, handling the entire policy lifecycle. The enterprise distribution of OPA features a different runtime that consumes less memory, evaluates faster, and can connect to various data sources, providing more efficient and scalable policy management solutions.

What is Policy as Code?

Policy as code is a practice where policies and rules are defined, managed, and executed using code rather than through manual processes. This approach allows policies to be versioned, tested, and automated, similar to software development practices. By treating policies as code, organizations can ensure consistency, repeatability, and transparency in their policy enforcement, making it easier to manage and audit policies across complex environments. Tools like Open Policy Agent (OPA) (see above) facilitate policy as code by providing a framework to write, manage, and enforce policies programmatically.

What are the available Policy as Code Tools?

Several tools are available for implementing Policy as Code. Some of the prominent ones include:

Open Policy Agent (OPA) : An open-source framework, to easily write, manage, test, and enforce policies for infrastructure modifications, service communication, and access permissions. See our podcast episode with Anders Eknert from Styra .
HashiCorp Sentinel : A policy as code framework deeply integrated with HashiCorp products like Terraform, Vault, and Consul.
Kubernetes Policy Controller (Kyverno) : A Kubernetes-native policy management tool that allows you to define, validate, and enforce policies for Kubernetes resources.
Azure Policy : A service in Microsoft Azure for enforcing organizational standards and assessing compliance.

These tools help ensure that policies are codified, version-controlled, and easily integrated into CI/CD pipelines, providing greater consistency and efficiency in policy management.

How will OPA help to Make Simplyblock even more Secure?

Integrating Open Policy Agent (OPA) with simplyblock and Kuberntes can enhance security in several ways: Centralized Policy Management: OPA allows defining and enforcing policies centrally, ensuring consistent security policies across all services and environments. Fine-Grained Access Control: OPA provides detailed control over who can access what, reducing the risk of unauthorized access. Policies can, for example, be used to limit access to simplyblock block devices or prevent unauthorized write mounts. Compliance and Auditing: OPA’s policies can be versioned and audited, helping simplyblock to meet your compliance requirements. Using simplyblock and OPA, you have proof of who was authorized to access your data storage at any point in time. Dynamic Policy Enforcement: OPA can enforce policies in real-time, responding to changes quickly and preventing security breaches.

Transcript

Chris Engelbert: Hello everyone, welcome back to this week’s episode of simplyblock’s Cloud Commute Podcast. Today I have a guest with me that I actually never met in person as far as I know. I don’t think we have. No. Maybe just say a few words about you, where you’re from, who you are, and what you’re working with.

Anders Eknert: Sure, I’m Anders. I live here and work in Stockholm, Sweden. I work as a developer advocate, or a DevRel lead even, for Styra, the company behind the Open Policy Agent (OPA) project.

I’ve been here for, I think it’s three and a half years or so. Before that, I was at another company where I got involved in the OPA project. We had a need for a solution to do access control or authorization across a very diverse and complex environment. We had development teams in 12 different countries, seven different programming languages in our cluster, and it was just a big mess. Our challenge was how to do authorization in that kind of environment without having to go out to all of these teams and try to coordinate development work with each change we needed to do.

So that’s basically how I got involved in the OPA project. OPA emerged as a good solution to our problem at that time and, yeah, all these years later I’m still here and I’m having a lot of fun.

Chris Engelbert: All right, cool. So you mentioned Styra, I always thought it was Styra [Steera] to be honest, but okay, fair enough.

Anders Eknert: Yeah, no, the Swedish pronunciation would be ‘Steera’. So you’re definitely right. It is a Swedish word, which means to steer or to navigate.

Chris Engelbert: Oh, okay, yeah.

Anders Eknert: So you’re absolutely right. I’m just using the Americanized, the bastardized pronunciation.

Chris Engelbert: That’s fair, probably because I’m German that would be my initial thought. And it kind of makes sense. So in German it would probably be “steuern” or something.

All right, so tell us a little bit about Styra. You already mentioned the OPA project. I guess we’re coming back to that in a second, but maybe a little bit about the company itself.

Anders Eknert: Yeah, sure. Styra was founded by the creators of OPA and the idea, I think, is like the main thing. Everything at Styra revolves around OPA and I think it always has and I’m pretty sure it always will to some extent.

So what Styra does is we created and maintain the OPA project. We created and maintain a lot of the things you’ll find in the ecosystem around OPA and Styra. And also of course we’re a commercial company. So there are two products that are both based around OPA. One is Styra DAS, which is a commercial control plane, which allows you to manage OPA at scale. So like from the whole kind of policy lifecycle. And then there’s an enterprise distribution of OPA as well, which has basically a whole different runtime, which allows it to consume much less memory, evaluate faster, connect to various data sources and so on. So basically both the distributed component and the centralized component.

Chris Engelbert: Right, okay. You mentioned OPA a few times, I think you already mentioned what it really means, but maybe we need to dig into that a little bit deeper. So I think OPA is the Open Policy Agent. And if I’m not mistaken, it’s a framework to actually build policy as we call it policy as code.

Anders Eknert: That’s right, that’s right. So yeah, the idea behind OPA is basically that you define your policies as code, but not just code as like any other code running or which is kind of coupled to your applications, but rather that you try and decouple that part of your code and move it outside of your application so you can work with that in isolation.

And some common use cases could be things like authorization. And I mentioned before this need where you have like a complex environment, you have a whole bunch of services and you need to control authorization. How do we do authorization here? How do we make changes to this at runtime? How do we know what authorization decisions got logged or what people did in our systems? So how do we do auditing of this? So that is one type of policy and it’s a very common one.

But it doesn’t stop there. Basically anywhere you can define rules, you can define policy. So other common use cases are policy for infrastructure where you want to say like, I don’t want to allow pods to run in my Kubernetes cluster unless they have a well-defined security context or if they don’t allow mounts of certain types and so on. So you basically define the rules for your infrastructure. And this could be things like Terraform plans, Kubernetes resource manifests, or simply just JSON and YAML files on disk. So there are many ways to, and many places where you might want to enforce policy. And the whole idea behind OPA is that you have one way of doing it and it’s a unified way of doing it. So there are many policy engines out there and most of them do this for one particular use case. So there might be a policy engine that does authorization and many others that do infrastructure and so on. But that all means that you’re still going to end up with this problem where policy is scattered all over the place, it looks different, it logs different and so on. While with OPA, you have one unified way of doing this and to work with policy across your whole stack and organization. So that is kind of the idea behind OPA.

Chris Engelbert: So that means if I’m thinking about something like simplyblock being a cloud native block storage, I could prevent services from mounting our block devices through the policies, right? So something like, okay, cool.

Anders Eknert: Right

Chris Engelbert: You mentioned authorization, I guess that is probably the most common thing when people think about policy management in general. What I kind of find interesting is, in the past, when you did those things, there was also often the actual policies or the rules for permission configuration or something. It was already like a configuration file, but with OPA, you kind of made this like the first first-class spot. Like it shouldn’t be in your code. Here’s the framework that you can just drop into or drop before your application, I think, right? It’s not even in the application itself.

Anders Eknert: No, I guess it depends, but most commonly you’ll have like a separate policy repo where that goes. And of course, a benefit of that is like, we’re not giving up on code. Like we still want to treat policy as code. We want to be able to test it. We want to be able to review it. We want to work with all of these things like lint it or what not. We want to work with all these good tools and processes that we kind of established for any development. We want to kind of piggyback on that for policy just as we do for anything else. So if you want to change something in a policy, the way you do that is you submit a pull request. It’s not like you need to call a manager or you need to submit a form or something. That is how it used to be, right? But we want to, as developers, we want to work with these kinds of things like we work with any other type of code.

Chris Engelbert: Right. So how does it look like from a developer’s point of view? I mean, you can use it to, I think automatically create credentials for something like Postgres. Or is that the DAS tool? Do you need one of the enterprise tools for that?

Anders Eknert: No, yeah, creating credentials, I guess, you could definitely use OPA for that. But I think in most cases, what you use OPA for is basically to make decisions that are either most commonly they’re yes or no. ‘So should we allow these credentials?’ would be probably a better use case for OPA. ‘No, we should not allow them because they’re not sufficiently secure’ or what have you. But yeah, you can use OPA and Rego, the policy language, for a whole lot of things and a whole lot of things that we might not have decided for initially. So as an example, like there’s this linter for Rego, which is called Regal that I have been working on for the past year or so. And that linter itself is written mostly in Rego. So we kind of use Rego to define the rules of what you can do in Rego.

Chris Engelbert: Like a small exception.

Anders Eknert: Yeah, yeah. There’s a lot of that.

Chris Engelbert: All right. I mean, you know that your language is good when you can build your own stuff in your own language, right?

Anders Eknert: Exactly.

Chris Engelbert: So coming back to the original question, like what does it look like from a developer’s point of view if I want to access, for example, a Postgres database?

Anders Eknert: Right. So the way OPA works, it basically acts as a layer in between. So you probably have a service between your database and your user or another service. So rather than having that user or service go right to the database, they’d query that service for access. And in that service, you’d have an integration with OPA, either with OPA running as another service or running embedded inside of that service. And that OPA would determine whether access should be allowed or not based on policy and data that it has been provided.

Chris Engelbert: Right. Okay, got it, got it. I actually thought that, maybe I’m wrong because I’m thinking one of the enterprise features or enterprise products, I thought it was its own service that handles all of that automatically, but maybe I misunderstood to be honest. So there are, as you said, there’s OPA enterprise and there is DAS, the declarative authorization service.

Anders Eknert: Yeah, yeah, that’s right. You got it right. I remembered right.

Chris Engelbert: So maybe tell us a little bit about those. Maybe I’m mixing things up here.

Anders Eknert: Sure. So I talked a bit about OPA and OPA access to distributed component or the decision point. So that’s where the decisions are made. So OPA is going to tell the user or another service, should we allow this or not. And once you start to have tens or twenties or hundreds or thousands of these OPAs running in your cluster, and if you have a distributed environment and you want to do like zero trust, microservice authorization or whatnot, you’re going to have hundreds or thousands of OPAs. So the problem that Styra DAS solves is essentially like, how do we manage this at scale? How do I know what version or which policy is deployed in all these environments? How do I manage policy changes between like dev, test, prod, and so on? But it kind of handles the whole policy lifecycle. We talked about testing before. We talked about things like auditing. How are these things logged? How can I search these logs? Can I use these logs to replay a decision and see, like, if I did change this, would it have an impact on the outcome and so on?

So it’s basically the centralized component. If OPA is the distributed component, Styra DAS provides a centralized component which allows things like a security team or even a policy team to kind of gain this level of control that would previously be missing when you just let any developer team handle this on their own.

Chris Engelbert: So it’s a little bit like fleet management for your policies.

Anders Eknert: Yes, that is right.

Chris Engelbert: Okay, that makes sense. And the DAS specifically, that is the management control or the management tool?

Anders Eknert: Yeah, that it is.

Chris Engelbert: Okay.

Anders Eknert: And then the enterprise OPA is a drop-in replacement for OPA adding a whole bunch of features on top of it, like reduced memory usage, direct integrations with data sources, things like Kafka streaming data from Kafka and so on and so forth. So we provide commercial solutions both for the centralized part and the distributed part.

Chris Engelbert: Right, okay. I think now I remember where my confusion comes from. I think I saw OPA Enterprise and saw all the services which are basically source connectors. So I think you already mentioned Kubernetes before, but how does that work in the Kubernetes environment? I think you can, as you said, deploy it as its own service or run it embedded in microservices. How would that apply together somehow? I mean, we’re a cloud podcast.

Anders Eknert: Yeah, of course, of course. So in the context of Kubernetes, there’s basically two use cases. Like the first one we kind of covered, it’s authorization in the level, like inside of the workloads. Our applications need to know that the user trying to do something is authorized to do so. In that context, you’d normally have OPA running as a sidecar or in a gateway or as part of like an envoy proxy or something like that. So it basically provides a layer on top or before any request is hitting an actual application.

Chris Engelbert: In the sense of user operated.

Anders Eknert: Yeah, exactly. So on the next content or the next use case for OPA and Kubernetes is commonly like admission control, where Kubernetes itself or the Kubernetes API is protected by OPA. So whenever you try and make a modification to Kubernetes or the database etcd, the Kubernetes API reaches out to OPA to ask, like should this be allowed or not? So if you try and deploy a pod or a deployment or I don’t know, what have you, what kind of resources, OPA will be provided at resource. Again, it’s just JSON or YAML. So anything that’s JSON or YAML is basically what OPA has to work with. It doesn’t even know, like OPA doesn’t know what a Kubernetes resource is. It just seems like here’s a YAML document or here’s a JSON document. Is this or that property that I expect, is it in this JSON blob? And does it have the values that I need? If it doesn’t, it’s not approved. So we’re going to deny that. So basically just tells the Kubernetes API, no, this should not be allowed and the Kubernetes API will enforce that. So the user will see this was denied because this or that reason.

Chris Engelbert: So that means I can also use it in between any Kubernetes services, everything or anything deployed into Kubernetes, I guess, not just the Kubernetes API.

Anders Eknert: Yeah, anything you try and deploy, like for modifications, is going to have to pass through the Kubernetes API.

Chris Engelbert: That’s a really interesting thing. So I guess going back to the simplyblock use case, that would probably be where our authorization layer or approval layer would sit, basically either approving or denying the CSI deployment.

Anders Eknert: Yeah.

Chris Engelbert: Okay, that makes sense. So because we’re already running out of time, do you think that, or well, I think the answer is yes, but maybe you can elaborate a little bit on that. Do you think that authorization policies or policies in general became more important with the move to cloud? Probably more people have access to services because they have to, something like that.

Anders Eknert: Yeah, I’d say like they were probably just as important back in the days. What really changed with like the invent of cloud and this kind of automation is the level of scale that any individual engineer can work with. Like in the past, you’d have an infra engineer would perhaps manage like 20 machines or something like that. While today they could manage thousands of machines or virtual machines in cloud instances or whatnot.

And once you reach that level of scale, there’s basically no way that you can do policy like manually, that you have a PDF document somewhere where it says like, you cannot deploy things unless these conditions are met. And then have engineers sit and try and make an inventory of what do we have here? And are we all compliant? That doesn’t work.

So that is basically the difference today from how policy was handled in the past. We need to automate every kind of policy check as well just as we automated infrastructure and so with cloud.

Chris Engelbert: Yeah, that makes sense. I think the scale is a good point about that. It was not something I thought about it. I thought in the sense or my thought was more in the sense of like you have probably much bigger teams than you had in the past, which also makes it much more complicated to manage policies or make sure that just like the right people have access. And many times have to have this like access because somebody else is on vacation and it will never be removed again. We all know how it worked in the past.

Anders Eknert: Yeah, yeah. Now, and another difference I think like today compared to 20 years ago is like, at least when I started working in tech, it was like, if you got to any larger company, they’re like, ‘Hey, we’re doing Java here or we’re doing like .NET.’ But if you go to those companies today, they’re like, ‘There’s going to be Python. There’s going to be Erlang. There’s going to be some closure running somewhere. There’s going to be like so many different things.’

This idea of team autonomy and like teams and deciding for themselves what the best solution for any given problem is. And that is, I love that. It’s like, it makes it so much more interesting to work in tech, but it also provides like a huge challenge for anything that is security related because in anything anywhere where you need to kind of centralize or have some form of control, it’s really, really hard. How do you audit something if it’s like in eight different programming languages? Like I can barely understand two of them. Like how would I do that?

Chris Engelbert: How to make sure that all the policies are implemented? If policy change happens, yeah, you’re right. You have to implement it in multiple languages. The descriptor language for the rules isn’t the same. Yeah, that’s a good point. That’s a very good point actually. And just because time, I think I would have like a million more questions, but there’s one thing that I always have to ask. What do you think is like the next big thing in terms of cloud, in your case, authorization policies, but also in the broader scheme of everything?

Anders Eknert: Yeah, sure. So I’d say like, first of all, I think both identity and access control, they are kind of slow moving and for good reasons. There’s not like there’s going to be a revolutionary thing or disruptive event that turns everything around. I think that’s basically where we have to be. We can rely on things to not change or to update too frequently or too dramatically.

So yeah, what would the next big thing is, I still think like this area where we decoupled policy and we worked with it consistently across like large organizations and so on, it’s still the next revolutionary thing. It’s like, there’s definitely a lot of adopters already, but we’re just at a start of this. And again, that’s probably like organizations don’t just swap out their like how they do authorization or identity that could be like a decade or so. So I still think this policy as code while it’s starting to be like an established concept, that it is still the next big thing. And that’s why it’s also so exciting to work with in this space.

Chris Engelbert: All right, fair enough. At least you didn’t say automatic AI generation.

Anders Eknert: No, God no.

Chris Engelbert: That would have been really the next big thing. Now we’re talking. No, seriously. Thank you very much. That was very informative. I loved that. Yeah, thank you for being here.

Anders Eknert: Thanks for having me.

Chris Engelbert: And for the audience, next week, same time, same podcast channel, whatever you want to call that. Hope to hear you again or you hear me again. And thank you very much.

Key Takeaways

In this episode of simplyblock’s Cloud Commute Podcast, host Chris Engelbert welcomes Anders Eknert, a developer advocate and DevRel lead at Styra, the company behind the Open Policy Agent (OPA) project. The conversation dives into Anders’ background, Styra’s mission, and the significance of OPA in managing policies at scale.

Anders Eknert works as a Developer Advocate/DevRel at Styra, the company responsible for the Open Policy Agent (OPA) Project. He’s been with the company for 3.5 years and was previously involved in the OPA project at another company.

Styra created and maintains the OPA project with 2 key products around OPA; 1) Styra DAS, a commercial control plane for managing OPA at scale, handling the entire policy lifecycle and 2) an enterprise distribution of OPA, which has a different runtime and allows it to consume less memory, evaluate faster, connect to various data sources etc. If OPA is the distributed component, Styra DAS is a centralized component.

OPA is a framework to build and run policies – a project for defining policies as code, decoupled from applications, for use cases like authorization, or policy for infrastructure. The idea behind OPA is that it allows a unified way of working with policy across your whole stack and organization.

In the context of Kubernetes, there are 2 key use cases: 1) authorization inside of the workloads where OPA can be deployed as a sidecar or in a gateway or as part of an envoy proxy; 2) admission control where Kubernetes API is protected by OPA.

Anders also talks about the advent of the cloud and how policy management and automation has become essential due to the scale at which engineers today operate. He also discusses the use of diverse programming environments today and team autonomy, both of which necessitate a unified approach to policy management, making tools like OPA crucial.

Anders predicts that policy as code will continue to gain traction, offering a consistent and automated way to manage policies across organizations.

The post Policy Management at Cloud-Scale with Anders Eknert from Styra (video + interview) appeared first on simplyblock.

Your CI/CD Pipeline is a Production system with Stefan Prodan from ControlPlane (video + interview)

Chris Engelbert — Fri, 31 May 2024 12:10:32 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Stefan Prodan ( Twitter/X , Personal Blog ), a Principal Engineer at ControlPlane, who talks about the importance of recognizing that a deployment pipeline is basically a cluster admin and needs to be handled securely as a production system.

Chris Engelbert: Hello, everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today with me I have Stefan. Please pronounce the last name yourself in a second. I’m not going to try to do that myself. But Stefan joins us from ControlPlane. Before that, I guess he’s also talking a little bit about his own background. So, Stefan, welcome. And maybe say a few words about yourself. Who are you? Why are you here?

Stefan Prodan: Thanks, Chris, for inviting me. I’m Stefan Prodan. I’m a software engineer for some time now. And in the last seven years, I’ve been focusing exclusively on open-source engineering. I’ve been involved with the CNCF FluxCD project for all this time. And I’ve developed some of my own sub-projects inside FluxCD, like Flagger, for example, which is continuous delivery and progressive delivery side of that. And yeah, I help architect and shape the current version of Flux, which is version two. And yeah, my passion is around working with the cloud-native ecosystem, with Kubernetes, and building solutions on top of that.

Chris Engelbert: All right. Yeah, you said FluxCD, and I think we’re going to come back to that in a second. Right now, you’re working with a company called ControlPlane. And from what I see on the website, that is a security consultant, a cloud-native security consultancy. Let’s put it that way. So maybe say a few words about the company itself.

Stefan Prodan: Yeah, so the company is from London, but we are distributed around the globe. It’s a security company which focuses on threat modelling, pen testing for Kubernetes environments. We do architectural designs for your continuous integration and delivery pipelines with a focus on security, of course, and compliance. So yeah, our services are more around helping organizations evolve into their cloud-native journey. And while they are doing that, doing it in a safe way, you know, you should, if you are migrating to the cloud, you should gain also a better security stance out of it. And that’s one of our main focuses.

Chris Engelbert: Right, right. So that means from a customer point of view, when I joined as a customer, what would a typical customer look like? Is it like the big company that, as you said, is just moving into the cloud game? And what are the challenges they face and where you help them?

Stefan Prodan: So there are a wide range of customers. There aren’t only banks and financial institutions, but those are usually the companies that organizations that are looking for, you know, answering the questions, ‘are we really secure? Are we doing the right thing here?’ And I mean, it’s not only about most banks moved part of their infrastructure a long time ago on the cloud, right? It’s not about getting them started with, but it’s more about, you know, how the hybrid cloud looks for you, which are the challenges here. And when we go, usually we do an architectural review, we try to understand the system there. And then, you know, through pen testing, threat modelling, and other practices like that, training, we try to make first the customer’s employees more security conscious of their day-to-day operations, and then come up with a recommendation for how they can improve that. Also, with pen testing, we discover all sorts of, you know, let’s say, misconfigurations, and we also propose solutions for that, but it’s up to the customer to actually take that knowledge and make their security difference better.

So it’s a mix around, you know, consultancy going there to red team type of analysis where you poke around and see what you find, but it’s also about looking at the architecture of the whole thing and how that can be improved. Usually, improving one means also, especially in the cloud-native world, means simplification.

Like, if you see out there, like, me as a Flux maker, I’ve talked to so many users of Flux, which are, I know, thousands, tens of thousands, and when you get started with Kubernetes, it’s very easy and with the cloud-native landscape. It’s very, you know, ‘how do I solve this? Oh, I add this component, and I add this other component, and I add this other component,’ and then you have, like, ten key something controllers with hundreds of configurations and so on, right?

So if you do this in a, let’s say, rush way, or you do it as a proof of concept, and that proof of concept ends up being the thing that you are running in production, you may want to go over it and look, ‘how can I simplify this? Can I take advantage of this component and maybe eliminate other things?’ Simplifying things usually means you have a better understanding of your system, and that makes the system more secure. So, yeah, we in the cloud-native world tend to deal with massive complexity, and yeah, that’s one of the things I’m seeing, like, trying to reduce complexity and reduce the noise. It’s a good way forward.

Chris Engelbert: Right. I think one of the interesting things you mentioned is pen testing, and pen testing is always something that is dear to my heart, because I did not do it in a professional way in the past, mostly for online games and stuff. But I think it’s a really important process of actively trying to break into systems or break systems and to find those issues before, well, hopefully, the hostile actors find those. So, I think this is really interesting. That is something, I don’t know, maybe you have a different feeling about that, but I think it’s still something that is not really actively used by a lot of companies, maybe the big ones, but a lot of the smaller companies still seem to miss that, like, where they don’t really get the importance of pen testing. What do you think about that?

Stefan Prodan: Yeah, I mean, I first came to Fosdem some years ago, and Andy, who is the CEO of ControlPlane, we worked on something together, but I joined ControlPlane this year, so I’m quite new to the company. He had a talk on how to hack Kubernetes and he was on stage hacking Kubernetes from the root container on the node. ‘Okay, now I’m on the node. How can I get control of the whole control plane of Kubernetes?’ And then, ‘yay, I’m cluster admin, and from here, I can do whatever I want.’

And yeah, I think we should educate people more, Kubernetes users through things like that, you know, great talks. We, at ControlPlane, also do professional training around where we actually teach people how to hack their own Kubernetes. We have a product called Kubesim, which is a Kubernetes simulator, everybody gets a cluster, it has all sorts of, you know, you deploy in our container, now shell-exacting to it, and from there, you can go sideways and do all sorts of things. And I think that kind of mentality is important to, you know, promote it more.

Every time there is some way of getting around security constraints, that should be one of the things you have in place. So poking around it, it can be fun, and it also teaches you a lot about the system itself, you learn better Kubernetes if you try to, you know, exploit it from this perspective.

Chris Engelbert: That’s very true. It’s kind of the same thing. In the past, I advocated a lot for how to build resilient and fault-tolerant systems. It’s kind of the same thing from my perspective with security. There is no way to build a 100% secure system, except for you don’t build it at all. So embrace the idea of there are security issues, and in the worst case, pay somebody to find them for you.

It’s kind of the same thing with resiliency, right? A resilient system is nice, and you can probably build like a 100% resilient system, but nobody will pay the money for that. So it’s a trade-off. Like, how much money do I have in my bank, and how much is this problem worth solving?

Stefan Prodan: Yeah, vulnerabilities come at this point, come to you from all directions, right? It’s what we’ve seen in the last years with, you know, exploiting the continuous integration and continuous delivery pipeline. And you don’t even have to have the production system. Maybe that’s bulletproof, but you can get into some Jenkins server, which is out there on the internet with a hard coded admin password that everyone can guess very easily. And once you’re into the CI system, you can, you know, poison those binaries or deploy your own container on the production cluster, even if the production cluster is great. You’re there through the pipeline, right?

Chris Engelbert: And even worse, you’re gaining the trust of a maintainer over the years of contribution just to sneak in something into the CI/CD pipeline. Which is like, totally mind-blowing to me. Someone would invest so much time up front just to—anyway.

But you made a good bridge to FluxCD, right? You mentioned one of the important things now is that a lot of attack vectors are going towards the deployment pipeline or the CI/CD pipeline, trying to inject something at build time, and getting it signed or whatever you want to call it. It looks totally fine, but it’s still a perfect attack vector. That is where ControlPlane also comes into play with the enterprise and enterprise for FluxCD. Is that it?

Stefan Prodan: Yes. FluxCD, being a CNCF project, you as a company, even if you hire maintainers, you are not allowed to say Enterprise FluxCD because FluxCD is a brand of CNCF. And we also, you know, it’s ControlPlane Enterprise for FluxCD, some other company tomorrow can offer the same thing and there is their enterprise offering for this particular project. So that’s the meaning there.

Basically what Flux does, it’s a way for you to rip apart the CD things from your CI/CD. I truly think CI/CD shouldn’t happen in one tool or be a thing that’s like this huge monolith that builds all the code, has access to all the sources, produces artifacts, then also connects all your production systems and deploys those.

Having this kind of monolith may sound easier to get started. But if you look from a security perspective, and also from a scaling perspective, it becomes a single point of failure and a major vulnerability in the infrastructure that you have there. Also, there is this mentality where, you know, especially CI systems are not – people don’t think of them as part of your production, right? So, right, everybody has access to the Jenkins cluster or whatever. But production is secure, only SRE people have – well, if the CI system has a Kube config with cluster admin, right, because it needs to deploy all things on the cluster, then you either think of it as your production system, or you adopt a pattern like GitOps, for example, what FluxCD implements, where you move the continuous delivery side inside your production, where the thing that deploys on the cluster is running in the cluster, and it’s subject to Kubernetes RBAC, security constraints, network policies, and you apply the same, you know, security mindset to your continuous delivery tool as you apply to the whole production system itself, right.

So the shift with FluxCD and all the other GitOps tools in the ecosystem is the fact that it runs there in production, and you don’t connect from outside from Jenkins or whatever your GitHub actions, you don’t have to open your clusters on the internet, you don’t have to give some external system your cluster admin configuration and authentication. But the cluster itself goes somewhere and looks there and says, ‘oh, this is what I have to deploy, let me deploy’ and that somewhere is the Git-Repo that can be different and should be different in most ways than where you store your source code.

So you can apply constraints on who has access to the Git repository where my production system is defined. You can have a different type of, you know, groups of people and how you drive changes there, you can enforce all sorts of good practices that you can enforce on any Git-Repo, like main branches being protected, and every time you modify something on a cluster, you have to open up a request, someone from the SRE team has to approve it, ‘oh yeah, it’s okay to change this network policy,’ right, so you basically apply all the good practices that you have for your code to your production systems. You can keep these things in a separate repository or repositories, and then the production system comes to the repository, sees, ‘oh, there is a new version of this app, let me now deploy it for you.’

So you don’t go to the system, the production system comes to you and decides how the new version should be deployed. So it’s basically FluxCD if you think of it as like a proxy between, you know, the desired state which is a Git-Repo and the production system where it runs. So you no longer go to the system and control it yourself, you tell Flux, ‘hey, I would like for my cluster to look like this,’ and Flux can tell you, hey, this is not possible, I have Kyverno or OPA in here and they are blocking this change, now go and figure out the fix for it. So you can Flux integrate with admission controllers which can enforce good practices, better security constraints on top of your continuous delivery pipeline.

So there is a continuous delivery pipeline here in the cluster and a CI thing which is completely separated. So just having this separation, you know, improves your security stance and you have a more reliable way of deploying it because let’s say like you start with one production cluster, one region, then your business grows, right, maybe you move from US and you open a shop in Europe as well, you want the European customers to not have huge latency, right, not go to the cluster in US, so you’ll probably create a new cluster in the European region there. So the more your business expands, the more clusters you have and what that means if you have everything running from a single CI/CD tool, every time you add a new cluster you have to, you know, onboard it into your CI system, like setting up certificates, how you connect to it and all of that.

With something like Flux, when you add a new cluster, you bootstrap Flux which is the thing that after the cluster gets created, the first thing that gets deployed there is Flux itself and then you tell Flux, ‘hey, configure this whole cluster, this whole region according to that repository where you have defined your production system,’ it automatically does it, right, so it’s easier to, you know, expand your production system over regions and so on when you adopt something like GitOps in your pipeline.

Chris Engelbert: That was amazing. I had so many questions you had just answered all of them. You literally just went for answered all of them in one go. That was absolutely incredible.

Just one quick question because I think a lot of the audience may use something like ArgoCD, in that sense it’s kind of similar, right? It’s kind of a similar idea that you separate out like your build pipeline which would be probably like Jenkins and then you have Flux or Argo CD or something on the cluster side installing or deploying all the artifacts.

Stefan Prodan: Yeah, yeah. So there are two main projects that implement the GitOps pattern instance here that Flux and Argo CD. There is also the continuous delivery foundation and the Linux foundation, CDF, where they host the Jenkins X which is, it’s a rewrite of Jenkins that has GitOps features. There is also Tekton in CDF as a project which does continuous, can do continuous integration but can also configure Tekton to do continuous delivery. It also runs in your cluster and there are other projects out there which have or which have begun implementing GitOps features into it. So GitOps is quite mature as a way of doing continuous delivery right now.

It’s far, far away from when I started with it seven years ago which felt like ‘well what is this GitOps thing?’ Right now people like actually get it and GitOps says the idea is not new, is not something that we invented in the cloud-native space. It’s an idea, it’s an old idea that Puppet did the same exact thing way before Kubernetes with the agents and everything. So it’s the idea that you have some kind of agent in your production system that pulls the desired state from outside and tries to change the system and make it fit into what you have described is over 12 years old. Puppet did really good back then.

Chris Engelbert: Yeah, I agree. The whole GitOps thing, it’s one of those things which is around for a while but never had a real name but people have done it for quite a while. So yeah, I agree.

In the sake of time because we’re already behind the 20 minutes but I really want to ask you like what is your personal view on the future? What do you think is like the next big thing? Is there something you see coming as like the next innovation of GitOps or CD pipeline security, whatever you think?

Stefan Prodan: So for me, what I am trying to promote inside the FluxCD organization and through the Flux project and all the Flux maintainers are – we try to drag Flux in a direction where we offer a different way of doing GitOps without Git in production, but with Git still as the tool that you use for collaboration. So what we are shifting into Flux, and it’s already in there, we have production users using it, is where we use the container registry as the thing that holds your whole desired state and we rely upon the open container initiative specification, which I know since two years ago, three years ago, it has this concept of an OCI artifact.

So an OCI artifact is what you are already using it. If you use a container image, that’s an OCI artifact and it’s a tarball which has some metadata and it’s stored in the container registry. Those are your app images and with Flux, what we’ve done is we are offering tools also in the CLI and also the controllers where you can say I can do a Flux push which is the same as a Docker push, but instead of pushing your binaries with Flux push, you push the configuration of your cluster which can be all the Kubernetes YAMLs, custom resources, Helm charts, all the definition of your production cluster, it’s stored in the container registry which by design is more closer to the cluster. It’s HA and can live inside your private VPC next to the cluster where Git did something that it’s usually outside of that trust zone because developers have to have access to it and so on.

So you will basically push the configuration to your Git-Repo but instead of Flux coming from the cluster to Git and basically getting over the security trust zone in this area, you will push the configuration along with your when you do the Docker build and Docker push right after that will do a Flux push, the configuration of that application to the same container registry. You sign it in the same way with coastline or notation and when Flux deploys the new version of the app instead of going to Git, goes to the container registry, pulls the definition there, verifies that the definition is correct, and only then deploys it.

So it fits in the security model and we are also promoting this through our ControlPlane offering for Flux, the Enterprise Edition where we want to ensure that ControlPlane customers which are relying on Flux, they can adopt a more secure way and a better way of doing continuous delivery not only from a security perspective but also from a reliability perspective right because you no longer have to get Git in there in your production system and you can rely on the container registry which you should already have it in there, you should have figured out how to do a change for it and if you are using a cloud vendor you already have all these things but no cloud vendor out there will give you the same SLAs and the same insurances for a Git offering in the way they do it for container registry right so that’s where we are moving with Flux.

Chris Engelbert: That is a really interesting approach and I never thought of it like the container registry to me always was like an image registry basically that’s an interesting approach to just reuse the same system and say okay now you basically push your consider or your suggested state and on the other side you just pull it and rebuild it. That’s interesting. Unfortunately we’re out of time, I have a few more questions about, 20 minutes are over so thank you very much it was a pleasure I hope people learned something. I certainly learned something new, I only used ArgoCD in the past so Flux is new to me and thank you for being here. Thank you Chris.

Stefan Prodan: Thank you for inviting me, yeah please try Flux, you’ll love it.

Chris Engelbert: Yes please try Flux and until next week when you come back and listen into the next episode you have one week to try Flux now. Thank you very much for being here and hear you guys next week or you hear me whatever. See ya!

The post Your CI/CD Pipeline is a Production system with Stefan Prodan from ControlPlane (video + interview) appeared first on simplyblock.

Continuous vulnerability scanning in production with Oshrat Nir from ARMO

Chris Engelbert — Fri, 24 May 2024 12:11:05 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Oshrat Nir ( Twitter/X , Personal Blog ), a Developer Advocate from ARMO, who talks about the importance of runtime vulnerability scanning. See below for more information on what vulnerability scanning is, what vulnerability scanning tools exist and how simplyblock uses vulnerability scanning. Also see interview transcript at the end.

Key Learnings

What is Vulnerability Scanning?

Vulnerability scanning is a security process that involves using automated tools to identify and evaluate security weaknesses in a computer system, network, or application. The main objective of vulnerability scanning is to find vulnerabilities that could potentially be exploited by attackers to gain unauthorized access, cause disruptions, or steal data. Here’s a more detailed breakdown of what vulnerability scanning entails:

Identification:

Asset Discovery: The process begins with identifying all the assets (servers, networks, applications, etc.) within the scope of the scan.
Cataloging: Creating a comprehensive list of these assets, including software versions, configurations, and open ports.

Scanning:

Automated Tools: Using specialized software tools that automatically scan the identified assets for known vulnerabilities. These tools often maintain a database of known vulnerabilities, which is regularly updated.
Types of Scans:

** Network Scans: Focus on identifying vulnerabilities in network devices and configurations. ** Host Scans: Target individual computers and servers to find vulnerabilities in operating systems and installed software. ** Application Scans: Look for security weaknesses in web applications, APIs, and other software applications.

Analysis:

Vulnerability Database: Comparing scan results against a database of known vulnerabilities to identify matches.
Severity Assessment: Evaluating the severity of identified vulnerabilities based on factors like potential impact, exploitability, and exposure.

Reporting:

Detailed Reports: Generating reports that detail the vulnerabilities found, their severity, and recommendations for remediation.
Prioritization: Providing a prioritized list of vulnerabilities to address based on their potential impact on the organization.

Remediation:

Patch Management: Applying software updates and patches to fix the vulnerabilities.
Configuration Changes: Adjusting system and network configurations to eliminate vulnerabilities.
Mitigation Strategies: Implementing additional security measures, such as firewalls or intrusion detection systems, to mitigate the risk of exploitation.

Rescanning:

Verification: Conducting follow-up scans to ensure that previously identified vulnerabilities have been successfully addressed.
Continuous Monitoring: Implementing ongoing scanning to detect new vulnerabilities as they emerge.

What are some Vulnerability Scanning Tools?

There are various vulnerability scanning tools available, each with its own focus and strengths. See some of the main types below: Network Vulnerability Scanners Web Application Scanners Database Scanners Cloud Vulnerability Scanners

Some of the most widely-used tools include: Tenable Nessus : Comprehensive vulnerability scanner for identifying and assessing vulnerabilities, misconfigurations, and malware across various systems. OpenVAS : An open-source tool for vulnerability scanning and management, derived from Nessus. Enterprise TruRisk™ Platform : A cloud-based service that offers continuous vulnerability scanning and compliance management, previously known as QualysGuard. Rapid7 Nexpose : A real-time vulnerability management solution that helps in identifying, prioritizing, and remediating vulnerabilities. Acunetix : Focused on web application security, it identifies vulnerabilities such as SQL injection, cross-site scripting, and other web-related issues. IBM Security QRadar : A security information and event management (SIEM) solution that integrates vulnerability scanning and management. OWASP ZAP (Zed Attack Proxy) : An open-source tool aimed at finding vulnerabilities in web applications. Nikto : An open-source web server scanner that checks for dangerous files, outdated server components, and other security issues. ARMO Kubescape : An open-source Kubernetes security platform offering vulnerability and misconfiguration scanning, risk assessment, as well as reporting on security compliance. See our podcast episode with Oshrat Nir from ARMO . Snyk : A platform to provide vulnerability, misconfiguration, and code security flaws throughout all of the development process. See our podcast episode with Brian Vermeer from Snyk .

How does Simplyblock use Vulnerability Scanning?

Simplyblock employs vulnerability scanning to ensure the security and integrity of its cloud-based aspects of their storage solutions. For the storage clusters, simplyblock seamlessly works with the industry-standard vulnerability scanning solutions. That means that storage clusters, running the simplyblock storage system, inside the customer’s AWS account can be discovered, catalogued, and monitored for outdated software, misconfigurations, and other security risks. This involves using automated tools to identify, assess, and mitigate potential security threats across their infrastructure.

Transcript

Chris Engelbert: Welcome back to the next episode of simplyblock’s Cloud Commute podcast. This week, I have another guest from the security space, something that really is close to my heart. So thank you for being here, Oshrat. Is that actually correct? I forgot to ask up front.

Oshrat Nir: It’s hard to say my name correctly if you’re not a native Hebrew speaker, but Oshrat is close enough.

Chris Engelbert: Okay. It’s close enough. All right. I forgot to ask that. So maybe you do a quick introduction. Who are you? Where are you from? What do you do? And we’ll take it from there.

Oshrat Nir: So thanks, Chris, for having me. This is a great opportunity. My name is Oshrat Nir. I am currently the developer advocate for ARMO and Kubescape, which is our CNCF sandbox project. We have an enterprise and an open source platform that I look after. I’ve been at ARMO for a year and a half. I’ve been in cloud native for about 5.5 years. And before that, I worked in telco. And fun fact about me is that I lived on 3 continents before I was 9 years old.

Chris Engelbert: All right. We’ll come back to that. Or maybe just now. What do you mean you lived on 3continents?

Oshrat Nir: I was born in Germany, which is Europe. Then I left Germany when I was nearly 3and moved to the States. I lived in Philadelphia for6 years. When I was 8.5years old, I moved to Israel and that’s where I’ve been living since.

Chris Engelbert: All right. So you’re not-

Oshrat Nir: I don’t speak German.

Chris Engelbert: Fair enough.

Oshrat Nir: I tried to learn German when I was working for a German company. My friend at Giant, shout out to Giant Swarm. But no, they did a lot of good things for me, like introducing me to cloud native, but German was not one of them.

Chris Engelbert: I agree. I feel sad for everyone who has to learn German. The grammar is such a pain.Anyway, you said you work for ARMO. So tell us a little bit about ARMO, a little bit more than it’s open source or enterprise.

Oshrat Nir: Okay. So ARMO is a cybersecurity company. The co-founders are Shauli Rozen, who is now our CEO, and Ben Hirschberg, who is our CTO, and another co-founder who’s now on the board called Leonid Sandler. Originally, Leonid and Ben come from cybersecurity. They’ve been doing it since the 90s. They built out a really, really good product that required installing an agent in a cluster. It was highly intrusive and very resource intensive. It might’ve been a good idea, but it was like maybe, I don’t know, maybe five years ahead of its time because that was in the days where agent-less was the thing. And it kind of, it became a thing. Then what happened was that NSA and CISA came out with the guidelines for hardening Kubernetes. That was in August of 2021. They grabbed that idea and built an open source misconfiguration scanner based on that framework, and that’s Kubescape.

They built it out, and it went crazy within days. The star chart was nearly perpendicular. It got to thousands of stars very quickly. By the way, we are waiting to get to 10,000 stars. So if anybody uses and likes us, please, we really, really want to celebrate that 10K milestone. We reached 1,000, 3,000, 5,000 stars very quickly. Then we added more frameworks to the misconfiguration scanner, which include CIS benchmarks. I mean, everybody uses the benchmark. These were all things that allowed people to easily adhere to these frameworks and help with continuous compliance. But you can’t, I don’t know, Alice in Wonderland. I worked with Lewis Carroll. ‘You need to run in order to stay in place,’ said the Red Queen to Alice.

So we had to continue to develop the product into a platform because the misconfiguration scanner is not enough. Then we went into CD scanning, image scanning. So there’s image scanning, repository scanning, scan the cluster. We also have an agent-less flavor, which was the original way we worked. Then we decided, even though past experience showed that the market was good for that, to also develop an agent, an operator that you put on your cluster. Because things that you can see from inside the cluster are not the same as things you can see from outside the cluster.That’s really important in terms of security, because you don’t want blind spots. You want to have all your bases covered, if I were to use an American sports analogy. So you want to have everything covered. That’s how Kubescape continued to develop.

At the end of 2023, or yeah, it was December of 2023, no, sorry, December of 2022. We were accepted, Kubescape was accepted by the CNCF as a sandbox project. The first misconfiguration scanner in the CNCF. And we’re still there, happy, growing, and we’re at a bid for incubation. So if I do another plug here now, if you’re using Kubescape and you love it, please add yourself to the adopters list because we want to get to incubation in 2024.We only have 7 months to go, so yeah, please help us with that.

What happened when Kubescape was accepted into the CNCF, we had to break it out of our enterprise offerings, out of our commercial offering. So we broke it out, and now we have two offerings. We have ARMO platform, which is the enterprise offering. It’s either SaaS or as a private installation, whatever works. And of course, Kubescape, which is open source, free for all, anybody can use or contribute. It seems that people really know and love Kubescape.This is the impression I got from when I came back from Paris at the KubeCon. I mean, people stopped at the ARMO booth and said, “Oh, you’re Kubescape.” So yeah, Kubescape is very known. It’s a known brand, and people seem to like it, which is great.

Chris Engelbert: Right, right. So as I said, we just had a guest, like, I think 2weeks ago, Brian Vermeer from Snyk. I just learned it’s actually pronounced Snyk[Sneak]. And they’re also in the security space. But from my understanding, ARMO is slightly different. So Snyk mostly looks at the developer and the build pipeline, trying to make sure that all possible vulnerabilities are found before you actually deploy. Common coding mistakes, like the typical SQL injection, all that kind of stuff is caught before it actually can get into production.But with the onsite or continuous online scanning, whatever you want to call it, ARMO is on the other side of these things, right? So why would you need that? Why would you want that continuous scanning? I mean, if there was no security issue, why would there be one in production at some point?

Oshrat Nir: Okay, so first, let’s kind of dial this a little back. Snyk talks about themselves as an app tech company, and they look at things from the workload or application point of view, and then they work their way down. And they get informed by information from cloud providers, etc.ARMO is the other way around. We start from the infrastructure. Kubernetes infrastructure is like something that has never been before. I mean, Kubernetes is different. You can’t use legacy processes and tools to scan your Kubernetes because you just don’t get everything that you need. Kubernetes is ephemeral, it scales up, it scales down. Containers don’t last as long, so you don’t have time to test them. There’s a lot of things that you could do in the past and you can’t do with Kubernetes.

So the way we look at securing Kubernetes and by extension the applications or the workloads running on it is the fact that we start from the infrastructure. We work off of those frameworks, best practices that we talked about, and we use runtime to inform our security because one of the main problems that people securing Kubernetes have is the fact that if they work according to best practices, their applications break or may break. And what you need to do is understand application behavior and then secure the infrastructure informed bythat.

So it’s sort of a different perspective. We kind of do bottom up and Snyk does top down, and we kind of meet at the application, I would say, because I don’t think Snyk goes all the way down to Kubernetes and we don’t go all the way up to the SaaS or all of those four little acronyms that aren’t exactly in the Kubernetes world, but over Kubernetes.

Chris Engelbert: So as a company, I actually want both tools, right? I want the development side, the bottom up to make sure that I catch as much as possible before even going into production. And I want the top down approach in production to make sure that nothing happens at runtime, because I think ARMO also does compliance testing in terms of that my policies are correct. It looks for misconfiguration. So it looks much more on the operational side, stuff that a lot of the other tools, I think, will not necessarily catch easily.

Oshrat Nir: Correct.ARMO looks again, we are there throughout the software development lifecycle from the beginning, even to the point where you can do registry scanning and repo scanning and image scanning before. And then as you write things and as you build out your pipelines, you put security gateways in the pipelines using ARMO.

And an interesting thing, we have started to leverage eBPF a lot from many of the things that we do. In order to reduce the signal-to-noise ratio, one of the problems that there is in the world of DevOps and in the operations is alert fatigue, a lot of false positives. And peopleareso overwhelmed. And there’s also a missing piece, because again, even in the world of CVEs, when you’re judging things only by their CVSS, only by the severity and the score of the CVE, then you might not be as efficient as you need to be. Because sometimes you have a high severity vulnerability, somewhere, that doesn’t even get loaded into memory. So it’s not a problem that you have to deal with now. You can deal with it somewhere in the future when you have time, which is never, because nobody ever has time.

But the idea is, again, having production informing what happens in operation by saying, ‘Okay, this way the application or the workload needs to work, and this is why I care about this vulnerability and not that vulnerability.’

Chris Engelbert: Right, right.

Oshrat Nir: Now, speaking of that, ARMO is working on introducing, we already have this in beta in Kubescape, but it’s coming out at ARMO as well, on cloud-native detection and response, like runtime, or for runtime. So we have built out, since we’ve been working with the workload, since we’ve been using eBPF to see how applications are supposed to act so that we can secure the infrastructure without breaking the application, what we’re doing now is saying, ‘Okay, so now we know how the application needs to act’, so I can actually alert you on when it’s acting abnormally, and then we have anomaly detection. I can actually detect the fingerprints of malware, and then I can flag that and say, ‘Look, this might be problematic.You might be needing to look at this because you might have a virus,’because people might be scanning CVEs. And sorry for the 90s reference, but I’m a Gen X-er, people might be scanning for CVEs, but they’re not looking for viruses on images. And that’s just the problem waiting to happen.

Chris Engelbert: Especially with something like the XZ issue just recently.

Oshrat Nir: There you go.

Chris Engelbert: And I think that probably opened the eyes of a lot of people, that to what extent or to what length people go to inject stuff into your application and take over either your build pipeline or your eventual production. I think in the XZ situation, it was like a backdoor that would eventually make it into production, so you have access to production systems.

Yeah, I agree.And you said another important thing, and I’m coming from a strong Java background. It’s about dynamically loading libraries or dependencies. And Java was like the prime example in the past. Not everything you had in your classpath was necessarily loaded into RAM or into memory. But you have the same thing for JavaScript, for PHP, for Python, and especially JavaScript, TypeScript, Python. Those are like the big comers, not newcomers, but the big comers or upcomers in terms of dynamic languages. So yeah, I get that. That is really interesting in the sense of you look at runtime and just because something is in your image doesn’t necessarily mean it’s bad. It’s going to be bad the second it’s loaded into memory and is available to the application. That makes a lot of sense.So you said ARMO runs inside the Kubernetes cluster, right? There’s an operator, I guess.

Oshrat Nir: Yeah.

Chris Engelbert: So do I need to be prepared of anything? Is there anything special I need to think about or is it literally you drop it in, and because it’s eBPF and agent-less it does all the magic for me and I don’t have to think about it at all.Like magic.

Oshrat Nir: Yeah, the idea is for you not to think about it. However, we do give users tools. Again, we’re very cognizant of alert fatigue because what happens is people are overwhelmed. So they’ll either work themselves to burnout or start ignoring things. Neither is a good option.

Okay, so what we want to do is thinking about the usability about the processes, not just the UX, but about the processes that are involved. So we have configurable security controls. You can quiet alerts for specific things, either forever, because this is a risk you’re willing to take. Or that’s just the way the app works and you can’t change it or you’re not changing for now.

So you can configure the controls, you can set down alerts for a configurable period of time or forever. And all of these things are in order to bring you to the point where you really, really, focus on the things that you need. And you increase the efficiency of your security work. You only fix what needs are these things. A good example here is a task path. People, I mean, it’s called an attack chain, an attack vector, kill chain, there’s lots of terminology around the same thing. But basically what it says is that there’s a step by step taskor thing that an attacker would use in order to compromise your entity. There are different entry points that are caused by either misconfigurations or viruses or vulnerabilities, etc. So what we do is we provide a visualization of a possible attack path and say, ok, it’s sort of a, I’m hesitant to use the word node because Kubernetes, but it’s kind of a node of the subway map sort of thing where you can basically, you can check for each node what you need to fix. Sometimes there’s one node where you need to fix one misconfiguration and you’re done and you immediately hardened your infrastructure to the point where the attack path is blocked.Of course, you need to fix everything around that. But the first thing you need to do is to make sure that you’re secure now. And that really helps and it increases the efficiency.

Chris Engelbert: Right. So you’re basically cutting off the chain of possible possibilities so that even if a person gets to that point, it’s now stopped in its tracks, basically. All right. That’s interesting. That sounds that sounds very useful.

Oshrat Nir: Yeah, I think that’s an important thing because that’s basically our North Star where we’re saying we know that security work is hard. We know that it’s been delegated to DevOps people that don’t necessarily like it or want to do it and are overwhelmed with other things and want to do things that they find more interesting, which is great. Although, you know, security people,don’t take me personally, I work for a security company. I think it’s interesting. But my point is, is that and this is I’m sorry, this is a Snyk tagline. Sorry, Brian. But but you want security tools that DevOps people will use. And that’s basically what we’re doing at ARMO. We want to create a security tool that DevOps people will use and security people will love. And again, sorry, Snyk. That’s basically the same thing, but we’re coming from the bottom, your from the top.

Chris Engelbert: I to be to be honest, I think that is perfectly fine. They probably appreciate the call out, to be honest.

Right. So because we’re almost running out of time, we’re pretty much running out of time right now. Do you think that there is or what is your feeling about security as a thought at companies? Do they like neglect it a little bit?Do they see it as important as it should be? What is your feeling? Is there headroom?

Oshrat Nir: Well, I spend a lot of time on subreddits of security people. These people are very unhappy. I mean, some of them are really great professionals that want to do a good job and they feel they’re being discounted. Again, there’s this problem where there are tools that they want to use, but the DevOps that the people that they serve them to don’t want to use. So there needs to be a conversation. Security is important. Ok, F16s runs on Kubernetes. Water plants, sewage plants, a lot of important infrastructure runs on Kubernetes. So securing Kubernetes is very important. Now, in order for that to happen, everybody needs to get on board with that. And the only way to get on board with that is to have that conversation and to say, ‘ok, this is what needs to be done. This is what we think you need to do it. Are you are you on board? And if not, how do we get you on board?’ And one of the ways to get you on board is ok, look, you can put this in the CICD pipeline, forget about it until it reminds you. You can scan a repository every time you pull for it or an image every time you pull it. You can you have a VSCode plugin or a GitHub action. And all of these things are in order to have that conversation and say, look, security is important, but we don’t want to distract you from things that you find important. And that’s a conversation that has to happen, has to happen all the time. Security doesn’t end.

Chris Engelbert: Right, right. Ok, last question. Any predictions or any thoughts on the future of security? Anything you see on the horizon that is upcoming or that needs to happen from your perspective?

Oshrat Nir: Runtime is upcoming. It’s like two years, even two years ago, what’s the thing? Nobody was talking about anything else except shift left security. You shift left. DevOps should to do it. We’re done. We shifted left. And we found that even if one thing gets through our shift left, our production workloads are in danger.So next thing on the menu is runtime security.

Chris Engelbert: It’s a beautiful last sentence.Very, very nice. Thank you for being here. It was a pleasure having you.And I hope we we’re going to see. I think we never met in person, which is which is really weird. But since we’re both in the Kubernetes space, there is a good chance we do. And I hope we really do. So thank you very much for being here.

Oshrat Nir: Thanks so much for having me, Chris.

Chris Engelbert: Great. For the audience next week, next episode. I hope you’re listening again. And thank you very much for being here as well. Thank you very much. See ya.

Key Takeaways

Oshrat Nir has been with ARMO for 1.5 years, bringing 5.5 years of experience in cloud native technologies. ARMO, the company behind Kubescape, specializes in open source-based CI/CD & Kubernetes security, allowing organizations to be fully compliant to frameworks like NSA or CIS, as well as secure from code to production.

The founders of ARMO built a great product that required installing an agent in a cluster, which was highly intrusive & resource intensive. It was around five years ahead of its time, according to Oshrat. After the NSA and CISA came with guidelines on Kubernetes, the founders built an open source misconfiguration scanner based on that framework, which was Kubescape.

Kubescape quickly gained popularity, amassing thousands of stars on GitHub and became accepted by the CNCF (Cloud Native Computing Foundation) as a sandbox project – the first misconfiguration scanner in the CNCF. They’re still growing & are aiming to get to incubation in 2024.

Currently they have 2 offerings; the ARMO platform, which is the enterprise offering, and Kubescape, which is open source.

Oshrat also speaks about Snyk, which focuses on application security from a top-down approach, identifying vulnerabilities during development to prevent issues before deployment. ARMO takes a bottom-up approach, starting from the infrastructure and working upward, “We kind of do bottom up and Snyk does top down, and we kind of meet at the application.”

Oshrat also mentions how they have started to leverage eBPF to improve their scanning without changing the applications or infrastructure, which will help their users, particularly to decrease alert fatigue and the number of false positives.

ARMO is also introducing cloud-native detection and response for runtime. Since using eBPF, they are able to integrate additional anomaly detections.

Oshrat also spoke about the importance of the usability of the processes, which is why they have configurable security controls where you can quiet down or configure alerts for a period of time so you can focus on what you need, which greatly increases the efficiency of your security work.

Oshrat underscores the need for dialogue and consensus between security and DevOps teams to prioritize security without overwhelming developers.

Looking ahead, Oshrat predicts that runtime security will be a critical focus, just as shift left security was in the past. ARMO has you covered already.

The post Continuous vulnerability scanning in production with Oshrat Nir from ARMO appeared first on simplyblock.

How Oracle transforms its operation into a cloud business with Gerald Venzl from Oracle

Chris Engelbert — Fri, 17 May 2024 12:11:50 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Gerald Venzl ( Twitter/X , Personal Blog ), a Product Manager from Oracle Database , who talks about the shift of focus away from on-premise databases towards the cloud. It’s a big change for a company like Oracle, but a necessary one. Learn more about the challenges and why Oracle believes multi-cloud is the future.

Chris Engelbert: Welcome back to the next episode of simpleblock’s Cloud Commute podcast.Today I have a very special guest, like always. I mean, I never have non-special guests.But today he’s very special because he’s from a very different background.Gerald, welcome. And maybe you can introduce yourself. Who are you? And how did you happen to be here?

Gerald Venzl: Yeah, thank you very much, Chris. Well, how I really don’t know, but I’m Gerald, I’m a database product manager for Oracle Database, working for Oracle a bit over 12 years now in California, originally from Austria. And yeah, kind of had an interesting path that set me into database product management.Essentially, I was a developer who developed a lot of PL/SQL alongside other programming languages, building ERP systems with databases in the background, the Oracle database. And eventually that’s how I ended up in product management for Oracle. The ‘how I’m here’, I think you found me. We hada fun conversation about 5 years ago, as we know, when we met first at a conference, as it so often happens. And you reached out and I think today is all about talking about Cloud Native, databasesand everything else we can come up with.

Chris Engelbert: Exactly. Is it 5 years ago that we’ve seen last time or that we’ve seen at all?

Gerald Venzl: No, that we’ve met 5 years ago.

Chris Engelbert: Seriously?

Gerald Venzl: Yeah.

Chris Engelbert: Are you sure it wasn’t JavaOne somewhere way before that?

Gerald Venzl: Well, we probably crossed paths, right? But I think it was the conference there where we both had the speaker dinner and got to exchange some, I mean, more than just like, “Hello, I’m so-and-so.”

Chris Engelbert: All right, that’s fair. Well, you said you’re working for Oracle. I think Oracle doesn’t really need any introduction. Probably everyone listening in knows what Oracle is. But maybe you said you’re a product manager from the database department. So what is that like? I mean, it’s special or it’s different from the typical audience or from the typical guest that I have. So how is the life of a product manager?

Gerald Venzl: Yeah, so what I particularly like about Oracle product management, or especially in database, obviously different lines of business and then inside Oracle may operate differently.It’s a job with a lot of facets.So the typical kind of product management job, the way how it was described to me was, well, you gather customer requirements, you bring it back to development, then it gets implemented, and then you kind of do go-to-market campaigns. So basically, you’re responsible for the collateral, what’s the message to advocate these new features to customers, to the world, and then that’s not so true for Oracle. I think one of the things that really excites me in the database world, it’s like this goes back to the late 70s. I mean, other than Larry, not that many people are around from the era anymore. But Oracle back then did a lot of things that were either before its time or when there simply was no other choice or way of doing it, commensurable wisdom, I would say. So one of the nice things in Oracle is that actually the coming up with new features is really a nice collaboration between development and product management.

So development just as much has their own ideas of what we need to do or should be doing, like the PMs, and we really get together and discuss it out. And of course, sometimes, there’s features that you may or may not agree with personally or don’t see the need for. And often, actually, and much more so, you get quite amazed by what we’ve come up with. And we have a lot of really smart people in the work.And one thing that, yeah, not to go too much into a rabbit hole, but a couple of things that I really like; believe it or not, database development, it feels a lot like a startup. There’s no fixed hierarchies as such, ‘you can only do this. You must only do this or anything like that.’You can very openly approach the development leads, so even up to the SVP levels. And actually, just as we started now, one of those guys was like, “Hey, let’s talk while I’m driving into work.” I was like “sorry, I’m busy right now”. So you have that going. And then also, there’s a lot of the product management work that has alot of facets to it. So it’s not just ‘define the product’ or anything like that. That is obviously part of it, but also it’s evangelizing,as I’m doing right now. I speak to people on a thought leadership front for data management, if you like, or how to organize data and so forth.

And as I said before, one other thing that I really enjoy working in a team isthere’s actually quite a lot of really smart people in the org that go back to the 90s and some of them even to the 80s. So I got one guy who can explain exactly how you would lay out some bytes on disk for fastest read, etc. Then this is stuff that I never really touched anymore in school. We were already too abstract. It’s like, “Yeah, yeah, yeah, whatever. There’s some disk and it stores some stuff.” But you get still these low level guys and some of them, one of them is like, “Yeah,I helped on the C compiler back then with Kernighan.” It’s like there was one of the guys but was involved in it. And so anyway, as you know in the industry, people go around quite a bit. And so that has a lot going there.

Chris Engelbert: So from the other perspective, I mean, Oracle is known for big database servers. I think everyone remembers the database clusters. These days, it’s mostly like SUN, SPARC, I guess. But there’s also the Oracle Cloud and the database in the cloud. So how does that play into each other?

Gerald Venzl: Oh, yeah. Now things have changed drastically. I mean, traditionally starting a database software in the good old 80s where you didn’t even have terminal server or whatever, a client server. So the first version is apparently a terminal based or something like that.

It’s like, again, I never saw this. But there was a big client server push.And obviously now there’s a big push into what’s cloud and a lot of cloud means really distributed systems. And so how does it play into each other? So all the database cloud services in Oracle Cloud, all the Oracle database cloud services are owned by us in development as well.

So we have gone into this mode of building cloud services for Oracle database. And of course, that’s really nice because that gives us this visibility to the requirements of distributed storage or distributed workloads and that in turn feeds back into the product. So for example, we are still one of the very few relational databases that offers you sharding on a relational model, which is, of course, much harder than a self-contained hierarchical model such as JSON, which you can shard way nicer. But once you actually split up your data across a bunch of different tables and have relations between those, sharding becomes quite more complicated.

And then of course, it’s like we have a lot of database know-how. We also got MySQL, they do also their thing with good collaboration going on with them. So we have sort of quite a good, I want to say, brainpower, intellectual power in the company when it comes to accessing data and to writing data. You mentioned SPARC before. There’s, of course, a lot of that going on. And quite frankly, I will say even way before cloud, the fact of accessing data that doesn’t necessarily sit in a database butanalyze it or query it with SQL. It’s like you literally go back like 10, 12 years ago and everybody said Hadoop will kill every database and big data is the way forward. And I’m sure there was the same thing going on in the mid-2000s. I was not in the industry yet. So like, yeah, this notion of that you have data sitting somewhere else and you just want to analyze it has been around for a long time, actually much longer than people see now with object store buckets and data lakes and all the good stuff.

Chris Engelbert: So how does that look like for customers? I mean, I can see that smaller customers won’t have an issue with the cloud, but I could imagine that banks or insurances or stuff like that may actually have that. What does the typical cloud customer for Oracle look like? I think it may be very different from a lot of other people using Postgres or something.

Gerald Venzl: Yeah. I mean, you kind of mentioned it before. I think there is, ‘are you small or are you large?’ Right. And the SMB, small, medium business customers, the smaller ones, obviously, they’re very much attracted by cloud, the fact that they don’t have to stand up servers and the data centers themselves to just get their product or their services to their customers. Big guys are much more like ‘consolidation’and the biggest customers we work with, it’s really like their data center costs are massive because they are massive data centers. So they are looking at a more of a cost saving exercise.Okay- if we can lift and shift this all to cloud, not only can I close down my data centers or large portion to it, but of course also most of themareactually re-leveraging their workforce. So people, especially the Ops guys are always very scared of cloud or often very scared of cloud that will take their job away. But actually most customers are just thinking‘rather than looking after the servers running this good stuff, maybe in 2024, we can leverage your time for something that’s more important to the business, more tangible to the business.’ So they’re not necessarily looking so much to just get rid of that workforce, but transforming it to take care of other tasks.

A couple of years ago when we did a big push to cloud for Oracle Database and our premier database, Cloud Service Autonomous Database came out, there was quite a big push for the DBAs to transform into something more like a data governance person. So all the data privacy laws have crept in quite heavily in the last 5to 10 years. I mean, they were always there, but with GDPR and all these sorts of laws, they are quite different in what they are asking from data privacy laws before. And this is getting more and more and more complex, quite frankly. So there was obviously a lot of aspects of, ‘hey, you are the guys who look after these databases storing these terabytes and terabytes of data.’ It’s like, ‘now we have these regulatory requirements, where this needs to be stored, how this needs to be accessed, et cetera.’ And I might try to have you figure that out and figure out whether the backup was successfully taken or something like that. So you’re looking at that angle.

But yeah, so the big guys, then they, I think to some extent also very quickly get concerned of whether data is stored public cloud or not.Oracle was actually, I want to say we were either the first or definitely a forerunner of what we called Cloud@Customer. So basically you can have an Oracle cloud at your site. So you reinstall Oracle cloud in your data center. So for those customers who say, “This data is really, really precious.” You always have a spectrum. It’s like there’s a lot of data you don’t careabout, a lot of public data that you may or may not store, reference data and so forth, that you have to have for your operations. And then there’s actually the really sensitive data, your customer confidential information and so forth. And so there’s always a spectrum of stuff that ‘I don’t care can move quicker to cloud’ or whatever. And then of course, the highly confidential data or competitive confidential data– ‘I really don’t want anybody else to get a hold of this’ or ‘it’s not allowed or regulatory.’

Those systems then they look into a similar model where they say‘well, we like this sort of subscription-based model where we just pay a monthly or yearly fee per use and still all the automation is there. It’s like we still don’t have to have people looking whether the backup is successful or something. But we want it in our data center. We want to be full control. We want to be able to basically kind of pull out the cable if we have to and the data resides in our data center and you guys can no longer access it. Sort of that sense. I mean, that is obviously very extreme.And so this is what we call Cloud@Customer. You can have an Oracle cloud environment installed in your data center. Our guys will go in there and set everything up like it is in public cloud.

Chris Engelbert: That is interesting. I didn’t know that thing existed.

Gerald Venzl: Yeah, it’s actually gotten much bigger now. So just to finish up on that, it’s like, so now we have these, I mean, even governments is this next level, right? So governments come backand they say, “We’re not going to store our data in another country’s data center.” So this kind of exploded into like even what we call government regions. So, and there’s some public references out there where some governments actually have a government region of Oracle cloud in their country.

Chris Engelbert: So it’s interesting. I didn’t know that that Oracle or Oracle Cloud@Customer existed.Is that probably how AWS handled all the like AWS or what is it called Oracle at AWS or something?

Gerald Venzl: No,so AWS is different. AWS came out with outposts, but that was actually years laterandwhen you do your research, you see that Oracle had this way longer. But now I think every provider has some sort of like ‘Cloud@Customer’derivative. But now AWS is Oracle databases and what they call RDS, the relational database services.But I think what you’re thinking of is the Microsoft Azure partnership that we did.

So there’s an Oracle database at Microsoft Azure.And even that has a precursor to it. So a couple of years ago, basically Microsoft and Oracle partnered up and put a fast interconnect between the two clouds so that you kind of don’t go out of public net. But you could interconnect them from cloud data center to cloud data center, they were essentially co-located in the same kind of data center buildings. I mean, factories is really what they look like these days. So that’s how you got this fast interconnect, or kind of like buildings next to each other. And that was the beginning of the partnership. And yeah, by now it was a big announcement, you know, Satya Nadella and Larry Ellison were up in Redmond at Microsoft, I want to say was last fall, around September, something like that, but around the time where they had this joint announcement that yeah, you can have now Oracle database in Azure.But you know, the Oracle database happens to still run on Oracle cloud infrastructure. And why this fast connect is exposed via Azure.

Now the important thing is, all the billing, all the provisioning, all the connectivity, everything you do is going through Azure. So you actually don’t have to know Oracle cloud, what effect that runs in Oracle cloud, that is all taken care of. And that caters to the customers, we have, you know,lots and lots and lots of customers who have applications that run on a Microsoft stack, rather than pick any Windows based application that are in Azure, it’s a natural fit, what happens to have an Oracle database backend. And I think that in general is something that we see in the industry right now thatthese clouds in the beginning became thismassive monolithic islands where you can go into the cloud and they provide you all these services, but it was very hard to actually talk to different services between clouds.

And our founder and CTO Larry Ellison thinks very highly of what he calls multi cloud or what we call multi cloud, you know, it’s like you should not have to kind of put all your eggs in a basket. It’s literally a kind of the good old story of vendor lock-in again, just in cloud world. So yeah, you should not have tohave one cloud provider and that’s only it. And even there, we have already seen government regulations that actually sayyou have to be able to run at least two clouds. So if one cloud provider goes out of business or down or whatever, you cannot completely go out of business either. I mean, it’s unlikely, but you know how the government regulations happen, right?

Chris Engelbert: Right. So two very important questions. First, super, super important. How do I getan interconnect to Azure data centers to myhome?

Gerald Venzl: Yeah, that I don’t know. They are really expensive. There are some big pipes.

Chris Engelbert: The other one, I mean, sure, that’s a partnership between, you said Microsoft and Oracle, so maybe I was off, but are other cloud providers on the roadmap? Are there talks? If you can talk about that.

Gerald Venzl: Yeah. I mean, I’m too far away to know what exactly is happening. I do know for a fact that we get the question from customers as well all the time. And, you know, against common belief, I want to say, it’s not so much us that isn’t willing to play ball. It’s more than the other cloud vendors. So, we are definitely interested in exposing our services, especially Oracle database services on other clouds and we actively pursue that. But yeah, it basically needs a big corporate partnership. There’smanypeople that look at that and want to have a say in that. But I hope that insome time we reach a point whereall of these clouds perhaps become interconnected, or at least it’s easier to exchange information. I mean, even this ingress/egress thing is already ridiculous, I find. So this was another thing that Oracle did from the very early days. It’s like we didn’t charge for egress, right? ‘If data goes out of your cloud, well, we don’t charge you for it.’And now you see other cloud vendors dropping their egress prices, either constantly going lower or dropping them altogether. But you know, customer demand will push it eventually, right?

Chris Engelbert: Right. I think I think that is true. I mean, for a lot of bigger companies, it becomes very important to not be just on a single cloud provider, but to be just failure safe, fault tolerant, whatever you want to call it. And that means sometimes you actually have to go to separate clouds, but keeping state or synchronizing state between those clouds is, as you said, very, very expensive, or it gets very expensivevery fast. Let’s say it that way. So because we’re pretty much running out of time already, is there any secret on the roadmap you really want to share?

Gerald Venzl: Regarding cloudor in general? I mean, one thing that I should say, is likeOracle database, you know, a lot of people may say, ‘it’s like, this is old, this is legacy, what can I do with it, etc.’So that’s all not true, right? We just kind of announced our vector supportand got quite heavily involved with that lately. So that’s new and exciting. And you willsoonseenew version of Oracle database, we announced this already at Cloud World, that has this vector support in it. So we’re definitely top-notch there.

And the‘how do I get started with Oracle database,’this is also something that often people haven’t looked for a long time anymore. So these days, you can get an Oracle database via Docker image, or you have also this new database variation called Oracle Database Free. So you can literally just Google ‘Oracle Database Free’, it’s like a successor of the good old Express edition for those people who happen to have heard of that. But too many people didn’t know that Oracle Database, there was a free variant of that. And so that’s why we literally put it inthe name, ‘Oracle Database Free.’ So that’s your self-contained,free to use Oracle Database, you know, it has certain storage restrictions, basically, and then you kind of go too big as a database. And but the big item doesn’t come with commercial support. So you can think a little bit of like in the open source world of Community Edition and Enterprise Edition. So you know, it’s like, Oracle Database Free is the free thing that doesn’t come with support, it’s essentially restricts itself to a certain size. And it’s really meant for you to tinker around, develop, run small apps on, etc. But yeah, just Google that or go to Oracle.com/database/free . You will find it there. And just give Oracle Database a go. I think you will find that we have kept up with the times. As mentioned before, you know, one of the very few relational databases that can shard on a relational model, not only on JSON or whatever. So certainly a lot of good things in there.

Chris Engelbert: Right. So, last question, what do you think islike the next big thing or the next cool thing, or even maybe it’s already here?

Gerald Venzl: I mean, I’m looking at the whole AI thing that’s obviously pushing heavily. And I’mlikeold enough to have seen some hype cycles, you know, kind of completely facepalm. And I’m still young enough to be very excited. So somewhere on the fence there to be like, AI could be the next big thing, or it could just, you know, kind of once everybody realizes…

Chris Engelbert: The next not-big-thing.

Gerald Venzl: Exactly. I think right now there’s nothing else on the horizon. I mean, maybe there’s always the always something coming. But I think everybody’s so laser-focused on AI right now that we probably don’t even care to look anywhere else. So we’ll see how that goes. But yeah, I thinkthere’s something to it. We shall see.

Chris Engelbert: That’s fair. I think that is probably true as well. I mean, I asked a question to everyone, and I always would have a hard time answering myself. So I’m asking all the people to get some good answer if somebody asks me that someday.

Gerald Venzl: Yes. Smart, actually.

Chris Engelbert: I know, I know. That’s what I that’s what I tried to be. I wanted to saythat I am, but I’m not sure I’m actually smart. All right.That was a pleasure. It was nice. Thank you very much for being here. I hope to see you somewhere at a conference soon again.

Gerald Venzl: Yeah, thanks for having me. It was really fun.

Chris Engelbert: Oh no, my pleasure. And for the audience, hear you next week or you hear me next week. Next episode, next week. See you. Thanks.

The post How Oracle transforms its operation into a cloud business with Gerald Venzl from Oracle appeared first on simplyblock.

PostgreSQL mistakes and how to avoid them with Jimmy Angelakos

Chris Engelbert — Thu, 02 May 2024 12:12:35 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Jimmy Angelakos (X/Twitter) , a freelance consultant, talks about his experiences with customers running PostgreSQL on bare-metal, in the cloud, and on Kubernetes. He also talks about his new book ” PostgreSQL Mistakes and How to Avoid Them “.

Chris Engelbert: Welcome back everyone. Welcome to the next episode of simplyblock’s Cloud Commute podcast. Today I have a very interesting guest, very different from the other ones before, because he’s actually an author, writing a book right now. Well, I think he already published one or two at least. But he’ll talk about that himself. Welcome, Jimmy.

Jimmy Angelakos: Hi, very nice to be here.

Chris Engelbert: Very nice. Thank you for being here. Maybe we just start simple with the basic stuff. Who are you? Where are you from? What do you do for a living? Except for writing a book.

Jimmy Angelakos: My name is Jimmy Angelakos, which is obviously a Greek name. I live in Edinburgh in Scotland. I’ve been working with Postgres for maybe around 16 years now, exclusively. I haven’t used any other database in 16 years in a professional capacity. Naturally, the time came to share my experiences and I wrote a couple of books on this. Well, I actually co-wrote the ” PostgreSQL16 Administration Cookbook ” with my lovely co-authors Boris Mejías, Gianni Ciolli, Vibhor Kumar, and the sadly departed Simon Riggs, who was an awesome fellow. I’d like to pay a little tribute to him as a person, as a mentor to the entire Postgres community. He will be greatly missed.

Chris Engelbert: Thank you very much. I appreciate you sharing that because I think it was last week at the time of recording. It is a sad story for the Postgres community as a whole. Thank you for sharing that. From your professional life, for the last couple of years next to writing books, I think you’re mostly working as a consultant with a couple of different companies and customers. What do you think is the most common task? I mean, you’re probably coming in to help them optimize Postgres, optimize queries.

Jimmy Angelakos: Right. I’ve done all sorts of things in the past few years, like training customers to use Postgres in general, training them to use Postgres in a specific way that is suited to their needs. I have provided support to customers who ran Postgres, and also professional services like consulting. I can’t really say what the thing they use the most is or they request the most, but I can tell you a few of the things. Some customers come in and say, “My queries aren’t running well. What can I do?” It’s like the most frequent thing you hear. Some other people say, “Tell me what hardware to buy for Postgres.” You tell them, “I can’t really give you a response because it really depends on your workload,” which is the most important factor, I think, with databases. Everyone uses them differently. If it’s a database that is widely used as Postgres with so many use cases and so many different ways to use it, you can do analytics on it. To an extent, you can use it for transaction processing (OLTP), you can use it as a document database, with JSONB. There’s all sorts of things you can do. There’s no good answer to the things that people ask like, “Give me the best tuning parameters for Postgres,” or “How to write a query the right way.” It really depends on the amount of data you have, the type of data you have, and the sort of queries you’re going to be running.

Chris Engelbert: Yeah, that makes a lot of sense. It’s not only for the Postgres community or for Postgres. That is very true for a lot of things. From my own personal background, with a lot of programming languages or runtime environments, people ask, “What is the optimized or the optimal way of configuring it?” And they’re like, “I don’t know. Can’t give you the answer.” So, yeah, I hear where you’re coming from. All right, so… Sorry, I’m still having a little bit of a flu. So, from your personal background, you said you’ve co-written one book, but I also hinted on the fact that you’re writing another book right now, and I looked a little bit into it because it’s on Manning and it has Early Access, which is nice. But maybe you can give us a little bit of an insight of what you’re writing about.

Jimmy Angelakos: Right. So, the book that is under construction is called PostgreSQL Mistakes and how you can avoid them. So, it’s a bit of an anti-how-to. So, for people that are used to how-to books, like, “How do I partition? How do I do this? How do I do that?” It’s a bit of the other way around. I was trying to do this, but things went wrong. So, it’s experiences that I’ve collected from the things I’ve seen our customers do or the things I’ve done in the past.

Chris Engelbert: Right.

Jimmy Angelakos: And it’s really important to learn from mistakes. Everyone makes mistakes. And Postgres is very particular in how it wants things done. So if you get it right, the database is fantastic. It works very well with excellent performance. And when you start to do things a different way, you can see different results. And that’s basically the whole idea. There’s three chapters. Three chapters up on the web now. And there’s a huge fourth chapter that’s being published as we speak. That has anti-patterns that are not really restricted to Postgres. It’s things like, don’t improvise, don’t create your own distributed systems. There’s people that have spent hundreds of thousands of hours working on these problems, and you don’t need to reinvent the wheel.

Chris Engelbert: I hear you. As you said, there’s three chapters out right now. I haven’t seen the fourth one yet, so I think I have to look into that right after the recording.

Jimmy Angelakos: Manning are in the process of publishing it as we speak.

Chris Engelbert: All right, cool. But so far, I really like the second chapter and you bringing up all of the SQL code examples and showing the execution plans. And I think just by saying the word execution plan or the term execution plan, I probably lost half of the audience right now. So maybe you can give them a little bit of a feeling of what is an execution plan? Why is it so important to understand those things?

Jimmy Angelakos: Yeah, so Postgres has a quasi-intelligent query planner, which basically examines the way your query is written and produces a plan on how it’s going to get executed by the database server. It’s like, oh, they wrote this, where, this, and that, and it looks like a join. So I’m going to perform a join of these tables and then I’m going to order the results in this. So that’s the execution plan. It’s basically telling you how the database is going to execute your SQL query. Now, the planner takes into account things such as how much memory do you have or how fast are your disks that you’ve already specified in the Postgres configuration. It also takes into account things like what’s the nature of the data? What’s the cardinality, let’s say, in your tables? And these are things that are updated automatically by Postgres itself in its statistics tables. So it produces, most of the time, a really good plan. And what is a good plan? It’s the cheapest plan in terms of arbitrary cost. And arbitrary cost is calculated using those factors that I just mentioned. And it iterates through many plans for the execution, chooses the cheapest one, which will probably end up being the fastest one to execute in real-world terms. And seeing the execution plans is key to understand why your queries are running well or why they’re running slowly. Because then you can see, ah, this is what Postgres was trying to do. So maybe I should force its hand by writing this slightly differently.

Chris Engelbert: Yeah, that’s true. I think my personal favorite example is a common table expression, which ends up being a join because the query planner understands now a join is actually better. I don’t need to do the temporary heap table to store the intermediate result. So we kind of hinted where people can find the early access version. It’s at Manning. Do you want to add anything more to that? Maybe have a specific simple URL or something where people can find it.

Jimmy Angelakos: I can share the URL, but I certainly cannot spell it out.

Chris Engelbert: Ok, that’s fair enough. We’re going to put it in the show notes. That’s totally fine.

Jimmy Angelakos: Thanks very much. Yeah, I think it’s going to be an interesting book because it’s real-world use cases. And where it isn’t a real-world use case, it’s close enough. And I will tell you so in the text.

Chris Engelbert: That is true. And I agree. As I said, I’ve well kind of read through the first three. I read as much as I had time, but I really enjoyed it. And many of those code examples you brought up, as I said, especially in the second chapter, they were like, yes, either I’ve been there or I had people helping with that as well. I’ve worked for a Postgres based startup in the past. And we had people asking pretty much the same questions over and over again. So yes, for everyone using Postgres or starting using Postgres, it’s probably a pretty, pretty good pick.

Jimmy Angelakos: Thank you. I appreciate that. Yeah, as you know, people are familiar with other databases because Postgres has most recently exploded in popularity. It was kind of a niche database for a few years. And now it looks like all the enterprises are using it, all the hyperscalers are starting to use it, like AWS, Google, Azure. This means that they have recognized the value that Postgres brings to the table.

Chris Engelbert: Yeah, I agree. And I think it’s kind of interesting because you kind of hinted at that earlier. But you can do a lot of things with Postgres. There is a lot of stuff in Postgres itself. If you want document database, you have XML and JSON. If you want key value, you have hstore. But there is also a really good extensibility to Postgres, giving you the chance to plug everything else in, like time series, graph databases. I don’t know what else. You probably could define Postgres as the actual really only in the world multimodal database.

Jimmy Angelakos: Right, yeah. And we were actually considering of changing the description of Postgres on the website, where you go in and it says it’s an object relational database, which is kind of a formal, traditional way to put it. But nowadays, you’re right. I think it’s more of a multimodal database. And I think that is also the term that Simon Riggs preferred. Because it does all of these things and can also let you do. Things that the developers of Postgres hadn’t even thought of because of the extension system. Like a very famous extension is PostGIS, which is the GIS (geospatial) capabilities for Postgres, and is now considered the gold standard in geographical databases.

Chris Engelbert: True.

Jimmy Angelakos: From an open-source extension to an open-source database. And there’s like thousands of people that are professionally employed to use this extension in their day jobs, which is amazing.

Chris Engelbert: True. I agree. So let me see. Let me flip back a little bit. I mean, we’re officially a cloud podcast. We talked a lot about the cool Postgres world. And I was part of a Postgres world. I was part of the Java world. So that is mostly the guests I had so far. But because we’re a cloud podcast, what do you think, like working with all the different customers, what is your feeling? Like how many people are actually deploying Postgres in the cloud, in Kubernetes, in EC2, or anything like that?

Jimmy Angelakos: Well, the company I’m working with right now are using it on RDS. They’re using RDS Postgres because it suits their use case better in the sense that they don’t have a team that wants to worry about replication and backups and things like that. And availability zones, they want that handled as a service. And that fits their use case quite well. When you want more flexibility, you can still use the cloud. You can run, for example, Postgres on Azure boxes or EC2 boxes or whatever you want. But then you have to take care of these things yourself.

Chris Engelbert: Right.

Jimmy Angelakos: But it still gives you freedom from having to worry about hard drives and hardware and purchase orders and things like that. You just send off a check every month and you’re done. Now, Kubernetes is an interesting case. There’s a couple of operators for Postgres. The most recent one is Cloud Native PG, which is starting to get supported and getting traction from the Cloud Native Computing Foundation, which is great. And they are trying to do things in a different way that is totally cloud-native. So everything is defined as a resource in Kubernetes. But the resources map to things that are well known in Postgres, like clusters and nodes and backups and actual things so that you don’t have to perform black magic like running it in a pod, but also having to configure the pod manually to talk to another pod that is your replicant, things like that. And there are other operators that have evolved over time to approximate this ease of use. I think the Crunchy Data Operator comes to mind. It started off being very imperative. They had a command-line utility that created clusters and so on. And now they’ve turned it into a declarative, which is more cloud-native, more preferred by the Kubernetes world.I think these two are the major Postgres things that I’ve seen in Kubernetes, at least that I’ve seen in use the past few years. There are still things that haven’t been sorted because, as we said, Postgres is super flexible. And this flexibility and the ease of use of Kubernetes, where everything is taken care of automatically, comes at a cost. You have reduced flexibility when you’re on Kubernetes. So there’s things that haven’t been totally worked out yet, like how do you one-click migrate from a cluster that is outside Kubernetes to something that is running in Kubernetes? Or can you take a backup that was produced elsewhere and create a cluster in Kubernetes, a Postgres cluster from that backup? Now, once they have these things sorted and also hardware support is very important when you’re talking to databases, I think we’ll see many more people going to Postgres on Kubernetes in production. But specifically hardware and specifically disk performance and throughput and latency, you have to get into the hardware nitty-gritty of Kubernetes to take maximum advantage of Postgres because as a database, it loves fast disks. Generally speaking, the faster your disk, the faster Postgres will go.

Chris Engelbert: That is true. And just like a shameless plug, we’re working on something. But because we’re running out of time already, 20 minutes is always so super short. What do you think is going to be the next thing for Postgres, the database world, the cloud world, whatever you like. What do you think is the next thing?

Jimmy Angelakos: I can’t give you an answer, but you can go search on YouTube and you can find Simon’s last contribution to the Postgres community. He gave a talk at PostgreSQL Conference Europe last December where he said ” Postgres, the next 20 years ” or something to that effect. And he predicted things, how things will go for Postgres in the future and future directions. That’s a very interesting talk for anyone who wants to watch that. I wouldn’t want to hazard a guess because I’ve seen people just blindly accept the thing that AI is the next big thing. And everything in Postgres and databases and Java and Python is going to revolve around AI in the future. That remains to be seen.

Chris Engelbert: I like that because normally I start to say, please don’t say AI. Everyone says that. And I think AI will be a big part of the future, but I agree with you. It remains to be seen how exactly. Yeah, thank you very much. We’re going to put the video link in the show notes as well for everyone interested. And yeah, Jimmy, thank you very much. It was a pleasure having you.

Jimmy Angelakos: Thanks very much. I appreciate the invitation.

Chris Engelbert: My pleasure. And for the audience, we’re going to see or hear us next week. And thank you very much for being here.

The post PostgreSQL mistakes and how to avoid them with Jimmy Angelakos appeared first on simplyblock.

Improve Security with API Gateways, Nicolas Fränkel

Chris Engelbert — Fri, 19 Apr 2024 12:13:28 +0000

In this installment of the podcast, we talked to Nicolas Fränkel (X/Twitter) from API7.ai, the creator of Apache APISIX, a high-performance open-source API gateway, discusses the significance of choosing tools that fit your needs, and emphasizes making choices based on what works best for your requirements.

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.

Chris Engelbert: Hello, everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast, your weekly 20-minute podcast show about cloud, cloud security, cloud storage, cloud Kubernetes. Today I have Nicolas with me, Nicolas Frankel. I think it’s a German last name, right?

Nicolas Fränkel: It’s a German last name. I’m French, and it’s spoken mostly by English speaking, so I don’t care anymore.

Chris Engelbert: All right, fair enough. You can jump right into that. Tell us a little bit about you, where you’re from, why you have a German last name, and being French, and everything else.

Nicolas Fränkel: I’m Nicolas Frankel. Yeah, I’m French. I was born in France. For a long time, I was a consultant in different roles, developer, architects, cloud architect, solution architect, whatever. I worked in projects with crazy deadlines, sometimes stupid management, changing requirements and stuff. And so I got very dissatisfied with it, and since a couple of years now I’m doing developer advocacy.

Chris Engelbert: Right, right. And we know each other from the Java world, so you’ve been a lot around the Java community for a long, long while.

Nicolas Fränkel: Yeah, I think we first met at conferences. I don’t remember which one, because it was quite long ago, but my main focus at the time was Java and the JVM.

Chris Engelbert: I think the first time was actually still JavaOne or something. So people that know a little bit of the Java space and remember JavaOne, you can guess how long this must be, or how far this must be ago. Right, so right now you’re working for a company called API7.

Nicolas Fränkel: So API7 is a company that is working on the Apache APISIX. Yeah, I agree. That’s funny. That was probably designed by engineers with no billboard marketing, but it’s still good, because 7 is better than 6, right? So Apache APISIX is an API gateway, and it’s an Apache project, obviously.

Chris Engelbert: All right, so you mentioned APISIX, and you obviously have the merch on you. So API7 is like the Python version, right? It’s one-based. APISIX is the zero-based version. We can argue which one is better.

Nicolas Fränkel: it’s a bit more complicated. So API7 is the company. All right. APISIX is the Apache projects, but API7 also has an offering called API7. So either you have an API7 on-premise version or an API7 cloud version. Yet you can think about it just like Confluent and Kafka. Of course, again, API7, APISIX, it’s a bit confusing. But just forget about the numbering. It’s just Confluent and Kafka. Confluent contributes on Kafka, but still they have their own offering. They do support on their own products, and they also have an on-premise and cloud version.

Chris Engelbert: All right, so that means that API7 as a company basically has probably the majority of engineers working on APISIX, which itself is a project in the Apache Foundation, right?

Nicolas Fränkel: I wouldn’t say they have the majority. To be honest, I didn’t check. But regarding the Apache Foundation, in order for a project to be promoted to top level, you must uphold a certain number of conditions. So the process goes like this. You go to the Apache Foundation, you give the project, and then you become part of the incubator. And in order to be promoted, you need to, as I mentioned, uphold a certain number of conditions that I didn’t check. But one of them is you must have enough committers from different companies. In order for one company not to be the only driving force behind the product, which in my opinion is a good thing. Whereas the CNCF, the project is managed by a company or different companies. In the Apache Foundation, the granularity is the contributor. So a contributor can afterwards, of course, change company or whatever. But in order to actually graduate from the incubator, you must have a certain number of people from different companies.

Chris Engelbert: Yeah, Ok. That makes sense. It’s supposed to be more of a community thing. I think that is the big thing with the Apache Foundation.

Nicolas Fränkel: That’s the whole point.

Chris Engelbert: Also, I think also in comparison or in difference from the Eclipse Foundation, where a lot of the projects are basically company driven.

Nicolas Fränkel: I don’t know about Eclipse. I know about this CNCF. I heard that in order to give your projects to the CNCF, you need to pay them money, which is a bit weird. Again, I didn’t proof-check that. But it’s company driven. You talk to companies. CNCF talk to companies. Whereas the Apache Foundation talk to people.

Chris Engelbert: Yeah, OK. Fair enough. All right. Let’s see. You said it’s an API gateway. So for the people that have not used an API gateway and have no idea what that means– and I think APISIX is a little bit more than just a standard gateway. So maybe can you elaborate a little bit?

Nicolas Fränkel: You can think about an API gateway as a reverse proxy on steroids that allows you to do stuff that is focused on APIs. I always use the same example of rate limiting. Rate limiting has been a feature of any reverse proxy since the 80s, because you want to protect your information system from distributed denial of service attacks. The thing is, it works very well. But then you need to consider every one of your clients the same. So you rate limit them exactly the same. Now imagine you are providing APIs. Probably there is a huge chance that you will want to give some offerings so that a couple of customers can get a higher limit than others. And it means that you can do that in a reverse proxy probably, but you would need to now add business logic into the reverse proxy. And as I mentioned, reverse proxy were designed at a time where they were completely, purely technical. They don’t like business logic so much. Nothing would prevent you from creating a C module and put it in NGINX and do that. But then you have or you encounter a couple of issues.

The first one is the open source version of NGINX. If you need to change the configuration, you need to switch it off and on again. If it sits at the entrance of your information system, it’s not great. And now the business logic might change every now and then and probably quite often, meaning it’s not great. That’s why those technical components, in general, they are not happy about business logic. You want to move the business logic away from those components. API gateways in my definition, because we will find plenty of definitions, first, you need to change the configuration dynamically. You don’t need to switch it then off and on again. And although you still don’t want to have too much business logic, it’s not unfriendly to business logic, meaning you can, for example, in Apache APISIX, you would create your plugin in Lua, and then you can change the Lua code. And then it’s fine.

Chris Engelbert: Right. Ok, so APISIX also uses Lua. I think that seems to be pretty much stable along a lot of the implementations.

Nicolas Fränkel: Not really. I mean, regarding the architecture, it’s based on NGINX. But as I mentioned, NGINX is not great for that. So on top of that, you have something called OpenResty. And OpenResty is actually Lua code that allows you to change the configuration of NGINX dynamically. The thing is, the configuration of OpenResty itself maps only one-to-one to the configuration of NGINX. So if you are doing it at scale, it’s not the best maintainability ever. So Apache APISIX provides you with abstractions. So what is an upstream? What is a route? Then you can reuse an upstream across several routes. What is a service? And everything is plugin-based. So it’s easy for routes to add a plugin, to remove a plugin, to change the configuration of plugin, and so on and so forth.

Chris Engelbert: Right, So from an applications perspective, or application developer’s perspective, do I need to be aware of that? Or does that all happen transparently to me?

Nicolas Fränkel: That’s the thing. It’s an infrastructure component. So normally, you shouldn’t care about it. You mostly don’t care about it. Even better, in general, a lot of stuff that you would do with frameworks or libraries like Spring or whatever, you can remove them from every individual app that you create and put them in these entry points at a very specific place. So your applications itself don’t need to protect the DDoS because the API gateway will do it for you. And you can also have authentication, authorization, caching, whatever. You can mostly move all those features away from your app, focus on your business logic, and just use the plugins that you need.

Chris Engelbert: Right, so you mentioned authentication. So I think it will hand me a JWT token or whatever kind of thing.

Nicolas Fränkel: For that we have multiple plugins. So yes, we have a JWT token. We have a Keycloak integration with a plugin. We have OpenID Connect. We have lots and lots of plugins. And if it’s plugin-based, then nothing prevents you from creating your own plugin. So either to interface with one of your own proprietary authentication systems, or if there is something that you want that is still generic, and then you can always contribute it back to the Apache Foundation, and then it becomes part of the products. And I mean, that’s the beauty of open source.

Chris Engelbert: Yeah, I agree. And I mean, we know each other for a long time. You know that I’m a big fan of open source for exactly all those reasons. Also from a company perspective, like a backing company, like in this case, API7, I think it makes a lot of sense. Because you get– I don’t want to say free help, but you get people that love your project, your product, and that are willing and happy to contribute as well.

Nicolas Fränkel: Exactly. I mean, we both worked for Hazelcast, although at different periods. And that was open source. But for me, this is the next step. The product is not only open source, and open source right at the moment is very interesting moment, because some companies are afraid that your product will be shrink-wrapped by the cloud provider, and they switch to an open license, which is not truly open source according to the creo. But the Apache Foundation is fully open source. So even if, for whatever reason, API7 decides not to work on the project anymore, then you can still have the project somewhere. And if you find a couple of maintainers, it means it’s still maintained.

Chris Engelbert: So from a deployment perspective, I guess I deploy that into Kubernetes, or?

Nicolas Fränkel: That’s the thing. It’s not focused on Kubernetes. So you can deploy that in any cloud provider, or even directly on the machine you choose. You have basically two modes. The first mode is the one that you would like to play with at first. So you deploy your nodes, and then you deploy etcd. So the same one used by Kubernetes to store its configuration. It’s a key-value distributed store. And then you can change the configuration of APISIX through an API call itself, and it will store its configuration in etcd. And then it’s very dynamic. If you have more maturity in GitOps, if you have more maturity in DevOps in general, perhaps you will notice that, oh, now where is my configuration? Well, in etcd. But now I need to back it up. How do I migrate? I need to move the data from etcd to another cluster. So it’s perhaps not the best production-grade way. So another way is to have everything static in YAML file. I hate YAML.

But at the moment, everybody is using YAML, and that’s the configuration. Like, at least Ops understand how to operate that. And so you have every node as its own set of YAML file, and then those YAML files are synchronized for GitOps to a GitHub repository. And then the GitHub repository is the source of truth, and it can be read, it can be audited, it can be whatever. Whereas if you store everything in etcd, it still works the same way, but it’s opaque. You don’t know what happens, right?

Chris Engelbert: I mean, the last thing you said with the GitHub repository being basically infrastructure as code, source of truth, that would probably then play into something like ArgoCD to deploy the updated version.

Nicolas Fränkel: Right, Ok. That makes sense. We don’t enforce any products. And actually, we just provide a way to statically configure Apache APISIX, and then you use whatever product you want. We are not partisan. We just allow you to do it.

Chris Engelbert: So from your own feeling, what do you think is the most common use case why people would use API gateways? Is that, as you said, rate limiting? I can see that as a very common thing, not only for companies like X or Twitter or whatever you want to call those these days, but also GitHub. I think every meaningful API has some kind of a rate limit. But I could also see DDoS attack, whereas I think people would probably use Cloudflare or any of these providers. What do you think is the biggest typical use case for that?

Nicolas Fränkel: If you are using APIs, you probably need something more than just a traditional reverse proxy. If you are using a reverse proxy, you are happy with your reverse proxy. You didn’t hit any limits of your reverse proxy. Just keep using your reverse proxy. As I mentioned, once you start to delve your feet into the API world, you will notice the reverse proxy is as the features are. It has some of the features that you want, but perhaps not the ease or the flexibility of configuration that you want. That said, you want to consider different clients in different ways. In that case, that’s probably the time where you need to think about, Ok, I need to think about migrating to an API gateway.

But context are so different that it’s very hard to provide a simple solution that caters to everybody’s need. But you could have a reverse proxy at the entrance of your whole information system. And at the second level, you would have the API gateway. Or you could have an API gateway for each different, I don’t know, domain of your organization, because your organization has different teams for every domain. And then, though it would be possible to have one gateway that is managed by different teams, then it makes a lot of sense to have different teams managing all their own configuration on their own component. But it’s like one micro-service. So everybody manages their own stuff. And you are sure that nobody will step on each other’s foot. But again, it depends a lot on the size, on how well you’re organized, on the maturity, on many different things. There are as many architectures as probably organizations.

Chris Engelbert: Just quickly, hinting back at Kubernetes, I think when– and I may be wrong here. If I use APISIX, I do not need any other ingress system, because APISIX can be the ingress provider for Kubernetes, doesn’t it?

Nicolas Fränkel: So getting back to Kubernetes, yes, we have an ingress controller. Or we have a hand chart. You can install APISIX inside your Kubernetes cluster. And it will serve as an ingress controller. So you will have the ingress controller itself. And it will configure Apache APISIX according to your manifests.

Chris Engelbert: All right, cool. So just looking at the time. Yeah, 20 minutes is not a lot. So when I want to use APISIX, should I call you guys at API7 Or should I go with the Apache project? Or should I do something else.

Nicolas Fränkel: It depends. I would always encourage people, if you are a tech person, to just take the project, use the Docker container, for example, try to play with it, try to change if it’s exactly what you need, try to understand the limits and the benefits in your own organization. Then if you’ve got questions, we’ve got a Slack that I can send you the reference, and then you can start to ask questions like, “Why in this case I tried to do that, and it works like this, and I wanted it to do that?” Then when you think that Apache APISIX is the right solution, then check if the open source version is enough. I believe if you are managing, if you are running a company, you will need to have some kind of support at some points. Up until that point, of course, just use the open source version, be happy with it. If you want to use it in a production-grade environment with support, with guarantees, and stuff, of course, please call us. It also pays my salary, so it’s also great. You’re welcome to play with the open source version and to check if it suits your requirements.

Chris Engelbert: Before we come to the last question, which is something I always have to ask people, maybe a quick comparison to other products. There are a lot of API gateways, at least in air quotes on the market. Why is APISIX special?

Nicolas Fränkel: First, every cloud provider comes with its own API gateway. My feeling is that all of them are much better integrated, much more limited in features. Again, if it suits you, then use them. That’s fine. If you find yourself at some point, you need to find workarounds, then perhaps it’s time to move away from them. Then about the comparison, the only really in-depth comparison I’ve done so far is with the Spring Cloud API gateway. I have written a blog post, but in short, if you are a developer team using Spring, knowing Spring, then use the Spring Cloud API gateway. It will be fine. If you want an Ops team to operate it, then probably it won’t be that great. The basic level, you can do a lot with YAML, and then you find yourself needing to write Java code. Ops people, I’m sorry, but they are not experts in writing Java code. You don’t want to have a compile phase.

Anyway, if you are a team, as I mentioned before, if you are a team, you manage your own domain, you have only developers or DevOps people, you are familiar with Java, you are expert in Spring, you want to only manage your own stuff, then perhaps it could be a very good gateway for your needs. Otherwise, I’m not sure it’s a great idea. Regarding the others, I honestly have no clue what’s the pros and the cons compared to Apache APISIX, but I know that Apache APISIX is the only truly open source project, the only one managed by the Apache Foundation. If you care about open source, not because you love open source so much, but you care about the future of the project, you care about long-term maintainability of the project, then it’s our main benefit. I won’t talk about performance or whatever, because, again, I didn’t do any benchmark myself, and every benchmark that is provided by any vendor can probably be discarded out of the box directly, because you should do your own benchmark in your own infrastructure.

Chris Engelbert: Yeah. I couldn’t have said that any better. It’s something that I keep telling people when they ask, whatever company you work for, there’s always people asking for benchmarks, and it’s always like, don’t believe benchmarks. Even if a vendor is really honest and tries to do meaningful benchmarks, it’s always an artificial dataset or whatever. Run your own benchmarks, do it with your own datasets, operational behavior and figure it out yourself. We can help you but you just don’t want to believe your benchmarks or not.

Nicolas Fränkel: Right. Exactly.

Chris Engelbert: Alright. Ok. So we’re coming to the end of our episode. And something that I always ask everybody is if there’s one thing that you think people should take away from our conversation today, what would that be?

Nicolas Fränkel: I think the most important thing is that regardless of the project or the tool that you choose, that you choose it for the right reasons. As I mentioned, if you’re using a cloud provider and if it suits your needs, then use it. If it doesn’t suit your needs, if it’s too limited, then don’t hesitate to move away. The good thing with the cloud is that you’re not stuck, right? And if you want a product that is focused on open source, and if you are in the open source space, I think Apache APISIX is a very good solution. And yeah, that’s it. Always make choices that fit your needs. And it’s good that you don’t have just one choice, right? You have a couple of them.

Chris Engelbert: That’s really well said. All right, so thank you very much, Nicolas, for being on the show. It was great having you.

Nicolas Fränkel: Thank you.

The post Improve Security with API Gateways, Nicolas Fränkel appeared first on simplyblock.

Building a Time Series Database in the Cloud with Steven Sklar from QuestDB (video + interview)

Chris Engelbert — Fri, 12 Apr 2024 12:13:27 +0000

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of the podcast, we talked to Steven Sklar ( his private blog , X/Twitter ) from QuestDB , a company producing a time series database for large IoT, metrics, observability and other time-component data sets, talks about how they implemented their database offering, from building their own operator to how storage is handled.

Chris Engelbert: Hello, everyone. Welcome back to another episode of simpleblock’s Cloud Commute Podcast. Today, I’m joined by Steven Sklar from QuestDB. He was recommended by a really good friend and an old coworker who’s also at QuestDB. So hello, Steven, and good to have you.

Steven Sklar: Thank you. It’s really a pleasure to be here, and I’m looking forward to our chat.

Chris Engelbert: All right, cool. So maybe just start with a quick introduction. I mean, we already know your name, and I hope I pronounced that correctly. But what else is to talk about you?

Steven Sklar: Sure. So I kind of have a nontraditional background. I started with a degree in economics and finance and worked on Wall Street for a little bit. I like to say in most of my conference talks on the first slide that my first programming language was actually Excel VBA, which I do still have a soft spot for. And I found myself on a bond trading desk and kind of reached the boundaries of Excel and started working in C# and SQL Server, realized I liked that a lot more than just kind of talking to people on the phone and negotiating over various mortgage bonds and things. So I moved into the IT realm and software development and have been developing software ever since. So I moved on from C# into the Python world, moved on from finance into the startup world, and I currently am QuestDB, as you mentioned earlier.

Chris Engelbert: Right. So maybe you can say a few words about QuestDB. What is it? What does it do? And why do people want to use it?

Steven Sklar: Sure. QuestDB is a time series database with a focus on high performance. And I like to think of ease of usability. So we can ingest up to like millions of rows per second on some benchmarks, which is just completely mind-blowing to me. It’s actually written primarily in Java, which doesn’t necessarily go hand in hand with high performance, but we’ve rewritten most of the standard library to avoid memory allocation. So I know it actually truly is high performance. We’ve actually been introducing Rust as well into the code base. You can query the database using plain old SQL. And it really fits into several use cases, like financial tick by tick data and sensor data. I have one going on in my house right now, collecting all of my smart home stuff from Home Assistant. And I mean, yes, I’ve been here for around a year and a half, I want to say. And it’s been a great ride.

Chris Engelbert: Right. So you mentioned time series. And I’m aware what time series are because I’ve been at a competitor before that. So Jaromir and I went slightly different directions, but we both ended up in the time series world. But for the audience, that may not be perfectly aware what time series are. You already mentioned tick data from the financial background. You also mentioned Home Assistant and IoT data, which is great because I’m doing the same thing for me. It’s most like energy consumption and stuff. But maybe you have some more examples.

Steven Sklar: Sure. Kind of a canonical one is monitoring and metrics. So any kind of data, I think, has a time component. Because it’s and I think you need a specialized database. A lot of people ask, well, why not just use Postgres or any of the common databases? And you could, but you’re probably not going to scale. And you’re going to hit a point where your queries are just not performing. And time series databases, in many cases, ours in particular as well, I can speak to, is a columnar database. So it stores data in a different format than you normally would see in a traditional database. And that makes querying and actually ingesting data from a wide range of sources much more efficient. And you kind of like to think of it as, I don’t want to put myself on the spot and do mental math. But imagine if you have 10,000 devices that are sending information to your database for more than a second. It’s not that big of a deal. But maybe, let’s say, you scale and you end up with a million devices. All of a sudden, you’re dealing with tremendous amounts of data going into your database that you need to manage. And that’s a different problem, I think, than your typical relational database.

Chris Engelbert: Right. And I think you brought up a good example. Most of the time when we talk about devices as I said, I’m coming from a kind of similar background. It’s not like a device just sends you a single data point. When we talk about connected cars, they actually send thousands to 100,000 of data, position information, all kinds of metrics about the car itself, the electronics, and all that kind of stuff. And that comes down to quite a massive amount of data. So yeah, I agree with you. An actual time series database is super important. You mentioned columnar storage. Maybe you can say a few words about how that is different from, I guess, your Excel sheet.

Steven Sklar: Sure. Well, I guess I don’t know if I can necessarily compare it to my Excel spreadsheet, since that’s its own weird XML format, of course. But columnar data, I guess, is different from, let’s say, tabular data in your typical database. Tabular data is generally stored in the table format, where all of your columns and rows are kind of stored together versus columnar in a data store, each column is its own separate file. And that kind of makes it more efficient when you’re working in a time component, because time is generally your index. You’re not really indexing on a lot of things like primary key type things. You’re really just mostly indexing on time, like what happened at this point in time or over this time period. Because of that, we’re able to optimize the storage model to allow faster querying and also ingestion as well. And just for clarity, I’m not a core developer. I’m more of a cloud guy, so I hope I got those details right.

Chris Engelbert: I think you get the gist of it. But for QuestDB, that means it still looks like a tabular kind of database. So you still have your typical tables, but the individual columns are stored separately. Is that correct?

Steven Sklar: Correct.

Chris Engelbert: Ok, cool. So you said you’re a cloud guy. But as far as I know, you can install QuestDB locally, on-prem. You can install it into your own private cloud. I think there is the QuestDB cloud, which is the hosted platform. Well, not I guess. I know that it is. So maybe what is special about that? Does it have special features? Or is that mostly about the convenience of getting the managed database and getting rid of all the work you have to do when you run your own database, which can be complicated.

Steven Sklar: Absolutely. So actually, both. Obviously, you don’t have to manage it, and that’s great. You can leave it to the experts. That’s already worth the price of admission, I think. Additionally, we have the enterprise, the QuestDB enterprise, which has additional features. And all of those features, like role-based authentication and replication that’s coming soon and compression of your data on disk, are all things that you get automatically through the cloud.

Chris Engelbert: Ok, so that means I have to buy QuestDB enterprise when I want to have those features on prem, but I get them on the cloud right away.

Steven Sklar: Correct.

Chris Engelbert: Ok, cool. And correct me if I’m wrong, but I think from a client perspective, it uses the Postgres protocol. So any Postgres client is a QuestDB client, basically.

Steven Sklar: Absolutely, 100%.

Chris Engelbert: All right, so that means as an application developer, it’s super simple. I’ll basically drop in QuestDB instead of Postgres or anything else. So yeah, let’s talk a little bit about the cloud then. Maybe you can elaborate a little bit on the stack you’re running on. I’m not sure how much you can actually say, but anything you can share will probably be beneficial to everyone.

Steven Sklar: Oh, yeah, no problem. So we run on AWS. We run on Kubernetes. And we also– I guess one thing that I’m particularly proud of is an operator that I wrote to orchestrate all these databases. So our model, which is not necessarily your bread and butter Kubernetes deployment, is actually a single-tenant model. So we have one database per instance. And when you’re running Kubernetes, you kind of think, why do you care about what nodes you’re running on? Shouldn’t all that be abstracted away? And I would agree. We primarily use Kubernetes for its orchestration. But we want to avoid the noisy neighbor problem. We want to make it easy for users to change instances and instance types quickly. We want users to be able to shut down their database. And we still have the volume. So all these things, we could orchestrate them directly through Kubernetes. But we decided to use single-tenant nodes for that.

Chris Engelbert: Right. So let me see. So that means you’re using Kubernetes, as you said, mostly for orchestration, which means it’s more like if the database for some reason goes down or you have to have maintenance or you want to upgrade. It’s more about the convenience of having something managing that instead of doing it manually, right?

Steven Sklar: Exactly. And so I think we really thought, ok and this is a little bit before my time, but you could always roll your own cluster. But there’s so many things that are baked into Kubernetes these days, like monitoring and logs and metrics and networking and DNS and all these things that I don’t necessarily want to spend all my time on. I want to build a product. And by using Kubernetes and leveraging those components, we were able to build the cloud incredibly quickly, get us up and running, and then now expand upon it in the future. And that’s why, again, I mentioned the operator earlier. That was not originally part of the cloud. The cloud still has in a more limited capacity what we call a provisioner. So basically, if you’re interacting with the cloud and you make a new database, basically send a message to a queue, and that message will be picked up by a provisioner. And previously, that provisioner would say, ok, you want a database. Let’s make a stateful set. Let’s make a persistent volume. Let’s make these networking policies. Let’s do all of these things. If there’s an error, we can roll back. And we have retries. So it’s fairly sophisticated. But we ended up moving towards this operator model, which instead of the provisioner managing each of these individual components, it just manages one QuestDB resource. And our operator now handles all of those other little things. So I think that’s much more flexible for us in terms of, A, simplifying the provisioner code, and also by adding new features instead of having to work in this ever-growing web of Python. Now, it’s really just adding a snippet here and there to our reconciliation inside of everything.

Chris Engelbert: Right. You mentioned that the database is mostly written in Java. Most operators are written in Go. So what about your operator? Is it Java?

Steven Sklar: It’s Go.

Chris Engelbert: That’s fair. To be honest, I think the vast majority is. So you mentioned AWS. But I think that you are mostly talking about QuestDB Cloud, right? I think from a user’s perspective, do I use a helm chart or do I also use the operator to install it myself?

Steven Sklar: Yes. So the operator is actually only limited to the cloud because it’s built specifically to manage our own infrastructure with our own assumptions. We do have a helm chart and an open source image on Docker Hub. So I’ve used that plenty of times more than I can count.

Chris Engelbert: Ok, fair enough. So you basically support all cloud environments, all on-premise. But when you go for QuestDB Cloud, that is AWS, which I think is a fair decision. It is the biggest environment by far. So from a storage engine perspective, how much can you share? Can you talk about some fancy details? Like what kind of storage do you use? Do you use the local NVMe storage attached to the virtual machine or EBS volumes?

Steven Sklar: Yeah. So in our cloud, we have both actually NVMe and EBS. Most customers end up using EBS. And honestly, EBS is simpler to provision. But I do want to actually talk about some cool stuff that we’ve done with compression. Because we actually never implemented our own compression algorithm. We’re running on top of ZFS and using their compression algorithm. And we’ve actually– there’s an issue about data corruption, potentially, using mmap on ZFS, or rather a combination of mmap and traditional sys calls, the pwrite and preads. And what we do is actually identify when we’re running on ZFS and then decide to only use mmap calls to avoid this issue. And I think what we’ve done is pretty cool also on the storage side of orchestrating this whole thing. Because ZFS has its own notion of snapshots, its own notion of replication, its own notion of ZPools. And to simplify things, again, because we’re running this kind of I don’t necessarily want to say antiquated, but we’re running a single-tenant model, which might not be in vogue these days. What we actually do is we create one ZPool per volume and throw our QuestDB on the ZPool, enabling compression. And we’ve written our own CSI storage driver that sits in the middle of Kubernetes and other cloud providers so that we’re able to pass calls onto the cloud providers if, let’s say, we need to create or delete a volume using the cloud provider API. But when it comes to mounting specific ZFS and running ZFS-related commands, we actually take control of that and perform that in our own driver. I don’t know when this is going to be released, but I’m actually talking about this in Atlanta next week.

Chris Engelbert: No. Next week is a little bit early. Currently, I’m doing a couple of recordings, building a little bit of a pipeline. Because of conferences, the same thing will be in Paris for KubeCon next week. So there is a little bit. No, I don’t know the exact date. I think it’s in three or four weeks. So it’s a little bit out. But I guess your talk may be recorded. And public by then. So if that is the case, I’m happy if you drop it over and I put it into the show notes, people will love that. So you said when you run on, or when you detect that you run on ZFS, you use mmap. So you basically map the file into memory. And you change the memory positions directly. And then you fsync it. Or how does it work? How do I have to think about that?

Steven Sklar: Oh, boy. Ok. This is getting a little out of my– So you always use mmap regardless. But the issue is when you combine mmap with traditional sys calls on ZFS. And so what we do is we basically turn off those other sys calls and only use mmap when we’re writing to our files. In terms of the specifics of when we sync and things like that, I wish I could answer it right off of the bat.

Chris Engelbert: That’s totally fine. So just to sneak in a little shameless plug here, we should totally look into getting QuestDB running on simplyblock. I think that could be a really interesting thing. Because you mentioned ZFS, it’s basically ZFS on steroids. ZFS from my perspective, I mean, I’m running a ZFS file server in the basement. It saved me a couple of times with a broken hard disk. It’s just an incredible piece of technology. I agree with that. And it’s interesting because I’ve seen a lot of people running database on ZFS. And ZFS is all about reliability. It’s not necessarily about the highest performance. So it’s interesting you choose ZFS and you say, that’s perfect and works great for us. So because we’re almost running out of time, as I said earlier, 20 minutes is super short. When you look at cloud and databases and the world as a whole, whatever you want to talk about, what do you think is the next big trend or the current big trend? What is coming? What do you think would be really cool?

Steven Sklar: Yeah. So I guess I’m not going to talk about the existential crisis I’m having with Devin and the AI bots because it’s just a little depressing for me right now. But I think one thing that I’ve been seeing over the past few years that I find very interesting is this move away from cloud and back into your own data center. I think having control over your data is something that’s incredibly important to basically everyone now. And I think it’s to find a happy medium as a DevOps engineer between all the wonderful cloud APIs that you can use and going in the server room and kind of hooking things up. There’s probably a happy medium there somewhere that I think is an area that is going to start growing in the future. You see a lot of on-prem Kubernetes type things, Kubernetes on edge maybe. And for me, it presents a lot of interesting challenges because I spent most of my career in startups working on the cloud and understanding the fundamentals of not just the cloud APIs but operating systems and hardware a little bit. And so kind of figuring out where to draw that line in terms of what knowledge is transferable to this new paradigm will be interesting. And I think that’s a new trend that I’ve been focused on at least over the past couple of months.

Chris Engelbert: That is interesting that you mentioned that because it is kind of that. When the cloud became big, everyone wanted to move to the cloud because it was like “cheaper” in air quotes. And I think– well, the next step was serverless because it is yet even cheaper, which we all know is not necessarily true. And I see kind of the same thing. Now people realize that not every workload actually works perfectly or is a great fit for the cloud and people slowly start moving back or at least going back to not necessarily cloud instance but co-located servers or virtual machines, like plain virtual machines and just taking those for the workloads that do not need to be super scalable or super elastic. Well, thank you very much. That was very delightful. It was a pleasure having you.

Steven Sklar: Thank you.

Chris Engelbert: Thank you for being here and for the audience. I hope to– well, not see you, but hear you next time, next week. Thank you very much.

Steven Sklar: Thank you. Take care.

The post Building a Time Series Database in the Cloud with Steven Sklar from QuestDB (video + interview) appeared first on simplyblock.

Production-grade Kubernetes PostgreSQL, Álvaro Hernández

Chris Engelbert — Fri, 05 Apr 2024 12:13:27 +0000

In this episode of the Cloud Commute podcast, Chris Engelbert is joined by Álvaro Hernández Tortosa, a prominent figure in the PostgreSQL community and CEO of OnGres. Álvaro shares his deep insights into running production-grade PostgreSQL on Kubernetes, a complex yet rewarding endeavor. The discussion covers the challenges, best practices, and innovations that make PostgreSQL a powerful database choice in cloud-native environments.

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.

Key Takeaways

Q: Should you deploy PostgreSQL in Kubernetes?

Deploying PostgreSQL in Kubernetes is a strategic move for organizations aiming for flexibility and scalability. Álvaro emphasizes that Kubernetes abstracts the underlying infrastructure, allowing PostgreSQL to run consistently across various environments—whether on-premise or in the cloud. This approach not only simplifies deployments but also ensures that the database is resilient and highly available.

Q: What are the main challenges of running PostgreSQL on Kubernetes?

Running PostgreSQL on Kubernetes presents unique challenges, particularly around storage and network performance. Network disks, commonly used in cloud environments, often lag behind local disks in performance, impacting database operations. However, these challenges can be mitigated by carefully choosing storage solutions and configuring Kubernetes to optimize performance. Furthermore, managing PostgreSQL’s ecosystem—such as backups, monitoring, and high availability—requires robust tooling and expertise, which can be streamlined with solutions like StackGres.

Q: Why should you use Kubernetes for PostgreSQL?

Kubernetes offers a powerful platform for running PostgreSQL due to its ability to abstract infrastructure details, automate deployments, and provide built-in scaling capabilities. Kubernetes facilitates the management of complex PostgreSQL environments, making it easier to achieve high availability and resilience without being locked into a specific vendor’s ecosystem.

Q: Can I use PostgreSQL on Kubernetes with PGO?

Yes, you can. Tools like the PostgreSQL Operator (PGO) for Kubernetes simplify the management of PostgreSQL clusters by automating routine tasks such as backups, scaling, and updates. These operators are essential for ensuring that PostgreSQL runs efficiently on Kubernetes while reducing the operational burden on database administrators.

Key Learnings

Q: How does Kubernetes scheduler work with PostgreSQL?

Kubernetes uses its scheduler to manage how and where PostgreSQL instances are deployed, ensuring optimal resource utilization. However, understanding the nuances of Kubernetes’ scheduling can help optimize PostgreSQL performance, especially in environments with fluctuating workloads.

simplyblock Insight: Leveraging simplyblock’s solution, users can integrate sophisticated monitoring and management tools with Kubernetes, allowing them to automate the scaling and scheduling of PostgreSQL workloads, thereby ensuring that database resources are efficiently utilized and downtime is minimized. Q: What is the best experience of running PostgreSQL in Kubernetes?

The best experience comes from utilizing a Kubernetes operator like StackGres, which simplifies the deployment and management of PostgreSQL clusters. StackGres handles critical functions such as backups, monitoring, and high availability out of the box, providing a seamless experience for both seasoned DBAs and those new to PostgreSQL on Kubernetes.

simplyblock Insight: By using simplyblock’s Kubernetes-based solutions, you can further enhance your PostgreSQL deployments with features like dynamic scaling and automated failover, ensuring that your database remains resilient and performs optimally under varying loads. Q: How does disk access latency impact PostgreSQL performance in Kubernetes?

Disk access latency is a significant factor in PostgreSQL performance, especially in Kubernetes environments where network storage is commonly used. While network storage offers flexibility, it typically has higher latency compared to local storage, which can slow down database operations. Optimizing storage configurations in Kubernetes is crucial to minimizing latency and maintaining high performance.

simplyblock Insight: simplyblock’s advanced storage solutions for Kubernetes can help mitigate these latency issues by providing optimized, low-latency storage options tailored specifically for PostgreSQL workloads, ensuring your database runs at peak efficiency. Q: What are the advantages of clustering in PostgreSQL on Kubernetes?

Clustering PostgreSQL in Kubernetes offers several advantages, including improved fault tolerance, load balancing, and easier scaling. Kubernetes operators like StackGres enable automated clustering, which simplifies the process of setting up and managing a highly available PostgreSQL cluster.

simplyblock Insight: With simplyblock, you can easily deploy clustered PostgreSQL environments that automatically adjust to your workload demands, ensuring continuous availability and optimal performance across all nodes in your cluster.

Additional Nugget of Information

Q: What are the advantages of clustering in Postgres? A: Clustering in PostgreSQL provides several benefits, including improved performance, high availability, and better fault tolerance. Clustering allows multiple database instances to work together, distributing the load and ensuring that if one node fails, others can take over without downtime. This setup is particularly advantageous for large-scale applications that require high availability and resilience. Clustering also enables better scalability, as you can add more nodes to handle increasing workloads, ensuring consistent performance as demand grows.

Conclusion

Deploying PostgreSQL on Kubernetes offers powerful capabilities but comes with challenges. Álvaro Hernández Tortosa highlights how StackGres simplifies this process, enhancing performance, ensuring high availability, and making PostgreSQL more accessible. With the right tools and insights, you can confidently manage PostgreSQL in a cloud-native environment.

Full Video Transcript

Chris Engelbert: Welcome to this week’s episode of Cloud Commute podcast by simplyblock. Today, I have another incredible guest, a really good friend, Álvaro Hernández from OnGres. He’s very big in the Postgres community. So hello, and welcome, Álvaro.

Álvaro Hernández Tortosa: Thank you very much, first of all, for having me here. It’s an honor.

Chris Engelbert: Maybe just start by introducing yourself, who you are, what you’ve done in the past, how you got here. Well, except me inviting you.

Álvaro Hernández Tortosa: OK, well, I don’t know how to describe myself, but I would say, first of all, I’m a big nerd, big fan of open source. And I’ve been working with Postgres, I don’t know, for more than 20 years, 24 years now. So I’m a big Postgres person. There’s someone out there in the community that says that if you say Postgres three times, I will pop up there. It’s kind of like Superman or Batman or these superheroes. No, I’m not a superhero. But anyway, professionally, I’m the founder and CEO of a company called OnGres. Let’s guess what it means, On Postgres. So it’s pretty obvious what we do. So everything revolves around Postgres, but in reality, I love all kinds of technology. I’ve been working a lot with many other technologies. I know you because of being a Java programmer, which is kind of my hobby. I love programming in my free time, which almost doesn’t exist. But I try to get some from time to time. And everything related to technology in general, I’m also a big fan and supporter of open source. I have contributed and keep contributing a lot to open source. I also founded some open source communities, like for example, I’m a Spaniard. I live in Spain. And I founded Debian Spain, an association like, I don’t know, 20 years ago. More recently, I also founded a foundation, a nonprofit foundation also in Spain called Fundación PostgreSQL. Again, guess what it does? And I try to engage a lot with the open source communities. We, by the way, organized a conference for those who are interested in Postgres in the magnificent island of Ibiza in the Mediterranean Sea in September this year, 9th to 11th September for those who want to join. So yeah, that’s probably a brief intro about myself.

Chris Engelbert: All right. So you are basically the Beetlejuice of Postgres. That’s what you’re saying.

Álvaro Hernández Tortosa: Beetlejuice, right. That’s more upper bid than superheroes. You’re absolutely right.

Chris Engelbert: I’m not sure if he is a superhero, but he’s different at least. Anyway, you mentioned OnGres. And I know OnGres isn’t really like the first company. There were quite a few before, I think, El Toro, a database company.

Álvaro Hernández Tortosa: Yes, Toro DB.

Chris Engelbert: Oh, Toro DB. Sorry, close, close, very close. So what is up with that? You’re trying to do a lot of different things and seem to love trying new things, right?

Álvaro Hernández Tortosa: Yes. So I sometimes define myself as a 0.x serial entrepreneur, meaning that I’ve tried several ventures and sold none of them. But I’m still trying. I like to try to be resilient, and I keep pushing the ideas that I have in the back of my head. So yes, I’ve done several ventures, all of them, around certain patterns. So for example, you’re asking about Toro DB. Toro DB is essentially an open source software that is meant to replace MongoDB with, you guessed it, Postgres, right? There’s a certain pattern in my professional life. And Toro DB was. I’m speaking in the past because it no longer unfortunately maintained open source projects. We moved on to something else, which is OnGres. But the idea of Toro DB was to essentially replicate from Mongo DB live these documents and in the process, real time, transform them into a set of relational tables that got stored inside of a Postgres database. So it enabled you to do SQL queries on your documents that were MongoDB. So think of a MongoDB replica. You can keep your MongoDB class if you want, and then you have all the data in SQL. This was great for analytics. You could have great speed ups by normalizing data automatically and then doing queries with the power of SQL, which obviously is much broader and richer than query language MongoDB, especially for analytics. We got like 100 times faster on most queries. So it was an interesting project.

Chris Engelbert: So that means you basically generated the schema on the fly and then generated the table for that schema specifically? Interesting.

Álvaro Hernández Tortosa: Yeah, it was generating tables and columns on the fly.

OnGres StackGres: Operator for Production-Grade PostgreSQL on Kubernetes

Chris Engelbert: Right. Ok, interesting. So now you’re doing the OnGres thing. And OnGres has, I think, the main product, StackGres, as far as I know. Can you tell a little bit about that?

Álvaro Hernández Tortosa: Yes. So OnGres, as I said, means On Postgres. And one of our goals in OnGres is that we believe that Postgres is a fantastic database. I don’t need to explain that to you, right? But it’s kind of the Linux kernel, if I may use this parallel. It’s a bit bare bones. You need something around it. You need a distribution, right? So Postgres is a little bit the same thing. The core is small, it’s fantastic, it’s very featureful, it’s reliable, it’s trustable. But it needs tools around it. So our vision in OnGres is to develop this ecosystem around this Postgres core, right? And one of the things that we experience during our professional lifetime is that Postgres requires a lot of tools around it. It needs monitoring, it needs backups, it needs high availability, it needs connection pooling.

By the way, do not use Postgres without connection pooling, right? So you need a lot of tools around. And none of these tools come from a core. You need to look into the ecosystem. And actually, this is good and bad. It’s good because there’s a lot of options. It’s bad because there’s a lot of options. Meaning which one to choose, which one is good, which one is bad, which one goes with a good backup solution or the good monitoring solution and how you configure them all. So this was a problem that we coined as a stack problem. So when you really want to run Postgres in production, you need the stack on top of Postgres, right? To orchestrate all these components.

Now, the problem is that we’ve been doing this a lot of time for our customers. Typically, we love infrastructure scores, right? And everything was done with Ansible and similar tools and Terraform for infrastructure and Ansible for orchestrating these components. But the reality is that every environment into which we looked was slightly different. And we can just take our Ansible code and run it. You’ve got this stack. But now the storage is different. Your networking is different. Your entry point. Here, one is using virtual IPs. That one is using DNS. That one is using proxies. And then the compute is also somehow different. And it was not reusable. We were doing a lot of copy, paste, modify, something that was not very sustainable. At some point, we started thinking, is there a way in which we can pack this stack into a single deployable unit that we can take essentially anywhere? And the answer was Kubernetes. Kubernetes provides us this abstraction where we can abstract away this compute, this storage, this bit working and code against a programmable API that we can indeed create this package. So that’s a StackGres.

StackGres is the stack of components you need to run production Postgres, packaging a way that is uniform across any environment where you want to run it, cloud, on-prem, it doesn’t matter. And is production ready! It’s packaged at a very, very high level. So basically you barely need, I would say, you don’t need Postgres knowledge to run a production ready enterprise quality Postgres cluster introduction. And that’s the main goal of StackGres.

Chris Engelbert: Right, right. And as far as I know, I think it’s implemented as a Kubernetes operator, right?

Álvaro Hernández Tortosa: Yes, exactly.

Chris Engelbert: And there’s quite a few other operators as well. But I know that StackGres has some things which are done slightly differently. Can you talk a little bit about that? I don’t know how much you wanna actually make this public right now.

Álvaro Hernández Tortosa: No, actually everything is open source. Our roadmap is open source, our issues are open source. I’m happy to share everything. Well, first of all, what I would say is that the operator pattern is essentially these controllers that take actions on your cluster and the CRDs. We gave a lot of thought to these CRDs. I would say that a lot of operators, CRDs are kind of a byproduct. A second thought, “I have my objects and then some script generates the CRDs.” No, we said CRDs are our user-facing API. The CRDs are our extended API. And the goal of operators is to abstract the way and package business logic, right? And expose it with a simple user interface.

So we designed our CRDs to be very, very high level, very amenable to the user, so that again, you don’t require any Postgres expertise. So if you look at the CRDs, in practical terms, the YAMLs, right? The YAMLs that you write to deploy something on StackGres, they should be able to deploy, right? You could explain to your five-year-old kid and your five-year-old kid should be able to deploy Postgres into a production-quality cluster, right? And that’s our goal. And if we didn’t fulfill this goal, please raise an issue on our public issue tracker on GitLab because we definitely have failed if that’s not true. So instead of focusing on the Postgres usual user, very knowledgeable, very high level, most operators focused on low level CRDs and they require Postgres expertise, probably a lot. We want to make Postgres more mainstream than ever, right? Postgres increases in popularity every year and it’s being adopted by more and more organizations, but not everybody’s a Postgres expert. We want to make Postgres universally accessible for everyone. So one of the things is that we put a lot of effort into this design. And we also have, instead of like a big one, gigantic CRD. We have multiple. They actually can be attached like in an ER diagram between them. So you understand relationships, you create one and then you reference many times, you don’t need to restart or reconfigure the configuration files. Another area where I would say we have tried to do something is extensions. Postgres extensions is one of the most loved, if not the most loved feature, right?

And StackGres is the operator that arguably supports the largest number of extensions, over 200 extensions of now and growing. And we did this because we developed a custom solution, which is also open source by StackGres, where we can load extensions dynamically into the cluster. So we don’t need to build you a fat container with 200 images and a lot of security issues, right? But rather we deploy you a container with no extensions. And then you say, “I want this, this, this and that.” And then they will appear in your cluster automatically. And this is done via simple YAML. So we have a very powerful extension mechanism. And the other thing is that we not only expose the usual CRD YAML interface for interacting with StackGres, it’s more than fine and I love it, but it comes with a fully fledged web console. Not everybody also likes the command line or GitOps approach. We do, but not everybody does. And it’s a fully fledged web console which also supports single sign-on, where you can integrate with your AD, with your OIDC provider, anything that you want. Has detailed fine-grained permissions based on Kubernetes RBAC. So you can say, “Who can create clusters, who can view configurations, who can do anything?” And last but not least, there’s a REST API. So if you prefer to automate and integrate with another kind of solution, you can also use the REST API and create clusters and manage clusters via the REST API. And these three mechanisms, the YAML files, CRDs, the REST API and the web console are fully interchangeable. You can use one for one operation, the other one for everything goes back to the same. So you can use any one that you want.

And lately we also have added sharding. So sharding scales out with solutions like Citus, but we also support foreign interoperability, Postgres with partitioning and Apache ShardingSphere. Our way is to create a cluster of multiple instances. Not only one primary and one replica, but a coordinator layer and then shards, and it shares a coordinator of the replica. So typically dozens of instances, and you can create them with a simple YAML file and very high-level description, requires some knowledge and wires everything for you. So it’s very, very convenient to make things simple.

Chris Engelbert: Right. So the plugin mechanism or the extension mechanism, that was exactly what I was hinting at. That was mind-blowing. I’ve never seen anything like that when you showed it last year in Ibiza. The other thing that is always a little bit of like a hat-scratcher, I think, for a lot of people when they hear that a Kubernetes operator is actually written in Java. I think RedHat built the original framework. So it kind of makes sense that RedHat is doing that, I think the original framework was a Go library. And Java would probably not be like the first choice to do that. So how did that happen?

Álvaro Hernández Tortosa: Well, at first you’re right. Like the operator framework is written in Go and there was nothing else than Go at the time. So we were looking at that, but our team, we had a team of very, very senior Java programmers and none of them were Go programmers, right? But I’ve seen the Postgres community and all the communities that people who are kind of more in the DevOps world, then switching to Go programmers is a bit more natural, but at the same time, they are not senior from a Go programming perspective, right? The same would have happened with our team, right? They would switch from Java to Go. They would have been senior in Go, obviously, right? So it would have taken some time to develop those skills. On the other hand, we looked at what is the technology behind, what is an operator? An operator is no more than essentially an HTTP server that receives callbacks from Kubernetes and a client because it makes calls to Kubernetes. And HTTP clients and servers can read written in any language. So we look at the core, how complicated this is and how much does this operator framework bring to you? How we saw that it was not that much.

And actually something, for example, just mentioned before, the CRDs are kind of generated from your structures and we really wanted to do the opposite way. This is like the database. You use an ORM to read your database existing schema that we develop with all your SQL capabilities or you just create an object and let that generate a database. I prefer the format. So we did the same thing with the CRDs, right? And we wanted to develop them. So Java was more than okay to develop a Kubernetes operator and our team was expert in Java. So by doing it in Java, we were able to be very efficient and deliver a lot of value, a lot of features very, very fast without having to retrain anyone, learn a new language, or learn new skills. On top of this, there’s sometimes a concern that Java requires a JVM, which is kind of a heavy environment, right? And consumes memory and resources, and disk. But by default, StackGres uses a compilation technology and will build a whole project around it called GraalVM. And this allows you to generate native images that are indistinguishable from any other binary, Linux binary you can have with your system. And we deploy StackGres with native images. You can also switch JVM images if you prefer. We over expose both, but by default, there are native images. So at the end of the day, StackGres is several megabytes file, Linux binary and the container and that’s it.

Chris Engelbert: That makes sense. And I like that you basically pointed out that the efficiency of the existing developers was much more important than like being cool and going from a new language just because everyone does. So we talked about the operator quite a bit. Like what are your general thoughts on databases in the cloud or specifically in Kubernetes? What are like the issues you see, the problems running a database in such an environment? Well, it’s a wide topic, right? And I think one of the most interesting topics that we’re seeing lately is a concern about cost and performance. So there’s kind of a trade off as usual, right?

Álvaro Hernández Tortosa: There’s a trade off between the convenience I want to run a database and almost forget about it. And that’s why you switched to a cloud managed service which is not always true by the way, because forgetting about it means that nobody’s gonna then back your database, repack your tables, right? Optimize your queries, analyze if you haven’t used indexes. So if you’re very small, that’s more than okay. You can assume that you don’t need to touch your database even if you grow over a certain level, you’re gonna need the same DBAs, the same, at least to operate not the basic operations of the database which are monitoring, high availability and backups. So those are the three main areas that a managed service provides to you.

But so there’s convenience, but then there’s an additional cost. And this additional cost sometimes is quite notable, right? So it’s typically around 80% premium on a N+1/N number of instances because sometimes we need an extra even instance for many cloud services, right? And that multiply by 1.8 ends up being two point something in the usual case. So you’re overpaying that. So you need to analyze whether this is good for you from this perspective of convenience or if you want to have something else. On the other hand, almost all cloud services use network disks. And these network disks are very good and have improved performance a lot in the last years, but still they are far from the performance of a local drive, right? And running databases with local drives has its own challenges, but they can be addressed. And you can really, really move the needle by kind of, I don’t know if that’s the right term to call it self-hosting, but this trend of self-hosting, and if we could marry the simplicity and the convenience of managed services, right?

With the ability of running on any environment and running on any environment at a much higher performance, I think that’s kind of an interesting trend right now and a good sweet spot. And Kubernetes, to try to marry all the terms that you mentioned in the question, actually is one driver towards this goal because it enables us infrastructure independence and it enables both network disks and local disks and equally the same. And it’s kind of an enabler for this pattern that I see more trends, more trends as of now, more important and one that definitely we are looking forward to.

Chris Engelbert: Right, I like that you pointed out that there’s ways to address the local storage issues, just shameless plug, we’re actually working on something.

Álvaro Hernández Tortosa: I heard something.

The Biggest Trend in Containers?

Chris Engelbert: Oh, you heard something. (laughing) All right, last question because we’re also running out of time. What do you see as the biggest trend right now in containers, cloud, whatever? What do you think is like the next big thing? And don’t say AI, everyone says that.

Álvaro Hernández Tortosa: Oh, no. Well, you know what? Let me do a shameless plug here, right?

Chris Engelbert: All right. I did one. (laughing)

Álvaro Hernández Tortosa: So there’s a technology we’re working on right now that works for our use case, but will work for many use cases also, which is what we’re calling dynamic containers. So containers are essential as something that is static, right? You build a container, you have a build with your Dockerfile, whatever you use, right? And then that image is static. It is what it is. Contains the layers that you specified and that’s all. But if you look at any repository in Docker Hub, right? There’s plenty of tags. You have what, for example, Postgres. There’s Postgres based on Debian. There’s Postgres based on Alpine. There’s Postgres with this option. Then you want this extension, then you want this other extension. And then there’s a whole variety of images, right? And each of those images needs to be built independently, maintained, updated independently, right? But they’re very orthogonal. Like upgrading the Debian base OS has nothing to do with the Postgres layer, has nothing to do with the timescale extension, has nothing to do with whether I want the debug symbols or not. So we’re working on technology with the goal of being able to, as a user, express any combination of items I want for my container and get that container image without having to rebuild and maintain the image with the specific parameters that I want.

Chris Engelbert: Right, and let me guess, that is how the Postgres extension stuff works.

Álvaro Hernández Tortosa: It is meant to be, and then as a solution for the Postgres extensions, but it’s actually quite broad and quite general, right? Like, for example, I was discussing recently with some folks of the OpenTelemetry community, and the OpenTelemetry collector, which is the router for signals in the OpenTelemetry world, right? Has the same architecture, has like around 200 plugins, right? And you don’t want a container image with those 200 plugins, which potentially, because many third parties may have some security vulnerabilities, or even if there’s an update, you don’t want to update all those and restart your containers and all that, right? So why don’t you kind of get a container image with the OpenTelemetry collector with this source and this receiver and this export, right? So that’s actually probably more applicable. Yeah, I think that makes sense, right? I think that is a really good end, especially because the static containers in the past were in the original idea was that the static gives you some kind of consistency and some security on how the container looks, but we figured out over time, that is not the best solution. So I’m really looking forward to that being probably a more general thing. To be honest, actually the idea, I call it dynamic containers, but in reality, from a user perspective, they’re the same static as before. They are dynamic from the registry perspective.

Chris Engelbert: Right, okay, fair enough. All right, thank you very much. It was a pleasure like always talking to you. And for the other ones, I see, hear, or read you next week with my next guest. And thank you to Álvaro, thank you for being here. It was appreciated like always.

Álvaro Hernández Tortosa: Thank you very much.

The post Production-grade Kubernetes PostgreSQL, Álvaro Hernández appeared first on simplyblock.