Database Archives | simplyblock

NVMe & Kubernetes: Future-Proof Infrastructure

Chris Engelbert — Wed, 27 Nov 2024 13:34:00 +0000

The marriage of NVMe storage and Kubernetes persistent volumes represents a perfect union of high-performance storage and modern container orchestration. As organizations increasingly move performance-critical workloads to Kubernetes, understanding how to leverage NVMe technology becomes crucial for achieving optimal performance and efficiency.

The Evolution of Storage in Kubernetes

When Kubernetes was created over 10 years ago, its only purpose was to schedule and orchestrate stateless workloads. Since then, a lot has changed, and Kubernetes is increasingly used for stateful workloads. Not just basic ones but mission-critical workloads, like a company’s primary databases. The promise of workload orchestration in infrastructures with growing complexity is too significant.

Anyhow, traditional Kubernetes storage solutions relied upon and still rely on old network-attached storage protocols like iSCSI. Released in 2000, iSCSI was built on the SCSI protocol itself, first introduced in the 1980s. Hence, both protocols are inherently designed for spinning disks with much higher seek times and access latencies. According to our modern understanding of low latency and low complexity, they just can’t keep up.

While these solutions worked well for basic containerized applications, they fall short for high-performance workloads like databases, AI/ML training, and real-time analytics. Let’s look at the NVMe standard, particularly NVMe over TCP, which has transformed our thinking about storage in containerized environments, not just Kubernetes.

Why NVMe and Kubernetes Work So Well Together

The beauty of this combination lies in their complementary architectures. The NVMe protocol and command set were designed from the ground up for parallel, low-latency operations–precisely what modern containerized applications demand. When you combine NVMe’s parallelism with Kubernetes’ orchestration capabilities, you get a system that can efficiently distribute I/O-intensive workloads while maintaining microsecond-level latency. Further, comparing NVMe over TCP vs iSCSI, we see significant improvement in terms of IOPS and latency performance when using NVMe/TCP.

Consider a typical database workload on Kubernetes. Traditional storage might introduce latencies of 2-4ms for read operations. With NVMe over TCP, these same operations complete in under 200 microseconds–a 10-20x improvement. This isn’t just about raw speed; it’s about enabling new classes of applications to run effectively in containerized environments.

The Technical Symphony

The integration of NVMe with Kubernetes is particularly elegant through persistent volumes and the Container Storage Interface (CSI). Modern storage orchestrators like simplyblock leverage this interface to provide seamless NVMe storage provisioning while maintaining Kubernetes’ declarative model. This means development teams can request high-performance storage using familiar Kubernetes constructs while the underlying system handles the complexity of NVMe management, providing fully reliable shared storage.

The NMVe Impact: A Real-World Example

But what does that mean for actual workloads? Our friends over at Percona found in their MongoDB Performance on Kubernetes report that Kubernetes implies no performance penalty. Hence, we can look at the disks’ actual raw performance.

A team of researchers from the University of Southern California, San Jose State University, and Samsung Semiconductor took on the challenge of measuring the implications of NVMe SSDs (over SATA SSD and SATA HDD) for real-world database performance.

The general performance characteristics of their test hardware:

	NVMe SSD	SATA SSD	SATA HDD
Access latency	113µs	125µs	14,295µs
Maximum IOPS	750,000	70,000	190
Maximum Bandwidth	3GB/s	278MB/s	791KB/s

Table 1: General performance characteristics of the different storage types

Their resume states, “scale-out systems are driving the need for high-performance storage solutions with high available bandwidth and lower access latencies. To address this need, newer standards are being developed exclusively for non-volatile storage devices like SSDs,” and “NVMe’s hardware and software redesign of the storage subsystem translates into real-world benefits.”

They’re closing with some direct comparisons that claim an 8x performance improvement of NVMe-based SSDs compared to a single SATA-based SSD and still a 5x improvement over a [Hardware] RAID-0 of four SATA-based SSDs.

Transforming Database Operations

Perhaps the most compelling use case for NVMe in Kubernetes is database operations. Typical modern databases process queries significantly faster when storage isn’t the bottleneck. This becomes particularly important in microservices architectures where concurrent database requests and high-load scenarios are the norm.

Traditionally, running stateful services in Kubernetes meant accepting significant performance overhead. With NVMe storage, organizations can now run high-performance databases, caches, and messaging systems with bare-metal-like performance in their Kubernetes clusters.

Dynamic Resource Allocation

One of Kubernetes’ central promises is dynamic resource allocation. That means assigning CPU and memory according to actual application requirements. Furthermore, it also means dynamically allocating storage for stateful workloads. With storage classes, Kubernetes provides the option to assign different types of storage backends to different types of applications. While not strictly necessary, this can be a great application of the “best tool for the job” principle.

That said, for IO-intensive workloads, such as databases, a storage backend providing NVMe storage is essential. NVMe’s ability to handle massive I/O parallelism aligns perfectly with Kubernetes’ scheduling capabilities. Storage resources can be dynamically allocated and deallocated based on workload demands, ensuring optimal resource utilization while maintaining performance guarantees.

Simplified High Availability

The low latency of NVMe over TCP enables new approaches to high availability. Instead of complex database replication schemes, organizations can leverage storage-level replication (or more storage-efficient erasure coding, like in the case of simplyblock) with a negligible performance impact. This significantly simplifies application architecture while improving reliability.

Furthermore, NVMe over TCP utilizes multipathing as an automatic fail-over implementation to protect against network connection issues and sudden connection drops, increasing the high availability of persistent volumes in Kubernetes.

The Physics Behind NVMe Performance

Many teams don’t realize how profoundly storage physics impacts database operations. Traditional storage solutions averaging 2-4ms latency might seem fast, but this translates to a hard limit of about 80 consistent transactions per second, even before considering CPU or network overhead. Each transaction requires multiple storage operations: reading data pages, writing to WAL, updating indexes, and performing one or more fsync() operations. At 3ms per operation, these quickly stack up into significant delays. Many teams spend weeks optimizing queries or adding memory when their real bottleneck is fundamental storage latency.

This is where the NVMe and Kubernetes combination truly shines. With NVMe as your Kubernetes persistent volume storage backend, providing sub-200μs latency, the same database operations can theoretically support over 1,200 transactions per second–a 15x improvement. More importantly, this dramatic reduction in storage latency changes how databases behave under load. Connection pools remain healthy longer, buffer cache decisions become more efficient, and query planners can make better optimization choices. With the storage bottleneck removed, databases can finally operate closer to their theoretical performance limits.

Looking Ahead

The combination of NVMe and Kubernetes is just beginning to show its potential. As more organizations move performance-critical workloads to Kubernetes, we’ll likely see new patterns and use cases that fully take advantage of this powerful combination.

Some areas to watch:

AI/ML workload optimization through intelligent data placement
Real-time analytics platforms leveraging NVMe’s parallel access capabilities
Next-generation database architectures built specifically for NVMe on Kubernetes Persistent Volumes

The marriage of NVMe-based storage and Kubernetes Persistent Volumes represents more than just a performance improvement. It’s a fundamental shift in how we think about storage for containerized environments. Organizations that understand and leverage this combination effectively gain a significant competitive advantage through improved performance, reduced complexity, and better resource utilization.

For a deeper dive into implementing NVMe storage in Kubernetes, visit our guide on optimizing Kubernetes storage performance.

The post NVMe & Kubernetes: Future-Proof Infrastructure appeared first on simplyblock.

Best Open Source Tools for Oracle Database

Rahil Parekh — Thu, 24 Oct 2024 22:06:49 +0000

What are the best open-source tools for your Oracle Database setup?

Oracle Database is a robust and widely-used relational database management system that powers critical applications across industries. Despite being a commercial product, the Oracle ecosystem has grown to include numerous open-source tools that can enhance the performance, management, and development of Oracle databases. These tools provide a cost-effective way to optimize your Oracle environment, ensuring efficiency and reliability. As Oracle Database continues to be a preferred choice for enterprise solutions, the demand for complementary tools has increased. In this post, we will explore nine essential open-source tools that can help you get the most out of your Oracle Database.

1. Oracle SQL Developer

Oracle SQL Developer is a free, open-source integrated development environment (IDE) that simplifies the development and management of Oracle databases. It provides tools for SQL and PL/SQL development, database administration, and data modeling. With its intuitive interface, SQL Developer allows you to query, script, and manage your Oracle databases with ease.

2. Liquibase

Liquibase is an open-source database schema change management tool that supports Oracle Database. It helps automate database deployments, making it easier to track, version, and deploy database changes. Liquibase integrates seamlessly into CI/CD pipelines, ensuring that your Oracle databases are always in sync with your application code.

3. ODAT (Oracle Database Attack Tool)

ODAT is an open-source penetration testing tool designed specifically for Oracle databases. It allows security professionals to audit and assess the security of their Oracle environments by simulating attacks. ODAT is essential for identifying vulnerabilities and ensuring that your Oracle databases are secure from potential threats.

4. Ora2Pg

Ora2Pg is an open-source tool that helps migrate Oracle databases to PostgreSQL. It is highly customizable and supports the conversion of complex Oracle database structures to PostgreSQL. Ora2Pg simplifies the migration process, reducing the time and effort required to move from Oracle to an open-source environment.

5. DBeaver

DBeaver is an open-source database management tool that supports a wide range of databases, including Oracle. It offers a powerful SQL editor, visual query builder, and data modeling tools, all within a single interface. DBeaver is ideal for developers and database administrators who need to manage and analyze their Oracle databases efficiently.

6. Flyway

Flyway is an open-source database migration tool that supports Oracle Database. It provides version control for your database schema changes, making it easier to track and deploy updates. Flyway is lightweight and integrates with various CI/CD tools, ensuring smooth database migrations in your development workflow.

7. Ora2ELK

Ora2ELK is an open-source tool that integrates Oracle Database with the ELK stack (Elasticsearch, Logstash, Kibana) for real-time log analysis and monitoring. It helps you extract, transform, and load (ETL) data from Oracle to Elasticsearch, enabling advanced analytics and visualization with Kibana. Ora2ELK is essential for gaining insights into your Oracle database performance and usage.

8. TOra

TOra is an open-source GUI tool for managing Oracle databases. It provides a rich set of features, including database browsing, SQL scripting, and performance monitoring. TOra’s user-friendly interface makes it easy for developers and DBAs to interact with their Oracle databases, improving productivity and reducing the learning curve.

9. Allround Automations PL/SQL Developer

While not entirely open-source, PL/SQL Developer offers a free version that provides essential tools for Oracle database development. It supports PL/SQL programming with features like syntax highlighting, debugging, and integrated documentation. PL/SQL Developer is widely used by Oracle developers to streamline their coding and testing processes.

How to Optimize Oracle Storage with Open-source Tools

This guide explored nine essential open-source tools for Oracle Database, from SQL Developer’s IDE capabilities to TOra’s management features. While these tools excel at different aspects – Liquibase for schema management, Flyway for migrations, and DBeaver for querying – proper implementation is crucial. Tools like ODAT provide security testing, while Ora2ELK enables advanced monitoring. Each tool offers unique approaches to managing and optimizing Oracle deployments.

Why Choose simplyblock for Oracle Database?

While Oracle Database provides robust data management capabilities, optimizing storage costs and ensuring efficient management of tablespaces and redo logs is crucial. This is where simplyblock’s intelligent storage optimization creates unique value:

Optimized Oracle Storage Management:

Simplyblock enhances Oracle’s storage efficiency through sophisticated volume management. Using thin provisioning and automatic tiering, simplyblock optimizes storage utilization for Oracle’s tablespaces, redo logs, and archive logs. Frequently accessed tablespaces benefit from ultra-low latency NVMe storage, while less active segments automatically move to cost-effective S3 storage. This is particularly valuable for managing Oracle’s archived redo logs and historical tablespaces, where storage costs can accumulate rapidly.

Oracle Performance Enhancement:

Simplyblock streamlines Oracle’s I/O operations through its unified storage pool approach. By leveraging NVMe over TCP and local instance storage caching, simplyblock provides exceptional performance for Oracle’s write-ahead logging and random read operations. The platform’s ability to pool EBS volumes ensures consistent I/O performance across all database files, while its multi-attach capabilities enable seamless Oracle RAC operations. This architecture is especially beneficial for organizations running demanding OLTP workloads that require high IOPS and low latency.

Enterprise-Grade Oracle Protection:

Simplyblock strengthens Oracle’s data protection through advanced backup and disaster recovery features. The platform’s consistent snapshot capability ensures that backups maintain consistency across all database files, including control files, data files, and redo logs. By streaming write-ahead logs to S3, simplyblock provides near-zero RPO disaster recovery without impacting database performance. This approach is particularly valuable for organizations requiring rapid recovery capabilities while maintaining strict data consistency, as simplyblock can coordinate recoveries across multiple database instances and auxiliary services.

If you’re looking to further streamline your Oracle operations, simplyblock offers comprehensive solutions that integrate seamlessly with these tools, helping you get the most out of your Oracle Database environment

Ready to take your Oracle Database management to the next level? Contact simplyblock today to learn how we can help you simplify and enhance your Oracle journey.

The post Best Open Source Tools for Oracle Database appeared first on simplyblock.

NVMe Storage for Database Optimization: Lessons from Tech Giants

Rob Pankow — Thu, 17 Oct 2024 13:27:59 +0000

Database Scalability Challenges in the Age of NVMe

In 2024, data-driven organizations increasingly recognize the crucial importance of adopting NVMe storage solutions to stay competitive. With NVMe adoption still below 30%, there’s significant room for growth as companies seek to optimize their database performance and storage efficiency. We’ve looked at how major tech companies have tackled database optimization and scalability challenges, often turning to self-hosted database solutions and NVMe storage.

While it’s interesting to see what Netflix or Pinterest engineers are investing their efforts into, it is also essential to ask yourself how your organization is adopting new technologies. As companies grow and their data needs expand, traditional database setups often struggle to keep up. Let’s look at some examples of how some of the major tech players have addressed these challenges.

Pinterest’s Journey to Horizontal Database Scalability with TiDB

Pinterest, which handles billions of pins and user interactions, faced significant challenges with its HBase setup as it scaled. As their business grew, HBase struggled to keep up with evolving needs, prompting a search for a more scalable database solution. They eventually decided to go with TiDB as it provided the best performance under load.

Selection Process:

Evaluated multiple options, including RocksDB, ShardDB, Vitess, VoltDB, Phoenix, Spanner, CosmosDB, Aurora, TiDB, YugabyteDB, and DB-X.
Narrowed down to TiDB, YugabyteDB, and DB-X for final testing.

Evaluation:

Conducted shadow traffic testing with production workloads.
TiDB performed well after tuning, providing sustained performance under load.

TiDB Adoption:

Deployed 20+ TiDB clusters in production.
Stores over 200+ TB of data across 400+ nodes.
Primarily uses TiDB 2.1 in production, with plans to migrate to 3.0.

Key Benefits:

Improved query performance, with 2-10x improvements in p99 latency.
More predictable performance with fewer spikes.
Reduced infrastructure costs by about 50%.
Enabled new product use cases due to improved database performance.

Challenges and Learnings:

Encountered issues like TiCDC throughput limitations and slow data movement during backups.
Worked closely with PingCAP to address these issues and improve the product.

Future Plans:

Exploring multi-region setups.
Considering removing Envoy as a proxy to the SQL layer for better connection control.
Exploring migrating to Graviton instance types for a better price-performance ratio and EBS for faster data movement (and, in turn, shorter MTTR on node failures).

Uber’s Approach to Scaling Datastores with NVMe

Uber, facing exponential growth in active users and ride volumes, needed a robust solution for their datastore “Docstore” challenges.

Hosting Environment and Limitations:

Initially on AWS, later migrated to hybrid cloud and on-premises infrastructure
Uber’s massive scale and need for customization exceeded the capabilities of managed database services

Uber’s Solution: Schemaless and MySQL with NVMe

Schemaless: A custom solution built on top of MySQL
Sharding: Implemented application-level sharding for horizontal scalability
Replication: Used MySQL replication for high availability
NVMe storage: Leveraged NVMe disks for improved I/O performance

Results:

Able to handle over 100 billion queries per day
Significantly reduced latency for read and write operations
Improved operational simplicity compared to Cassandra

Discord’s Storage Evolution and NVMe Adoption

Discord, facing rapid growth in user base and message volume, needed a scalable and performant storage solution.

Hosting Environment and Limitations:

Google Cloud Platform (GCP)
Discord’s specific performance requirements and need for customization led them to self-manage their database infrastructure

Discord’s storage evolution:

MongoDB: Initially used for its flexibility, but faced scalability issues
Cassandra: Adopted for better scalability but encountered performance and maintenance challenges
ScyllaDB: Finally settled on ScyllaDB for its performance and compatibility with Cassandra

Discord also created a solution, “superdisk” with a RAID0 on top of the Local SSDs, and a RAID1 between the Persistent Disk and RAID0 array. They could configure the database with a disk drive that would offer low-latency reads while still allowing us to benefit from the best properties of Persistent Disks. One can think of it as a “simplyblock v0.1”.

Figure 1: Discord’s “superdisk” architecture

Key improvements with ScyllaDB:

Reduced P99 latencies from 40-125ms to 15ms for read operations
Improved write performance, with P99 latencies dropping from 5-70ms to a consistent 5ms
Better resource utilization, allowing Discord to reduce their cluster size from 177 Cassandra nodes to just 72 ScyllaDB nodes

Summary of Case Studies

In the table below, we can see a summary of the key initiatives taken by tech giants and their respective outcomes. What is notable, all of the companies were self-hosting their databases (on Kubernetes or on bare-metal servers) and have leveraged local SSD (NVMe) for improved read/write performance and lower latency. However, at the same time, they all had to work around data protection and scalability of the local disk. Discord, for example, uses RAID to mirror the disk, which causes significant storage overhead. Such an approach doesn’t also offer a logical management layer (i.e. “storage/disk virtualization”). In the next paragraphs, let’s explore how simplyblock adds even more performance, scalability, and resource efficiency to such setups.

Company	Database	Hosting environment	Key Initiative
Pinterest	TiDB	AWS EC2 & Kubernetes, local NVMe disk	Improved performance & scalability
Uber	MySQL	Bare-metal, NVMe storage	Reduced read/write latency, improved scalability
Discord	ScyllaDB	Google Cloud, local NVMe disk with RAID mirroring	Reduced latency, improved performance and resource utilization

The Role of Intelligent Storage Optimization in NVMe-Based Systems

While these case studies demonstrate the power of NVMe and optimized database solutions, there’s still room for improvement. This is where intelligent storage optimization solutions like simplyblock are spearheading market changes.

Simplyblock vs. Local NVMe SSD: Enhancing Database Scalability

While local NVMe disks offer impressive performance, simplyblock provides several critical advantages for database scalability. Simplyblock builds a persistent layer out of local NVMe disks, which means that is not just a cache and it’s not just ephemeral storage. Let’s explore the benefits of simplyblock over local NVMe disk:

Scalability: Unlike local NVMe storage, simplyblock offers dynamic scalability, allowing storage to grow or shrink as needed. Simplyblock can scale performance and capacity beyond the local node’s disk size, significantly improving tail latency.
Reliability: Data on local NVMe is lost if an instance is stopped or terminated. Simplyblock provides advanced data protection that survives instance outages.
High Availability: Local NVMe loses data availability during the node outage. Simplyblock ensures storage remains fully available even if a compute instance fails.
Data Protection Efficiency: Simplyblock uses erasure coding (parity information) instead of triple replication, reducing network load and improving effective-to-raw storage ratios by about 150% (for a given amount of NVMe disk, there is 150% more usable storage with simplyblock).
Predictable Performance: As IOPS demand increases, local NVMe access latency rises, often causing a significant increase in tail latencies (p99 latency). Simplyblock maintains constant access latencies at scale, improving both median and p99 access latency. Simplyblock also allows for much faster write at high IOPS as it’s not using NVMe layer as write-through cache, hence its performance isn’t dependent on a backing persistent storage layer (e.g. S3)
Maintainability: Upgrading compute instances impacts local NVMe storage. With simplyblock, compute instances can be maintained without affecting storage.
Data Services: Simplyblock provides advanced data services like snapshots, cloning, resizing, and compression without significant overhead on CPU performance or access latency.
Intelligent Tiering: Simplyblock automatically moves infrequently accessed data to cheaper S3 storage, a feature unavailable with local NVMe.
Thin Provisioning: This allows for more efficient use of storage resources, reducing overprovisioning common in cloud environments.
Multi-attach Capability: Simplyblock enables multiple nodes to access the same volume, which is useful for high-availability setups without data duplication. Additionally, multi-attach can decrease the complexity of volume management and data synchronization.

Technical Deep Dive: Simplyblock’s Architecture

Simplyblock’s architecture is designed to maximize the benefits of NVMe while addressing common cloud storage challenges:

NVMe-oF (NVMe over Fabrics) Interface: Exposes storage as NVMe volumes, allowing for seamless integration with existing systems while providing the low-latency benefits of NVMe.
Distributed Data Plane: Uses a statistical placement algorithm to distribute data across nodes, balancing performance and reliability.
Logical Volume Management: Supports thin provisioning, instant resizing, and copy-on-write clones, providing flexibility for database operations.
Asynchronous Replication: Utilizes a block-storage-level write-ahead log (WAL) that’s asynchronously replicated to object storage, enabling disaster recovery with near-zero RPO (Recovery Point Objective).
CSI Driver: Provides seamless integration with Kubernetes, allowing for dynamic provisioning and lifecycle management of volumes.

Below is a short overview of simplyblock’s high-level architecture in the context of PostgreSQL, MySQL, or Redis instances hosted in Kubernetes. Simplyblock creates a clustered shared pool out of local NVMe storage attached to Kubernetes compute worker nodes (storage is persistent, protected by erasure coding), serving database instances with the performance of local disk but with an option to scale out into other nodes (which can be either other compute nodes or separate, disaggregated, storage nodes). Further, the “colder” data is tiered into cheaper storage pools, such as HDD pools or object storage.

Figure 2: Simplified simplyblock architecture

Applying Simplyblock to Real-World Scenarios

Let’s explore how simplyblock could enhance the setups of the companies we’ve discussed:

Pinterest and TiDB with simplyblock

While TiDB solved Pinterest’s scalability issues, and they are exploring Graviton instances and EBS for a better price-performance ratio and faster data movement, simplyblock could potentially offer additional benefits:

Price/Performance Enhancement: Simplyblock’s storage orchestration could complement Pinterest’s move to Graviton instances, potentially amplifying the price-performance benefits. By intelligently managing storage across different tiers (including EBS and local NVMe), simplyblock could help optimize storage costs while maintaining or even improving performance.
MTTR Improvement & Faster Data Movements: In line with Pinterest’s goal of faster data movement and reduced Mean Time To Recovery (MTTR), simplyblock’s advanced data management capabilities could further accelerate these processes. Its efficient data protection with erasure coding and multi-attach capabilities helps with smooth failovers or node failures without performance degradation. If a node fails, simplyblock can quickly and autonomously rebuild the data on another node using parity information provided by erasure coding, eliminating downtime.
Better Scalability through Disaggregation: Simplyblock’s architecture allows for the disaggregation of storage and compute, which aligns well with Pinterest’s exploration of different instance types and storage options. This separation would provide Pinterest with greater flexibility in scaling their storage and compute resources independently, potentially leading to more efficient resource utilization and easier capacity planning.

Figure 3: Simplyblock’s multi-attach functionality visualized

Uber’s Schemaless

While Uber’s custom Schemaless solution on MySQL with NVMe storage is highly optimized, simplyblock could still offer benefits:

Unified Storage Interface: Simplyblock could provide a consistent interface across Uber’s diverse storage needs, simplifying operations.
Intelligent Data Placement: For Uber’s time-series data (like ride information), simplyblock’s tiering could automatically optimize data placement based on age and access patterns.
Enhanced Disaster Recovery: Simplyblock’s asynchronous replication to S3 could complement Uber’s existing replication strategies, potentially improving RPO.

Discord and ScyllaDB

Discord’s move to ScyllaDB already provided significant performance improvements, but simplyblock could further enhance their setup:

NVMe Resource Pooling: By pooling NVMe resources across nodes, simplyblock would allow Discord to further reduce their node count while maintaining performance.
Cost-Efficient Scaling: For Discord’s rapidly growing data needs, simplyblock’s intelligent tiering could help manage costs as data volumes expand.
Simplified Cloning for Testing: Simplyblock’s instant cloning feature could be valuable for Discord’s development and testing processes.It allows for quick replication of production data without additional storage overhead.

What’s next in the NVMe Storage Landscape?

The case studies from Pinterest, Uber, and Discord highlight the importance of continuous innovation in database and storage technologies. These companies have pushed beyond the limitations of managed services like Amazon RDS to create custom, high-performance solutions often built on NVMe storage.

However, the introduction of intelligent storage optimization solutions like simplyblock represents the next frontier in this evolution. By providing an innovative layer of abstraction over diverse storage types, implementing smart data placement strategies, and offering features like thin provisioning and instant cloning alongside tight integration with Kubernetes, simplyblock spearheads market changes in how companies approach storage optimization.

As data continues to grow exponentially and performance demands increase, the ability to intelligently manage and optimize NVMe storage will become ever more critical. Solutions that can seamlessly integrate with existing infrastructure while providing advanced features for performance, cost optimization, and disaster recovery will be key to helping companies navigate the challenges of the data-driven future.

The trend towards NVMe adoption, coupled with intelligent storage solutions like simplyblock is set to reshape the database infrastructure landscape. Companies that embrace these technologies early will be well-positioned to handle the data challenges of tomorrow, gaining a significant competitive advantage in their respective markets.

The post NVMe Storage for Database Optimization: Lessons from Tech Giants appeared first on simplyblock.

RDS vs. EKS: The True Cost of Database Management

Rob Pankow — Thu, 12 Sep 2024 23:21:23 +0000

Databases can make up a significant portion of the costs for a variety of businesses and enterprises, and in particular for SaaS, Fintech, or E-commerce & Retail verticals. Choosing the right database management solution can make or break your business margins. But have you ever wondered about the true cost of your database management? Is your current solution really as cost-effective as you think? Let’s dive deep into the world of database management and uncover the hidden expenses that might be eating away at your bottom line.

The Database Dilemma: Managed Services or Self-Managed?

The first crucial decision comes when choosing the operating model for your databases: should you opt for managed services like AWS RDS or take the reins yourself with a self-managed solution on Kubernetes? It’s not just about the upfront costs – there’s a whole iceberg of expenses lurking beneath the surface.

The Allure of Managed Services

At first glance, managed services like AWS RDS seem to be a no-brainer. They promise hassle-free management, automatic updates, and round-the-clock support. But is it really as rosy as it seems?

The Visible Costs

Subscription Fees : You’re paying for the convenience, and it doesn’t come cheap.
Storage Costs : Every gigabyte counts, and it adds up quickly.
Data Transfer Fees : Moving data in and out? Be prepared to open your wallet.

The Hidden Expenses

Overprovisioning : Are you paying for more than you are actually using?
Personnel costs : Using RDS and assuming that you don’t need to understand databases anymore? Surprise! You still need team that will need to configure the database and set it up for your requirements.
Performance Limitations : When you hit a ceiling, scaling up can be costly.
Vendor Lock-in : Switching providers? That’ll cost you in time and money.
Data Migration : Moving data between services can cost a fortune.
Backup and Storage : Those “convenient” backups? They’re not free. In addition, AWS RDS does not let you plug in other storage solution than AWS-native EBS volumes, which can get quite expensive if your database is IO-intensive

The Power of Self-Managed Kubernetes Databases

On the flip side, managing your databases on Kubernetes might seem daunting at first. But let’s break it down and see where you could be saving big.

Initial Investment

Learning Curve : Yes, there’s an upfront cost in time and training. You need to have on your team engineers that are comfortable with Kubernetes or Amazon EKS.
Setup and Configuration : Getting things right takes effort, but it pays off.

Long-term Savings

Flexibility : Scale up or down as needed, without overpaying.
Multi-Cloud Freedom : Avoid vendor lock-in and negotiate better rates.
Resource Optimization : Use your hardware efficiently across workloads.
Resource Sharing : Kubernetes lets you efficiently allocate resources.
Open-Source Tools : Leverage free, powerful tools for monitoring and management.
Customization : Tailor your setup to your exact needs, no compromise.

Where are the Savings Coming from when using Kubernetes for your Database Management?

In a self-managed Kubernetes environment, you have greater control over resource allocation, leading to improved utilization and efficiency. Here’s why:

a) Dynamic Resource Allocation : Kubernetes allows for fine-grained control over CPU and memory allocation. You can set resource limits and requests at the pod level, ensuring databases only use what they need. Example: During off-peak hours, you can automatically scale down resources, whereas in managed services, you often pay for fixed resources 24/7.

b) Bin Packing : Kubernetes scheduler efficiently packs containers onto nodes, maximizing resource usage. This means you can run more workloads on the same hardware, reducing overall infrastructure costs. Example: You might be able to run both your database and application containers on the same node, optimizing server usage.

c) Avoid Overprovisioning : With managed services, you often need to provision for peak load at all times. In Kubernetes, you can use Horizontal Pod Autoscaling to add resources only when needed. Example: During a traffic spike, you can automatically add more database replicas, then scale down when the spike ends.

d) Resource Quotas : Kubernetes allows setting resource quotas at the namespace level, preventing any single team or application from monopolizing cluster resources. This leads to more efficient resource sharing across your organization.

Self-managed Kubernetes databases can also significantly reduce data transfer costs compared to managed services. Here’s how:

a) Co-location of Services : In Kubernetes, you can deploy your databases and application services in the same cluster. This reduces or eliminates data transfer between zones or regions, which is often charged in managed services. Example: If your app and database are in the same Kubernetes cluster, inter-service communication doesn’t incur data transfer fees.

b) Efficient Data Replication : Kubernetes allows for more control over how and when data is replicated. You can optimize replication strategies to reduce unnecessary data movement. Example: You might replicate data during off-peak hours or use differential backups to minimize data transfer.

c) Avoid Provider Lock-in : Managed services often charge for data egress, especially when moving to another provider. With self-managed databases, you have the flexibility to choose the most cost-effective data transfer methods. Example: You could use direct connectivity options or content delivery networks to reduce data transfer costs between regions or clouds.

d) Optimized Backup Strategies : Self-managed solutions allow for more control over backup processes. You can implement incremental backups or use deduplication techniques to reduce the amount of data transferred for backups. Example: Instead of full daily backups (common in managed services), you might do weekly full backups with daily incrementals, significantly reducing data transfer.

e) Multi-Cloud Flexibility : Self-managed Kubernetes databases allow you to strategically place data closer to where it’s consumed. This can reduce long-distance data transfer costs, which are often higher. Example: You could have a primary database in one cloud and read replicas in another, optimizing for both performance and cost.

By leveraging these strategies in a self-managed Kubernetes environment, organizations can significantly optimize their resource usage and reduce data transfer costs, leading to substantial savings compared to typical managed database services.

Breaking down the Numbers: a Cost Comparison between PostgreSQL on RDS vs EKS

Let’s get down to brass tacks. How do the costs really stack up? We’ve crunched the numbers for a small Postgres database between using managed RDS service and hosting on Kubernetes. For Kubernetes we are using EC2 instances with local NVMe disks that are managed on EKS and simplyblock as storage orchestration layer.

Scenario: 3TB Postgres Database with High Availability (3 nodes) and Single AZ Deployment

Managed Service (AWS RDS) using three Db.m4.2xlarge on Demand with Gp3 Volumes

Available resources

Costs

Available vCPU: 8 Available Memory: 32 GiB Available Storage: 3TB Available IOPS: 20,000 per volume Storage latency: 1-2 milliseconds

Monthly Total Cost: $2511,18
3-Year Total: $2511,18 * 36 months = $90,402

Editorial: See the pricing calculator for Amazon RDS for PostgreSQL

Self-Managed on Kubernetes (EKS) using three i3en.xlarge Instances on Demand

Available resources

Costs

Available vCPU: 12 Available Memory: 96 GiB Available

Storage: 3.75TB (7.5TB raw storage with assumed 50% data protection overhead for simplyblock) Available IOPS: 200,000 per volume (10x more than with RDS) Storage latency: below 200 microseconds (local NVMe disk orchestrated by simplyblock)

Monthly instance cost: $989.88 Monthly storage orchestration cost (e.g. Simplyblock): $90 (3TB x $30/TB)

Monthly EKS cost: $219 ($73 per cluster x 3)

Monthly Total Cost: $1298.88

3-Year Total: $1298.88 x 36 months = $46,759 Base Savings : $90,402 – $46,759 = $43,643 (48% over 3 years)

That’s a whopping 48% saving over three years! But wait, there’s more to consider. We have made some simplistic assumptions to estimate additional benefits of self-hosting to showcase the real potential of savings. While the actual efficiencies may vary from company to company, it should at least give a good understanding of where the hidden benefits might lie.

Additional Benefits of Self-Hosting (Estimated Annual Savings)

Resource optimization/sharing : Assumption: 20% better resource utilization (assuming existing Kubernetes clusters) Estimated Annual Saving: 20% x 989.88 x 12= $2,375
Reduced Data Transfer Costs : Assumption: 50% reduction in data transfer fees Estimated Annual Saving: $2,000
Flexible Scaling : Avoid over-provisioning during non-peak times Estimated Annual Saving: $3,000
Multi-Cloud Strategy : Ability to negotiate better rates across providers Estimated Annual Saving: $5,000
Open-Source Tools : Reduced licensing costs for management tools Estimated Annual Saving: $4,000

Disaster Recovery Insights

RTO (Recovery Time Objective) Improvement : Self-managed: Potential for 40% faster recovery Estimated value: $10,000 per hour of downtime prevented
RPO (Recovery Point Objective) Enhancement : Self-managed: Achieve near-zero data loss Estimated annual value: $20,000 in potential data loss prevention

Total Estimated Annual Benefit of Self-Hosting

Self-hosting pays off. Here is the summary of benefits: Base Savings: $8,400/year Additional Benefits: $15,920/year Disaster Recovery Improvement: $30,000/year (conservative estimate)

Total Estimated Annual Additional Benefit: $54,695

Total Estimated Additional Benefits over 3 Years: $164,085

Note: These figures are estimates and can vary based on specific use cases, implementation efficiency, and negotiated rates with cloud providers.

Beyond the Dollar Signs: the Real Value Proposition

Money talks, but it’s not the only factor in play. Let’s look at the broader picture.

Performance and Scalability

With self-managed Kubernetes databases, you’re in the driver’s seat. Need to scale up for a traffic spike? Done. Want to optimize for a specific workload? You’ve got the power.

Security and Compliance

Think managed services have the upper hand in security? Think again. With self-managed solutions, you have granular control over your security measures. Plus, you’re not sharing infrastructure with unknown entities.

Innovation and Agility

In the fast-paced tech world, agility is king. Self-managed solutions on Kubernetes allow you to adopt cutting-edge technologies and practices without waiting for your provider to catch up.

Is the Database on Kubernetes for Everyone?

Definitely not. While self-managed databases on Kubernetes offer significant benefits in terms of cost savings, flexibility, and control, they’re not a one-size-fits-all solution. Here’s why:

Expertise: Managing databases on Kubernetes demands a high level of expertise in both database administration and Kubernetes orchestration. Not all organizations have this skill set readily available. Self-management means taking on responsibilities like security patching, performance tuning, and disaster recovery planning. For smaller teams or those with limited DevOps resources, this can be overwhelming.
Scale of operations : For simple applications with predictable, low-to-moderate database requirements, the advanced features and flexibility of Kubernetes might be overkill. Managed services could be more cost-effective in these scenarios. Same applies for very small operations or startups in early stages – the cost benefits of self-managed databases on Kubernetes might not outweigh the added complexity and resource requirements.

While database management on Kubernetes offers compelling advantages, organizations must carefully assess their specific needs, resources, and constraints before making the switch. For many, especially larger enterprises or those with complex, dynamic database requirements, the benefits can be substantial. However, others might find that managed services better suit their current needs and capabilities.

Bonus: Simplyblock

There is one more bonus benefit that you get when running your databases in Kubernetes – you can add simplyblock as your storage orchestration layer behind a single CSI driver that will automatically and intelligently serve storage service of your choice. Do you need fast NVMe cache for some hot transactional data with random IO but don’t want to keep it hot forever? We’ve got you covered!

Simplyblock is an innovative cloud-native storage product, which runs on AWS, as well as other major cloud platforms. Simplyblock virtualizes, optimizes, and orchestrates existing cloud storage services (such as Amazon EBS or Amazon S3) behind a NVMe storage interface and a Kubernetes CSI driver. As such, it provides storage for compute instances (VMs) and containers. We have optimized for IO-heavy database workloads, including OLTP relational databases, graph databases, non-relational document databases, analytical databases, fast key-value stores, vector databases, and similar solutions.

This optimization has been built from the ground up to orchestrate a wide range of database storage needs, such as reliable and fast (high write-IOPS) storage for write-ahead logs and support for ultra-low latency, as well as high IOPS for random read operations. Simplyblock is highly configurable to optimally serve the different database query engines.

Some of the key benefits of using simplyblock alongside your stateful Kubernetes workloads are:

Cost Reduction, Margin Increase: Thin provisioning, compression, deduplication of hot-standby nodes, and storage virtualization with multiple tenants increases storage usage while enabling gradual storage increase.
Easy Scalability of Storage: Single node databases require highly scalable storage (IOPS, throughput, capacity) since data cannot be distributed to scale. Simplyblock pools either Amazon EBS volumes or local instance storage from EC2 virtual machines and provides a scalable and cost effective storage solution for single node databases.
Enables Database Branching Features: Using instant snapshots and clones, databases can be quickly branched out and provided to customers. Due to copy-on-write, the storage usage doesn’t increase unless the data is changed on either the primary or branch. Customers could be charged for “additional storage” though.
Enhances Security: Using an S3-based streaming of a recovery journal, the database can be quickly recovered from full AZ and even region outages. It also provides protection against typical ransomware attacks where data gets encrypted by enabling Point-in-Time-Recovery down to a few hundred milliseconds granularity.

Conclusion: the True Cost Revealed

When it comes to database management, the true cost goes far beyond the monthly bill. By choosing a self-managed Kubernetes solution, you’re not just saving money – you’re investing in flexibility, performance, and future-readiness. The savings and benefits will be always use-case and company-specific but the general conclusion shall remain unchanged. While operating databases in Kubernetes is not for everyone, for those who have the privilege of such choice, it should be a no-brainer kind of decision.

Is managing databases on Kubernetes complex?

While there is a learning curve, modern tools and platforms like simplyblock significantly simplify the process, often making it more straightforward than dealing with the limitations of managed services. The knowledge acquired in the process can be though re-utilized across different cloud deployments in different clouds.

How can I ensure high availability with self-managed databases?

Kubernetes offers robust features for high availability, including automatic failover and load balancing. With proper configuration, you can achieve even higher availability than many managed services offer, meeting any possible SLA out there. You are in full control of the SLAs.

How difficult is it to migrate from a managed database service to Kubernetes?

While migration requires careful planning, tools and services exist to streamline the process. Many companies find that the long-term benefits far outweigh the short-term effort of migration.

How does simplyblock handle database backups and point-in-time recovery in Kubernetes?

Simplyblock provides automated, space-efficient backup solutions that integrate seamlessly with Kubernetes. Our point-in-time recovery feature allows you to restore your database to any specific moment, offering protection against data loss and ransomware attacks.

Does simplyblock offer support for multiple database types?

Yes, simplyblock supports a wide range of database types including relational databases like PostgreSQL and MySQL, as well as NoSQL databases like MongoDB and Cassandra. Check out our “Supported Technologies” page for a full list of supported databases and their specific features.

The post RDS vs. EKS: The True Cost of Database Management appeared first on simplyblock.

Ransomware Attack Recovery with Simplyblock

Michael Schmidt — Tue, 10 Sep 2024 23:26:57 +0000

In 2023, the number of victims of Ransomware attacks more than doubled, with 2024 off to an even stronger start. A Ransomware attack encrypts your local data. Additionally, the attackers demand a ransom be paid. Therefore, data is copied to remote locations to increase pressure on companies to pay the ransom. This increases the risk of the data being leaked to the internet even if the ransom is paid. Strong Ransomware protection and mitigation are now more important than ever.

Simplyblock provides sophisticated block storage-level Ransomware protection and mitigation. Together with recovery options, simplyblock enables Point-in-Time Recovery (PITR) for any service or solution storing data.

What is Ransomware?

Ransomware is a type of malicious software (also known as malware) designed to block access to a computer system and/or encrypt data until a ransom is paid to the attacker. Cybercriminals typically carry out this type of attack by demanding payment, often in cryptocurrency, in exchange for providing a decryption key to restore access to the data or system.

Statistics show a significant rise in ransomware cyber attacks: ransomware cases more than doubled in 2023, and the amount of ransom paid reached more than a billion dollars—and these are only official numbers. Many organizations prefer not to report breaches and payments, as those are illegal in many jurisdictions.

The Danger of Ransomware Increases

The number and sophistication of attack tools have also increased significantly. They are becoming increasingly commoditized and easy to use, drastically reducing the skills cyber criminals require to deploy them.

There are many best practices and tools to protect against successful attacks. However, little can be done once an account, particularly a privileged one, has been compromised. Even if the breach is detected, it is most often too late. Attackers may only need minutes to encrypt important data.

Storage, particularly backups, serves as a last line of defense. After a successful attack, they provide a means to recover. However, there are certain downsides to using backups to recover from a successful attack:

The latest backup does not contain all of the data: Data written between the last backup and the time the attack is unrecoverably lost. Even the loss of one hour of data written to a database can be critical for many enterprises.
Backups are not consistent with each other: The backup of one database may not fit the backup of another database or a file repository, so the systems will not be able to integrate correctly after restoration.
The latest backups may already contain encrypted data. It may be necessary to go back in time to find an older backup that is still “clean.” This backup, if available at all, may be linked to substantial data loss.
Backups must be protected from writes and delete operations; otherwise, they can be destroyed or damaged by attackers. Attackers may also damage the backup inventory management system, making it hard or impossible to locate specific backups.
Human error in Backup Management may lead to missing backups.

Simplyblock for Ransomware Protection and Mitigation

Simplyblock provides a smart solution to recover data after a ransomware attack, complementing classical backups.

In addition to writing data to hot-tier storage, simplyblock creates an asynchronously replicated write-ahead log (WAL) of all data written. This log is optimized for high throughput to secondary (low IOPS) storage, such as Amazon S3 or HDD pools, like AWS’ EBS st2 service. If this secondary storage supports write and deletion protection for pre-defined retention periods, as with S3, it is possible to “rewind” the storage to the point immediately before the attack. This performs a data recovery with near-zero RPO (Recovery Point Objective).

A recovery mechanism like this is particularly useful in combination with databases. Before the attack can start, database systems typically have to be stopped. This is necessary as all data and WAL files are in use by the database. This allows for automatically identifying a consistent recovery point with no data loss.

In the future, simplyblock plans to enhance this functionality further. A multi-stage attack detection mechanism will be integrated into the storage. Additionally, deletion protection after clearance from attack within a historical time window and precise automatic identification of attack launch points to locate recovery points.

Furthermore, simplyblock will support partial restore of recovery points to enable different service’ data on the same logical volumes to be restored from individual points in time. This is important since encryption of one service might have started earlier or later than for others, hence the point in time to rewind to must be different.

Conclusion

Simplyblock provides a complementary recovery solution to classical backups. Backups support long-term storage of full recovery snapshots. In contrast, write-ahead log-based recovery is specifically designed for near-zero RPO recovery right after a Ransomware attack starts and enables quick and easy recovery for data protection.

While many databases and data-storing services, such as PostgreSQL, may provide the possibility of Point-in-Time Recovery, the WAL segments need to be stored outside the system as soon as they are closed. That said, the RPO would come down to the size of a WAL segment, whereas with simplyblock, due to its copy-on-write nature, the RPO can be as small as one committed write.

Learn more about simplyblock and its other features like thin-provisioning, immediate clones and branches, encryption, compression, deduplication, and more. Or just get started right away and find the best Ransomware attack protection and mitigation to date.

The post Ransomware Attack Recovery with Simplyblock appeared first on simplyblock.

Disaster Recovery with Simplyblock in AWS

Michael Schmidt — Fri, 06 Sep 2024 23:41:03 +0000

When disaster strikes, a great recovery strategy is required. Oftentimes, deficiencies are only discovered when it’s already too late. Simplyblock provides comprehensive disaster recovery support for databases, file storages, and whole infrastructures, enabling the restore from ground up in a different availability zone with minimal RTO (Recovery Time Objective) and near-zero RPO (Recovery Point Objective).

Amazon EBS, Amazon S3, and Local Instance Storage

AWS’ cloud block storage ( Amazon EBS ) is a great product, providing a multitude of different product types depending on your performance (random IOPS, access latency) requirements. However, the provided durability is limited. Depending on the EBS volume type , AWS provides a durability indicator between 99.8% and 99.999%. The bigger issue though, in case of a disaster in your availability zone (AZ), storage will become unavailable in its entirety and, depending on the type of the disaster, data may actually be lost (partially or in full).

The durability is even worse with local instance storage. Local instance storage are NVMe disks which are physically located on the virtual machine host that runs your workload. That said, all data stored on local instance storage is immediately lost once the instance is turned off, or a failure occurs with the physical host.

Amazon S3 storage, on the other hand, is considered to be extremely durable, offering 99.999999999% durability. In addition, it is replicated across availability zones. Therefore, the probability of data loss by any kind of disaster is close to zero. To our knowledge, and as of time of writing, it has never actually happened. In terms of durability, Amazon S3 is king.We trade, however, durability for latency.

Data Protection for Amazon EBS

As shown, all persistent (meaning, non ephemeral) data stored in Amazon EBS requires additional means of protection. That said, the most common way to protect your data is taking a snapshot of your EBS volume and backing it up to Amazon S3.

Those S3 backups have a number of important drawbacks though: A snapshot-based backup always implicitly means that you’ll have data loss of some kind. Data which has been written between the last backup and the time of the failure is irrecoverably lost. No restore procedure will be able to recover it. For low velocity data (data which is rarely changed) that may be a minor issue. Examples of this kind of data may be media files or archived documents. However, the data loss can be catastrophic for other types of data such as transactional systems. Multiple backups between different systems aren’t consistent between each other. The backup of one database may not fit the backup of another database or a file repository. That said, after restoration the systems may have inconsistent data states and will not integrate correctly. Bringing a collection of systems with backups taken at different times back into a working state can be a massive manual effort. Sometimes it is even impossible. Backup management is a significant effort. To free up disk space, it is necessary to remove snapshots from EBS after moving them to S3. Furthermore, backups have to be configured with retention policies. The successful operations of taking backups must be monitored and backups have to be tested regularly to make sure it is possible to restore them successfully.

Last but not least, human error in backup management may lead to missing or corrupted backups.

Data Protection for Amazon EBS with Simplyblock

Simplyblock provides a smart solution to the consistent recovery of hot data after a major incident or even a zone-level disaster.

First and foremost, simplyblock logical volumes stores data synchronously into the hot tier storage backend. In addition, data is also written into an asynchronous replicated write-ahead log (WAL). Writing this log is optimized for high throughput to secondary (low IOPS) storage such as S3 or HDD pools (e.g. the Amazon EBS st2 service). Last but not least, the WAL is efficiently compacted at regular intervals to limit storage growth and optimize recovery times.

Simplyblock’s logical volumes inherently support snapshots. Due to the copy-on-write nature of simplyblock, snapshots are taken immediately and, together with the WAL, asynchronously replicated to S3.

Data recovery, on the other hand, restores all live volumes and snapshots in a fully consistent manner. The asynchronicity of the replication limits data loss to a few hundred milliseconds.

Disaster Recovery with Near-Zero RPO

The solution stores all “hot” data either in distributed instance storage or within gp3 pools, providing the necessary online performance of storage. At the same time, all data is also asynchronously replicated into S3.

In case of a loss of the entire infrastructure in an availability zone (including the gp3 volumes and local instance storage) it is possible to consistently bootstrap the entire environment in a new AZ.

If a customer uses simplyblock to store the databases, but also bootstrap and deployment information (like ArgoCD configuration, terraform data, or similar), a recover operation can consistently restore the entirety of the infrastructure from ground up. Using this strategy, infrastructures supported by simplyblock can be consistently and fully automatically recovered with near-zero RPO and a low RTO becomes possible.

For this purpose, the “primary” simplyblock storage pod, which contains all data required for bootstrapping, has to be restarted in a new zone and connected to the control plane. Afterwards, all storage is consistently accessible.

First, infrastructure templates and configurations for the environment are retrieved, after which the deployment scripts are run and the infrastructure is redeployed. In this process, databases, documents, and other file stores can already be connected to their corresponding volumes, which contain all of the data in a crash-consistent manner.

At a later stage, “secondary” storage plane pods can be restarted within the new availability zone and data will be recovered.

The recovery time depends largely on the amount of data and the instance network bandwidth. The read time from S3 is highly optimized using large, parallel reads, wherever possible to pre-fetch hot data as quickly as possible.

Conclusion

All that said, simplyblock, the intelligent storage orchestrator, provides a powerful and feature-rich solution to provide a crash-consistent, yet performant storage solution.

Built upon well-known storage solutions, such as local instance storage, Amazon EBS, and Amazon S3, simplyblock combines the ultra low latency access of NVMe volumes (pooled or unpooled) with the extreme durability of Amazon S3. Simplyblock’s write-ahead log and disaster recovery support enables the lowest RPO and minimal downtime, even in case of the loss of a full availability zone.

Get started with simplyblock today and learn all about the other amazing features simplyblock brings right to you.

The post Disaster Recovery with Simplyblock in AWS appeared first on simplyblock.

How to reduce AWS cloud costs with AWS marketplace products?

Rob Pankow — Fri, 28 Jun 2024 02:19:03 +0000

The AWS Marketplace is a comprehensive catalog consisting of thousands of offerings that help organizations find, purchase, deploy and manage third-party software and services to optimize their cloud operations. It’s also a great place to find numerous tools specifically designed to help you optimize your AWS cloud costs. These tools can help you monitor your cloud usage, right-size resources, leverage cost-effective pricing models, and implement automated management practices to reduce waste and improve efficiency.

In this blog post you will learn more on the key drivers behind the cost with AWS Cloud, what cloud cost optimization is, why you need to think about it and what tools are at your disposal, particularly in the AWS Marketplace.

What are the Fundamental Drivers of Cost with AWS Cloud?

Industry studies show that almost 70% of organizations experience higher-than-anticipated cloud costs. Understanding the key factors that drive costs in AWS Cloud is essential for effective cost management. Below is a breakdown of the key drivers of cloud costs, including compute resources and storage, which together make up almost 60 -70% of the total spend, costs associated with data transfer, networking, database services, what support plans you opt for, additional costs of licensing & marketplace products and serverless services like API calls.

Based on the Vantage Cloud Cost Report for Q1 2024 , we can see that most used services in public clouds are by far comput instances (EC2 on AWS, Compute Engine on Google Cloud and Virtual Machines on Microsoft Azure), followed by storage and databases. Optimizing costs of compute, storage and databases will have the highest impact on cloud bill reduction.

Looking more granularly on AWS, here are key services to look into when optimizing cloud costs:

Compute Resources

EC2 Instances : The cost depends on the type, size, and number of EC2 instances you run. Different instance types have varying performance and pricing.
Lambda Functions : Pricing is based on the number of requests and the duration of execution.

Cloud Storage

S3 Buckets : Costs vary depending on the amount of data stored, the frequency of access (standard, infrequent access, or Glacier), and the number of requests made.
EBS Volumes : Pricing is based on the type and size of the volume, provisioned IOPS and snapshots. Cloud block storage prices can be very high if used for highly transaction workloads such as relational, NoSQL or vector databases.
EFS and FSx : Pricing is based on the service type, IOPS and other requested services. Prices of file systems in the cloud can become very expensive with extensive usage.

Data Transfer

Data Ingress and Egress : Inbound data transfer is generally free, but outbound data transfer (data leaving AWS) incurs charges. Costs can add up, especially with high-volume transfers across regions or to the internet. Networking
VPC: Costs associated with using features like VPN connections, VPC peering, and data transfer between VPCs.
Load Balancer s: Costs for using ELB (Elastic Load Balancers) vary based on the type (Application, Network, or Classic) and usage. Database Services:
RDS: Charges depend on the database engine, instance type, storage, and backup storage.
DynamoDB: Pricing is based on read and write throughput, data storage, and optional features like backups and data transfer.

Understanding these drivers helps you identify areas where you can cut costs without sacrificing performance, allowing for better budgeting, more efficiency in operations and better scalability as demand increases.

What is Cloud Cost Optimization?

Cloud cost optimization involves using various strategies, techniques, best practices, and tools to lower cloud expenses. It aims to find the most economical way to operate your applications in the cloud, ensuring you get the highest business value from your investment. It may involve tactics like monitoring your cloud usage, identifying waste, and making adjustments to use resources more effectively without compromising performance or reliability and using marketplace solutions instead of some cloud-provider-native offerings.

Why do you need Cloud Cost Optimization?

Organizations waste approximately 32% of their cloud spending, which is a significant amount whether you’re a small business or a large one spending millions on cloud services. Cloud optimization helps you minimize this redundancy and avoid overspending. Cloud cost optimization also goes beyond just cost-cutting; it also focuses on thorough analysis of current usage, identifying inefficiencies and eliminating wastage to optimize value.

More than just cutting costs, it’s also about ensuring your spending aligns with your business goals. Cloud cost optimization means understanding your cloud expenses and making smart adjustments to control costs without sacrificing performance. Also see our blog post on AWS and cloud cost optimization .

What is the AWS Marketplace?

The AWS Marketplace is a “curated digital catalog that customers can use to find, buy, deploy, and manage third-party software, data, and services to build solutions and run their businesses.” It features thousands of software solutions, including but not limited to security, networking, storage, machine learning, and business applications, from independent software vendors (ISVs). These offerings are easy to use and can be quickly deployed directly to an AWS environment, making it easy to integrate new solutions into your existing cloud infrastructure.

AWS Marketplace also offers various flexible pricing options, including hourly, monthly, annual, and BYOL (Bring Your Own License). And lastly, many of the software products available in the Marketplace have undergone rigorous security assessments and comply with industry standards and regulations. Also note that purchases from the AWS Marketplace can count towards AWS Enterprise Discount Program (EDP) commitments. See our blog post on the EDP .

Cloud Cost Optimization Tools on AWS Marketplace you can use to Optimize your Cloud Costs

In addition to its thousands of software products, AWS Marketplace also offers many products and services that can help you optimize your cloud costs. Here are some tools and ways in which you can use AWS Marketplace to do so effectively.

Cloud Cost Management Tools AWS Marketplace hosts a variety of cost management tools that provide insights into your cloud spending. Products like CloudHealth and CloudCheckr offer comprehensive dashboards and reports that help you understand where your money is going. These tools can identify underutilized resources, recommend rightsizing opportunities, and alert you to unexpected cost spikes, enabling proactive management of your AWS expenses.

Optimzing Compute Costs: Reserved Instances and Savings Plans One of the most effective ways to reduce AWS costs is by purchasing Reserved Instances (RIs) and Savings Plans, as mentioned above. However, understanding the best mix and commitment level can be challenging. Tools like Spot.io and Cloudability available on AWS Marketplace can analyze your usage patterns and recommend the optimal RI or Savings Plan purchases. These products ensure you get the best return on your investment while maintaining the flexibility to adapt to changing workloads.

Optimizing Cloud Storage Costs Data storage can quickly become one of the largest expenses in your AWS bill. Simplyblock, available on AWS Marketplace, is the next generation of software-defined storage, enabling storage requirements for the most demanding workloads. High IOPS per Gigabyte density, low predictable latency, and high throughput is enabled using the pooled storage, as well as our distributed data placement algorithm. Using erasure coding (a better RAID) instead of replicas helps to minimize storage overhead without sacrificing data safety and fault tolerance .

Automate Resource Management Automated resource management tools can help you scale your resources up or down based on demand, ensuring you only pay for what you use. Products like ParkMyCloud and Scalr can automate the scheduling of non-production environments to shut down during off-hours, significantly reducing costs. These tools also help in identifying and terminating idle resources, ensuring no wastage of your cloud budget.

Enhance Security and Compliance Security and compliance are critical but can also be cost-intensive. Utilizing AWS Marketplace products like Trend Micro and Alert Logic can enhance your security posture without the need for a large in-house team. These services provide continuous monitoring and automated compliance checks, helping you avoid costly breaches and fines while optimizing the allocation of your security budget.

Consolidate Billing and Reporting For organizations managing multiple AWS accounts, consolidated billing and reporting tools can simplify cost management. AWS Marketplace offers solutions like CloudBolt and Turbonomic that provide a unified view of your cloud costs across all accounts. These tools offer detailed reporting and chargeback capabilities, ensuring each department or project is accountable for their cloud usage, promoting cost-conscious behavior throughout the organization.

By leveraging the diverse range of products available on AWS Marketplace, organizations can gain better control over their AWS spending, optimize resource usage, and enhance operational efficiency. Whether it’s through cost management tools, automated resource management, or enhanced security solutions, AWS Marketplace products provide the necessary tools to reduce cloud costs effectively.

How to Reduce EBS Cost in AWS?

AWS Marketplace storage solutions, such as simplyblock can help reducing Amazon EBS costs and AWS database costs up to 80% . Simplyblock offers high-performance cloud block storage that enhances the performance of your databases and applications. This ensures you get better value and efficiency from your cloud resources.

Simplyblock software provides a seamless bridge between local EC2 NVMe disk, Amazon EBS, and Amazon S3, integrating these storage options into a single, cohesive system designed for ultimate scale and performance of IO-intensive stateful workloads. By combining the high performance of local NVMe storage with the reliability and cost-efficiency of EBS and S3 respectively, simplyblock enables enterprises to optimize their storage infrastructure for stateful applications, ensuring scalability, cost savings, and enhanced performance. With simplyblock, you can save up to 80% on your EBS costs on AWS.

Our technology uses NVMe over TCP for minimal access latency, high IOPS/GB, and efficient CPU core utilization, outperforming local NVMe disks and Amazon EBS in cost/performance ratio at scale. Ideal for high-performance Kubernetes environments, simplyblock combines the benefits of local-like latency with the scalability and flexibility necessary for dynamic AWS EKS deployments , ensuring optimal performance for I/O-sensitive workloads like databases. By using erasure coding (a better RAID) instead of replicas, simplyblock minimizes storage overhead while maintaining data safety and fault tolerance. This approach reduces storage costs without compromising reliability.

Simplyblock also includes additional features such as instant snapshots (full and incremental), copy-on-write clones, thin provisioning, compression, encryption, and many more – in short, there are many ways in which simplyblock can help you optimize your cloud costs. Get started using simplyblock right now and see how simplyblock can help you on the AWS Marketplace .

To save on your cloud costs, you can also take advantage of discounts provided by various platforms. You can visit here to grab a discount on your AWS credits.

The post How to reduce AWS cloud costs with AWS marketplace products? appeared first on simplyblock.

Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview)

Chris Engelbert — Thu, 27 Jun 2024 12:09:00 +0000

Introduction

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this insightful video, we explore the cutting-edge field of machine learning-driven database optimization with Luigi Nardi In this episode of the Cloud Commute podcast.

Key Takeaways

Q: Can machine learning improve database performance? Yes, machine learning can significantly improve database performance. DBtune uses machine learning algorithms to automate the tuning of database parameters, such as CPU, RAM, and disk usage. This not only enhances the efficiency of query execution but also reduces the need for manual intervention, allowing database administrators to focus on more critical tasks. The result is a more responsive and cost-effective database system.

Q: How do machine learning models predict query performance in databases? DBtune employs probabilistic models to predict query performance. These models analyze various metrics, such as CPU usage, memory allocation, and disk activity, to forecast how queries will perform under different conditions. The system then provides recommendations to optimize these parameters, ensuring that the database operates at peak efficiency. This predictive capability is crucial for maintaining performance in dynamic environments.

Q: What are the main challenges in integrating AI-driven optimization with legacy database systems? Integrating AI-driven optimization into legacy systems presents several challenges. Compatibility issues are a primary concern, as older systems may not easily support modern optimization techniques. Additionally, there’s the need to gather sufficient data to train machine learning models effectively. Luigi also mentions the importance of addressing security concerns, especially when sensitive data is involved, and ensuring that the integration process does not disrupt existing workflows.

Q: Can you provide examples of successful AI-driven query optimization in real-world applications? DBtune has successfully applied its technology across various database systems, including Postgres, MySQL, and SAP HANA. For instance, in a project with a major telecom company, DBtune’s optimization algorithms reduced query execution times by up to 80%, leading to significant cost savings and improved system responsiveness. These real-world applications demonstrate the practical benefits of AI-driven query optimization in diverse environments.

undefined

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

Q: Can machine learning be used for optimization?

Yes, machine learning can be highly effective in optimizing complex systems by analyzing large datasets and identifying patterns that might not be apparent through traditional methods. It can automatically adjust system configurations, predict resource needs, and streamline operations to enhance performance.

simplyblock Insight: While simplyblock does not directly use machine learning for optimization, it provides advanced infrastructure solutions that are designed to seamlessly integrate with AI-driven tools. This allows organizations to leverage machine learning capabilities within a robust and flexible environment, ensuring that their optimization processes are supported by reliable and scalable infrastructure. Q: How does AI-driven query optimization improve database performance?

AI-driven query optimization improves database performance by analyzing system metrics in real-time and adjusting configurations to enhance data processing speed and efficiency. This leads to faster query execution and better resource utilization.

simplyblock Insight: simplyblock’s platform enhances database performance through efficient storage management and high availability features. By ensuring that storage is optimized and consistently available, simplyblock allows databases to maintain high performance levels, even as AI-driven processes place increasing demands on the system. Q: What are the main challenges in integrating AI-driven optimization with legacy database systems?

Integrating AI-driven optimization with legacy systems can be challenging due to compatibility issues, the complexity of existing configurations, and the risk of disrupting current operations.

simplyblock Insight: simplyblock addresses these challenges by offering flexible deployment options that are compatible with legacy systems. Whether through hyper-converged or disaggregated setups, simplyblock enables seamless integration with existing infrastructure, minimizing the risk of disruption and ensuring that AI-driven optimizations can be effectively implemented. Q: What is the relationship between machine learning and databases?

The relationship between machine learning and databases is integral, as machine learning algorithms rely on large datasets stored in databases to train and improve, while databases benefit from machine learning’s ability to optimize their performance and efficiency.

simplyblock Insight: simplyblock enhances this relationship by providing a scalable and reliable infrastructure that supports large datasets and high-performance demands. This allows databases to efficiently manage the data required for machine learning, ensuring that the training and inference processes are both fast and reliable.

Additional Nugget of Information

Q: How is the rise of vector databases impacting the future of machine learning and databases? The rise of vector databases is revolutionizing how large language models and AI systems operate by enabling more efficient storage and retrieval of vector embeddings. These databases, such as pgvector for Postgres, are becoming essential as AI applications demand more from traditional databases. The trend indicates a future where databases are increasingly specialized to handle the unique demands of AI, which could lead to even greater integration between machine learning and database management systems. This development is likely to play a crucial role in the ongoing evolution of both AI and database technologies.

Conclusion

Luigi Nardi showcases how machine learning is transforming database optimization. As DBtune’s founder, he highlights the power of AI to boost performance, cut costs, and enhance sustainability in database management. The discussion also touches on emerging trends like vector databases and DBaaS, making it a must-listen for anyone keen on the future of database technology. Stay tuned for more videos on cutting-edge technologies and their applications.

Full Episode Transcript

Chris Engelbert: Hello, everyone. Welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have Luigi with me. Luigi, obviously, from Italy. I don’t think he has anything to do with Super Mario, but he can tell us about that himself. So welcome, Luigi. Sorry for the really bad joke.

Luigi Nardi: Glad to be here, Chris.

Chris Engelbert: So maybe you start with introducing yourself. Who are you? We already know where you’re from, but I’m not sure if you’re actually residing in Italy. So maybe just tell us a little bit about you.

Luigi Nardi: Sure. Yes, I’m originally Italian. I left the country to explore and study abroad a little while ago. So in 2006, I moved to France and studied there for a little while. I spent almost seven years in total in France eventually. I did my PhD program there in Paris and worked in a company as a software engineer as well. Then I moved to the UK for a few years, did a postdoc at Imperial College London in downtown London, and then moved to the US. So I lived in California, Palo Alto more precisely, for a few years. Then in 2019, I came back to Europe and established my residency in Sweden.

Chris Engelbert: Right. Okay. So you’re in Sweden right now.

Luigi Nardi: That’s correct.

Chris Engelbert: Oh, nice. Nice. How’s the weather? Is it still cold?

Luigi Nardi: It’s great. Everybody thinks that Sweden has very bad weather, but Sweden is a very, very long country. So if you reside in the south, actually, the weather is pretty decent. It doesn’t snow very much.

Chris Engelbert: That is very true. I actually love Stockholm, a very beautiful city. All right. One thing you haven’t mentioned, you’re actually the founder and CEO of DBtune. So you left out the best part guess. Maybe tell us a little bit about DBtune now.

Luigi Nardi: Sure. DBtune is a company that is about four years old now. It’s a spinoff from Stanford University and the commercialization of about a decade of research and development in academia. We were working on the intersection between machine learning and computer systems, specifically the use of machine learning to optimize computer systems. This is an area that in around 2018 or 2019 received a new name, which is MLSys, machine learning and systems. This new area is quite prominent these days, and you can do very beautiful things with the combination of these two pieces. DBtune is specifically focusing on using machine learning to optimize computer systems, particularly in the computer system area. We are optimizing databases, the database management systems more specifically. The idea is that you can automate the process of tuning databases. We are focusing on the optimization of the parameters of the database management systems, the parameters that govern the runtime system. This means the way the disk, the RAM, and the CPU interact with each other. You take the von Neumann model and try to make it as efficient as possible through optimizing the parameters that govern that interaction. By doing that, you automate the process, which means that database engineers and database administrators can focus on other tasks that are equally important or even more important. At the same time, you get great performance, you can reduce your cloud costs as well. If you’re running in the cloud in an efficient way, you can optimize the cloud costs. Additionally, you get a check on your greenops, meaning the sustainability aspect of it. So this is one of the examples I really like of how you can be an engineer and provide quite a big contribution in terms of sustainability as well because you can connect these two things by making your software run more efficiently and then scaling down your operations.

Chris Engelbert: That is true. And it’s, yeah, I’ve never thought about that, but sure. I mean, if I get my queries to run more efficient and use less compute time and compute power, huh, that is actually a good thing. Now I’m feeling much better.

Luigi Nardi: I’m feeling much better too. Since we started talking a little bit more about this, we have a blog post that will be released pretty soon about this very specific topic. I think this connection between making software run efficiently and the downstream effects of that efficiency, both on your cost, infrastructure cost, but also on the efficiency of your operations. It’s often underestimated, I would say.

Chris Engelbert: Yeah, that’s fair. It would be nice if you, when it’s published, just send me over the link and I’m putting it into the show notes because I think that will be really interesting to a lot of people. As he said specifically for developers that would otherwise have a hard time having anything in terms of sustainability. You mentioned database systems, but I think DBtune specifically is focused on Postgres, isn’t it?

Luigi Nardi: Right. Today we are focusing on Postgres. As a proof of concept, though, we have applied similar technology to five different database management systems, including relational and non-relational systems as well. So we were, a little while ago, we wanted to show that this technology can be used across the board. And so we play around with MySQL, with FoundationDB, which is the system behind iCloud, for example, and many of the VMware products. And then we have RocksDB, which is behind your Instagram and Facebook and so on. Facebook, very pressing that open source storage system. And things like SAP HANA as well, we’ve been focusing on that a little bit as well, just as a proof of concept to show that basically the same methodology can apply to very different database management systems in general.

Chris Engelbert: Right. You want to look into Oracle and take a chunk of their money, I guess. But you’re on the right track with SAP HANA. It’s kind of on the same level. So how does that work? I think you have to have some kind of an agent inside of your database. For Postgres, you’re probably using the stats tables, but I guess you’re doing more, right?

Luigi Nardi: Right. This is the idea of, you know, observability and monitoring companies. They mainly focus on gathering all this metrics from the machine and then getting you a very nice visualization on your dashboard. As a user, you would look at these metrics and how they evolve over time, and then they help you guide the next step, which is some sort of manual optimization of your system. We are moving one step forward and we’re trying to use those metrics automatically instead of just giving them back to the user. So we move from a passive monitoring approach to an active approach where the metrics are collected and then the algorithm will help you also to automatically change the configuration of the system in a way that it gets faster over time. And so the metrics that we look at usually are, well, the algorithm itself will gather a number of metrics to help it to improve over time. And this type of metrics are related to, you know, your system usage, you know, CPU memory and disk usage. And other things, for example, latency and throughput as well from your Postgres database management system. So using things like pg_stat_statements, for example, for people that are a little more familiar with Postgres. And by design, we refrain from looking inside your tables or looking specifically at your metadata, at your queries, for example, we refrain from that because it’s easier to basically, you know, deploy our system in a way that it’s not dangerous for your data and for your privacy concerns and things like that.

Chris Engelbert: Right. Okay. And then you send that to a cloud instance that visualizes the data, just the simple stuff, but there’s also machine learning that actually looks at all the collected data and I guess try to find pattern. And how does that work? I mean, you probably have a version of the query parser, the Postgres query parser in the backend to actually make sense of this information, see what the execution plan would be. That is just me guessing. I don’t want to spoil your product.

Luigi Nardi: No, that’s okay. So the agent is open source and it gets installed on your environment. And anyone fluent in Python can read that in probably 20 minutes. So it’s pretty, it’s not massive. It’s not very big. That’s what gets connected with our backend system, which is running in our cloud. And the two things connect and communicate back and forth. The agent reports the metrics and requests what’s the next recommendation from the optimizer that runs in our backend. The optimizer responds with a recommendation, which is then enabled in the system through the agent. And then the agent also starts to measure what’s going on on the machine before reporting these metrics back to the backend. And so this is a feedback loop and the optimizer gets better and better at predicting what’s going on on the other side. So this is based on machine learning technology and specifically probabilistic models, which I think is the interesting part here. By using probabilistic models, the system is able to predict the performance for a new guess, but also predict the uncertainty around that estimate. And that’s, I think, very powerful to be able to combine some sort of prediction, but also how confident you are with respect to that prediction. And those things are important because when you’re optimizing a computer system, of course, you’re running this in production and you want to make sure that this stays safe for the system that is running. You’re changing the system in real time. So you want to make sure that these things are done in a safe way. And these models are built in a way that they can take into account all these unpredictable things that may otherwise book in the engineer system.

Chris Engelbert: Right. And you mentioned earlier that you’re looking at the pg_stat_statements table, can’t come up with the name right now. But that means you’re not looking at the actual data. So the data is secure and it’s not going to be sent to your backend, which I think could be a valid fear from a lot of people like, okay, what is actually being sent, right?

Luigi Nardi: Exactly. So Chris, when we talk with large telcos and big banks, the first thing that they say, what are you doing to my data? So you need to sit down and meet their infosec teams and explain to them that we’re not transferring any of that data. And it’s literally just telemetrics. And those telemetrics usually are not sensitive in terms of privacy and so on. And so usually there is a meeting that happens with their infosec teams, especially for big banks and telcos, where you clarify what is being sent and then they look at the source code because the agent is open source. So you can look at the open source and just realize that nothing sensitive is being sent to the internet.

Chris Engelbert: Right.

Luigi Nardi: And perhaps to add one more element there. So for the most conservative of our clients, we also provide a way to deploy this technology in a completely offline manner. So when everybody’s of course excited about digital transformations and moving to the cloud and so on, we actually went kind of backwards and provided a way of deploying this, which is sending a standalone software that runs in your environment and doesn’t communicate at all to the internet. So we have that as an option as well for our users. And that supports a little harder for us to deploy because we don’t have direct access to that anymore. So it’s easy for us to deploy the cloud-based version. But if you, you know, in some cases, you know, there is not very much you can do that will not allow you to go through the internet. There are companies that don’t buy Salesforce for that reason. So if you don’t buy Salesforce, you probably not buy from anybody else on the planet. So for those scenarios, that’s what we do.

Chris Engelbert: Right. So how does it work afterwards? So the machine learning looks into the data, tries to find patterns, has some optimization or some … Is it only queries or does it also give me like recommendations on how to optimize the Postgres configuration itself? And how does that present those? I guess they’re going to be shown in the UI.

Luigi Nardi: So we’re specifically focusing on that aspect, the optimization of the configuration of Postgres. So that’s our focus. And so the things like, if you’re familiar with Postgres, things like the shared buffers, which is this buffer, which contains the copy of the data from tables from the disk and keep it a local copy on RAM. And that data is useful to keep it warm in RAM, because when you interact with the CPU, then you don’t need to go all the way back to disk. And so if you go all the way back to disk, there is an order of magnitude more like delay and latency and slow down based on that. So you try to keep the data close to where it’s processed. So trying to keep the data in cache as much as possible and share buffer is a form of cache where the cache used in this case is a piece of RAM. And so sizing these shared buffers, for example, is important for performance. And then there are a number of other things similar to that, but slightly different. For example, in Postgres, there is an allocation of a buffer for each query. So each query has a buffer which can be used as an operating memory for the query to be processed. So if you’re doing some sort of like sorting, for example, in the query that small memory is used again. And you want to keep that memory close to the CPU and specifically the workman parameter, for example, is what helps with that specific thing. And so we optimize all this, all these things in a way that the flow of data from disk to the registers of the CPU, it’s very, very smooth and it’s optimized. So we optimize the locality of the data, both spatial and temporal locality if you want to use the technical terms for that.

Chris Engelbert: Right. Okay. So it doesn’t help me specifically with my stupid queries. I still have to find a consultant to fix that or find somebody else in the team.

Luigi Nardi: Yeah, for now, that’s correct. We will probably focus on that in the future. But for now, the way you usually optimize your queries is that you optimize your queries and then if you want to see what’s the actual benefit, you should also optimize your parameters. And so if you want to do it really well, you should optimize your queries, then you go optimize your parameters and go back optimize again your queries, parameters and kind of converge into this process. So now that one of the two is fully automated, you can focus on the queries and, you know, speed up the process of optimizing the queries by a large margin. So to in terms of like benefits, of course, if you optimize your queries, you will write your queries, you can get, you know, two or three order of magnitude performance improvement, which is really, really great. If you optimize the configuration of your system, you can get, you know, an order of magnitude in terms of performance improvement. And that’s, that’s still very, very significant. Despite what many people say, it’s possible to get an order of magnitude improvement in performance. If your system by baseline, it’s fairly, it’s fairly basic, let’s say. And the interesting fact is that by the nature of Postgres, for example, the default configuration of Postgres needs to be pretty conservative because Postgres needs to be able to run on big server machines, but also on smaller machines. So the form factor needs to be taken into account when you define the default configuration of Postgres. And so by that fact, it needs to be pretty conservative. And so what you can observe out there is that this problem is so complex that people don’t really change the default configuration of Postgres when they run on a much bigger instance. And so there is a lot of performance improvement that can be obtained by changing that configuration to a better-suited configuration. And you have the point of doing this through automation and through things like DBtune is that you can then refine the configuration of your system specifically for the specific use case that you have, like your application, your workload, the machine size, and all these things are considered together to give you the best outcome for your use case, which is, I think, the new part, the novelty of this approach, right? Because if you’re doing this through some sort of heuristics, they usually don’t really get to cover all these different things. And there will always be kind of super respect to what you can do with an observability loop, right?

Chris Engelbert: Yeah, and I think you mentioned that a lot of people don’t touch the configuration. I think there is the problem that the Postgres configuration is very complex. A lot of parameters depend on each other. And it’s, I mean, I’m coming from a Java background, and we have the same thing with garbage collectors. Optimizing a garbage collector, for every single algorithm you have like 20 or 30 parameters, all of them depend on each other. Changing one may completely disrupt all the other ones. And I think that is what a lot of people kind of fear away from. And then you Google, and then there’s like the big Postgres community telling you, “No, you really don’t want to change that parameter until you really know what you’re doing,” and you don’t know, so you leave it alone. So in this case, I think something like Dbtune will be or is absolutely amazing.

Luigi Nardi: Exactly. And, you know, if you spend some time on blog posts learning about the Postgres parameters you get that type of feedback and takes a lot of time to learn it in a way that you can feel confident and comfortable in changes in your production system, especially if you’re working in a big corporation. And the idea here is that at DBtune we are partnered with leading Postgres experts as well. Magnus Hagander, for example, we see present of the Postgres Europe organization, for example, it’s been doing this manual tuning for about two decades and we worked very closely with him to be able to really do this in a very safe manner, right. You should basically trust our system to be doing the right thing because it’s engineering a way that incorporates a lot of domain expertise so it’s not just machine learning it’s also about the specific Postgres domain expertise that you need to do this well and safely.

Chris Engelbert: Oh, cool. All right. We’re almost out of time. Last question. What do you think it’s like the next big thing in Postgres and databases, in cloud, in db tuning.

Luigi Nardi: That’s a huge question. So we’ve seen all sorts of things happening recently with, of course, AI stuff but, you know, I think it’s, it’s too simple to talk about that once more I think you guys covered those type of topics a lot. I think what’s interesting is that there is there is a lot that has been done to support those type of models and using for example the rise of vector databases for example, which was I think quite interesting vector databases like for example the extension for Postgres, the pgvector was around for a little while but in last year you really saw a huge adoption and that’s driven by all sort of large language models that use this vector embeddings and that’s I think a trend that will see for a little while. For example, our lead investor 42CAP, they recently invested in another company that does this type of things as well, Qdrant for example, and there are a number of companies that focus on that Milvus and Chroma, Zilliz, you know, there are a number of companies, pg_vectorize as well by the Tembo friends. So this is certainly a trend that will stay and for a fairly long time. In terms of database systems, I am personally very excited about the huge shift left that is happening in the industry. Shift left the meaning all the databases of service, you know, from Azure flexible server Amazon RDS, Google Cloud SQL, those are the big ones, but there are a number of other companies that are doing the same and they’re very interesting ideas, things that are really, you know, shaping that whole area, so I can mention a few for example, Tembo, even EnterpriseDB and so on that there’s so much going on in that space and in some sort, the DBtune is really in that specific direction, right? So helping to automate more and more of what you need to do in a database when you’re operating at database. From a machine learning perspective, and then I will stop that Chris, I think we’re running out of time. From machine learning perspective, I’m really interested in, and that’s something that we’ve been studying for a few years now in my academic team, with my PhD students. The, you know, pushing the boundaries of what we can do in terms of using machine learning for computer systems and specifically when you get computer systems that have hundreds, if not thousands of parameters and variables to be optimized at the same time jointly. And we have recently published a few pieces of work that you can find on my Google Scholar on that specific topic. So it’s a little math-y, you know, it’s a little hard to maybe read them parts, but it’s quite rewarding to see that these new pieces of technology are becoming available to practitioners and people that work on applications as well. So that perhaps the attention will move away at some point from full LLMs to also other areas in machine learning and AI that are also equally interesting in my opinion.

Chris Engelbert: Perfect. That’s, that’s beautiful. Just send me the link. I’m happy to put it into the show note. I bet there’s quite a few people that would be really, really into reading those things. I’m not big on mathematics that’s probably way over my head, but that’s, that’s fine. Yeah, I was that was a pleasure. Thank you for being here. And I hope we. Yeah, I hope we see each other somewhere at a Postgres conference we just briefly talked about that before the recording started. So yeah, thank you for being here. And for the audience, I see you, I hear you next week or you hear me next week with the next episode. And thank you for being here as well.

Luigi Nardi: Awesome for the audience will be at the Postgres Switzerland conference as sponsors and we will be giving talks there so if you come by, feel free to say hi, and we can grab coffee together. Thank you very much.

Chris Engelbert: Perfect. Yes. Thank you. Bye bye.

The post Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview) appeared first on simplyblock.

How Oracle transforms its operation into a cloud business with Gerald Venzl from Oracle

Chris Engelbert — Fri, 17 May 2024 12:11:50 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site.

In this installment of podcast, we’re joined by Gerald Venzl ( Twitter/X , Personal Blog ), a Product Manager from Oracle Database , who talks about the shift of focus away from on-premise databases towards the cloud. It’s a big change for a company like Oracle, but a necessary one. Learn more about the challenges and why Oracle believes multi-cloud is the future.

Chris Engelbert: Welcome back to the next episode of simpleblock’s Cloud Commute podcast.Today I have a very special guest, like always. I mean, I never have non-special guests.But today he’s very special because he’s from a very different background.Gerald, welcome. And maybe you can introduce yourself. Who are you? And how did you happen to be here?

Gerald Venzl: Yeah, thank you very much, Chris. Well, how I really don’t know, but I’m Gerald, I’m a database product manager for Oracle Database, working for Oracle a bit over 12 years now in California, originally from Austria. And yeah, kind of had an interesting path that set me into database product management.Essentially, I was a developer who developed a lot of PL/SQL alongside other programming languages, building ERP systems with databases in the background, the Oracle database. And eventually that’s how I ended up in product management for Oracle. The ‘how I’m here’, I think you found me. We hada fun conversation about 5 years ago, as we know, when we met first at a conference, as it so often happens. And you reached out and I think today is all about talking about Cloud Native, databasesand everything else we can come up with.

Chris Engelbert: Exactly. Is it 5 years ago that we’ve seen last time or that we’ve seen at all?

Gerald Venzl: No, that we’ve met 5 years ago.

Chris Engelbert: Seriously?

Gerald Venzl: Yeah.

Chris Engelbert: Are you sure it wasn’t JavaOne somewhere way before that?

Gerald Venzl: Well, we probably crossed paths, right? But I think it was the conference there where we both had the speaker dinner and got to exchange some, I mean, more than just like, “Hello, I’m so-and-so.”

Chris Engelbert: All right, that’s fair. Well, you said you’re working for Oracle. I think Oracle doesn’t really need any introduction. Probably everyone listening in knows what Oracle is. But maybe you said you’re a product manager from the database department. So what is that like? I mean, it’s special or it’s different from the typical audience or from the typical guest that I have. So how is the life of a product manager?

Gerald Venzl: Yeah, so what I particularly like about Oracle product management, or especially in database, obviously different lines of business and then inside Oracle may operate differently.It’s a job with a lot of facets.So the typical kind of product management job, the way how it was described to me was, well, you gather customer requirements, you bring it back to development, then it gets implemented, and then you kind of do go-to-market campaigns. So basically, you’re responsible for the collateral, what’s the message to advocate these new features to customers, to the world, and then that’s not so true for Oracle. I think one of the things that really excites me in the database world, it’s like this goes back to the late 70s. I mean, other than Larry, not that many people are around from the era anymore. But Oracle back then did a lot of things that were either before its time or when there simply was no other choice or way of doing it, commensurable wisdom, I would say. So one of the nice things in Oracle is that actually the coming up with new features is really a nice collaboration between development and product management.

So development just as much has their own ideas of what we need to do or should be doing, like the PMs, and we really get together and discuss it out. And of course, sometimes, there’s features that you may or may not agree with personally or don’t see the need for. And often, actually, and much more so, you get quite amazed by what we’ve come up with. And we have a lot of really smart people in the work.And one thing that, yeah, not to go too much into a rabbit hole, but a couple of things that I really like; believe it or not, database development, it feels a lot like a startup. There’s no fixed hierarchies as such, ‘you can only do this. You must only do this or anything like that.’You can very openly approach the development leads, so even up to the SVP levels. And actually, just as we started now, one of those guys was like, “Hey, let’s talk while I’m driving into work.” I was like “sorry, I’m busy right now”. So you have that going. And then also, there’s a lot of the product management work that has alot of facets to it. So it’s not just ‘define the product’ or anything like that. That is obviously part of it, but also it’s evangelizing,as I’m doing right now. I speak to people on a thought leadership front for data management, if you like, or how to organize data and so forth.

And as I said before, one other thing that I really enjoy working in a team isthere’s actually quite a lot of really smart people in the org that go back to the 90s and some of them even to the 80s. So I got one guy who can explain exactly how you would lay out some bytes on disk for fastest read, etc. Then this is stuff that I never really touched anymore in school. We were already too abstract. It’s like, “Yeah, yeah, yeah, whatever. There’s some disk and it stores some stuff.” But you get still these low level guys and some of them, one of them is like, “Yeah,I helped on the C compiler back then with Kernighan.” It’s like there was one of the guys but was involved in it. And so anyway, as you know in the industry, people go around quite a bit. And so that has a lot going there.

Chris Engelbert: So from the other perspective, I mean, Oracle is known for big database servers. I think everyone remembers the database clusters. These days, it’s mostly like SUN, SPARC, I guess. But there’s also the Oracle Cloud and the database in the cloud. So how does that play into each other?

Gerald Venzl: Oh, yeah. Now things have changed drastically. I mean, traditionally starting a database software in the good old 80s where you didn’t even have terminal server or whatever, a client server. So the first version is apparently a terminal based or something like that.

It’s like, again, I never saw this. But there was a big client server push.And obviously now there’s a big push into what’s cloud and a lot of cloud means really distributed systems. And so how does it play into each other? So all the database cloud services in Oracle Cloud, all the Oracle database cloud services are owned by us in development as well.

So we have gone into this mode of building cloud services for Oracle database. And of course, that’s really nice because that gives us this visibility to the requirements of distributed storage or distributed workloads and that in turn feeds back into the product. So for example, we are still one of the very few relational databases that offers you sharding on a relational model, which is, of course, much harder than a self-contained hierarchical model such as JSON, which you can shard way nicer. But once you actually split up your data across a bunch of different tables and have relations between those, sharding becomes quite more complicated.

And then of course, it’s like we have a lot of database know-how. We also got MySQL, they do also their thing with good collaboration going on with them. So we have sort of quite a good, I want to say, brainpower, intellectual power in the company when it comes to accessing data and to writing data. You mentioned SPARC before. There’s, of course, a lot of that going on. And quite frankly, I will say even way before cloud, the fact of accessing data that doesn’t necessarily sit in a database butanalyze it or query it with SQL. It’s like you literally go back like 10, 12 years ago and everybody said Hadoop will kill every database and big data is the way forward. And I’m sure there was the same thing going on in the mid-2000s. I was not in the industry yet. So like, yeah, this notion of that you have data sitting somewhere else and you just want to analyze it has been around for a long time, actually much longer than people see now with object store buckets and data lakes and all the good stuff.

Chris Engelbert: So how does that look like for customers? I mean, I can see that smaller customers won’t have an issue with the cloud, but I could imagine that banks or insurances or stuff like that may actually have that. What does the typical cloud customer for Oracle look like? I think it may be very different from a lot of other people using Postgres or something.

Gerald Venzl: Yeah. I mean, you kind of mentioned it before. I think there is, ‘are you small or are you large?’ Right. And the SMB, small, medium business customers, the smaller ones, obviously, they’re very much attracted by cloud, the fact that they don’t have to stand up servers and the data centers themselves to just get their product or their services to their customers. Big guys are much more like ‘consolidation’and the biggest customers we work with, it’s really like their data center costs are massive because they are massive data centers. So they are looking at a more of a cost saving exercise.Okay- if we can lift and shift this all to cloud, not only can I close down my data centers or large portion to it, but of course also most of themareactually re-leveraging their workforce. So people, especially the Ops guys are always very scared of cloud or often very scared of cloud that will take their job away. But actually most customers are just thinking‘rather than looking after the servers running this good stuff, maybe in 2024, we can leverage your time for something that’s more important to the business, more tangible to the business.’ So they’re not necessarily looking so much to just get rid of that workforce, but transforming it to take care of other tasks.

A couple of years ago when we did a big push to cloud for Oracle Database and our premier database, Cloud Service Autonomous Database came out, there was quite a big push for the DBAs to transform into something more like a data governance person. So all the data privacy laws have crept in quite heavily in the last 5to 10 years. I mean, they were always there, but with GDPR and all these sorts of laws, they are quite different in what they are asking from data privacy laws before. And this is getting more and more and more complex, quite frankly. So there was obviously a lot of aspects of, ‘hey, you are the guys who look after these databases storing these terabytes and terabytes of data.’ It’s like, ‘now we have these regulatory requirements, where this needs to be stored, how this needs to be accessed, et cetera.’ And I might try to have you figure that out and figure out whether the backup was successfully taken or something like that. So you’re looking at that angle.

But yeah, so the big guys, then they, I think to some extent also very quickly get concerned of whether data is stored public cloud or not.Oracle was actually, I want to say we were either the first or definitely a forerunner of what we called Cloud@Customer. So basically you can have an Oracle cloud at your site. So you reinstall Oracle cloud in your data center. So for those customers who say, “This data is really, really precious.” You always have a spectrum. It’s like there’s a lot of data you don’t careabout, a lot of public data that you may or may not store, reference data and so forth, that you have to have for your operations. And then there’s actually the really sensitive data, your customer confidential information and so forth. And so there’s always a spectrum of stuff that ‘I don’t care can move quicker to cloud’ or whatever. And then of course, the highly confidential data or competitive confidential data– ‘I really don’t want anybody else to get a hold of this’ or ‘it’s not allowed or regulatory.’

Those systems then they look into a similar model where they say‘well, we like this sort of subscription-based model where we just pay a monthly or yearly fee per use and still all the automation is there. It’s like we still don’t have to have people looking whether the backup is successful or something. But we want it in our data center. We want to be full control. We want to be able to basically kind of pull out the cable if we have to and the data resides in our data center and you guys can no longer access it. Sort of that sense. I mean, that is obviously very extreme.And so this is what we call Cloud@Customer. You can have an Oracle cloud environment installed in your data center. Our guys will go in there and set everything up like it is in public cloud.

Chris Engelbert: That is interesting. I didn’t know that thing existed.

Gerald Venzl: Yeah, it’s actually gotten much bigger now. So just to finish up on that, it’s like, so now we have these, I mean, even governments is this next level, right? So governments come backand they say, “We’re not going to store our data in another country’s data center.” So this kind of exploded into like even what we call government regions. So, and there’s some public references out there where some governments actually have a government region of Oracle cloud in their country.

Chris Engelbert: So it’s interesting. I didn’t know that that Oracle or Oracle Cloud@Customer existed.Is that probably how AWS handled all the like AWS or what is it called Oracle at AWS or something?

Gerald Venzl: No,so AWS is different. AWS came out with outposts, but that was actually years laterandwhen you do your research, you see that Oracle had this way longer. But now I think every provider has some sort of like ‘Cloud@Customer’derivative. But now AWS is Oracle databases and what they call RDS, the relational database services.But I think what you’re thinking of is the Microsoft Azure partnership that we did.

So there’s an Oracle database at Microsoft Azure.And even that has a precursor to it. So a couple of years ago, basically Microsoft and Oracle partnered up and put a fast interconnect between the two clouds so that you kind of don’t go out of public net. But you could interconnect them from cloud data center to cloud data center, they were essentially co-located in the same kind of data center buildings. I mean, factories is really what they look like these days. So that’s how you got this fast interconnect, or kind of like buildings next to each other. And that was the beginning of the partnership. And yeah, by now it was a big announcement, you know, Satya Nadella and Larry Ellison were up in Redmond at Microsoft, I want to say was last fall, around September, something like that, but around the time where they had this joint announcement that yeah, you can have now Oracle database in Azure.But you know, the Oracle database happens to still run on Oracle cloud infrastructure. And why this fast connect is exposed via Azure.

Now the important thing is, all the billing, all the provisioning, all the connectivity, everything you do is going through Azure. So you actually don’t have to know Oracle cloud, what effect that runs in Oracle cloud, that is all taken care of. And that caters to the customers, we have, you know,lots and lots and lots of customers who have applications that run on a Microsoft stack, rather than pick any Windows based application that are in Azure, it’s a natural fit, what happens to have an Oracle database backend. And I think that in general is something that we see in the industry right now thatthese clouds in the beginning became thismassive monolithic islands where you can go into the cloud and they provide you all these services, but it was very hard to actually talk to different services between clouds.

And our founder and CTO Larry Ellison thinks very highly of what he calls multi cloud or what we call multi cloud, you know, it’s like you should not have to kind of put all your eggs in a basket. It’s literally a kind of the good old story of vendor lock-in again, just in cloud world. So yeah, you should not have tohave one cloud provider and that’s only it. And even there, we have already seen government regulations that actually sayyou have to be able to run at least two clouds. So if one cloud provider goes out of business or down or whatever, you cannot completely go out of business either. I mean, it’s unlikely, but you know how the government regulations happen, right?

Chris Engelbert: Right. So two very important questions. First, super, super important. How do I getan interconnect to Azure data centers to myhome?

Gerald Venzl: Yeah, that I don’t know. They are really expensive. There are some big pipes.

Chris Engelbert: The other one, I mean, sure, that’s a partnership between, you said Microsoft and Oracle, so maybe I was off, but are other cloud providers on the roadmap? Are there talks? If you can talk about that.

Gerald Venzl: Yeah. I mean, I’m too far away to know what exactly is happening. I do know for a fact that we get the question from customers as well all the time. And, you know, against common belief, I want to say, it’s not so much us that isn’t willing to play ball. It’s more than the other cloud vendors. So, we are definitely interested in exposing our services, especially Oracle database services on other clouds and we actively pursue that. But yeah, it basically needs a big corporate partnership. There’smanypeople that look at that and want to have a say in that. But I hope that insome time we reach a point whereall of these clouds perhaps become interconnected, or at least it’s easier to exchange information. I mean, even this ingress/egress thing is already ridiculous, I find. So this was another thing that Oracle did from the very early days. It’s like we didn’t charge for egress, right? ‘If data goes out of your cloud, well, we don’t charge you for it.’And now you see other cloud vendors dropping their egress prices, either constantly going lower or dropping them altogether. But you know, customer demand will push it eventually, right?

Chris Engelbert: Right. I think I think that is true. I mean, for a lot of bigger companies, it becomes very important to not be just on a single cloud provider, but to be just failure safe, fault tolerant, whatever you want to call it. And that means sometimes you actually have to go to separate clouds, but keeping state or synchronizing state between those clouds is, as you said, very, very expensive, or it gets very expensivevery fast. Let’s say it that way. So because we’re pretty much running out of time already, is there any secret on the roadmap you really want to share?

Gerald Venzl: Regarding cloudor in general? I mean, one thing that I should say, is likeOracle database, you know, a lot of people may say, ‘it’s like, this is old, this is legacy, what can I do with it, etc.’So that’s all not true, right? We just kind of announced our vector supportand got quite heavily involved with that lately. So that’s new and exciting. And you willsoonseenew version of Oracle database, we announced this already at Cloud World, that has this vector support in it. So we’re definitely top-notch there.

And the‘how do I get started with Oracle database,’this is also something that often people haven’t looked for a long time anymore. So these days, you can get an Oracle database via Docker image, or you have also this new database variation called Oracle Database Free. So you can literally just Google ‘Oracle Database Free’, it’s like a successor of the good old Express edition for those people who happen to have heard of that. But too many people didn’t know that Oracle Database, there was a free variant of that. And so that’s why we literally put it inthe name, ‘Oracle Database Free.’ So that’s your self-contained,free to use Oracle Database, you know, it has certain storage restrictions, basically, and then you kind of go too big as a database. And but the big item doesn’t come with commercial support. So you can think a little bit of like in the open source world of Community Edition and Enterprise Edition. So you know, it’s like, Oracle Database Free is the free thing that doesn’t come with support, it’s essentially restricts itself to a certain size. And it’s really meant for you to tinker around, develop, run small apps on, etc. But yeah, just Google that or go to Oracle.com/database/free . You will find it there. And just give Oracle Database a go. I think you will find that we have kept up with the times. As mentioned before, you know, one of the very few relational databases that can shard on a relational model, not only on JSON or whatever. So certainly a lot of good things in there.

Chris Engelbert: Right. So, last question, what do you think islike the next big thing or the next cool thing, or even maybe it’s already here?

Gerald Venzl: I mean, I’m looking at the whole AI thing that’s obviously pushing heavily. And I’mlikeold enough to have seen some hype cycles, you know, kind of completely facepalm. And I’m still young enough to be very excited. So somewhere on the fence there to be like, AI could be the next big thing, or it could just, you know, kind of once everybody realizes…

Chris Engelbert: The next not-big-thing.

Gerald Venzl: Exactly. I think right now there’s nothing else on the horizon. I mean, maybe there’s always the always something coming. But I think everybody’s so laser-focused on AI right now that we probably don’t even care to look anywhere else. So we’ll see how that goes. But yeah, I thinkthere’s something to it. We shall see.

Chris Engelbert: That’s fair. I think that is probably true as well. I mean, I asked a question to everyone, and I always would have a hard time answering myself. So I’m asking all the people to get some good answer if somebody asks me that someday.

Gerald Venzl: Yes. Smart, actually.

Chris Engelbert: I know, I know. That’s what I that’s what I tried to be. I wanted to saythat I am, but I’m not sure I’m actually smart. All right.That was a pleasure. It was nice. Thank you very much for being here. I hope to see you somewhere at a conference soon again.

Gerald Venzl: Yeah, thanks for having me. It was really fun.

Chris Engelbert: Oh no, my pleasure. And for the audience, hear you next week or you hear me next week. Next episode, next week. See you. Thanks.

The post How Oracle transforms its operation into a cloud business with Gerald Venzl from Oracle appeared first on simplyblock.

Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview)

Chris Engelbert — Fri, 26 Apr 2024 12:13:28 +0000

This interview is part of the simplyblock’s Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , Pandora , Samsung Podcasts, and our show site .

In this installment of podcast, we’re joined by Gunnar Morling (X/Twitter) , from Decodable , a cloud-native stream processing platform that makes it easier to build real-time applications and services, highlights the challenges and opportunities in stream processing, as well as the evolving trends in database and cloud technologies.

Chris Engelbert: Hello everyone. Welcome back to the next episode of simplyblock’s Cloud Commute podcast. Today I have a really good guest, and a really good friend with me. We know each other for quite a while. I don’t know, many, many, many years. Another fellow German. And I guess a lot of, at least when you’re in the Java world, you must have heard of him. You must have heard him. Gunnar, welcome. Happy to have you.

Gunnar Morling: Chris, hello, everybody. Thank you so much, family. Super excited. Yes, I don’t know, to be honest, for how long we have known each other. Yes, definitely quite a few years, you know, always running into each other in the Java community.

Chris Engelbert: Right. I think the German Java community is very encapsulated. There’s a good chance, you know, a good chunk of them.

Gunnar Morling: I mean, you would actively have to try and avoid each other, I guess, if you really don’t want to meet somebody.

Chris Engelbert: That is very, very true. So, well, we already heard who you are, but maybe you can give a little bit of a deeper introduction of yourself.

Gunnar Morling: Sure. So, I’m Gunnar. I work as a software engineer right now at a company called Decodable. We are a small startup in the data streaming space, essentially moving and processing your data. And I think we will talk more about what that means. So, that’s my current role. And I have, you know, a bit of a mixed role between engineering and then also doing outreach work, like doing blog posts, podcasts, maybe sometimes, going to conferences, talking about things. So, that’s what I’m currently doing. Before that, I’ve been for exactly up to the day, exactly for 10 years at Red Hat, where I worked on several projects. So, I started working on different projects from the Hibernate umbrella. Yes, it’s still a thing. I still like it. So, I was doing that for roughly five years working on Bean Validation. I was the spec lead for Bean Validation 2.0, for instance, which I think is also how we met or I believe we interacted somehow with in the context of Bean Validation. I remember something there. And then, well, I worked on a project which is called Debezium. It’s a tool and a platform for change data capture. And again, we will dive into that. But I guess that’s what people might know me for. I’m also a Java champion as you are, Chris. And I did this challenge. I need to mention it. I did this kind of viral challenge in the Java space. Some people might also have come across my name in that context.

Chris Engelbert: All right. Let’s get back to the challenge in a moment. Maybe say a couple of words about Decodable.

Gunnar Morling: Yes. So, essentially, we built a SaaS, a software as a service for stream processing. This means, essentially, it connects to all kinds of data systems, let’s say databases like Postgres or MySQL, streaming platforms like Kafka, Apache Pulsar. It takes data from those kinds of systems. And in the simplest case, it just takes this data and puts it into something like Snowflake, like a search index, maybe another database, maybe S3, maybe something like Apache Pino or Clickhouse. So, it’s about data movement in the simplest case, taking data from one place to another. And very importantly, all this happens in real time. So, it’s not batch driven, like, you know, running once per hour, once per day or whatever. But this happens in near real time. So, not in the hard, you know, computer science sense of the word, with a fixed SLA, but with a very low latency, like seconds, typically. But then, you know, going beyond data movement, there’s also what we would call data processing. So, it’s about filtering your data, transforming it, routing it, joining multiple of those real time data streams, doing things like groupings, real time analytics of this data, so you could gain insight into your data. So, this is what we do. It’s based on Apache Flink as a stream processing engine. It’s based on Debezium as a CDC tool. So, this gives you a source connectivity with all kinds of databases. And yeah, people use it for, as I mentioned, for taking data from one place to another, but then also for, I don’t know, doing fraud detection, gaining insight into their purchase orders or customers, you know, all those kinds of things, really.

Chris Engelbert: All right, cool. Let’s talk about your challenge real quick, because you already mentioned stream processing. Before we go on with, like, the other stuff, like, let’s talk about the challenge. What was that about?

Gunnar Morling: What was that about? Yes, this was, to be honest, it was kind of a random thing, which I started over the holidays between, you know, Christmas and New Year’s Eve. So, this had been on my mind for quite some time, doing something like processing one billion rows, because that’s what it was, a one billion row challenge. And this had been on my mind for a while. And I know somehow, then I had this idea, okay, let me just put it out into the community, and let’s make a challenge out of it and essentially ask people, so how fast can you be with Java to process one billion rows of a CSV file, essentially? And the task was, you know, to take temperature measurements, which were given in that file, and aggregate them per weather station. So, the measurements or the rows in this file were essentially always like, you know, a weather station name and then a temperature value. And you had to aggregate them per station, which means you had to get the minimum, the maximum and the mean value per station. So, this was the task. And then it kind of took off. So, like, you know, many people from the community entered this challenge and also like really big names like Aleksey Shipilëv, Cliff Click, Thomas Wuerthinger, the leads of GraalVm at Oracle and many, many others, they started to work on this and they kept working on it for the entire month of January. And like really bringing down those execution times, essentially, in the end, it was like less than two seconds for processing this file, which I had with 13 gigabytes of size on an eight core CPU configuration.

Chris Engelbert: I think the important thing is he said less than a second, which is already impressive because a lot of people think Java is slow and everything. Right. We know those terms and those claims.

Gunnar Morling: By the way, I should clarify. So, you know, I mean, this is highly parallelizable, right? So, the less than a second number, I think like 350 milliseconds or so this was an old 32 cores I had in this machine with hyperthreading, with turbo boost. So, this was the best I could get.

Chris Engelbert: But it also included reading those, like 13 gigs, right? And I think that is impressive.

Gunnar Morling: Yes. But again, then reading from memory. So, essentially, I wanted to make sure that disk IO is not part of the equation because it would be super hard to measure for me anyway. So, that’s why I said, okay, I will have everything in a RAM disk. And, you know, so everything comes or came out of memory for that context.

Chris Engelbert: Ok. Got it. But still, it got pretty viral. I’ve seen it from the start and I was kind of blown away by who joined that discussion. It was really cool to look after and to just follow up. I didn’t have time to jump into that myself, but by the numbers and the results I’ve seen, I would have not won anyway. That was me not wasting time.

Gunnar Morling: Absolutely. I mean, people pulled off like really crazy tricks to get there. And by the way, if you’re at JavaLand in a few weeks, I will do a talk about some of those things in Java land.

Chris Engelbert: I think by the time this comes out, it was a few weeks ago. But we’ll see.

Gunnar Morling: Ok. I made the mistake for every recording. I made the temporal reference.

Chris Engelbert: That’s totally fine. I think a lot of the JavaLand talks are now recorded these days and they will show up on YouTube. So when this comes out and the talks are already available, I’ll just put it in the show notes.

Gunnar Morling: Perfect.

Chris Engelbert: All right. So that was the challenge. Let’s get back to Decodable. You mentioned Apache Flink being like the underlying technology build on. So how does that work?

Gunnar Morling: So Apache Flink, essentially, that’s an open source project which concerns itself with real-time data processing. So it’s essentially an engine for processing either bounded or unbounded streams of events. So there’s also a way where you could use it in a batch mode. But this is not what we are too interested in so far. It’s always about unbounded data streams coming from a Kafka topic, so it takes those event streams, it defines semantics on those event streams. Like what’s an event time? What does it mean if an event arrives late or out of order? So you have the building blocks for all those kinds of things. Then you have a stack, a layer of APIs, which allow you to implement stream processing applications. So there’s more imperative APIs, which in particular is called the data streaming API. So there you really program in Java, typically, or Scala, I guess, your flow in an imperative way. Yeah Scala, I don’t know who does it, but that may be some people. And then there’s more and more abstract APIs. So there’s a table API, which essentially gives you like a relational programming paradigm. And finally, there’s Flink SQL, which also is what Decodable employs heavily in the product. So there you reason about your data streams in terms of SQL. So let’s say, you know, you want to take the data from an external system, you would express this as a create table statement, and then this table would be backed by a Kafka topic. And you can do a select then from such a table. And then of course you can do, you know, projections by massaging your select clause. You can do filterings by adding where clauses, you can join multiple streams by well using the join operator and you can do windowed aggregations. So I would feel that’s the most accessible way for doing stream processing, because there’s of course, a large number of people who can implement a SQL, right?

Chris Engelbert: Right. And I just wanted to say, and it’s all like a SQL dialect, it’s pretty close as far as I’ve seen to the original like standard SQL.

Gunnar Morling: Yes, exactly. And then there’s a few extensions, you know, because you need to have this notion of event time or what does it mean? How do you express how much lateness you would be willing to accept for an aggregation? So there’s a few extensions like that. But overall, it’s SQL. For my demos, oftentimes, I can start working on Postgres, developing, develop some queries on Postgres, and then I just take them, paste them into like the Flink SQL client, and they might just run as is, or they may need a little bit of adjustment, but it’s pretty much standard SQL.

Chris Engelbert: All right, cool. The other thing you mentioned was the Debezium. And I know you, I think you originally started Debezium. Is that true?

Gunnar Morling: It’s not true. No, I did not start it. It was somebody else at Red Hat, Randall Hauck, he’s now at Confluent. But I took over the project quite early on. So Randall started it. And I know I came in after a few months, I believe. And yeah, I think this is when it really took off, right? So, you know, I went to many conferences, I spoke about it. And of course, others as well. The team grew at Red Hat. So yeah, I was the lead for quite a few years.

Chris Engelbert: So for the people that don’t know, maybe just give a few words about what Debezium is, what it does, and why it is so cool.

Gunnar Morling: Right. Yes. Oh, man, where should I start? In a nutshell, it’s a tool for what’s called change data capture. So this means it taps into the transaction log of your database. And then whenever there’s an insert or an update or delete, it will capture this event, and it will propagate it to consumers. So essentially, you could think about it like the observer pattern for your database. So whenever there’s a data change, like a new customer record gets created, or purchase order gets updated, those kinds of things, you can, you know, react and extract this change event from the database, push it to consumers, either via Kafka or via pullbacks in an API way, or via, you know, Google Cloud PubSub, Kinesis, all those kinds of things. And then well, you can take those events and it enables a ton of use cases. So you know, in the simplest case, it’s just about replication. So taking data from your operational database to your cloud data warehouse, or to your search index, or maybe to cache. But then also people use change data capture for doing things like microservices, data exchange, because I mean, microservices, they, you want to have them self dependent, but still, they need to exchange data, right? So they don’t exist in isolation, and change data capture can help with that in particular, with what’s called the outbox pattern, just on the side note, people use it for splitting up monolithic systems into microservices, you can use this change event stream as an audit log. I mean, if you kind of think about it, it’s, you know, if you just keep those events, all the updates to purchase order, we put them into a database, it’s kind of like a search index, right? Maybe you want to enrich it with a bit of metadata. You can do streaming queries. So I know you maybe you want to spot specific patterns in your data as it changes, and then trigger some sort of alert. That’s the use case, and many, many more, but really, it’s a super versatile tool, I would say.

Chris Engelbert: Yeah, and I also have a couple of talks on that area. And I think my favorite example, that’s something that everyone understands is that you have some order coming in, and now you want to send out invoices. Invoices don’t need to be sent like, in the same operation, but you want to make sure that you only send out the invoice if the invoice was, or if the order was actually generated in the database. So that is where the outbox pattern comes in, or just looking at the order table in general, and filtering out all the new orders.

Gunnar Morling: Yes.

Chris Engelbert: So yeah, absolutely a great tool. Love it. It supports many, many databases. Any idea how many so far?

Gunnar Morling: It keeps growing. I know, certainly 10 or so or more. The interesting thing there is, well, you know, there is not a standardized way you could implement something like Debezium. So each of the databases have their own APIs, formats, their own ways for extracting those change events, which means there needs to be a dedicated Debezium connector for each database, which we want to support. And then the core team, you know, added support for MySQL, Postgres, SQL Server, Oracle, Cassandra, MongoDB, and so on. But then what happened is that also other companies and other organizations picked up the Debezium framework. So for instance, now something like Google Cloud Spanner, it’s also supported via Debezium, because the team at Google decided, that they want to expose change events based on the Debezium event format and infrastructure or ScyllaDB. So they maintain their own CDC connector, but it’s based on Debezium. And the nice thing about that is that it gives you as a user, one unified change event format, right? So you don’t have to care, which is the particular source database, does it come from Cloud Spanner, or does it come from Postgres? You can process those events in a unified way, which I think is just great to see that it establishes itself as a sort of a de facto standard, I would say.

Chris Engelbert: Yeah, I think that is important. That is a very, very good point. Debezium basically defined a JSON and I think Avro standard.

Gunnar Morling: Right. So I mean, you know, it defines the, let’s say, the semantic structure, like, you know, what are the fields, what are the types, how are they organized, and then how you serialize it as Avro, JSON, or protocol buffers. That’s essentially like a pluggable concern.

Chris Engelbert: Right. So we said earlier, Decodable is a cloud platform. So you basically have, in a little bit of a mean term, you have Apache Flink on steroids, ready to use, plus a couple of stuff on top of that. So maybe talk a little bit about that.

Gunnar Morling: Right. So yes, that’s the underlying tech, I would say. And then of course, if you want to put those things into production, there’s so many things you need to consider. Right. So how do you just go about developing and versioning those SQL statements? If you iterate on a statement, you want to have maybe like a preview and get a feeling or maybe just validation of this. So we have all this editing experience, preview. Then maybe you don’t want that all of your users in your organization can access all those streaming pipelines, which you have. Right. So you want to have something like role-based access control. You want to have managed connectors. You want to have automatic provisioning and sizing of your infrastructure. So you don’t want to think too much, “hey, do I need to keep like five machines for this dataflow sitting around?” And what happens if I don’t need them? Do I need to remove them and then scale them back up again? So all this auto scaling, auto provisioning, this is something which we do. Then we will primarily allow you to use SQL to define your queries, but then also we actually let you run your own custom Flink jobs. If that’s something which you want to do, you can do this. We are very close. And again, by the time this will be released, it should be live already. We will have Python, PyFlink support, and yeah, many, many more things. Right. So really it’s a managed experience for those dataflows.

Chris Engelbert: Right. That makes a lot of sense. So let me see. From a user’s perspective, I’m mostly working with SQL. I’m writing my jobs. I’m deploying those. Those jobs are everything from simple ETL to extract, translate, load. What’s the L again?

Gunnar Morling: Load.

Chris Engelbert: There you go. Nobody needs to load data. They just magically appear. But you can also do data enrichment. You said that earlier. You can do joins. Right. So is there anything I have to be aware of that is very complicated compared to just using a standard database?

Gunnar Morling: Yeah. I mean, I think this entire notion of event time, this definitely is something which can be challenging. So let’s say you want to do some sort of windowed analysis, like, you know, how many purchase orders do I have per category and hour, you know, this kind of thing. And now, depending on what’s the source of your data, those events might arrive out of order. Right. So it might be that your hour has closed. But then, like, five minutes later, because some event was stuck in some queue, you still get an event for that past hour. Right. And of course, now the question is, there’s this tradeoff between, okay, how accurate do you want your data to be? Essentially, how long do you want to wait for those late events versus, well, what is your latency? Right. Do you want to get out this updated count at the top of the hour? Or can you afford to wait for those five minutes? So there’s a bit of a tradeoff. I think, you know, this entire complex of event time, I think that’s certainly something where people often have at least some time to learn and grasp the concepts.

Chris Engelbert: Yeah, that’s a very good one. In a previous episode, we had the discussion about connected cars. And connected cars may or may not have an internet connection all the time. So you like super, super late events sometimes. All right. Because we’re almost running out of time.

Gunnar Morling: Wow. Ok.

Chris Engelbert: Yeah. 20 minutes is like nothing. What is the biggest trend you see right now in terms of database, in terms of cloud, in terms of whatever you like?

Gunnar Morling: Right. I mean, that’s a tough one. Well, I guess there can only be one answer, right? It has to be AI. I feel it’s like, I know it’s boring. But well, the trend is not boring. But saying it is kind of boring. But I mean, that’s what I would see. The way I could see this impact things like we do, I mean, it could help you just with like scaling, of course, like, you know, we could make intelligent predictions about what’s your workload like, maybe we can take a look at the data and we can sense, okay, you know, it might make sense to scale out some more compute load already, because we will know with a certain likelihood that it may be needed very shortly. I could see that then, of course, I mean, it could just help you with authoring those flows, right? I mean, with all those LLMs, it might be doable to give you some sort of guided experience there. So that’s a big trend for sure. Then I guess another one, I would see more technical, I feel like that’s a unification happening, right, of systems and categories of systems. So right now we have, you know, databases here, stream processing engines there. And I feel those things might come more closely together. And you would have real-time streaming capabilities also in something like Postgres itself. And I know maybe would expose Postgres as a Kafka broker, in a sense. So I could also see some more, you know, some closer integration of those different kinds of tools.

Chris Engelbert: That is interesting, because I also think that there is a general like movement to, I mean, in the past we had the idea of moving to different databases, because all of them were very specific. And now all of the big databases, Oracle, Postgres, well, even MySQL, they all start to integrate all of those like multi-model features. And Postgres, being at the forefront, having this like super extensibility. So yeah, that would be interesting.

Gunnar Morling: Right. I mean, it’s always going in cycles, I feel right. And even having this trend to decomposition, like it gives you all those good building blocks, which you then can put together and I know create a more cohesive integrated experience, right. And then I guess in five years, we want to tear it apart again, and like, let people integrate everything themselves.

Chris Engelbert: In 5 to 10 years, we have the next iteration of microservices. We called it SOAP, we called it whatever. Now we call it microservices. Who knows what we will call it in the future. All right. Thank you very much. That was a good chat. Like always, I love talking.

Gunnar Morling: Yeah, thank you so much for having me. This was great. Enjoy the conversation. And let’s talk soon.

Chris Engelbert: Absolutely. And for everyone else, come back next week. A new episode, a new guest. And thank you very much. See you.

The post Coding the Cloud: A Dive into Data Streaming with Gunnar Morling from Decodable (video + interview) appeared first on simplyblock.