Databases Archives | simplyblock

Database Performance: Impact of Storage Limitations

Rob Pankow — Tue, 21 Jan 2025 07:47:43 +0000

TLDR: Storage and storage limitations have a fundamental impact on database performance, with access latency creating a hard physical limitation on IOPS, queries per second (QPS), and transactions per second (TPS).

With the rise of the cloud-native world of microservices, event-driven architectures, and distributed systems, understanding storage physics has never been more critical. As organizations deploy hundreds of database instances across their infrastructure, the multiplicative impact of storage performance becomes a defining factor in system behavior and database performance metrics, such as queries per second (QPS) and transactions per second (TPS).

While developers obsess over query optimization and index tuning, a more fundamental constraint silently shapes every database operation: the raw physical limits of storage access.

These limits aren’t just academic concerns—they’re affecting your systems right now. Each microservice has its own database, each Kubernetes StatefulSet, and every cloud-native application wrestles with physical boundaries, often without realizing it. When your system spans multiple availability zones, involves event sourcing, or requires real-time data processing, storage physics becomes the hidden multiplier that can either enable or cripple your entire architecture.

In this deep dive, we’ll explain how storage latency and IOPS create performance ceilings that no amount of application-level optimization can break through. More importantly, we’ll explore how understanding these physical boundaries is crucial for building truly high-performance, cloud-native systems that can scale reliably and cost-effectively.

The Latency-IOPS-QPS-TPS Connection

When we look at database and storage performance, there are four essential metrics to understand.

Figure 1: Core metrics for database performance: Access latency, IOPS, QPS (Queries per Second), TPS (Transactions per Second)

Latency (or access latency) measures how long it takes to complete a single I/O operation from issuing to answering. On the other hand, IOPS (Input/Output Operations Per Second) represents how many operations can be performed per second. Hence, IOPS measures the raw storage throughput for read/write operations.

On the database side, QPS (Queries Per Second) represents the number of query operations that can be executed per second, basically the higher-level application throughput. Last, TPS (Transactions Per Second) defines how many actual database transactions can be executed per second. A single transaction may contain one or more queries.

These metrics have key dependencies:

Each query typically requires multiple I/O operations.
As IOPS increases, latency increases due to queuing and resource contention.
Higher latency constraints maximum achievable IOPS and QPS.
The ratio between QPS and IOPS varies based on query complexity and access patterns.
TPS is the higher-level metric of QPS. Both are directly related.

Consider a simple example:
If your storage system has a latency of 1 millisecond per I/O operation, the theoretical maximum IOPS would be 1,000 (assuming perfect conditions). However, increase that latency to 10 milliseconds, and your maximum theoretical IOPS drops to 100. Suppose each query requires an average of 2 I/O operations. In that case, your maximum QPS would be 500 at 1 ms latency but only 50 at 10 ms latency – demonstrating how latency impacts both IOPS and QPS in a cascading fashion.

1 second = 1000ms

1 I/O operation = 10ms
IOPS = 1000 / 10 = 100

1 query = 2 I/O ops
QPS = 100 / 2 = 50

The above is a simplified example. Modern storage devices have parallelism built into them, running multiple I/O operations simultaneously. However, you need a storage engine to make them available, and they only delay the inevitable.

Impact on Database Performance

For database workloads, the relationship between latency and IOPS becomes even more critical. Here’s why:

Query Processing Speed: Lower latency means faster individual query execution for data read from storage devices.
Concurrent Operations: Higher IOPS enables more simultaneous database operations.
Transaction Processing: The combination affects how many transactions per second (TPS) your database can handle.

The Hidden Cost of Latency

Storage latency impacts database operations in subtle but profound ways. Consider a typical PostgreSQL instance running on AWS EBS gp3 storage, which averages 2-4ms latency for read-write operations. While this might seem negligible, let’s break down its real impact:

Transaction Example:

Single read operation: 3ms
Write to WAL: 3ms
Page write: 3ms
fsync(): 3ms

Total latency: 12ms minimum per transaction
Maximum theoretical transactions per second: ~83

This means even before considering CPU time, memory access, or network latency, storage alone limits your database to fewer than 100 truly consistent transactions per second. Many teams don’t realize they’re hitting this physical limit until they’ve spent weeks optimizing application code with diminishing returns.

The IOPS Dance

IOPS limitations create another subtle challenge. Traditional cloud block storage solutions like Amazon EBS often struggle to simultaneously deliver low latency and high IOPS. This limitation can force organizations to over-provision storage resources, leading to unnecessary costs. For example, when running databases on AWS, many organizations provision multiple high-performance EBS volumes to achieve their required IOPS targets. However, this approach significantly underutilizes storage capacity while still not achieving optimal latency.

A typical gp3 volume provides a baseline of 3,000 IOPS. Let’s see how this plays out in real scenarios:

Common Database Operations IOPS Cost:

Index scan: 2-5 IOPS per page
Sequential scan: 1 IOPS per page
Write operation: 2-4 IOPS (data + WAL)
Vacuum operation: 10-20 IOPS per second

With just 20 concurrent users performing moderate-complexity queries, you could easily exceed your IOPS budget without realizing it. The database doesn’t stop – it just starts queueing requests, creating a cascading effect of increasing latency.

Real-World Database Performance Implications

Here’s a scenario many teams encounter:
A database server handling 1,000 transactions per minute seems to be performing well, with CPU usage at 40% and plenty of available memory. Yet response times occasionally spike inexplicably. The hidden culprit? Storage queuing:

Storage Queue Analysis:

Average queue depth: 4
Peak queue depth: 32
Additional latency per queued operation: 1ms
Effective latency during peaks: 35ms

Impact:

3x increase in transaction time
Timeout errors in the application layer
Connection pool exhaustion

The Ripple Effect

Storage performance limitations create unexpected ripple effects throughout the database system:

Connection Pool Behavior

When storage latency increases, transactions take longer to complete. This leads to connection pool exhaustion, not because of too many users, but because each connection holds onto resources longer than necessary.

Buffer Cache Efficiency

Higher storage latency makes buffer cache misses more expensive. This can cause databases to maintain larger buffer caches than necessary, consuming memory that could be better used elsewhere.

Query Planner Decisions

Most query planners don’t factor in current storage performance when making decisions. A plan that’s optimal under normal conditions might become significantly suboptimal during storage congestion periods.

Breaking Free from Storage Constraints

Figure 2: Impact of access latency and IOPS on query performance, queries per second, transactions per second, and query concurrency.

Modern storage solutions, such as simplyblock, are transforming this landscape. NVMe storage offers sub-200μs latency and millions of IOPS. Hence, databases operate closer to their theoretical limits:

Same Transaction on NVMe:

Single read operation: 0.2ms
Write to WAL: 0.2ms
Page write: 0.2ms
fsync(): 0.2ms

Total latency: 0.8ms
Theoretical transactions per second: ~1,250

This 15x improvement in theoretical throughput isn’t just about speed – it fundamentally changes how databases can be architected and operated.

New Architectural Possibilities

Understanding these storage physics opens new possibilities for database architecture:

Rethinking Write-Ahead Logging

With sub-millisecond storage latency, the traditional WAL design might be unnecessarily conservative. Some databases are exploring new durability models that take advantage of faster storage.

Dynamic Resource Management

Modern storage orchestrators can provide insights into actual storage performance, enabling databases to adapt their behavior based on current conditions rather than static assumptions.

Query Planning Evolution

Next-generation query planners could incorporate real-time storage performance metrics, making decisions that optimize for current system conditions rather than theoretical models.

How does the future of database performance optimization look like?

Understanding storage physics fundamentally changes how we approach database architecture and optimization. While traditional focus areas like query optimization and indexing remain essential, the emergence of next-generation storage solutions enables paradigm shifts in database design and operation. Modern storage architectures that deliver consistent sub-200μs latency and high IOPS aren’t just incrementally faster – they unlock entirely new possibilities for database architecture:

True Horizontal Scalability: With storage no longer being the bottleneck, databases can scale more effectively across distributed systems while maintaining consistent performance.
Predictable Performance: By eliminating storage queuing and latency variation, databases can deliver more consistent response times, even under heavy load.
Simplified Operations: When storage is no longer a constraint, many traditional database optimization techniques and workarounds become unnecessary, reducing operational complexity.

For example, simplyblock’s NVMe-first architecture delivers consistent sub-200μs latency while maintaining enterprise-grade durability through distributed erasure coding. This enables databases to operate much closer to their theoretical performance limits while reducing complexity and cost through intelligent storage optimization.

As more organizations recognize that storage physics ultimately governs database behavior, we’ll likely see continued innovation in storage architectures and database designs that leverage these capabilities. The future of database performance isn’t just about faster storage – it’s about fundamentally rethinking how databases interact with their storage layer to deliver better performance, reliability, and cost-effectiveness at scale.

FAQ

What are queries per second?

Queries per second (QPS) in a database context measures how many read or write operations (queries) a database can handle per second.

What are transactions per second?

Transactions per second (TPS) in a database context measures the number of complete, durable operations (involving one or more queries) successfully processed and committed to storage per second.

How to improve database performance?

Improving database performance involves optimizing query execution, indexing data effectively, scaling hardware resources, and fine-tuning storage configurations to reduce latency and maximize throughput.

What is database performance?

Database performance refers to how efficiently a database processes queries and transactions, delivering fast response times, high throughput, and optimal resource utilization. Many factors, such as query complexity, data model, underlying storage performance, and more, influence database performance.

How is database performance affected by storage?

Storage directly influences database performance. Factors like read/write speed, latency, IOPS capacity, and storage architecture (e.g., SSDs vs. HDDs) directly impact database throughput and query execution times.

The post Database Performance: Impact of Storage Limitations appeared first on simplyblock.

How to Build Scalable and Reliable PostgreSQL Systems on Kubernetes

Sanskar Gurdasani (Guest Author, CloudRaft) — Mon, 25 Nov 2024 15:31:37 +0000

This is a guest post by Sanskar Gurdasani, DevOps Engineer, from CloudRaft.

Maintaining highly available and resilient PostgreSQL databases is crucial for business continuity in today’s cloud-native landscape. The Cloud Native PostgreSQL Operator provides robust capabilities for managing PostgreSQL clusters in Kubernetes environments, particularly in handling failover scenarios and implementing disaster recovery strategies.

In this blog post, we’ll explore the key features of the Cloud Native PostgreSQL Operator for managing failover and disaster recovery. We’ll discuss how it ensures high availability, implements automatic failover, and facilitates disaster recovery processes. Additionally, we’ll look at best practices for configuring and managing PostgreSQL clusters using this operator in Kubernetes environments.

Why to run Postgres on Kubernetes?

Running PostgreSQL on Kubernetes offers several advantages for modern, cloud-native applications:

Stateful Workload Readiness: Contrary to old beliefs, Kubernetes is now ready for stateful workloads like databases. A 2021 survey by the Data on Kubernetes Community revealed that 90% of respondents believe Kubernetes is suitable for stateful workloads, with 70% already running databases in production.
Immutable Application Containers: CloudNativePG leverages immutable application containers, enhancing deployment safety and repeatability. This approach aligns with microservice architecture principles and simplifies updates and patching.
Cloud-Native Benefits: Running PostgreSQL on Kubernetes embraces cloud-native principles, fostering a DevOps culture, enabling microservice architectures, and providing robust container orchestration.
Automated Management: Kubernetes operators like CloudNativePG extend Kubernetes controllers to manage complex applications like PostgreSQL, handling deployments, failovers, and other critical operations automatically.
Declarative Configuration: CloudNativePG allows for declarative configuration of PostgreSQL clusters, simplifying change management and enabling Infrastructure as Code practices.
Resource Optimization: Kubernetes provides efficient resource management, allowing for better utilization of infrastructure and easier scaling of database workloads.
High Availability and Disaster Recovery: Kubernetes facilitates the implementation of high availability architectures across availability zones and enables efficient disaster recovery strategies.
Streamlined Operations with Operators: Using operators like CloudNativePG automates all the tasks mentioned above, significantly reducing operational complexity. These operators act as PostgreSQL experts in code form, handling intricate database management tasks such as failovers, backups, and scaling with minimal human intervention. This not only increases reliability but also frees up DBAs and DevOps teams to focus on higher-value activities, ultimately leading to more robust and efficient database operations in Kubernetes environments.

By leveraging Kubernetes for PostgreSQL deployments, organizations can benefit from increased automation, improved scalability, and enhanced resilience for their database infrastructure, with operators like CloudNativePG further simplifying and optimizing these processes.

List of Postgres Operators

Kubernetes operators represent an innovative approach to managing applications within a Kubernetes environment by encapsulating operational knowledge and best practices. These extensions automate the deployment and maintenance of complex applications, such as databases, ensuring smooth operation in a Kubernetes setup.

The Cloud Native PostgreSQL Operator is a prime example of this concept, specifically designed to manage PostgreSQL clusters on Kubernetes. This operator automates various database management tasks, providing a seamless experience for users. Some key features include direct integration with the Kubernetes API server for high availability without relying on external tools, self-healing capabilities through automated failover and replica recreation, and planned switchover of the primary instance to maintain data integrity during maintenance or upgrades.

Additionally, the operator supports scalable architecture with the ability to manage multiple instances, declarative management of PostgreSQL configuration and roles, and compatibility with Local Persistent Volumes and separate volumes for WAL files. It also offers continuous backup solutions to object stores like AWS S3, Azure Blob Storage, and Google Cloud Storage, ensuring data safety and recoverability. Furthermore, the operator provides full recovery and point-in-time recovery options from existing backups, TLS support with client certificate authentication, rolling updates for PostgreSQL minor versions and operator upgrades, and support for synchronous replicas and HA physical replication slots. It also offers replica clusters for multi-cluster PostgreSQL deployments, connection pooling through PgBouncer, a native customizable Prometheus metrics exporter, and LDAP authentication support.

By leveraging the Cloud Native PostgreSQL Operator, organizations can streamline their database management processes on Kubernetes, reducing manual intervention and ensuring high availability, scalability, and security in their PostgreSQL deployments. This operator showcases how Kubernetes operators can significantly enhance application management within a cloud-native ecosystem.

Here are the most popular PostgreSQL operators:

CloudNativePG (formerly known as Cloud Native PostgreSQL Operator)
Crunchy Data Postgres Operator (first released in 2017)
Zalando Postgres Operator (first released in 2017)
StackGres (released in 2020)
Percona Operator for PostgreSQL (released in 2021)
Kubegres (released in 2021)
Patroni (for HA PostgreSQL solutions using Python.)

Understanding Failover in PostgreSQL

Primary-Replica Architecture

In a PostgreSQL cluster, the primary-replica (formerly master-slave) architecture consists of:

Primary Node: Handles all write operations and read operations
Replica Nodes: Maintain synchronized copies of the primary node’s data

Automatic Failover Process

When the primary node becomes unavailable, the operator initiates the following process:

Detection: Continuous health monitoring identifies primary node failure
Election: A replica is selected to become the new primary
Promotion: The chosen replica is promoted to primary status
Reconfiguration: Other replicas are reconfigured to follow the new primary
Service Updates: Kubernetes services are updated to point to the new primary

Implementing Disaster Recovery

Backup Strategies

The operator supports multiple backup approaches:

1. Volume Snapshots

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql-cluster
spec:
  instances: 3
  backup:
    volumeSnapshot:
      className: csi-hostpath-snapclass
      enabled: true
      snapshotOwnerReference: true

2. Barman Integration

spec:
  backup:
    barmanObjectStore:
      destinationPath: 's3://backup-bucket/postgres'
      endpointURL: 'https://s3.amazonaws.com'
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS\_KEY\_ID
        secretAccessKey:
          name: aws-creds
          key: ACCESS\_SECRET\_KEY

Disaster Recovery Procedures

Point-in-Time Recovery (PITR)
- Enables recovery to any specific point in time
- Uses WAL (Write-Ahead Logging) archives
- Minimizes data loss
Cross-Region Recovery
- Maintains backup copies in different geographical regions
- Enables recovery in case of regional failures

Demo

This section provides a step-by-step guide to setting up a CloudNative PostgreSQL cluster, testing failover, and performing disaster recovery.

1. Installation

Method 1: Direct Installation

kubectl apply --server-side -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/main/releases/cnpg-1.24.0.yaml

Method 2: Helm Installation

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg \
  --namespace cnpg-system \
  --create-namespace \
  cnpg/cloudnative-pg

Verify the Installation

kubectl get deployment -n cnpg-system cnpg-controller-manager

Install CloudNativePG Plugin

CloudNativePG provides a plugin for kubectl to manage a cluster in Kubernetes. You can install the cnpg plugin using a variety of methods.

Via the installation script

curl -sSfL \
  https://github.com/cloudnative-pg/cloudnative-pg/raw/main/hack/install-cnpg-plugin.sh | \
  sudo sh -s -- -b /usr/local/bin

If you already have Krew installed, you can simply run:

kubectl krew install cnpg

2. Create S3 Credentials Secret

First, create an S3 bucket and an IAM user with S3 access. Then, create a Kubernetes secret with the IAM credentials:

kubectl create secret generic s3-creds \
  --from-literal=ACCESS_KEY_ID=your_access_key_id \
  --from-literal=ACCESS_SECRET_KEY=your_secret_access_key

3. Create PostgreSQL Cluster

Create a file named cluster.yaml with the following content:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example
spec:
  backup:
    barmanObjectStore:
      destinationPath: 's3://your-bucket-name/retail-master-db'
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: ACCESS_SECRET_KEY
  instances: 2
  imageName: ghcr.io/clevyr/cloudnativepg-timescale:16-ts2
  postgresql:
    shared_preload_libraries:
      - timescaledb
  bootstrap:
    initdb:
      postInitTemplateSQL:
        - CREATE EXTENSION IF NOT EXISTS timescaledb;
  storage:
    size: 20Gi

Apply the configuration to create cluster:

kubectl apply -f cluster.yaml

Verify the cluster status:

kubectl cnpg status example

4. Getting Access

Deploying a cluster is one thing, actually accessing it is entirely another. CloudNativePG creates three services for every cluster, named after the cluster name. In our case, these are:

kubectl get service

example-rw: Always points to the Primary node
example-ro: Points to only Replica nodes (round-robin)
example-r: Points to any node in the cluster (round-robin)

5. Insert Data

Create a PostgreSQL client pod:

kubectl run pgclient --image=postgres:13 --command -- sleep infinity

Connect to the database:

kubectl exec -ti example-1 -- psql app

Create a table and insert data:

CREATE TABLE stocks_real_time (
  time TIMESTAMPTZ NOT NULL,
  symbol TEXT NOT NULL,
  price DOUBLE PRECISION NULL,
  day_volume INT NULL
);

SELECT create_hypertable('stocks_real_time', by_range('time'));
CREATE INDEX ix_symbol_time ON stocks_real_time (symbol, time DESC);
GRANT ALL PRIVILEGES ON TABLE stocks_real_time TO app;

INSERT INTO stocks_real_time (time, symbol, price, day_volume)
VALUES
  (NOW(), 'AAPL', 150.50, 1000000),
  (NOW(), 'GOOGL', 2800.75, 500000),
  (NOW(), 'MSFT', 300.25, 750000);

6. Failover Test

Force a backup:

kubectl cnpg backup example

Initiate failover by deleting the primary pod:

kubectl delete pod example-1

Monitor the cluster status:

kubectl cnpg status example

Key observations during failover:

Initial status: “Switchover in progress”
After approximately 2 minutes 15 seconds: “Waiting for instances to become active”
After approximately 3 minutes 30 seconds: Complete failover with new primary

Verify data integrity after failover through service:

Retrieve the database password:

kubectl get secret example-app -o \
  jsonpath="{.data.password}" | base64 --decode

Connect to the database using the password:

kubectl exec -it pgclient -- psql -h example-rw -U app

Execute the following SQL queries:

# Confirm the count matches the number of rows inserted earlier. It will show 3
SELECT COUNT(*) FROM stocks_real_time;

#Insert new data to test write capability after failover:
INSERT INTO stocks_real_time (time, symbol, price, day_volume)
VALUES (NOW(), 'NFLX', 500.75, 300000);


SELECT * FROM stocks_real_time ORDER BY time DESC LIMIT 1;

Check read-only service:

kubectl exec -it pgclient -- psql -h example-ro -U app

Once connected, execute:

SELECT COUNT(*) FROM stocks_real_time;

Review logs of both pods:

kubectl logs example-1
kubectl logs example-2

Examine the logs for relevant failover information.

Perform a final cluster status check:

kubectl cnpg status example

Confirm both instances are running and roles are as expected.

7. Backup and Restore Test

First, check the current status of your cluster:

kubectl cnpg status example

Note the current state, number of instances, and any important details.

Promote the example-1 node to Primary:

kubectl cnpg promote example example-1

Monitor the promotion process, which typically takes about 3 minutes to complete.

Check the updated status of your cluster, then create a new backup:

kubectl cnpg backup example –backup-name=example-backup-1

Verify the backup status:

kubectl get backups
NAME               AGE   CLUSTER   METHOD              PHASE       ERROR
example-backup-1   38m   example   barmanObjectStore   completed

Delete the Original Cluster then prepare for the recovery test:

kubectl delete cluster example

There are two methods available to perform a Cluster Recovery bootstrap from another cluster. For further details, please refer to the documentation. There are two ways to achieve this result in CloudNativePG:

Using a recovery object store, that is a backup of another cluster created by Barman Cloud and defined via the barmanObjectStore option in the externalClusters section (recommended)
Using an existing Backup object in the same namespace (this was the only option available before version 1.8.0).

Method 1: Recovery from an Object Store

You can recover from a backup created by Barman Cloud and stored on supported object storage. Once you have defined the external cluster, including all the required configurations in the barmanObjectStore section, you must reference it in the .spec.recovery.source option.

Create a file named example-object-restored.yaml with the following content:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-object-restored
spec:
  instances: 2
  imageName: ghcr.io/clevyr/cloudnativepg-timescale:16-ts2
  postgresql:
    shared_preload_libraries:
      - timescaledb
  storage:
    size: 1Gi
  bootstrap:
    recovery:
      source: example
  externalClusters:
    - name: example
      barmanObjectStore:
        destinationPath: 's3://your-bucket-name'
        s3Credentials:
          accessKeyId:
            name: s3-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: s3-creds
            key: ACCESS_SECRET_KEY

Apply the restored cluster configuration:

kubectl apply -f example-object-restored.yaml

Monitor the restored cluster status:

kubectl cnpg status example-object-restored

Retrieve the database password:

kubectl get secret example-object-restored-app \
  -o jsonpath="{.data.password}" | base64 --decode

Connect to the restored database:

kubectl exec -it pgclient -- psql -h example-object-restored-rw -U app

Verify the restored data by executing the following SQL queries:

# it should show 4
SELECT COUNT(*) FROM stocks_real_time;
SELECT * FROM stocks_real_time;

The successful execution of these steps to recover from an object store confirms the effectiveness of the backup and restore process.

Delete the example-object-restored Cluster then prepare for the backup object restore test:

kubectl delete cluster example-object-restored

Method 2: Recovery from Backup Object

In case a Backup resource is already available in the namespace in which the cluster should be created, you can specify its name through .spec.bootstrap.recovery.backup.name

Create a file named example-restored.yaml:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-restored
spec:
  instances: 2
  imageName: ghcr.io/clevyr/cloudnativepg-timescale:16-ts2
  postgresql:
    shared_preload_libraries:
      - timescaledb
  storage:
    size: 1Gi
  bootstrap:
    recovery:
      backup:
        name: example-backup-1

Apply the restored cluster configuration:

kubectl apply -f example-restored.yaml

Monitor the restored cluster status:

kubectl cnpg status example-restored

Retrieve the database password:

kubectl get secret example-restored-app \
  -o jsonpath="{.data.password}" | base64 --decode

Connect to the restored database:

kubectl exec -it pgclient -- psql -h example-restored-rw -U app

Verify the restored data by executing the following SQL queries:

SELECT COUNT(*) FROM stocks_real_time;
SELECT * FROM stocks_real_time;

The successful execution of these steps confirms the effectiveness of the backup and restore process.

Kubernetes Events and Logs

1. Failover Events

Monitor events using:

# Watch cluster events
kubectl get events --watch | grep postgresql

# Get specific cluster events
kubectl describe cluster example | grep -A 10 Events

Key events to monitor:
- Primary selection process
- Replica promotion events
- Connection switching events
- Replication status changes

2. Backup Status

Monitor backup progress:

# Check backup status
kubectl get backups

# Get detailed backup info
kubectl describe backup example-backup-1

Key metrics:
- Backup duration
- Backup size
- Compression ratio
- Success/failure status

3. Recovery Process

Monitor recovery status:

# Watch recovery progress
kubectl cnpg status example-restored

# Check recovery logs
kubectl logs example-restored-1 \-c postgres

Important recovery indicators:
- WAL replay progress
- Timeline switches
- Recovery target status

Conclusion

The Cloud Native PostgreSQL Operator significantly simplifies the management of PostgreSQL clusters in Kubernetes environments. By following these practices for failover and disaster recovery, organizations can maintain highly available database systems that recover quickly from failures while minimizing data loss. Remember to regularly test your failover and disaster recovery procedures to ensure they work as expected when needed. Continuous monitoring and proactive maintenance are key to maintaining a robust PostgreSQL infrastructure.

Everything fails, all the time. ~ Werner Vogels, CTO, Amazon Web services

Editoral: And if you are looking for looking for a distributed, scalable, reliable, and durable storage for your PostgreSQL cluster in Kubernetes or any other Kubernetes storage need, simplyblock is the solution you’re looking for.

The post How to Build Scalable and Reliable PostgreSQL Systems on Kubernetes appeared first on simplyblock.

NVMe Storage for Database Optimization: Lessons from Tech Giants

Rob Pankow — Thu, 17 Oct 2024 13:27:59 +0000

Database Scalability Challenges in the Age of NVMe

In 2024, data-driven organizations increasingly recognize the crucial importance of adopting NVMe storage solutions to stay competitive. With NVMe adoption still below 30%, there’s significant room for growth as companies seek to optimize their database performance and storage efficiency. We’ve looked at how major tech companies have tackled database optimization and scalability challenges, often turning to self-hosted database solutions and NVMe storage.

While it’s interesting to see what Netflix or Pinterest engineers are investing their efforts into, it is also essential to ask yourself how your organization is adopting new technologies. As companies grow and their data needs expand, traditional database setups often struggle to keep up. Let’s look at some examples of how some of the major tech players have addressed these challenges.

Pinterest’s Journey to Horizontal Database Scalability with TiDB

Pinterest, which handles billions of pins and user interactions, faced significant challenges with its HBase setup as it scaled. As their business grew, HBase struggled to keep up with evolving needs, prompting a search for a more scalable database solution. They eventually decided to go with TiDB as it provided the best performance under load.

Selection Process:

Evaluated multiple options, including RocksDB, ShardDB, Vitess, VoltDB, Phoenix, Spanner, CosmosDB, Aurora, TiDB, YugabyteDB, and DB-X.
Narrowed down to TiDB, YugabyteDB, and DB-X for final testing.

Evaluation:

Conducted shadow traffic testing with production workloads.
TiDB performed well after tuning, providing sustained performance under load.

TiDB Adoption:

Deployed 20+ TiDB clusters in production.
Stores over 200+ TB of data across 400+ nodes.
Primarily uses TiDB 2.1 in production, with plans to migrate to 3.0.

Key Benefits:

Improved query performance, with 2-10x improvements in p99 latency.
More predictable performance with fewer spikes.
Reduced infrastructure costs by about 50%.
Enabled new product use cases due to improved database performance.

Challenges and Learnings:

Encountered issues like TiCDC throughput limitations and slow data movement during backups.
Worked closely with PingCAP to address these issues and improve the product.

Future Plans:

Exploring multi-region setups.
Considering removing Envoy as a proxy to the SQL layer for better connection control.
Exploring migrating to Graviton instance types for a better price-performance ratio and EBS for faster data movement (and, in turn, shorter MTTR on node failures).

Uber’s Approach to Scaling Datastores with NVMe

Uber, facing exponential growth in active users and ride volumes, needed a robust solution for their datastore “Docstore” challenges.

Hosting Environment and Limitations:

Initially on AWS, later migrated to hybrid cloud and on-premises infrastructure
Uber’s massive scale and need for customization exceeded the capabilities of managed database services

Uber’s Solution: Schemaless and MySQL with NVMe

Schemaless: A custom solution built on top of MySQL
Sharding: Implemented application-level sharding for horizontal scalability
Replication: Used MySQL replication for high availability
NVMe storage: Leveraged NVMe disks for improved I/O performance

Results:

Able to handle over 100 billion queries per day
Significantly reduced latency for read and write operations
Improved operational simplicity compared to Cassandra

Discord’s Storage Evolution and NVMe Adoption

Discord, facing rapid growth in user base and message volume, needed a scalable and performant storage solution.

Hosting Environment and Limitations:

Google Cloud Platform (GCP)
Discord’s specific performance requirements and need for customization led them to self-manage their database infrastructure

Discord’s storage evolution:

MongoDB: Initially used for its flexibility, but faced scalability issues
Cassandra: Adopted for better scalability but encountered performance and maintenance challenges
ScyllaDB: Finally settled on ScyllaDB for its performance and compatibility with Cassandra

Discord also created a solution, “superdisk” with a RAID0 on top of the Local SSDs, and a RAID1 between the Persistent Disk and RAID0 array. They could configure the database with a disk drive that would offer low-latency reads while still allowing us to benefit from the best properties of Persistent Disks. One can think of it as a “simplyblock v0.1”.

Figure 1: Discord’s “superdisk” architecture

Key improvements with ScyllaDB:

Reduced P99 latencies from 40-125ms to 15ms for read operations
Improved write performance, with P99 latencies dropping from 5-70ms to a consistent 5ms
Better resource utilization, allowing Discord to reduce their cluster size from 177 Cassandra nodes to just 72 ScyllaDB nodes

Summary of Case Studies

In the table below, we can see a summary of the key initiatives taken by tech giants and their respective outcomes. What is notable, all of the companies were self-hosting their databases (on Kubernetes or on bare-metal servers) and have leveraged local SSD (NVMe) for improved read/write performance and lower latency. However, at the same time, they all had to work around data protection and scalability of the local disk. Discord, for example, uses RAID to mirror the disk, which causes significant storage overhead. Such an approach doesn’t also offer a logical management layer (i.e. “storage/disk virtualization”). In the next paragraphs, let’s explore how simplyblock adds even more performance, scalability, and resource efficiency to such setups.

Company	Database	Hosting environment	Key Initiative
Pinterest	TiDB	AWS EC2 & Kubernetes, local NVMe disk	Improved performance & scalability
Uber	MySQL	Bare-metal, NVMe storage	Reduced read/write latency, improved scalability
Discord	ScyllaDB	Google Cloud, local NVMe disk with RAID mirroring	Reduced latency, improved performance and resource utilization

The Role of Intelligent Storage Optimization in NVMe-Based Systems

While these case studies demonstrate the power of NVMe and optimized database solutions, there’s still room for improvement. This is where intelligent storage optimization solutions like simplyblock are spearheading market changes.

Simplyblock vs. Local NVMe SSD: Enhancing Database Scalability

While local NVMe disks offer impressive performance, simplyblock provides several critical advantages for database scalability. Simplyblock builds a persistent layer out of local NVMe disks, which means that is not just a cache and it’s not just ephemeral storage. Let’s explore the benefits of simplyblock over local NVMe disk:

Scalability: Unlike local NVMe storage, simplyblock offers dynamic scalability, allowing storage to grow or shrink as needed. Simplyblock can scale performance and capacity beyond the local node’s disk size, significantly improving tail latency.
Reliability: Data on local NVMe is lost if an instance is stopped or terminated. Simplyblock provides advanced data protection that survives instance outages.
High Availability: Local NVMe loses data availability during the node outage. Simplyblock ensures storage remains fully available even if a compute instance fails.
Data Protection Efficiency: Simplyblock uses erasure coding (parity information) instead of triple replication, reducing network load and improving effective-to-raw storage ratios by about 150% (for a given amount of NVMe disk, there is 150% more usable storage with simplyblock).
Predictable Performance: As IOPS demand increases, local NVMe access latency rises, often causing a significant increase in tail latencies (p99 latency). Simplyblock maintains constant access latencies at scale, improving both median and p99 access latency. Simplyblock also allows for much faster write at high IOPS as it’s not using NVMe layer as write-through cache, hence its performance isn’t dependent on a backing persistent storage layer (e.g. S3)
Maintainability: Upgrading compute instances impacts local NVMe storage. With simplyblock, compute instances can be maintained without affecting storage.
Data Services: Simplyblock provides advanced data services like snapshots, cloning, resizing, and compression without significant overhead on CPU performance or access latency.
Intelligent Tiering: Simplyblock automatically moves infrequently accessed data to cheaper S3 storage, a feature unavailable with local NVMe.
Thin Provisioning: This allows for more efficient use of storage resources, reducing overprovisioning common in cloud environments.
Multi-attach Capability: Simplyblock enables multiple nodes to access the same volume, which is useful for high-availability setups without data duplication. Additionally, multi-attach can decrease the complexity of volume management and data synchronization.

Technical Deep Dive: Simplyblock’s Architecture

Simplyblock’s architecture is designed to maximize the benefits of NVMe while addressing common cloud storage challenges:

NVMe-oF (NVMe over Fabrics) Interface: Exposes storage as NVMe volumes, allowing for seamless integration with existing systems while providing the low-latency benefits of NVMe.
Distributed Data Plane: Uses a statistical placement algorithm to distribute data across nodes, balancing performance and reliability.
Logical Volume Management: Supports thin provisioning, instant resizing, and copy-on-write clones, providing flexibility for database operations.
Asynchronous Replication: Utilizes a block-storage-level write-ahead log (WAL) that’s asynchronously replicated to object storage, enabling disaster recovery with near-zero RPO (Recovery Point Objective).
CSI Driver: Provides seamless integration with Kubernetes, allowing for dynamic provisioning and lifecycle management of volumes.

Below is a short overview of simplyblock’s high-level architecture in the context of PostgreSQL, MySQL, or Redis instances hosted in Kubernetes. Simplyblock creates a clustered shared pool out of local NVMe storage attached to Kubernetes compute worker nodes (storage is persistent, protected by erasure coding), serving database instances with the performance of local disk but with an option to scale out into other nodes (which can be either other compute nodes or separate, disaggregated, storage nodes). Further, the “colder” data is tiered into cheaper storage pools, such as HDD pools or object storage.

Figure 2: Simplified simplyblock architecture

Applying Simplyblock to Real-World Scenarios

Let’s explore how simplyblock could enhance the setups of the companies we’ve discussed:

Pinterest and TiDB with simplyblock

While TiDB solved Pinterest’s scalability issues, and they are exploring Graviton instances and EBS for a better price-performance ratio and faster data movement, simplyblock could potentially offer additional benefits:

Price/Performance Enhancement: Simplyblock’s storage orchestration could complement Pinterest’s move to Graviton instances, potentially amplifying the price-performance benefits. By intelligently managing storage across different tiers (including EBS and local NVMe), simplyblock could help optimize storage costs while maintaining or even improving performance.
MTTR Improvement & Faster Data Movements: In line with Pinterest’s goal of faster data movement and reduced Mean Time To Recovery (MTTR), simplyblock’s advanced data management capabilities could further accelerate these processes. Its efficient data protection with erasure coding and multi-attach capabilities helps with smooth failovers or node failures without performance degradation. If a node fails, simplyblock can quickly and autonomously rebuild the data on another node using parity information provided by erasure coding, eliminating downtime.
Better Scalability through Disaggregation: Simplyblock’s architecture allows for the disaggregation of storage and compute, which aligns well with Pinterest’s exploration of different instance types and storage options. This separation would provide Pinterest with greater flexibility in scaling their storage and compute resources independently, potentially leading to more efficient resource utilization and easier capacity planning.

Figure 3: Simplyblock’s multi-attach functionality visualized

Uber’s Schemaless

While Uber’s custom Schemaless solution on MySQL with NVMe storage is highly optimized, simplyblock could still offer benefits:

Unified Storage Interface: Simplyblock could provide a consistent interface across Uber’s diverse storage needs, simplifying operations.
Intelligent Data Placement: For Uber’s time-series data (like ride information), simplyblock’s tiering could automatically optimize data placement based on age and access patterns.
Enhanced Disaster Recovery: Simplyblock’s asynchronous replication to S3 could complement Uber’s existing replication strategies, potentially improving RPO.

Discord and ScyllaDB

Discord’s move to ScyllaDB already provided significant performance improvements, but simplyblock could further enhance their setup:

NVMe Resource Pooling: By pooling NVMe resources across nodes, simplyblock would allow Discord to further reduce their node count while maintaining performance.
Cost-Efficient Scaling: For Discord’s rapidly growing data needs, simplyblock’s intelligent tiering could help manage costs as data volumes expand.
Simplified Cloning for Testing: Simplyblock’s instant cloning feature could be valuable for Discord’s development and testing processes.It allows for quick replication of production data without additional storage overhead.

What’s next in the NVMe Storage Landscape?

The case studies from Pinterest, Uber, and Discord highlight the importance of continuous innovation in database and storage technologies. These companies have pushed beyond the limitations of managed services like Amazon RDS to create custom, high-performance solutions often built on NVMe storage.

However, the introduction of intelligent storage optimization solutions like simplyblock represents the next frontier in this evolution. By providing an innovative layer of abstraction over diverse storage types, implementing smart data placement strategies, and offering features like thin provisioning and instant cloning alongside tight integration with Kubernetes, simplyblock spearheads market changes in how companies approach storage optimization.

As data continues to grow exponentially and performance demands increase, the ability to intelligently manage and optimize NVMe storage will become ever more critical. Solutions that can seamlessly integrate with existing infrastructure while providing advanced features for performance, cost optimization, and disaster recovery will be key to helping companies navigate the challenges of the data-driven future.

The trend towards NVMe adoption, coupled with intelligent storage solutions like simplyblock is set to reshape the database infrastructure landscape. Companies that embrace these technologies early will be well-positioned to handle the data challenges of tomorrow, gaining a significant competitive advantage in their respective markets.

The post NVMe Storage for Database Optimization: Lessons from Tech Giants appeared first on simplyblock.

Why would you run PostgreSQL in Kubernetes, and how?

Chris Engelbert — Wed, 02 Oct 2024 13:12:26 +0000

Running PostgreSQL in Kubernetes

When you need a PostgreSQL service in the cloud, there are two common ways to achieve this. The initial thought is going for one of the many hosted databases, such as Amazon RDS or Aurora, Google’s CloudSQL, Azure Database for Postgres, and others. An alternative way is to self-host a database. Something that was way more common in the past when we talked about virtual machines but got lost towards containerization. Why? Many believe containers (and Kubernetes specifically) aren’t a good fit for running databases. I firmly believe that cloud databases, while seemingly convenient at first sight, aren’t a great way to scale and that the assumed benefits are not what you think they are. Now, let’s explore deeper strategies for running PostgreSQL effectively in Kubernetes.

Many people still think running a database in Kubernetes is a bad idea. To understand their reasoning, I did the only meaningful thing: I asked X (formerly Twitter) why you should not run a database. With the important addition of “asking for a friend.” Never forget that bit. You can thank me later 🤣

The answers were very different. Some expected, some not.

K8s is not designed with Databases in Mind!

When Kubernetes was created, it was designed as an orchestration layer for stateless workloads, such as web servers and stateless microservices. That said, it initially wasn’t intended for workloads like databases or any other workload that needs to hold any state across restarts or migration.

So while this answer had some initial merit, it isn’t true today. People from the DoK (Data on Kubernetes) Community and the SIG Storage (Special Interest Group), which is responsible for the CSI (Container Storage Interface) driver interface, as well as the community as a whole, made a tremendous effort to bring stateful workloads to the Kubernetes world.

Never run Stateful Workloads in Kubernetes!

From my perspective, this one is directly related to the claim that Kubernetes isn’t made for stateful workloads. As mentioned before, this was true in the past. However, these days, it isn’t much of a problem. There are a few things to be careful about, but we’ll discuss some later.

Persistent Data will kill you! Too slow!

When containers became popular in the Linux world, primarily due to the rise of Docker, storage was commonly implemented through overlay filesystems. These filesystems had to do quite the magic to combine the read-only container image with some potential (ephemeral) read-write storage. Doing anything IO-heavy on those filesystems was a pain. I’ve built Embedded Linux kernels inside Docker, and while it was convenient to have the build environment set up automatically, IO speed was awful.

These days, though, the CSI driver interface enables direct mounting of all kinds of storage into the container. Raw blog storage, file storage, FUSE filesystems, and others are readily available and often offer immediate access to functionality such as snapshotting, backups, resizing, and more. We’ll dive a bit deeper into storage later in the blog post.

Nobody understands Kubernetes!

This is my favorite one, especially since I’m all against the claim that Kubernetes is easy. If you never used Kubernetes before, a database isn’t the way to start. Not … at … all. Just don’t do it. If you’re not familiar with Kubernetes, avoid using it for your database.

What’s the Benefit? Databases don’t need Autoscaling!

That one was fascinating. Unfortunately, nobody from this group responded to the question about their commonly administered database size. It would’ve been interesting. Obviously, there are perfect use cases for a database to be scaled—maybe not storage-wise but certainly compute-wise.

The simplest example is an online shop handling the Americas only. It’ll mostly go idle overnight. The database compute could be scaled down close to zero, whereas, during the day, you have to scale it up again.

Databases and Applications should be separated!

I couldn’t agree more. That’s what node groups are for. It probably goes back to the fact that “nobody understands Kubernetes,” so you wouldn’t know about this feature.

Simply speaking, node groups are groups of Kubernetes worker nodes commonly grouped by hardware specifications. You can tag and taint those nodes to specify which workloads are supposed to be run on and by them. This is super useful!

Not another Layer of Indirection / Abstraction!

Last but not least is the abstraction layer argument. And this is undoubtedly a valid one. If everything works, the world couldn’t be better, but if something goes wrong, good luck finding the root cause. And it only worsens the more abstraction you add, such as service meshes or others. Abstraction layers are two sides of the same coin, always.

Why run PostgreSQL in Kubernetes?

If there are so many reasons not to run my database in Kubernetes, why do I still think it’s not only a good idea but should be the standard?

No Vendor Lock-in

First and foremost, I firmly believe that vendor lock-in is dangerous. While many cloud databases offer standard protocols (such as Postgres or MySQL compatible), their internal behavior or implementation isn’t. That means that over time, your application will be bound to a specific database’s behavior, making it an actual migration whenever you need to move your application or, worse, make it cross-cloud or hybrid-compatible.

Kubernetes abstracts away almost all elements of the underlying infrastructure, offering a unified interface. This makes it easy to move workloads and deployments from AWS to Google, from Azure to on-premise, from everywhere to anywhere.

Unified Deployment Architecture

Furthermore, the deployment landscape will look similar—there will be no special handling by cloud providers or hyperscalers. You have an ingress controller, a CSI driver for storage, and the Cert Manager to provide certificates—it’s all the same.

This simplifies development, simplifies deployment, and, ultimately, decreases the time to market for new products or features and the actual development cost.

Automation

Last, the most crucial factor is that Kubernetes is an orchestration platform. As such, it is all about automating deployments, provisioning, operation, and more.

Kubernetes comes with loads of features that simplify daily operations. These include managing the TLS certificates and up- and down-scaling services, ensuring multiple instances are distributed across the Kubernetes cluster as evenly as possible, restarting failed services, and the list could go on forever. Basically, anything you’d build for your infrastructure to make it work with minimal manual intervention, Kubernetes has your back.

Best Practices when running PostgreSQL on Kubernetes

With those things out of the way, we should be ready to understand what we should do to make our Postgres on K8s experience successful.

While many of the following thoughts aren’t exclusively related to running PostgreSQL on Kubernetes, there are often small bits and pieces that we should be aware of or that make our lives easier than implementing them separately.

That said, let’s dive in.

Enable Security Features

Let’s get the elephant out of the room first. Use security. Use it wherever possible, meaning you want TLS encryption between your database server and your services or clients. But that’s not all. If you use a remote or distributed storage technology, make sure all traffic from and to the storage system is also encrypted.

Kubernetes has excellent support for TLS using Cert Manager. It can create self-signed certificates or sign them using existing certificate authorities (either internal or external, such as Let’s Encrypt).

You should also ensure that your stored data is encrypted as much as possible. At least enable data-at-rest encryption. You must make sure that your underlying storage solution supports meaningful encryption options for your business. What I mean by that is that a serverless or shared infrastructure might need an encryption key per mounted volume (for strong isolation between customers). At the same time, a dedicated installation can be much simpler using a single encryption key for the whole machine.

You may also want to consider extended Kubernetes distributions such as Edgeless Systems’ Constellation, which supports fully encrypted memory regions based on support for CPUs and GPUs. It’s probably the highest level of isolation you can get. If you need that level of confidence, here you do. I talked to Moritz from Edgeless Systems in an early episode of my Cloud Commute podcast. You should watch it. It’s really interesting technology!

Backups and Recovery

At conferences, I love to ask the audience questions. One of them is, “Who creates regular backups?” Most commonly, all the room had their hands up. If you add a second question about storing backups off-site (different data center, S3, whatever), about 25% to 30% of the hands already go down. That, in itself, is bad enough.

Adding a third question on regularly testing their backups by playing them back, most hands are down. It always hurts my soul. We all know we should do it, but testing backups at a regular cadence isn’t easy. Let’s face it: It’s tough to restore a backup, especially if it requires multiple systems to be restored in tandem.

Kubernetes can make this process less painful. When I was at my own startup just a few years ago, we tested our backups once a week. You’d say this is extensive? Maybe it was, but it was pretty painless to do. In our case, we specifically restored our PostgreSQL + Timescale database. For the day-to-day operations, we used a 3-node Postgres cluster: one primary and two replicas.

Running a Backup-Restore every week, thanks to Kubernetes

Every week (no, not Fridays 🤣), we kicked off a third replica. Patroni (an HA manager for Postgres) managed the cluster and restored the last full backup. Afterward, it would replay as much of the Write-Ahead Log (WAL) as is available on our Minio/S3 bucket and have the new replica join the cluster. Now here was the exciting part, would the node be able to join, replay the remaining WAL, and become a full-blown cluster member? If yes, the world was all happy. If not, we’d stop everything else and try to figure out what happened. Let me add that it didn’t fail very often, but we always had a good feeling that the backup worked.

The story above contains one more interesting bit. It uses continuous backups, sometimes also referred to as point-in-time recovery (PITR). If your database supports it (PostgreSQL does), make use of it! If not, a solution like simplyblock may be of help. Simplyblock implements PITR on a block storage level, meaning that it supports all applications on top that implement a consistent write pattern (which are hopefully all databases).

Don’t roll your own Backup Tool

Finally, use existing and tested backup tools. Do not roll your own. You want your backup tool to be industry-proven. A backup is one of the most critical elements of your setup, so don’t take it lightly. Or do you just have anybody build your house?

However, when you have to backup and restore multiple databases or stateful services at once for a consistent but split data set, you need to look into a solution that is more than just a database backup. In this case, simplyblock may be a good solution. Simplyblock can snapshot and backup multiple logical volumes at the same time, creating a consistent view of the world at that point in time and enabling a consistent restore across all services.

Do you use Extensions?

While not all databases are as extensible as PostgreSQL, quite a few have an extension mechanism.

If you need extensions that aren’t part of the standard database container images, remember that you have to build your own image layers. Depending on your company, that can be a challenge. Many companies want signed and certified container images, sometimes for regulatory reasons.

If you have that need, talk to whoever is responsible for compliance (SOC2, PCI DSS, ISO 27000 series, …) as early as possible. You’ll need the time. Compliance is crucial but also a complication for us as engineers or DevOps folks.

In general, I’d recommend that you try to stay with the default images as long as possible. Maybe your database has the option to side-load extensions from volume mounts. That way, you can get extensions validated and certified independently of the actual database image.

For PostgreSQL specifically, OnGres’ StackGres has a magic solution that spins up virtual image layers at runtime. They work on this technology independently from StackGres, so we might see this idea come to other solutions as well.

Think about Updates of your PostgreSQL and Kubernetes

Updates and upgrades are one of the most hated topics around. We all have been in a situation where an update went off the rails or failed in more than one way. Still, they are crucial.

Sometimes, updates bring new features that we need, and sometimes, they bring performance improvement, but they’ll always bring bug fixes. Just because our database isn’t publicly accessible (it isn’t, is it? 🤨) doesn’t mean we don’t have to ensure that essential updates (often critical security bug fixes) are applied. If you don’t believe me, you’d be surprised by how many data breaches or cyber-attacks come from employees. And I’m not talking about accidental leaks or social engineering.

Depending on your database, Kubernetes will not make your life easier. This is especially true for PostgreSQL whenever you have to run pg_upgrade. For those not deep into PG, pg_upgrade will upgrade the database data files from one Postgres version to another. For that to happen, it needs the current and the new Postgres installation, as well as double the storage since it’s not an in-place upgrade but rather a copy-everything-to-the-new-place upgrade.

While not every Postgres update requires you to run pg_upgrade, the ones that do hurt a lot. I bet there are similar issues with other databases.

The development cycles of Kubernetes are fast. It is a fast-moving target that adds new functionality, promotes functionality, and deprecates or changes functionality while still in Alpha or Beta. That’s why many cloud providers and hyperscalers only support the last 3 to 5 versions “for free.” Some providers, e.g., AWS, have implemented an extended support scheme that provides an additional 12 months of support for older Kubernetes versions for only six times the price. For that price difference, maybe hire someone to ensure that your clusters are updated.

Find the right Storage Provider

When you think back to the beginning of the blog post, people were arguing that Kubernetes isn’t made for stateful workloads and that storage is slow.

To prove them wrong, select a storage provider (with a CSI driver) that best fits your needs. Databases love high IOPS and low latency, at least most of them. Hence, you should run benchmarks with your data set, your specific data access patterns and queries, and your storage provider of choice.

Try snapshotting and rolling back (if supported), try taking a backup and restoring it, try resizing and re-attaching, try a failover, and measure how long the volume will be blocked before you can re-attach it to another container. All of these elements aren’t even about speed, but your Recovery Point Objective (RPO) and Recovery Time Objective (RTO). They need to fit your requirements. If they exceed them, that’s awesome, but if you find out you’ll have to breach them due to your storage solution, you’re in a bad spot. Migrations aren’t always fun, and I mean, never.

Last but not least, consider how the data is stored. Remember to check for data-at-rest encryption options. Potentially, you’ll need cross-data center or cross-availability zone replication. There are many things to think about upfront. Know your requirements.

How to find the best Storage Provider?

To help select a meaningful Kubernetes storage provider, I created a tool available at https://storageclass.info/csidrivers. It is an extensive collection of CSI driver implementations searchable by features and characteristics. The list contains over 150 providers. If you see any mistakes, please feel free to open a pull request. Most of the data is extracted manually by looking at the source code.

Requests, Limits, and Quotas

This one is important for any containerization solution. Most databases are designed with the belief that they can utilize all resources available, and so is PostgreSQL.

That should be the case for any database with a good amount of load. I’d always recommend having your central databases run on their own worker nodes, meaning that apart from the database and the essential Kubernetes services, nothing else should be running on the node. Give the database the freedom to do its job.

If you run multiple smaller databases, for example, a shared hosting environment or free tier service, sharing Kubernetes worker nodes is most probably fine. In this case, make sure you set the requests, limits, and quotas correctly. Feel free to overcommit, but keep the noise neighbor problem at the back of your head when designing the overcommitment.

Apart from that, there isn’t much to say, and a full explanation of how to configure these is out of the scope of this blog post. There is, however, a great beginner write-up by Ashen Gunawardena on Kubernetes resource configuration.

One side note, though, is that most databases (including PostgreSQL) love huge pages. Remember that huge pages must be enabled twice on the host operating system (and I recommend reserving memory to get a large continuous chunk) and in the database deployment descriptor. Again, Nickolay Ihalainen already has an excellent write-up. While this article is about Huge Pages with Kubernetes and PostgreSQL, much of the basics are the same for other databases, too.

Make your Database Resilient

One of the main reasons to run your database in the cloud is increased availability and the chance to scale up and down according to your current needs.

Many databases provide tools for high availability with their distributions. For others, it is part of their ecosystems, just as it is for PostgreSQL. Like with backup tools, I’d strongly discourage you from building your own cluster manager. If you feel like you have to, collect a good list of reasons. Do not just jump in. High availability is one of the key features. We don’t want it to fail.

Another resiliency consideration is automatic failover. What happens when a node in my database cluster dies? How will my client failover to a new primary or the newly elected leader?

For PostgreSQL you want to look at the “obvious choices” such as Patroni, repmgr, pg_auto_failover. There are more, but those seem to the ones to use with Patroni most probably leading the pack.

Connection Pool: Proxying Database Connections

In most cases, a database proxy will transparently handle those issues for your application. They typically handle features such as retrying and transparent failover. In addition, they often handle load balancing (if the database supports it).

This most commonly works by the proxy accepting and terminating the database connection, which itself has a set of open connections to the underlying database nodes. Now the proxy will forward the query to one of the database instances (in case of primary-secondary database setups, it’ll also make sure to send mutating operations to the primary database), wait for the result, and return it. If an operation fails because the underlying database instance is gone, the proxy can retry it against a new leader or other instance.

In PostgreSQL, you want to look into tools such as PgBouncer, PgPool-II, and PgCat, with PgBouncer being the most famous choice.

Observability and Monitoring

In the beginning, we established the idea that additional abstraction doesn’t always make things easier. Especially if something goes wrong, more abstraction layers make it harder to get to the bottom of the problem.

That is why I strongly recommend using an observability tool, not just a monitoring tool. There are a bunch of great observability tools available. Some of them are DataDog, Instana, DynaTrace, Grafana, Sumologic, Dash0 (from the original creators of Instana), and many more. Make sure they support your application stack and database stack as wholeheartedly as possible.

A great observability tool that understands the layers and can trace through them is priceless when something goes wrong. They often help to pinpoint the actual root cause and help understand how applications, services, databases, and abstraction layers work together.

Use a Kubernetes Operator

Ok, that was a lot, and I promise we’re almost done. So far, you might wonder how I can claim that any of this is easier than just running bare-metal or on virtual machines. That’s where Kubernetes Operators enter the stage.

Kubernetes Operators are active components inside your Kubernetes environment that deploy, monitor, and operate your database (or other service) for you. They ask for your specifications, like “give me a 3-node PostgreSQL cluster” and set it all up. Usually, including high availability, backup, failover, connection proxy, security, storage, and whatnot.

Operators make your life easy. Think of them as your database operations person or administrator.

For most databases, one or more Kubernetes Operators are available. I’ve written about how to select a PostgreSQL Kubernetes Operator for your setup. For other databases, look at their official documentation or search the Operator Hub.

Anyhow, if you run a database in Kubernetes, make sure you have an Operator at hand. Running a database is more complicated than the initial deployment. I’d even claim that day-to-day operations are more important than deployment.

Actually, for PostgreSQL (and other databases will follow), the Data on Kubernetes Community started a project to create a comparable feature list of PostgreSQL Kubernetes Operators. So far, there isn’t a searchable website yet (as for storage providers), but maybe somebody wants to take that on.

PostgreSQL in Kubernetes is not Cloud SQL

If you read to this point, thank you and congratulations. I know there is a lot of stuff here, but I doubt it’s actually complete. I bet if I’d dig deeper, I would find many more pieces to the game.

As I mentioned before, I strongly believe that if you have Kubernetes experience, your initial thought should be to run your database on Kubernetes, taking in all of the benefits of automation, orchestration, and operation.

One thing we shouldn’t forget, though, running a Postgres on Kubernetes won’t turn it into Cloud SQL, as Kelsey Hightower once said. However, using a cloud database will also not free you of the burden of understanding query patterns, cleaning up the database, configuring the correct indexes, or all the other elements of managing a database. They literally only take away the operations, and here you have to trust they do the right thing.

Anyhow, being slightly biased, I also believe that your database should use simplyblock’s database storage orchestration. We unify access to pooled Amazon EBS volumes, local instance storage, and Amazon S3, using a virtual block storage device that looks like any regular NVMe/SSD hard disk. Simplyblock enables automatic resizing of the storage pool, hence overcommitting the storage backends, snapshots, instant copy-on-write clones, S3-backed cross-availability zone backups, and many more. I recommend you try it out and see all the benefits for yourself.

The post Why would you run PostgreSQL in Kubernetes, and how? appeared first on simplyblock.

RDS vs. EKS: The True Cost of Database Management

Rob Pankow — Thu, 12 Sep 2024 23:21:23 +0000

Databases can make up a significant portion of the costs for a variety of businesses and enterprises, and in particular for SaaS, Fintech, or E-commerce & Retail verticals. Choosing the right database management solution can make or break your business margins. But have you ever wondered about the true cost of your database management? Is your current solution really as cost-effective as you think? Let’s dive deep into the world of database management and uncover the hidden expenses that might be eating away at your bottom line.

The Database Dilemma: Managed Services or Self-Managed?

The first crucial decision comes when choosing the operating model for your databases: should you opt for managed services like AWS RDS or take the reins yourself with a self-managed solution on Kubernetes? It’s not just about the upfront costs – there’s a whole iceberg of expenses lurking beneath the surface.

The Allure of Managed Services

At first glance, managed services like AWS RDS seem to be a no-brainer. They promise hassle-free management, automatic updates, and round-the-clock support. But is it really as rosy as it seems?

The Visible Costs

Subscription Fees : You’re paying for the convenience, and it doesn’t come cheap.
Storage Costs : Every gigabyte counts, and it adds up quickly.
Data Transfer Fees : Moving data in and out? Be prepared to open your wallet.

The Hidden Expenses

Overprovisioning : Are you paying for more than you are actually using?
Personnel costs : Using RDS and assuming that you don’t need to understand databases anymore? Surprise! You still need team that will need to configure the database and set it up for your requirements.
Performance Limitations : When you hit a ceiling, scaling up can be costly.
Vendor Lock-in : Switching providers? That’ll cost you in time and money.
Data Migration : Moving data between services can cost a fortune.
Backup and Storage : Those “convenient” backups? They’re not free. In addition, AWS RDS does not let you plug in other storage solution than AWS-native EBS volumes, which can get quite expensive if your database is IO-intensive

The Power of Self-Managed Kubernetes Databases

On the flip side, managing your databases on Kubernetes might seem daunting at first. But let’s break it down and see where you could be saving big.

Initial Investment

Learning Curve : Yes, there’s an upfront cost in time and training. You need to have on your team engineers that are comfortable with Kubernetes or Amazon EKS.
Setup and Configuration : Getting things right takes effort, but it pays off.

Long-term Savings

Flexibility : Scale up or down as needed, without overpaying.
Multi-Cloud Freedom : Avoid vendor lock-in and negotiate better rates.
Resource Optimization : Use your hardware efficiently across workloads.
Resource Sharing : Kubernetes lets you efficiently allocate resources.
Open-Source Tools : Leverage free, powerful tools for monitoring and management.
Customization : Tailor your setup to your exact needs, no compromise.

Where are the Savings Coming from when using Kubernetes for your Database Management?

In a self-managed Kubernetes environment, you have greater control over resource allocation, leading to improved utilization and efficiency. Here’s why:

a) Dynamic Resource Allocation : Kubernetes allows for fine-grained control over CPU and memory allocation. You can set resource limits and requests at the pod level, ensuring databases only use what they need. Example: During off-peak hours, you can automatically scale down resources, whereas in managed services, you often pay for fixed resources 24/7.

b) Bin Packing : Kubernetes scheduler efficiently packs containers onto nodes, maximizing resource usage. This means you can run more workloads on the same hardware, reducing overall infrastructure costs. Example: You might be able to run both your database and application containers on the same node, optimizing server usage.

c) Avoid Overprovisioning : With managed services, you often need to provision for peak load at all times. In Kubernetes, you can use Horizontal Pod Autoscaling to add resources only when needed. Example: During a traffic spike, you can automatically add more database replicas, then scale down when the spike ends.

d) Resource Quotas : Kubernetes allows setting resource quotas at the namespace level, preventing any single team or application from monopolizing cluster resources. This leads to more efficient resource sharing across your organization.

Self-managed Kubernetes databases can also significantly reduce data transfer costs compared to managed services. Here’s how:

a) Co-location of Services : In Kubernetes, you can deploy your databases and application services in the same cluster. This reduces or eliminates data transfer between zones or regions, which is often charged in managed services. Example: If your app and database are in the same Kubernetes cluster, inter-service communication doesn’t incur data transfer fees.

b) Efficient Data Replication : Kubernetes allows for more control over how and when data is replicated. You can optimize replication strategies to reduce unnecessary data movement. Example: You might replicate data during off-peak hours or use differential backups to minimize data transfer.

c) Avoid Provider Lock-in : Managed services often charge for data egress, especially when moving to another provider. With self-managed databases, you have the flexibility to choose the most cost-effective data transfer methods. Example: You could use direct connectivity options or content delivery networks to reduce data transfer costs between regions or clouds.

d) Optimized Backup Strategies : Self-managed solutions allow for more control over backup processes. You can implement incremental backups or use deduplication techniques to reduce the amount of data transferred for backups. Example: Instead of full daily backups (common in managed services), you might do weekly full backups with daily incrementals, significantly reducing data transfer.

e) Multi-Cloud Flexibility : Self-managed Kubernetes databases allow you to strategically place data closer to where it’s consumed. This can reduce long-distance data transfer costs, which are often higher. Example: You could have a primary database in one cloud and read replicas in another, optimizing for both performance and cost.

By leveraging these strategies in a self-managed Kubernetes environment, organizations can significantly optimize their resource usage and reduce data transfer costs, leading to substantial savings compared to typical managed database services.

Breaking down the Numbers: a Cost Comparison between PostgreSQL on RDS vs EKS

Let’s get down to brass tacks. How do the costs really stack up? We’ve crunched the numbers for a small Postgres database between using managed RDS service and hosting on Kubernetes. For Kubernetes we are using EC2 instances with local NVMe disks that are managed on EKS and simplyblock as storage orchestration layer.

Scenario: 3TB Postgres Database with High Availability (3 nodes) and Single AZ Deployment

Managed Service (AWS RDS) using three Db.m4.2xlarge on Demand with Gp3 Volumes

Available resources

Costs

Available vCPU: 8 Available Memory: 32 GiB Available Storage: 3TB Available IOPS: 20,000 per volume Storage latency: 1-2 milliseconds

Monthly Total Cost: $2511,18
3-Year Total: $2511,18 * 36 months = $90,402

Editorial: See the pricing calculator for Amazon RDS for PostgreSQL

Self-Managed on Kubernetes (EKS) using three i3en.xlarge Instances on Demand

Available resources

Costs

Available vCPU: 12 Available Memory: 96 GiB Available

Storage: 3.75TB (7.5TB raw storage with assumed 50% data protection overhead for simplyblock) Available IOPS: 200,000 per volume (10x more than with RDS) Storage latency: below 200 microseconds (local NVMe disk orchestrated by simplyblock)

Monthly instance cost: $989.88 Monthly storage orchestration cost (e.g. Simplyblock): $90 (3TB x $30/TB)

Monthly EKS cost: $219 ($73 per cluster x 3)

Monthly Total Cost: $1298.88

3-Year Total: $1298.88 x 36 months = $46,759 Base Savings : $90,402 – $46,759 = $43,643 (48% over 3 years)

That’s a whopping 48% saving over three years! But wait, there’s more to consider. We have made some simplistic assumptions to estimate additional benefits of self-hosting to showcase the real potential of savings. While the actual efficiencies may vary from company to company, it should at least give a good understanding of where the hidden benefits might lie.

Additional Benefits of Self-Hosting (Estimated Annual Savings)

Resource optimization/sharing : Assumption: 20% better resource utilization (assuming existing Kubernetes clusters) Estimated Annual Saving: 20% x 989.88 x 12= $2,375
Reduced Data Transfer Costs : Assumption: 50% reduction in data transfer fees Estimated Annual Saving: $2,000
Flexible Scaling : Avoid over-provisioning during non-peak times Estimated Annual Saving: $3,000
Multi-Cloud Strategy : Ability to negotiate better rates across providers Estimated Annual Saving: $5,000
Open-Source Tools : Reduced licensing costs for management tools Estimated Annual Saving: $4,000

Disaster Recovery Insights

RTO (Recovery Time Objective) Improvement : Self-managed: Potential for 40% faster recovery Estimated value: $10,000 per hour of downtime prevented
RPO (Recovery Point Objective) Enhancement : Self-managed: Achieve near-zero data loss Estimated annual value: $20,000 in potential data loss prevention

Total Estimated Annual Benefit of Self-Hosting

Self-hosting pays off. Here is the summary of benefits: Base Savings: $8,400/year Additional Benefits: $15,920/year Disaster Recovery Improvement: $30,000/year (conservative estimate)

Total Estimated Annual Additional Benefit: $54,695

Total Estimated Additional Benefits over 3 Years: $164,085

Note: These figures are estimates and can vary based on specific use cases, implementation efficiency, and negotiated rates with cloud providers.

Beyond the Dollar Signs: the Real Value Proposition

Money talks, but it’s not the only factor in play. Let’s look at the broader picture.

Performance and Scalability

With self-managed Kubernetes databases, you’re in the driver’s seat. Need to scale up for a traffic spike? Done. Want to optimize for a specific workload? You’ve got the power.

Security and Compliance

Think managed services have the upper hand in security? Think again. With self-managed solutions, you have granular control over your security measures. Plus, you’re not sharing infrastructure with unknown entities.

Innovation and Agility

In the fast-paced tech world, agility is king. Self-managed solutions on Kubernetes allow you to adopt cutting-edge technologies and practices without waiting for your provider to catch up.

Is the Database on Kubernetes for Everyone?

Definitely not. While self-managed databases on Kubernetes offer significant benefits in terms of cost savings, flexibility, and control, they’re not a one-size-fits-all solution. Here’s why:

Expertise: Managing databases on Kubernetes demands a high level of expertise in both database administration and Kubernetes orchestration. Not all organizations have this skill set readily available. Self-management means taking on responsibilities like security patching, performance tuning, and disaster recovery planning. For smaller teams or those with limited DevOps resources, this can be overwhelming.
Scale of operations : For simple applications with predictable, low-to-moderate database requirements, the advanced features and flexibility of Kubernetes might be overkill. Managed services could be more cost-effective in these scenarios. Same applies for very small operations or startups in early stages – the cost benefits of self-managed databases on Kubernetes might not outweigh the added complexity and resource requirements.

While database management on Kubernetes offers compelling advantages, organizations must carefully assess their specific needs, resources, and constraints before making the switch. For many, especially larger enterprises or those with complex, dynamic database requirements, the benefits can be substantial. However, others might find that managed services better suit their current needs and capabilities.

Bonus: Simplyblock

There is one more bonus benefit that you get when running your databases in Kubernetes – you can add simplyblock as your storage orchestration layer behind a single CSI driver that will automatically and intelligently serve storage service of your choice. Do you need fast NVMe cache for some hot transactional data with random IO but don’t want to keep it hot forever? We’ve got you covered!

Simplyblock is an innovative cloud-native storage product, which runs on AWS, as well as other major cloud platforms. Simplyblock virtualizes, optimizes, and orchestrates existing cloud storage services (such as Amazon EBS or Amazon S3) behind a NVMe storage interface and a Kubernetes CSI driver. As such, it provides storage for compute instances (VMs) and containers. We have optimized for IO-heavy database workloads, including OLTP relational databases, graph databases, non-relational document databases, analytical databases, fast key-value stores, vector databases, and similar solutions.

This optimization has been built from the ground up to orchestrate a wide range of database storage needs, such as reliable and fast (high write-IOPS) storage for write-ahead logs and support for ultra-low latency, as well as high IOPS for random read operations. Simplyblock is highly configurable to optimally serve the different database query engines.

Some of the key benefits of using simplyblock alongside your stateful Kubernetes workloads are:

Cost Reduction, Margin Increase: Thin provisioning, compression, deduplication of hot-standby nodes, and storage virtualization with multiple tenants increases storage usage while enabling gradual storage increase.
Easy Scalability of Storage: Single node databases require highly scalable storage (IOPS, throughput, capacity) since data cannot be distributed to scale. Simplyblock pools either Amazon EBS volumes or local instance storage from EC2 virtual machines and provides a scalable and cost effective storage solution for single node databases.
Enables Database Branching Features: Using instant snapshots and clones, databases can be quickly branched out and provided to customers. Due to copy-on-write, the storage usage doesn’t increase unless the data is changed on either the primary or branch. Customers could be charged for “additional storage” though.
Enhances Security: Using an S3-based streaming of a recovery journal, the database can be quickly recovered from full AZ and even region outages. It also provides protection against typical ransomware attacks where data gets encrypted by enabling Point-in-Time-Recovery down to a few hundred milliseconds granularity.

Conclusion: the True Cost Revealed

When it comes to database management, the true cost goes far beyond the monthly bill. By choosing a self-managed Kubernetes solution, you’re not just saving money – you’re investing in flexibility, performance, and future-readiness. The savings and benefits will be always use-case and company-specific but the general conclusion shall remain unchanged. While operating databases in Kubernetes is not for everyone, for those who have the privilege of such choice, it should be a no-brainer kind of decision.

Is managing databases on Kubernetes complex?

While there is a learning curve, modern tools and platforms like simplyblock significantly simplify the process, often making it more straightforward than dealing with the limitations of managed services. The knowledge acquired in the process can be though re-utilized across different cloud deployments in different clouds.

How can I ensure high availability with self-managed databases?

Kubernetes offers robust features for high availability, including automatic failover and load balancing. With proper configuration, you can achieve even higher availability than many managed services offer, meeting any possible SLA out there. You are in full control of the SLAs.

How difficult is it to migrate from a managed database service to Kubernetes?

While migration requires careful planning, tools and services exist to streamline the process. Many companies find that the long-term benefits far outweigh the short-term effort of migration.

How does simplyblock handle database backups and point-in-time recovery in Kubernetes?

Simplyblock provides automated, space-efficient backup solutions that integrate seamlessly with Kubernetes. Our point-in-time recovery feature allows you to restore your database to any specific moment, offering protection against data loss and ransomware attacks.

Does simplyblock offer support for multiple database types?

Yes, simplyblock supports a wide range of database types including relational databases like PostgreSQL and MySQL, as well as NoSQL databases like MongoDB and Cassandra. Check out our “Supported Technologies” page for a full list of supported databases and their specific features.

The post RDS vs. EKS: The True Cost of Database Management appeared first on simplyblock.

Ransomware Attack Recovery with Simplyblock

Michael Schmidt — Tue, 10 Sep 2024 23:26:57 +0000

In 2023, the number of victims of Ransomware attacks more than doubled, with 2024 off to an even stronger start. A Ransomware attack encrypts your local data. Additionally, the attackers demand a ransom be paid. Therefore, data is copied to remote locations to increase pressure on companies to pay the ransom. This increases the risk of the data being leaked to the internet even if the ransom is paid. Strong Ransomware protection and mitigation are now more important than ever.

Simplyblock provides sophisticated block storage-level Ransomware protection and mitigation. Together with recovery options, simplyblock enables Point-in-Time Recovery (PITR) for any service or solution storing data.

What is Ransomware?

Ransomware is a type of malicious software (also known as malware) designed to block access to a computer system and/or encrypt data until a ransom is paid to the attacker. Cybercriminals typically carry out this type of attack by demanding payment, often in cryptocurrency, in exchange for providing a decryption key to restore access to the data or system.

Statistics show a significant rise in ransomware cyber attacks: ransomware cases more than doubled in 2023, and the amount of ransom paid reached more than a billion dollars—and these are only official numbers. Many organizations prefer not to report breaches and payments, as those are illegal in many jurisdictions.

The Danger of Ransomware Increases

The number and sophistication of attack tools have also increased significantly. They are becoming increasingly commoditized and easy to use, drastically reducing the skills cyber criminals require to deploy them.

There are many best practices and tools to protect against successful attacks. However, little can be done once an account, particularly a privileged one, has been compromised. Even if the breach is detected, it is most often too late. Attackers may only need minutes to encrypt important data.

Storage, particularly backups, serves as a last line of defense. After a successful attack, they provide a means to recover. However, there are certain downsides to using backups to recover from a successful attack:

The latest backup does not contain all of the data: Data written between the last backup and the time the attack is unrecoverably lost. Even the loss of one hour of data written to a database can be critical for many enterprises.
Backups are not consistent with each other: The backup of one database may not fit the backup of another database or a file repository, so the systems will not be able to integrate correctly after restoration.
The latest backups may already contain encrypted data. It may be necessary to go back in time to find an older backup that is still “clean.” This backup, if available at all, may be linked to substantial data loss.
Backups must be protected from writes and delete operations; otherwise, they can be destroyed or damaged by attackers. Attackers may also damage the backup inventory management system, making it hard or impossible to locate specific backups.
Human error in Backup Management may lead to missing backups.

Simplyblock for Ransomware Protection and Mitigation

Simplyblock provides a smart solution to recover data after a ransomware attack, complementing classical backups.

In addition to writing data to hot-tier storage, simplyblock creates an asynchronously replicated write-ahead log (WAL) of all data written. This log is optimized for high throughput to secondary (low IOPS) storage, such as Amazon S3 or HDD pools, like AWS’ EBS st2 service. If this secondary storage supports write and deletion protection for pre-defined retention periods, as with S3, it is possible to “rewind” the storage to the point immediately before the attack. This performs a data recovery with near-zero RPO (Recovery Point Objective).

A recovery mechanism like this is particularly useful in combination with databases. Before the attack can start, database systems typically have to be stopped. This is necessary as all data and WAL files are in use by the database. This allows for automatically identifying a consistent recovery point with no data loss.

In the future, simplyblock plans to enhance this functionality further. A multi-stage attack detection mechanism will be integrated into the storage. Additionally, deletion protection after clearance from attack within a historical time window and precise automatic identification of attack launch points to locate recovery points.

Furthermore, simplyblock will support partial restore of recovery points to enable different service’ data on the same logical volumes to be restored from individual points in time. This is important since encryption of one service might have started earlier or later than for others, hence the point in time to rewind to must be different.

Conclusion

Simplyblock provides a complementary recovery solution to classical backups. Backups support long-term storage of full recovery snapshots. In contrast, write-ahead log-based recovery is specifically designed for near-zero RPO recovery right after a Ransomware attack starts and enables quick and easy recovery for data protection.

While many databases and data-storing services, such as PostgreSQL, may provide the possibility of Point-in-Time Recovery, the WAL segments need to be stored outside the system as soon as they are closed. That said, the RPO would come down to the size of a WAL segment, whereas with simplyblock, due to its copy-on-write nature, the RPO can be as small as one committed write.

Learn more about simplyblock and its other features like thin-provisioning, immediate clones and branches, encryption, compression, deduplication, and more. Or just get started right away and find the best Ransomware attack protection and mitigation to date.

The post Ransomware Attack Recovery with Simplyblock appeared first on simplyblock.

Rockset alternatives: migrate with simplyblock

Rob Pankow — Wed, 24 Jul 2024 01:52:57 +0000

The Rockset Transition: what you need to know

On June 21, 2024, Rockset announced its acquisition by OpenAI , setting off a countdown for many organizations using their database. If you’re a Rockset user, you’re now facing a critical deadline: September 30, 2024, at 17:00 PDT . By this date, all existing Rockset customers must transition off the platform. This sudden shift has left many companies scrambling to find suitable alternatives that can match Rockset’s performance in real-time analytics, business intelligence, and machine learning applications. At simplyblock, we understand the urgency and complexity of this situation, and we’re here to guide you through this transition.

Key Transition Points:

Deadline : September 30, 2024, at 17:00 PDT
Support : Available via email ( support@rockset.com ) or support tickets
Support Hours : Monday – Friday, 7am – 5pm PDT (San Francisco time)
Rockset FAQ : For more information on the transition

Top Rockset Alternatives: Finding your Perfect Match

As you navigate this transition, it’s crucial to find a solution that not only matches Rockset’s capabilities but also aligns with your specific use cases. Here are some top alternatives, along with how simplyblock can enhance their performance:

1. ClickHouse : the OLAP Powerhouse

ClickHouse is an open-source, column-oriented DBMS that excels in online analytical processing (OLAP) and real-time analytics.

simplyblock benefit : Our NVMe-based block storage significantly boosts ClickHouse’s already impressive query performance on large datasets, making it even more suitable for high-throughput analytics workloads.

2. StarTree : Real-Time Analytics at Scale

Built on Apache Pinot, StarTree is designed for real-time analytics at scale, making it a strong contender for Rockset users.

simplyblock benefit: StarTree’s distributed architecture pairs perfectly with our block storage, allowing for faster data ingestion and query processing across nodes.

3. Qdrant : Vector Similarity Search Engine

Qdrant is a vector similarity search engine designed for production environments, ideal for machine learning applications.

simplyblock benefit: Our storage solution dramatically reduces I/O wait times for Qdrant, enabling even faster vector searches on large datasets.

4. CrateDB : Distributed SQL with a Twist

CrateDB is an open-source distributed SQL database with built-in support for geospatial and full-text search.

simplyblock benefit: simplyblock enhances CrateDB’s distributed nature, allowing for faster data distribution and replication across nodes.

5. Weaviate : the Versatile Vector Database

Weaviate is an open-source vector database that supports various machine learning models and offers GraphQL-based queries.

simplyblock benefit: Our block storage solution enhances Weaviate’s performance, especially for write-intensive operations and real-time updates to vector indexes.

Additional Alternatives Worth considering

While the above options are our top picks, several other databases deserve mention: MongoDB Atlas : Flexible document database with analytics capabilities Redis Enterprise Cloud : In-memory data structure store, ideal for caching and real-time data processing CockroachDB : Distributed SQL database offering strong consistency Neo4j Graph Database : Specialized for handling complex, interconnected data Databricks Data Intelligence Platform : Comprehensive solution for big data analytics and ML YugabyteDB : Combines SQL and NoSQL capabilities in a distributed database

How Simplyblock Enhances your new Data Setup

Regardless of which alternative you choose, simplyblock’s NVMe-based block storage engine can significantly enhance your new data infrastructure:

Reduced I/O Wait Times : Cut down on data access latency, crucial for real-time analytics. 2. Optimized for High-Concurrency : Handle numerous concurrent queries efficiently, perfect for busy BI dashboards. 3. AWS Compatibility : Seamlessly integrate with your existing AWS infrastructure. 4. Scalability : Maintain high performance as your data volumes grow. 5. Cost-Efficiency : Improve storage performance to potentially reduce overall resource needs and lower cloud costs.

Use Cases we Support

Our solution is versatile and can support a wide range of data-intensive applications, including: Real-time dashboards and analytics Business intelligence (BI) Data warehouse speed layer Logging and metrics analysis Machine learning (ML) and data science applications Vector similarity search

Conclusion: Turning Challenge into Opportunity

While the Rockset transition poses significant challenges, it also presents an opportunity to optimize your data infrastructure. By pairing your chosen alternative with simplyblock’s high-performance block storage, you can create a robust, efficient, and future-proof analytics solution.

As you evaluate these options, our team is ready to provide insights on leveraging simplyblock with your new data platform. We’re committed to helping you not just migrate, but upgrade your data capabilities in the process.

Remember, the September 30th deadline is approaching rapidly. Start your migration journey today, and let simplyblock help you build a data setup that outperforms your expectations.

How can Simplyblock be used on AWS?

simplyblock offers high-performance cloud block storage that not only enhances the performance of your databases and applications but also brings cost efficiency. Most importantly, simplyblock storage cluster is based on EC2 instances with local NVMe disks, which qualify for AWS Savings Plans and Reservation Discounts. This means you can leverage simplyblock’s technology while also fulfilling your compute commitment to AWS. Such solution extends AWS compute savings plan to storage . It’s a win-win situation for AWS users seeking performance, scalability, and cost-effectiveness for fast NVMe-based storage.

Simplyblock uses NVMe over TCP for minimal access latency, high IOPS/GB, and efficient CPU core utilization, surpassing local NVMe disks and Amazon EBS in cost/performance ratio at scale. Moreover, simplyblock can be used alongside AWS EDP or AWS Savings Plans.

Ideal for high-performance Kubernetes environments, simplyblock combines the benefits of local-like latency with the scalability and flexibility necessary for dynamic AWS EKS deployments , ensuring optimal performance for I/O-sensitive workloads like databases. Using erasure coding (a better RAID) instead of replicas helps to minimize storage overhead without sacrificing data safety and fault tolerance.

Additional features such as instant snapshots (full and incremental), copy-on-write clones, thin provisioning, compression, encryption, and many more, simplyblock meets your requirements before you set them. Get started using simplyblock right now or learn more about our feature set. Simplyblock is available on AWS marketplace.

The post Rockset alternatives: migrate with simplyblock appeared first on simplyblock.

Neo4j in Cloud and Kubernetes: Advantages, Cypher Queries, and Use Cases

Chris Engelbert — Mon, 15 Jul 2024 02:09:36 +0000

Introduction:

In the era of big data, managing complex relationships is crucial. Neo4j , a leading graph database, excels at handling intricate data connections, making it indispensable for modern applications. This post explores Neo4j’s advantages, Cypher query language, and real-world applications in cloud and Kubernetes environments.

Why Choose Neo4j over Traditional Relational Databases?

Optimized for complex relationships Greater schema flexibility Intuitive data modeling Efficient query performance Natural data exploration

What are the Main Differences between Graph Databases and Relational Databases?

Graph databases and relational databases differ primarily in their structure and data retrieval methods. Graph databases excel at managing and querying relationships between data points, using nodes to represent entities and edges to illustrate connections. This structure is particularly useful for applications like social networks, fraud detection, and recommendation engines where relationships are key. In contrast, relational databases organize data into tables with rows and columns, focusing on structured data and using SQL (Structured Query Language) for CRUD operations (Create, Read, Update, Delete). Relational databases are ideal for applications requiring complex queries and transactions, such as financial systems and enterprise resource planning (ERP) solutions. Understanding these differences helps in selecting the appropriate database type based on specific application needs and data complexities.

How does Neo4j Handle Data Relationships Compared to SQL Databases?

Neo4j handles data relationships by using a graph-based model that directly connects data points (nodes) through relationships (edges). This allows for highly efficient querying and traversal of complex relationships without the need for complex JOIN-like operations (merges). Each relationship in Neo4j is stored as a first-class entity, making it easy to navigate and query intricate connections with minimal latency. In contrast, SQL databases manage relationships using foreign keys and JOIN operations across tables. While SQL databases are efficient for structured data and predefined queries, handling deeply nested or highly interconnected data often requires complex JOIN statements, which can be resource-intensive and slower. Neo4j’s graph model is specifically optimized for queries involving relationships, providing significant performance advantages in scenarios where understanding and traversing connections between data points is crucial.

When should Developers Choose Neo4j over a Traditional Database?

Developers should choose Neo4j over a traditional database when their application involves complex and dynamic relationships between data points. Neo4j’s graph-based model excels in scenarios such as social networking, recommendation systems, fraud detection, network and IT operations, and knowledge graphs, where understanding and querying intricate connections is critical. If the use case demands real-time querying and analysis of data relationships, such as finding the shortest path between nodes or traversing multi-level hierarchies efficiently, Neo4j provides superior performance and scalability compared to traditional relational databases. Additionally, Neo4j is advantageous when the data structure is flexible and evolves over time, as its schema-free nature allows for easy adaptation to changing requirements without significant reworking of the database schema. Choosing Neo4j can greatly enhance performance and simplify development in applications heavily reliant on interconnected data.

Simplifying Complex Data with Cypher Query Language

Cypher, Neo4j’s query language, streamlines data relationship management through:

Intuitive syntax
Declarative nature
Powerful pattern matching
Efficient recursion handling
Built-in graph functions
Advanced aggregation and filtering
Seamless integration with graph algorithms

How does Cypher Differ from SQL in Querying Complex Relationships?

Cypher, the query language.) for Neo4j, differs from SQL in its intuitive approach to querying complex relationships. Cypher uses pattern matching to navigate through nodes and relationships, making it naturally suited for graph traversal and relationship-focused queries. For example, finding connections between nodes in Cypher involves specifying patterns that resemble the graph structure, making the queries concise and easier to understand.

In contrast, SQL relies on JOIN operations to link tables based on foreign keys, which can become cumbersome and less efficient for deeply nested or highly interconnected data. Complex relationships in SQL require multiple JOINs and subqueries, often leading to more verbose and harder-to-maintain queries.

Cypher’s declarative syntax allows developers to describe what they want to retrieve without specifying how to retrieve it, optimizing the underlying traversal and execution. This makes Cypher particularly powerful for applications needing to query and analyze intricate data relationships, such as social networks, recommendation engines, and network analysis, providing a clear advantage over SQL in these scenarios.

What are the Key Features of Cypher that Make it Ideal for Graph Databases?

Cypher is ideal for graph databases due to several key features:

Pattern Matching : Allows intuitive querying by describing graph structures.
Declarative Syntax : Simplifies complex queries, letting the engine optimize execution.
Traversal Efficiency : Excels at navigating and exploring interconnected data.
Flexible Relationships : Easily handles various types and attributes of relationships.
Readability : Shorter, more readable queries compared to SQL’s JOIN operations.
Aggregation and Transformation : Supports advanced data analysis functions.
Schema-Free : Works well with dynamic, evolving data models.
Graph Algorithms : Integrates with Neo4j’s built-in algorithms for advanced analytics. These features make Cypher a powerful language for managing and querying complex relationships in graph databases.

Neo4j use Cases in Cloud and Kubernetes Environments

Microservices Management : Neo4j helps manage and visualize microservice architectures by tracking service dependencies and interactions by storing the relationships between each and every service, enhancing troubleshooting and system optimization.
Fraud Detection : It identifies patterns and anomalies in transactional data, enabling real-time detection and prevention of fraudulent activities through relationship analysis.
Identity and Access Management : Neo4j efficiently maps user permissions and roles, ensuring secure and scalable identity management and access control.
IT Operations and Network Management : It monitors and optimizes IT infrastructure by mapping and analyzing network topologies, dependencies, and configurations.
Recommendation Engines : Leveraging graph algorithms, Neo4j provides personalized recommendations by analyzing user preferences and relationships between items.
Supply Chain Optimization : Neo4j optimizes supply chain processes by mapping product flows, identifying bottlenecks, and enhancing logistics management through relationship analysis.
Healthcare Data Management : It manages complex healthcare data by integrating patient records, treatments, and outcomes, improving patient care and operational efficiency.
Social Network Analysis : Neo4j uncovers insights into social networks by analyzing connections and interactions, supporting marketing, and user engagement strategies.
Knowledge Graph Construction : It constructs and manages knowledge graphs, linking diverse data sources to provide a unified view and advanced search capabilities.
Compliance and Regulatory Reporting : Neo4j ensures compliance by tracking data lineage, managing regulatory requirements, and generating comprehensive reports for audits and governance.

How does Neo4j Enhance Microservices Management in Kubernetes?

Neo4j enhances microservices management in Kubernetes by providing a clear visualization of service dependencies and interactions. It helps in tracking the relationships between microservices, enabling efficient monitoring and troubleshooting. By mapping the complex network of services, Neo4j allows for a better understanding and management of service communications and dependencies, making it easier to identify issues, optimize performance, and ensure seamless integration within a dynamic Kubernetes environment .

What Advantages does Neo4j Offer for Fraud Detection in Cloud Environments?

In cloud environments, Neo4j offers several advantages for fraud detection:

Real-Time Analysis : Neo4j’s graph model allows for rapid querying and analysis of transactional data, enabling real-time detection of fraudulent activities by identifying unusual patterns and connections.
Pattern Recognition : Its ability to model and analyze complex relationships helps in recognizing sophisticated fraud patterns that might be missed by traditional methods.
Anomaly Detection : By examining relationships and behaviors across multiple dimensions, Neo4j can quickly spot anomalies and irregularities in transaction data.
Scalability : Neo4j scales efficiently in cloud environments, handling large volumes of data and complex queries required for comprehensive fraud detection.
Flexibility : The schema-free nature of Neo4j allows for easy adaptation to evolving fraud strategies and data models, ensuring ongoing effectiveness in detecting new types of fraud.

How can Neo4j Improve Supply Chain Management and Optimization?

Neo4j improves supply chain management by:

Providing End-to-End Visibility : Maps relationships across the supply chain to identify bottlenecks and inefficiencies.
Optimizing Demand and Inventory : Analyzes patterns to balance stock levels and prevent overstock or stockouts.
Managing Risks : Identifies vulnerabilities and potential risks within the supply chain.
Enhancing Logistics : Optimizes routes and distribution strategies for efficiency.
Facilitating Collaboration : Improves coordination and decision-making among supply chain partners.

How Simplyblock Enhances Neo4j Performance in Kubernetes

Simplyblock optimizes Neo4j in Kubernetes environments through: High-performance block storage Scalable storage with zero downtime scalability High availability and durability Cost-effective solutions Seamless Kubernetes integration Enhanced data mobility Advanced data management features

What Specific Performance Improvements can Neo4j Expect with Simplyblock?

Neo4j can benefit from simplyblock’s high-performance block storage, which enhances data access and processing speeds. This leads to improved query performance and faster response times. Additionally, simplyblock’s scalable storage options ensure that performance remains consistent even as data volumes grow.

How does Simplyblock Ensure Data Integrity for Neo4j in Kubernetes?

Simplyblock ensures data integrity for Neo4j by providing high availability and durability through features like automatic erasure coding, sync and async cluster replication, as well as backups. These capabilities safeguard data against loss and ensure that it remains accessible and intact, which is crucial for maintaining data integrity in Kubernetes environments.

Furthermore, simplyblock provides immediate snapshots and copy-on-write clones, enabling instant database forks (or clones) for development and staging environments straight from production.

Can Simplyblock help Reduce Storage Costs for Neo4j Deployments?

Yes, simplyblock helps reduce storage costs for Neo4j deployments by offering cost-effective storage solutions. Its cost optimization strategies allow organizations to manage their storage expenses efficiently, making it a practical choice for controlling storage costs in cloud environments.

Neo4j, coupled with simplyblock’s advanced storage solutions, offers unparalleled performance, scalability, and reliability for graph databases in cloud and Kubernetes environments. By leveraging these technologies, organizations can unlock the full potential of their complex data relationships and drive innovation across various industries.

The post Neo4j in Cloud and Kubernetes: Advantages, Cypher Queries, and Use Cases appeared first on simplyblock.

How to reduce AWS cloud costs with AWS marketplace products?

Rob Pankow — Fri, 28 Jun 2024 02:19:03 +0000

The AWS Marketplace is a comprehensive catalog consisting of thousands of offerings that help organizations find, purchase, deploy and manage third-party software and services to optimize their cloud operations. It’s also a great place to find numerous tools specifically designed to help you optimize your AWS cloud costs. These tools can help you monitor your cloud usage, right-size resources, leverage cost-effective pricing models, and implement automated management practices to reduce waste and improve efficiency.

In this blog post you will learn more on the key drivers behind the cost with AWS Cloud, what cloud cost optimization is, why you need to think about it and what tools are at your disposal, particularly in the AWS Marketplace.

What are the Fundamental Drivers of Cost with AWS Cloud?

Industry studies show that almost 70% of organizations experience higher-than-anticipated cloud costs. Understanding the key factors that drive costs in AWS Cloud is essential for effective cost management. Below is a breakdown of the key drivers of cloud costs, including compute resources and storage, which together make up almost 60 -70% of the total spend, costs associated with data transfer, networking, database services, what support plans you opt for, additional costs of licensing & marketplace products and serverless services like API calls.

Based on the Vantage Cloud Cost Report for Q1 2024 , we can see that most used services in public clouds are by far comput instances (EC2 on AWS, Compute Engine on Google Cloud and Virtual Machines on Microsoft Azure), followed by storage and databases. Optimizing costs of compute, storage and databases will have the highest impact on cloud bill reduction.

Looking more granularly on AWS, here are key services to look into when optimizing cloud costs:

Compute Resources

EC2 Instances : The cost depends on the type, size, and number of EC2 instances you run. Different instance types have varying performance and pricing.
Lambda Functions : Pricing is based on the number of requests and the duration of execution.

Cloud Storage

S3 Buckets : Costs vary depending on the amount of data stored, the frequency of access (standard, infrequent access, or Glacier), and the number of requests made.
EBS Volumes : Pricing is based on the type and size of the volume, provisioned IOPS and snapshots. Cloud block storage prices can be very high if used for highly transaction workloads such as relational, NoSQL or vector databases.
EFS and FSx : Pricing is based on the service type, IOPS and other requested services. Prices of file systems in the cloud can become very expensive with extensive usage.

Data Transfer

Data Ingress and Egress : Inbound data transfer is generally free, but outbound data transfer (data leaving AWS) incurs charges. Costs can add up, especially with high-volume transfers across regions or to the internet. Networking
VPC: Costs associated with using features like VPN connections, VPC peering, and data transfer between VPCs.
Load Balancer s: Costs for using ELB (Elastic Load Balancers) vary based on the type (Application, Network, or Classic) and usage. Database Services:
RDS: Charges depend on the database engine, instance type, storage, and backup storage.
DynamoDB: Pricing is based on read and write throughput, data storage, and optional features like backups and data transfer.

Understanding these drivers helps you identify areas where you can cut costs without sacrificing performance, allowing for better budgeting, more efficiency in operations and better scalability as demand increases.

What is Cloud Cost Optimization?

Cloud cost optimization involves using various strategies, techniques, best practices, and tools to lower cloud expenses. It aims to find the most economical way to operate your applications in the cloud, ensuring you get the highest business value from your investment. It may involve tactics like monitoring your cloud usage, identifying waste, and making adjustments to use resources more effectively without compromising performance or reliability and using marketplace solutions instead of some cloud-provider-native offerings.

Why do you need Cloud Cost Optimization?

Organizations waste approximately 32% of their cloud spending, which is a significant amount whether you’re a small business or a large one spending millions on cloud services. Cloud optimization helps you minimize this redundancy and avoid overspending. Cloud cost optimization also goes beyond just cost-cutting; it also focuses on thorough analysis of current usage, identifying inefficiencies and eliminating wastage to optimize value.

More than just cutting costs, it’s also about ensuring your spending aligns with your business goals. Cloud cost optimization means understanding your cloud expenses and making smart adjustments to control costs without sacrificing performance. Also see our blog post on AWS and cloud cost optimization .

What is the AWS Marketplace?

The AWS Marketplace is a “curated digital catalog that customers can use to find, buy, deploy, and manage third-party software, data, and services to build solutions and run their businesses.” It features thousands of software solutions, including but not limited to security, networking, storage, machine learning, and business applications, from independent software vendors (ISVs). These offerings are easy to use and can be quickly deployed directly to an AWS environment, making it easy to integrate new solutions into your existing cloud infrastructure.

AWS Marketplace also offers various flexible pricing options, including hourly, monthly, annual, and BYOL (Bring Your Own License). And lastly, many of the software products available in the Marketplace have undergone rigorous security assessments and comply with industry standards and regulations. Also note that purchases from the AWS Marketplace can count towards AWS Enterprise Discount Program (EDP) commitments. See our blog post on the EDP .

Cloud Cost Optimization Tools on AWS Marketplace you can use to Optimize your Cloud Costs

In addition to its thousands of software products, AWS Marketplace also offers many products and services that can help you optimize your cloud costs. Here are some tools and ways in which you can use AWS Marketplace to do so effectively.

Cloud Cost Management Tools AWS Marketplace hosts a variety of cost management tools that provide insights into your cloud spending. Products like CloudHealth and CloudCheckr offer comprehensive dashboards and reports that help you understand where your money is going. These tools can identify underutilized resources, recommend rightsizing opportunities, and alert you to unexpected cost spikes, enabling proactive management of your AWS expenses.

Optimzing Compute Costs: Reserved Instances and Savings Plans One of the most effective ways to reduce AWS costs is by purchasing Reserved Instances (RIs) and Savings Plans, as mentioned above. However, understanding the best mix and commitment level can be challenging. Tools like Spot.io and Cloudability available on AWS Marketplace can analyze your usage patterns and recommend the optimal RI or Savings Plan purchases. These products ensure you get the best return on your investment while maintaining the flexibility to adapt to changing workloads.

Optimizing Cloud Storage Costs Data storage can quickly become one of the largest expenses in your AWS bill. Simplyblock, available on AWS Marketplace, is the next generation of software-defined storage, enabling storage requirements for the most demanding workloads. High IOPS per Gigabyte density, low predictable latency, and high throughput is enabled using the pooled storage, as well as our distributed data placement algorithm. Using erasure coding (a better RAID) instead of replicas helps to minimize storage overhead without sacrificing data safety and fault tolerance .

Automate Resource Management Automated resource management tools can help you scale your resources up or down based on demand, ensuring you only pay for what you use. Products like ParkMyCloud and Scalr can automate the scheduling of non-production environments to shut down during off-hours, significantly reducing costs. These tools also help in identifying and terminating idle resources, ensuring no wastage of your cloud budget.

Enhance Security and Compliance Security and compliance are critical but can also be cost-intensive. Utilizing AWS Marketplace products like Trend Micro and Alert Logic can enhance your security posture without the need for a large in-house team. These services provide continuous monitoring and automated compliance checks, helping you avoid costly breaches and fines while optimizing the allocation of your security budget.

Consolidate Billing and Reporting For organizations managing multiple AWS accounts, consolidated billing and reporting tools can simplify cost management. AWS Marketplace offers solutions like CloudBolt and Turbonomic that provide a unified view of your cloud costs across all accounts. These tools offer detailed reporting and chargeback capabilities, ensuring each department or project is accountable for their cloud usage, promoting cost-conscious behavior throughout the organization.

By leveraging the diverse range of products available on AWS Marketplace, organizations can gain better control over their AWS spending, optimize resource usage, and enhance operational efficiency. Whether it’s through cost management tools, automated resource management, or enhanced security solutions, AWS Marketplace products provide the necessary tools to reduce cloud costs effectively.

How to Reduce EBS Cost in AWS?

AWS Marketplace storage solutions, such as simplyblock can help reducing Amazon EBS costs and AWS database costs up to 80% . Simplyblock offers high-performance cloud block storage that enhances the performance of your databases and applications. This ensures you get better value and efficiency from your cloud resources.

Simplyblock software provides a seamless bridge between local EC2 NVMe disk, Amazon EBS, and Amazon S3, integrating these storage options into a single, cohesive system designed for ultimate scale and performance of IO-intensive stateful workloads. By combining the high performance of local NVMe storage with the reliability and cost-efficiency of EBS and S3 respectively, simplyblock enables enterprises to optimize their storage infrastructure for stateful applications, ensuring scalability, cost savings, and enhanced performance. With simplyblock, you can save up to 80% on your EBS costs on AWS.

Our technology uses NVMe over TCP for minimal access latency, high IOPS/GB, and efficient CPU core utilization, outperforming local NVMe disks and Amazon EBS in cost/performance ratio at scale. Ideal for high-performance Kubernetes environments, simplyblock combines the benefits of local-like latency with the scalability and flexibility necessary for dynamic AWS EKS deployments , ensuring optimal performance for I/O-sensitive workloads like databases. By using erasure coding (a better RAID) instead of replicas, simplyblock minimizes storage overhead while maintaining data safety and fault tolerance. This approach reduces storage costs without compromising reliability.

Simplyblock also includes additional features such as instant snapshots (full and incremental), copy-on-write clones, thin provisioning, compression, encryption, and many more – in short, there are many ways in which simplyblock can help you optimize your cloud costs. Get started using simplyblock right now and see how simplyblock can help you on the AWS Marketplace .

To save on your cloud costs, you can also take advantage of discounts provided by various platforms. You can visit here to grab a discount on your AWS credits.

The post How to reduce AWS cloud costs with AWS marketplace products? appeared first on simplyblock.

Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview)

Chris Engelbert — Thu, 27 Jun 2024 12:09:00 +0000

Introduction

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this insightful video, we explore the cutting-edge field of machine learning-driven database optimization with Luigi Nardi In this episode of the Cloud Commute podcast.

Key Takeaways

Q: Can machine learning improve database performance? Yes, machine learning can significantly improve database performance. DBtune uses machine learning algorithms to automate the tuning of database parameters, such as CPU, RAM, and disk usage. This not only enhances the efficiency of query execution but also reduces the need for manual intervention, allowing database administrators to focus on more critical tasks. The result is a more responsive and cost-effective database system.

Q: How do machine learning models predict query performance in databases? DBtune employs probabilistic models to predict query performance. These models analyze various metrics, such as CPU usage, memory allocation, and disk activity, to forecast how queries will perform under different conditions. The system then provides recommendations to optimize these parameters, ensuring that the database operates at peak efficiency. This predictive capability is crucial for maintaining performance in dynamic environments.

Q: What are the main challenges in integrating AI-driven optimization with legacy database systems? Integrating AI-driven optimization into legacy systems presents several challenges. Compatibility issues are a primary concern, as older systems may not easily support modern optimization techniques. Additionally, there’s the need to gather sufficient data to train machine learning models effectively. Luigi also mentions the importance of addressing security concerns, especially when sensitive data is involved, and ensuring that the integration process does not disrupt existing workflows.

Q: Can you provide examples of successful AI-driven query optimization in real-world applications? DBtune has successfully applied its technology across various database systems, including Postgres, MySQL, and SAP HANA. For instance, in a project with a major telecom company, DBtune’s optimization algorithms reduced query execution times by up to 80%, leading to significant cost savings and improved system responsiveness. These real-world applications demonstrate the practical benefits of AI-driven query optimization in diverse environments.

undefined

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

Q: Can machine learning be used for optimization?

Yes, machine learning can be highly effective in optimizing complex systems by analyzing large datasets and identifying patterns that might not be apparent through traditional methods. It can automatically adjust system configurations, predict resource needs, and streamline operations to enhance performance.

simplyblock Insight: While simplyblock does not directly use machine learning for optimization, it provides advanced infrastructure solutions that are designed to seamlessly integrate with AI-driven tools. This allows organizations to leverage machine learning capabilities within a robust and flexible environment, ensuring that their optimization processes are supported by reliable and scalable infrastructure. Q: How does AI-driven query optimization improve database performance?

AI-driven query optimization improves database performance by analyzing system metrics in real-time and adjusting configurations to enhance data processing speed and efficiency. This leads to faster query execution and better resource utilization.

simplyblock Insight: simplyblock’s platform enhances database performance through efficient storage management and high availability features. By ensuring that storage is optimized and consistently available, simplyblock allows databases to maintain high performance levels, even as AI-driven processes place increasing demands on the system. Q: What are the main challenges in integrating AI-driven optimization with legacy database systems?

Integrating AI-driven optimization with legacy systems can be challenging due to compatibility issues, the complexity of existing configurations, and the risk of disrupting current operations.

simplyblock Insight: simplyblock addresses these challenges by offering flexible deployment options that are compatible with legacy systems. Whether through hyper-converged or disaggregated setups, simplyblock enables seamless integration with existing infrastructure, minimizing the risk of disruption and ensuring that AI-driven optimizations can be effectively implemented. Q: What is the relationship between machine learning and databases?

The relationship between machine learning and databases is integral, as machine learning algorithms rely on large datasets stored in databases to train and improve, while databases benefit from machine learning’s ability to optimize their performance and efficiency.

simplyblock Insight: simplyblock enhances this relationship by providing a scalable and reliable infrastructure that supports large datasets and high-performance demands. This allows databases to efficiently manage the data required for machine learning, ensuring that the training and inference processes are both fast and reliable.

Additional Nugget of Information

Q: How is the rise of vector databases impacting the future of machine learning and databases? The rise of vector databases is revolutionizing how large language models and AI systems operate by enabling more efficient storage and retrieval of vector embeddings. These databases, such as pgvector for Postgres, are becoming essential as AI applications demand more from traditional databases. The trend indicates a future where databases are increasingly specialized to handle the unique demands of AI, which could lead to even greater integration between machine learning and database management systems. This development is likely to play a crucial role in the ongoing evolution of both AI and database technologies.

Conclusion

Luigi Nardi showcases how machine learning is transforming database optimization. As DBtune’s founder, he highlights the power of AI to boost performance, cut costs, and enhance sustainability in database management. The discussion also touches on emerging trends like vector databases and DBaaS, making it a must-listen for anyone keen on the future of database technology. Stay tuned for more videos on cutting-edge technologies and their applications.

Full Episode Transcript

Chris Engelbert: Hello, everyone. Welcome back to this week’s episode of simplyblock’s Cloud Commute podcast. This week I have Luigi with me. Luigi, obviously, from Italy. I don’t think he has anything to do with Super Mario, but he can tell us about that himself. So welcome, Luigi. Sorry for the really bad joke.

Luigi Nardi: Glad to be here, Chris.

Chris Engelbert: So maybe you start with introducing yourself. Who are you? We already know where you’re from, but I’m not sure if you’re actually residing in Italy. So maybe just tell us a little bit about you.

Luigi Nardi: Sure. Yes, I’m originally Italian. I left the country to explore and study abroad a little while ago. So in 2006, I moved to France and studied there for a little while. I spent almost seven years in total in France eventually. I did my PhD program there in Paris and worked in a company as a software engineer as well. Then I moved to the UK for a few years, did a postdoc at Imperial College London in downtown London, and then moved to the US. So I lived in California, Palo Alto more precisely, for a few years. Then in 2019, I came back to Europe and established my residency in Sweden.

Chris Engelbert: Right. Okay. So you’re in Sweden right now.

Luigi Nardi: That’s correct.

Chris Engelbert: Oh, nice. Nice. How’s the weather? Is it still cold?

Luigi Nardi: It’s great. Everybody thinks that Sweden has very bad weather, but Sweden is a very, very long country. So if you reside in the south, actually, the weather is pretty decent. It doesn’t snow very much.

Chris Engelbert: That is very true. I actually love Stockholm, a very beautiful city. All right. One thing you haven’t mentioned, you’re actually the founder and CEO of DBtune. So you left out the best part guess. Maybe tell us a little bit about DBtune now.

Luigi Nardi: Sure. DBtune is a company that is about four years old now. It’s a spinoff from Stanford University and the commercialization of about a decade of research and development in academia. We were working on the intersection between machine learning and computer systems, specifically the use of machine learning to optimize computer systems. This is an area that in around 2018 or 2019 received a new name, which is MLSys, machine learning and systems. This new area is quite prominent these days, and you can do very beautiful things with the combination of these two pieces. DBtune is specifically focusing on using machine learning to optimize computer systems, particularly in the computer system area. We are optimizing databases, the database management systems more specifically. The idea is that you can automate the process of tuning databases. We are focusing on the optimization of the parameters of the database management systems, the parameters that govern the runtime system. This means the way the disk, the RAM, and the CPU interact with each other. You take the von Neumann model and try to make it as efficient as possible through optimizing the parameters that govern that interaction. By doing that, you automate the process, which means that database engineers and database administrators can focus on other tasks that are equally important or even more important. At the same time, you get great performance, you can reduce your cloud costs as well. If you’re running in the cloud in an efficient way, you can optimize the cloud costs. Additionally, you get a check on your greenops, meaning the sustainability aspect of it. So this is one of the examples I really like of how you can be an engineer and provide quite a big contribution in terms of sustainability as well because you can connect these two things by making your software run more efficiently and then scaling down your operations.

Chris Engelbert: That is true. And it’s, yeah, I’ve never thought about that, but sure. I mean, if I get my queries to run more efficient and use less compute time and compute power, huh, that is actually a good thing. Now I’m feeling much better.

Luigi Nardi: I’m feeling much better too. Since we started talking a little bit more about this, we have a blog post that will be released pretty soon about this very specific topic. I think this connection between making software run efficiently and the downstream effects of that efficiency, both on your cost, infrastructure cost, but also on the efficiency of your operations. It’s often underestimated, I would say.

Chris Engelbert: Yeah, that’s fair. It would be nice if you, when it’s published, just send me over the link and I’m putting it into the show notes because I think that will be really interesting to a lot of people. As he said specifically for developers that would otherwise have a hard time having anything in terms of sustainability. You mentioned database systems, but I think DBtune specifically is focused on Postgres, isn’t it?

Luigi Nardi: Right. Today we are focusing on Postgres. As a proof of concept, though, we have applied similar technology to five different database management systems, including relational and non-relational systems as well. So we were, a little while ago, we wanted to show that this technology can be used across the board. And so we play around with MySQL, with FoundationDB, which is the system behind iCloud, for example, and many of the VMware products. And then we have RocksDB, which is behind your Instagram and Facebook and so on. Facebook, very pressing that open source storage system. And things like SAP HANA as well, we’ve been focusing on that a little bit as well, just as a proof of concept to show that basically the same methodology can apply to very different database management systems in general.

Chris Engelbert: Right. You want to look into Oracle and take a chunk of their money, I guess. But you’re on the right track with SAP HANA. It’s kind of on the same level. So how does that work? I think you have to have some kind of an agent inside of your database. For Postgres, you’re probably using the stats tables, but I guess you’re doing more, right?

Luigi Nardi: Right. This is the idea of, you know, observability and monitoring companies. They mainly focus on gathering all this metrics from the machine and then getting you a very nice visualization on your dashboard. As a user, you would look at these metrics and how they evolve over time, and then they help you guide the next step, which is some sort of manual optimization of your system. We are moving one step forward and we’re trying to use those metrics automatically instead of just giving them back to the user. So we move from a passive monitoring approach to an active approach where the metrics are collected and then the algorithm will help you also to automatically change the configuration of the system in a way that it gets faster over time. And so the metrics that we look at usually are, well, the algorithm itself will gather a number of metrics to help it to improve over time. And this type of metrics are related to, you know, your system usage, you know, CPU memory and disk usage. And other things, for example, latency and throughput as well from your Postgres database management system. So using things like pg_stat_statements, for example, for people that are a little more familiar with Postgres. And by design, we refrain from looking inside your tables or looking specifically at your metadata, at your queries, for example, we refrain from that because it’s easier to basically, you know, deploy our system in a way that it’s not dangerous for your data and for your privacy concerns and things like that.

Chris Engelbert: Right. Okay. And then you send that to a cloud instance that visualizes the data, just the simple stuff, but there’s also machine learning that actually looks at all the collected data and I guess try to find pattern. And how does that work? I mean, you probably have a version of the query parser, the Postgres query parser in the backend to actually make sense of this information, see what the execution plan would be. That is just me guessing. I don’t want to spoil your product.

Luigi Nardi: No, that’s okay. So the agent is open source and it gets installed on your environment. And anyone fluent in Python can read that in probably 20 minutes. So it’s pretty, it’s not massive. It’s not very big. That’s what gets connected with our backend system, which is running in our cloud. And the two things connect and communicate back and forth. The agent reports the metrics and requests what’s the next recommendation from the optimizer that runs in our backend. The optimizer responds with a recommendation, which is then enabled in the system through the agent. And then the agent also starts to measure what’s going on on the machine before reporting these metrics back to the backend. And so this is a feedback loop and the optimizer gets better and better at predicting what’s going on on the other side. So this is based on machine learning technology and specifically probabilistic models, which I think is the interesting part here. By using probabilistic models, the system is able to predict the performance for a new guess, but also predict the uncertainty around that estimate. And that’s, I think, very powerful to be able to combine some sort of prediction, but also how confident you are with respect to that prediction. And those things are important because when you’re optimizing a computer system, of course, you’re running this in production and you want to make sure that this stays safe for the system that is running. You’re changing the system in real time. So you want to make sure that these things are done in a safe way. And these models are built in a way that they can take into account all these unpredictable things that may otherwise book in the engineer system.

Chris Engelbert: Right. And you mentioned earlier that you’re looking at the pg_stat_statements table, can’t come up with the name right now. But that means you’re not looking at the actual data. So the data is secure and it’s not going to be sent to your backend, which I think could be a valid fear from a lot of people like, okay, what is actually being sent, right?

Luigi Nardi: Exactly. So Chris, when we talk with large telcos and big banks, the first thing that they say, what are you doing to my data? So you need to sit down and meet their infosec teams and explain to them that we’re not transferring any of that data. And it’s literally just telemetrics. And those telemetrics usually are not sensitive in terms of privacy and so on. And so usually there is a meeting that happens with their infosec teams, especially for big banks and telcos, where you clarify what is being sent and then they look at the source code because the agent is open source. So you can look at the open source and just realize that nothing sensitive is being sent to the internet.

Chris Engelbert: Right.

Luigi Nardi: And perhaps to add one more element there. So for the most conservative of our clients, we also provide a way to deploy this technology in a completely offline manner. So when everybody’s of course excited about digital transformations and moving to the cloud and so on, we actually went kind of backwards and provided a way of deploying this, which is sending a standalone software that runs in your environment and doesn’t communicate at all to the internet. So we have that as an option as well for our users. And that supports a little harder for us to deploy because we don’t have direct access to that anymore. So it’s easy for us to deploy the cloud-based version. But if you, you know, in some cases, you know, there is not very much you can do that will not allow you to go through the internet. There are companies that don’t buy Salesforce for that reason. So if you don’t buy Salesforce, you probably not buy from anybody else on the planet. So for those scenarios, that’s what we do.

Chris Engelbert: Right. So how does it work afterwards? So the machine learning looks into the data, tries to find patterns, has some optimization or some … Is it only queries or does it also give me like recommendations on how to optimize the Postgres configuration itself? And how does that present those? I guess they’re going to be shown in the UI.

Luigi Nardi: So we’re specifically focusing on that aspect, the optimization of the configuration of Postgres. So that’s our focus. And so the things like, if you’re familiar with Postgres, things like the shared buffers, which is this buffer, which contains the copy of the data from tables from the disk and keep it a local copy on RAM. And that data is useful to keep it warm in RAM, because when you interact with the CPU, then you don’t need to go all the way back to disk. And so if you go all the way back to disk, there is an order of magnitude more like delay and latency and slow down based on that. So you try to keep the data close to where it’s processed. So trying to keep the data in cache as much as possible and share buffer is a form of cache where the cache used in this case is a piece of RAM. And so sizing these shared buffers, for example, is important for performance. And then there are a number of other things similar to that, but slightly different. For example, in Postgres, there is an allocation of a buffer for each query. So each query has a buffer which can be used as an operating memory for the query to be processed. So if you’re doing some sort of like sorting, for example, in the query that small memory is used again. And you want to keep that memory close to the CPU and specifically the workman parameter, for example, is what helps with that specific thing. And so we optimize all this, all these things in a way that the flow of data from disk to the registers of the CPU, it’s very, very smooth and it’s optimized. So we optimize the locality of the data, both spatial and temporal locality if you want to use the technical terms for that.

Chris Engelbert: Right. Okay. So it doesn’t help me specifically with my stupid queries. I still have to find a consultant to fix that or find somebody else in the team.

Luigi Nardi: Yeah, for now, that’s correct. We will probably focus on that in the future. But for now, the way you usually optimize your queries is that you optimize your queries and then if you want to see what’s the actual benefit, you should also optimize your parameters. And so if you want to do it really well, you should optimize your queries, then you go optimize your parameters and go back optimize again your queries, parameters and kind of converge into this process. So now that one of the two is fully automated, you can focus on the queries and, you know, speed up the process of optimizing the queries by a large margin. So to in terms of like benefits, of course, if you optimize your queries, you will write your queries, you can get, you know, two or three order of magnitude performance improvement, which is really, really great. If you optimize the configuration of your system, you can get, you know, an order of magnitude in terms of performance improvement. And that’s, that’s still very, very significant. Despite what many people say, it’s possible to get an order of magnitude improvement in performance. If your system by baseline, it’s fairly, it’s fairly basic, let’s say. And the interesting fact is that by the nature of Postgres, for example, the default configuration of Postgres needs to be pretty conservative because Postgres needs to be able to run on big server machines, but also on smaller machines. So the form factor needs to be taken into account when you define the default configuration of Postgres. And so by that fact, it needs to be pretty conservative. And so what you can observe out there is that this problem is so complex that people don’t really change the default configuration of Postgres when they run on a much bigger instance. And so there is a lot of performance improvement that can be obtained by changing that configuration to a better-suited configuration. And you have the point of doing this through automation and through things like DBtune is that you can then refine the configuration of your system specifically for the specific use case that you have, like your application, your workload, the machine size, and all these things are considered together to give you the best outcome for your use case, which is, I think, the new part, the novelty of this approach, right? Because if you’re doing this through some sort of heuristics, they usually don’t really get to cover all these different things. And there will always be kind of super respect to what you can do with an observability loop, right?

Chris Engelbert: Yeah, and I think you mentioned that a lot of people don’t touch the configuration. I think there is the problem that the Postgres configuration is very complex. A lot of parameters depend on each other. And it’s, I mean, I’m coming from a Java background, and we have the same thing with garbage collectors. Optimizing a garbage collector, for every single algorithm you have like 20 or 30 parameters, all of them depend on each other. Changing one may completely disrupt all the other ones. And I think that is what a lot of people kind of fear away from. And then you Google, and then there’s like the big Postgres community telling you, “No, you really don’t want to change that parameter until you really know what you’re doing,” and you don’t know, so you leave it alone. So in this case, I think something like Dbtune will be or is absolutely amazing.

Luigi Nardi: Exactly. And, you know, if you spend some time on blog posts learning about the Postgres parameters you get that type of feedback and takes a lot of time to learn it in a way that you can feel confident and comfortable in changes in your production system, especially if you’re working in a big corporation. And the idea here is that at DBtune we are partnered with leading Postgres experts as well. Magnus Hagander, for example, we see present of the Postgres Europe organization, for example, it’s been doing this manual tuning for about two decades and we worked very closely with him to be able to really do this in a very safe manner, right. You should basically trust our system to be doing the right thing because it’s engineering a way that incorporates a lot of domain expertise so it’s not just machine learning it’s also about the specific Postgres domain expertise that you need to do this well and safely.

Chris Engelbert: Oh, cool. All right. We’re almost out of time. Last question. What do you think it’s like the next big thing in Postgres and databases, in cloud, in db tuning.

Luigi Nardi: That’s a huge question. So we’ve seen all sorts of things happening recently with, of course, AI stuff but, you know, I think it’s, it’s too simple to talk about that once more I think you guys covered those type of topics a lot. I think what’s interesting is that there is there is a lot that has been done to support those type of models and using for example the rise of vector databases for example, which was I think quite interesting vector databases like for example the extension for Postgres, the pgvector was around for a little while but in last year you really saw a huge adoption and that’s driven by all sort of large language models that use this vector embeddings and that’s I think a trend that will see for a little while. For example, our lead investor 42CAP, they recently invested in another company that does this type of things as well, Qdrant for example, and there are a number of companies that focus on that Milvus and Chroma, Zilliz, you know, there are a number of companies, pg_vectorize as well by the Tembo friends. So this is certainly a trend that will stay and for a fairly long time. In terms of database systems, I am personally very excited about the huge shift left that is happening in the industry. Shift left the meaning all the databases of service, you know, from Azure flexible server Amazon RDS, Google Cloud SQL, those are the big ones, but there are a number of other companies that are doing the same and they’re very interesting ideas, things that are really, you know, shaping that whole area, so I can mention a few for example, Tembo, even EnterpriseDB and so on that there’s so much going on in that space and in some sort, the DBtune is really in that specific direction, right? So helping to automate more and more of what you need to do in a database when you’re operating at database. From a machine learning perspective, and then I will stop that Chris, I think we’re running out of time. From machine learning perspective, I’m really interested in, and that’s something that we’ve been studying for a few years now in my academic team, with my PhD students. The, you know, pushing the boundaries of what we can do in terms of using machine learning for computer systems and specifically when you get computer systems that have hundreds, if not thousands of parameters and variables to be optimized at the same time jointly. And we have recently published a few pieces of work that you can find on my Google Scholar on that specific topic. So it’s a little math-y, you know, it’s a little hard to maybe read them parts, but it’s quite rewarding to see that these new pieces of technology are becoming available to practitioners and people that work on applications as well. So that perhaps the attention will move away at some point from full LLMs to also other areas in machine learning and AI that are also equally interesting in my opinion.

Chris Engelbert: Perfect. That’s, that’s beautiful. Just send me the link. I’m happy to put it into the show note. I bet there’s quite a few people that would be really, really into reading those things. I’m not big on mathematics that’s probably way over my head, but that’s, that’s fine. Yeah, I was that was a pleasure. Thank you for being here. And I hope we. Yeah, I hope we see each other somewhere at a Postgres conference we just briefly talked about that before the recording started. So yeah, thank you for being here. And for the audience, I see you, I hear you next week or you hear me next week with the next episode. And thank you for being here as well.

Luigi Nardi: Awesome for the audience will be at the Postgres Switzerland conference as sponsors and we will be giving talks there so if you come by, feel free to say hi, and we can grab coffee together. Thank you very much.

Chris Engelbert: Perfect. Yes. Thank you. Bye bye.

The post Machine Learning driven Database Optimization with Luigi Nardi from DBtune (interview) appeared first on simplyblock.