Michael Schmidt, Author at simplyblock

Simplyblock for AWS: Environments with many gp2 or gp3 Volumes

Michael Schmidt — Thu, 19 Sep 2024 21:49:02 +0000

When operating your stateful workloads in Amazon EC2 and Amazon EKS, data is commonly stored on Amazon’s EBS volumes. AWS supports a set of different volume types which offer different performance requirements. The most commonly used ones are gp2 and gp3 volumes, providing a good combination of performance, capacity, and cost efficiency. So why would someone need an alternative?

For environments with high-performance requirements such as transactional databases, where low-latency access and optimized storage costs are key, alternative solutions are essential. This is where simplyblock steps in, offering a new way to manage storage that addresses common pain points in traditional EBS or local NVMe disk usage—such as limited scalability, complex resizing processes, and the cost of underutilized storage capacity.

What is Simplyblock?

Simplyblock is known for providing top performance based on distributed (clustered) NVMe instance storage at low cost with great data availability and durability. Simplyblock provides storage to Linux instances and Kubernetes environments via the NVMe block storage and NVMe over Fabrics (using TCP/IP as the underlying transport layer) protocols and the simplyblock CSI Driver.

Simplyblock’s storage orchestration technology is fast. The service provides access latency between 100 us and 500 us, depending on the IO access pattern and deployment topology. That means that simplyblock’s access latency is comparable to, or even lower than on Amazon EBS io2 volumes, which typically provide between 200 us to 300 us.

To make sure we only provide storage which will keep up, we test simplyblock extensively. With simplyblock you can easily achieve more than 1 million IOPS at a 4KiB block size on single EC2 compute instances. This is several times higher than the most scalable Amazon EBS volumes, io2 Block Express. On the other hand, simplyblock’s cost of capacity is comparable to io2. However, with simplyblock IOPS come for free – at absolutely no extra charge. Therefore, depending on the capacity to IOPS ratio of io2 volumes, it is possible to achieve cost advantages up to 10x .

For customers requiring very low storage access latency and high IOPS per TiB, simplyblock provides the best cost efficiency available today.

Why Simplyblock over Simple Amazon EBS?

Many customers are generally satisfied with the performance of their gp3 EBS volumes. Access latency of 6 to 10 ms is fine, and they never have to go beyond the included 3,000 IOPS (on gp2 and gp3). They should still care for simplyblock, because there is more. Much more.

Benefits of Thin Provisioning

With gp3, customers have to pay for provisioned rather than utilized capacity (~USD 80 per TiB provisioned). According to our research, the average utilization of Amazon EBS gp3 volumes is only at ~30%. This means that customers are actually paying more than three times the price per TiB of utilized storage. That said, due to the low utilization below one-third, the actual price comes down to about USD 250 per TiB. The higher the utilization, the closer a customer would be to the projected USD 80 per TiB.

In addition to the price inefficiency, customers also have to manage the resizing of gp3 volumes when utilization reaches the current capacity limit. However, resizing has its own number of limitations in EBS it is only possible once every six hours. To mitigate potential issues during that time, volumes are commonly doubled in size.

On the other hand, simplyblock provides thin provisioned logical volumes. This means that you can provision your volumes nearly without any restriction in size. Think of growable partitions that are sliced out of the storage pool. Logical volumes can also be over-provisioned, meaning, you can set the requested storage capacity to exceed the storage pool’s current size. There is no charge for the over-provisioned capacity as long as you do not use it.

That said, simplyblock thinly provisions NVMe volumes from a storage pool which is either made up of distributed local instance storage or gp3 volumes. The underlying pool is resized before it runs out of storage capacity.

These means enable you to save massively on storage, while also simplifying your operations. No more manual or script-based resizing! No more custom alerts before running out of storage.

Benefits of Storage Tiering

But if you feel there should be even more potential to save on storage, you are absolutely right!

The total data stored on a single EBS volume has very different access patterns. Let’s explore together what the average database setup looks like. The typical corporate’s transactional database will easily qualify as a “hot” storage. It is commonly stored on SSD-based EBS volumes. Nobody would think of putting this database to slow file storage stored on HDD or Amazon S3.

In reality, however, data that belongs to a database is never homogeneous when it comes to performance requirements. There is, for example, the so-called database transaction log, often referred to as write-ahead log (WAL) or simply a database journal. The WAL is quite sensitive to access latency and requires a high IOPS rate for writes. On the other hand, the log is relatively small compared to the entire dataset in the database.

Furthermore, some other data files store tablespaces and index spaces. Many of them are read so frequently that they are always kept in memory. They do not depend on storage performance. Others are accessed less frequently, meaning they have to be loaded from storage every time they’re accessed. They require solid storage performance on read.

Last but not least, there are large tables which are commonly used for archiving or document storage. They are written or read infrequently and typically in large IO sizes (batches). While throughput speed is relevant for accessing this data, access latency is not.

To support all of the above use cases, simplyblock supports automatic tiering. Our tiering will place less frequently accessed data to either Amazon EBS (st2) or Amazon S3, called warm storage. The tiering implementation is optimized for throughput, hence large amounts of data can be written or read in parallel. Simplyblock automatically identifies individual segments of data, which qualify for tiering, and moves them automatically to secondary storage, and only after tiering was successful, cleaning them up on the “hot” tier. This reduces the storage demand in the hot pool.

The AWS cost ratio between hot and warm storage is about 5:1, cutting cost to about 20% for tiered data. Tiering is completely transparent to you and data is automatically read from tiered storage when requested.

Based on our observations, we often see that up to 75% of all stored data can be tiered to warm storage. This creates another massive potential in storage costs savings.

How to Prevent Data Duplication

But there is yet more to come.

The AWS’ gp3 volumes do not allow multi-attach, meaning the same volume cannot be attached to multiple virtual machines or containers at the same time. Furthermore, its reliability is also relatively low (indicated at 99.8% – 99.9%) compared to Amazon S3.

That means neither a loss of availability nor a loss of data can be ruled out in case of an incident.

Therefore, additional steps need to be taken to increase availability of the storage consuming service, as well as the reliability of the storage itself. The common measure is to employ storage replication (RAID-1, or application-level replication). However, this leads to additional operational complexity, utilization of network bandwidth, and to a duplication of storage demand (which doubles the storage capacity and cost).

Simplyblock mitigates the requirement to replicate storage. First, the same thinly provisioned volume can be attached to more than one Amazon EC2 instance (or container) and, second, the reliability of each individual volume is higher (99.9999%) due to the internal use of erasure coding (parity data) to protect the data.

Multi-attach helps to cut the storage cost by half.

The Cost of Backup

Last but not least, backups. Yes there is even more.

A snapshot taken from an Amazon EBS volume is stored in an S3-like storage. However, AWS charges significantly more per TiB than for the same storage directly on S3. Actually about 3.5 times.

Snapshots taken from simplyblock logical volumes, however, are stored into a standard Amazon S3 bucket and based on the standard S3 pricing, giving you yet another nice cost reduction.

Near-Zero RPO Disaster Recovery

Anyhow, there is one more feature that we really want to talk about. Disaster recovery is an optional feature. Our DR comes at a minimum RPO and can be deployed without any redundancy on either the block storage or the compute layer between zones. Additionally, no data transfers between zones are needed.

Simplyblock employs asynchronous replication to store any change on the storage pool to an S3 bucket. This enables a fully crash-consistent and near-real-time option for disaster recovery. You can bootstrap and restart your entire environment after a disaster. This works in the same or a different availability zone and without having to take care of backup management yourself.

And if something happened, accidental deletion or even a successful ransomware attack which encrypted your data. Simplyblock is here to help. Our asynchronous replication journal provides full Point-in-Time-Recovery functionality on the block storage layer. No need for your service or database to support it. Just rewind the storage to whatever point in time in the past.

It also utilizes write- and deletion-protected on its S3 bucket making the journal itself resilient to ransomware attacks. That said, simplyblock provides a sophisticated solution to disaster recovery and cybersecurity breaches without the need for manual backup management.

Simplyblock is Storage Optimization – just for you

Simplyblock provides a number of advantages for environments that utilize a large number of Amazon EBS gp2 or gp3 volumes. Thin provisioning enables you to consolidate unused storage capacity and minimize the spent. Due to the automatic pool enlargement (increasing the pool with additional EBS volumes or storage nodes), you’ll never run out of storage space but also only require the least amount.

Together with automatic tiering, you can move infrequently used data blocks to warm or even cold storage. Fully transparent to the application. The same is true for our disaster recovery. Built into the storage layer, every application can benefit from point in time recovery, removing almost all RPO (Recovery Point Objective) from your whole infrastructure. And with consistent snapshots across volumes, you can enable a full-blown infrastructure recovery in case of an availability zone outage, right from ground up.

With simplyblock you get more features than mentioned here. Get started right away and learn about our other features and benefits.

The post Simplyblock for AWS: Environments with many gp2 or gp3 Volumes appeared first on simplyblock.

Ransomware Attack Recovery with Simplyblock

Michael Schmidt — Tue, 10 Sep 2024 23:26:57 +0000

In 2023, the number of victims of Ransomware attacks more than doubled, with 2024 off to an even stronger start. A Ransomware attack encrypts your local data. Additionally, the attackers demand a ransom be paid. Therefore, data is copied to remote locations to increase pressure on companies to pay the ransom. This increases the risk of the data being leaked to the internet even if the ransom is paid. Strong Ransomware protection and mitigation are now more important than ever.

Simplyblock provides sophisticated block storage-level Ransomware protection and mitigation. Together with recovery options, simplyblock enables Point-in-Time Recovery (PITR) for any service or solution storing data.

What is Ransomware?

Ransomware is a type of malicious software (also known as malware) designed to block access to a computer system and/or encrypt data until a ransom is paid to the attacker. Cybercriminals typically carry out this type of attack by demanding payment, often in cryptocurrency, in exchange for providing a decryption key to restore access to the data or system.

Statistics show a significant rise in ransomware cyber attacks: ransomware cases more than doubled in 2023, and the amount of ransom paid reached more than a billion dollars—and these are only official numbers. Many organizations prefer not to report breaches and payments, as those are illegal in many jurisdictions.

The Danger of Ransomware Increases

The number and sophistication of attack tools have also increased significantly. They are becoming increasingly commoditized and easy to use, drastically reducing the skills cyber criminals require to deploy them.

There are many best practices and tools to protect against successful attacks. However, little can be done once an account, particularly a privileged one, has been compromised. Even if the breach is detected, it is most often too late. Attackers may only need minutes to encrypt important data.

Storage, particularly backups, serves as a last line of defense. After a successful attack, they provide a means to recover. However, there are certain downsides to using backups to recover from a successful attack:

The latest backup does not contain all of the data: Data written between the last backup and the time the attack is unrecoverably lost. Even the loss of one hour of data written to a database can be critical for many enterprises.
Backups are not consistent with each other: The backup of one database may not fit the backup of another database or a file repository, so the systems will not be able to integrate correctly after restoration.
The latest backups may already contain encrypted data. It may be necessary to go back in time to find an older backup that is still “clean.” This backup, if available at all, may be linked to substantial data loss.
Backups must be protected from writes and delete operations; otherwise, they can be destroyed or damaged by attackers. Attackers may also damage the backup inventory management system, making it hard or impossible to locate specific backups.
Human error in Backup Management may lead to missing backups.

Simplyblock for Ransomware Protection and Mitigation

Simplyblock provides a smart solution to recover data after a ransomware attack, complementing classical backups.

In addition to writing data to hot-tier storage, simplyblock creates an asynchronously replicated write-ahead log (WAL) of all data written. This log is optimized for high throughput to secondary (low IOPS) storage, such as Amazon S3 or HDD pools, like AWS’ EBS st2 service. If this secondary storage supports write and deletion protection for pre-defined retention periods, as with S3, it is possible to “rewind” the storage to the point immediately before the attack. This performs a data recovery with near-zero RPO (Recovery Point Objective).

A recovery mechanism like this is particularly useful in combination with databases. Before the attack can start, database systems typically have to be stopped. This is necessary as all data and WAL files are in use by the database. This allows for automatically identifying a consistent recovery point with no data loss.

In the future, simplyblock plans to enhance this functionality further. A multi-stage attack detection mechanism will be integrated into the storage. Additionally, deletion protection after clearance from attack within a historical time window and precise automatic identification of attack launch points to locate recovery points.

Furthermore, simplyblock will support partial restore of recovery points to enable different service’ data on the same logical volumes to be restored from individual points in time. This is important since encryption of one service might have started earlier or later than for others, hence the point in time to rewind to must be different.

Conclusion

Simplyblock provides a complementary recovery solution to classical backups. Backups support long-term storage of full recovery snapshots. In contrast, write-ahead log-based recovery is specifically designed for near-zero RPO recovery right after a Ransomware attack starts and enables quick and easy recovery for data protection.

While many databases and data-storing services, such as PostgreSQL, may provide the possibility of Point-in-Time Recovery, the WAL segments need to be stored outside the system as soon as they are closed. That said, the RPO would come down to the size of a WAL segment, whereas with simplyblock, due to its copy-on-write nature, the RPO can be as small as one committed write.

Learn more about simplyblock and its other features like thin-provisioning, immediate clones and branches, encryption, compression, deduplication, and more. Or just get started right away and find the best Ransomware attack protection and mitigation to date.

The post Ransomware Attack Recovery with Simplyblock appeared first on simplyblock.

Disaster Recovery with Simplyblock in AWS

Michael Schmidt — Fri, 06 Sep 2024 23:41:03 +0000

When disaster strikes, a great recovery strategy is required. Oftentimes, deficiencies are only discovered when it’s already too late. Simplyblock provides comprehensive disaster recovery support for databases, file storages, and whole infrastructures, enabling the restore from ground up in a different availability zone with minimal RTO (Recovery Time Objective) and near-zero RPO (Recovery Point Objective).

Amazon EBS, Amazon S3, and Local Instance Storage

AWS’ cloud block storage ( Amazon EBS ) is a great product, providing a multitude of different product types depending on your performance (random IOPS, access latency) requirements. However, the provided durability is limited. Depending on the EBS volume type , AWS provides a durability indicator between 99.8% and 99.999%. The bigger issue though, in case of a disaster in your availability zone (AZ), storage will become unavailable in its entirety and, depending on the type of the disaster, data may actually be lost (partially or in full).

The durability is even worse with local instance storage. Local instance storage are NVMe disks which are physically located on the virtual machine host that runs your workload. That said, all data stored on local instance storage is immediately lost once the instance is turned off, or a failure occurs with the physical host.

Amazon S3 storage, on the other hand, is considered to be extremely durable, offering 99.999999999% durability. In addition, it is replicated across availability zones. Therefore, the probability of data loss by any kind of disaster is close to zero. To our knowledge, and as of time of writing, it has never actually happened. In terms of durability, Amazon S3 is king.We trade, however, durability for latency.

Data Protection for Amazon EBS

As shown, all persistent (meaning, non ephemeral) data stored in Amazon EBS requires additional means of protection. That said, the most common way to protect your data is taking a snapshot of your EBS volume and backing it up to Amazon S3.

Those S3 backups have a number of important drawbacks though: A snapshot-based backup always implicitly means that you’ll have data loss of some kind. Data which has been written between the last backup and the time of the failure is irrecoverably lost. No restore procedure will be able to recover it. For low velocity data (data which is rarely changed) that may be a minor issue. Examples of this kind of data may be media files or archived documents. However, the data loss can be catastrophic for other types of data such as transactional systems. Multiple backups between different systems aren’t consistent between each other. The backup of one database may not fit the backup of another database or a file repository. That said, after restoration the systems may have inconsistent data states and will not integrate correctly. Bringing a collection of systems with backups taken at different times back into a working state can be a massive manual effort. Sometimes it is even impossible. Backup management is a significant effort. To free up disk space, it is necessary to remove snapshots from EBS after moving them to S3. Furthermore, backups have to be configured with retention policies. The successful operations of taking backups must be monitored and backups have to be tested regularly to make sure it is possible to restore them successfully.

Last but not least, human error in backup management may lead to missing or corrupted backups.

Data Protection for Amazon EBS with Simplyblock

Simplyblock provides a smart solution to the consistent recovery of hot data after a major incident or even a zone-level disaster.

First and foremost, simplyblock logical volumes stores data synchronously into the hot tier storage backend. In addition, data is also written into an asynchronous replicated write-ahead log (WAL). Writing this log is optimized for high throughput to secondary (low IOPS) storage such as S3 or HDD pools (e.g. the Amazon EBS st2 service). Last but not least, the WAL is efficiently compacted at regular intervals to limit storage growth and optimize recovery times.

Simplyblock’s logical volumes inherently support snapshots. Due to the copy-on-write nature of simplyblock, snapshots are taken immediately and, together with the WAL, asynchronously replicated to S3.

Data recovery, on the other hand, restores all live volumes and snapshots in a fully consistent manner. The asynchronicity of the replication limits data loss to a few hundred milliseconds.

Disaster Recovery with Near-Zero RPO

The solution stores all “hot” data either in distributed instance storage or within gp3 pools, providing the necessary online performance of storage. At the same time, all data is also asynchronously replicated into S3.

In case of a loss of the entire infrastructure in an availability zone (including the gp3 volumes and local instance storage) it is possible to consistently bootstrap the entire environment in a new AZ.

If a customer uses simplyblock to store the databases, but also bootstrap and deployment information (like ArgoCD configuration, terraform data, or similar), a recover operation can consistently restore the entirety of the infrastructure from ground up. Using this strategy, infrastructures supported by simplyblock can be consistently and fully automatically recovered with near-zero RPO and a low RTO becomes possible.

For this purpose, the “primary” simplyblock storage pod, which contains all data required for bootstrapping, has to be restarted in a new zone and connected to the control plane. Afterwards, all storage is consistently accessible.

First, infrastructure templates and configurations for the environment are retrieved, after which the deployment scripts are run and the infrastructure is redeployed. In this process, databases, documents, and other file stores can already be connected to their corresponding volumes, which contain all of the data in a crash-consistent manner.

At a later stage, “secondary” storage plane pods can be restarted within the new availability zone and data will be recovered.

The recovery time depends largely on the amount of data and the instance network bandwidth. The read time from S3 is highly optimized using large, parallel reads, wherever possible to pre-fetch hot data as quickly as possible.

Conclusion

All that said, simplyblock, the intelligent storage orchestrator, provides a powerful and feature-rich solution to provide a crash-consistent, yet performant storage solution.

Built upon well-known storage solutions, such as local instance storage, Amazon EBS, and Amazon S3, simplyblock combines the ultra low latency access of NVMe volumes (pooled or unpooled) with the extreme durability of Amazon S3. Simplyblock’s write-ahead log and disaster recovery support enables the lowest RPO and minimal downtime, even in case of the loss of a full availability zone.

Get started with simplyblock today and learn all about the other amazing features simplyblock brings right to you.

The post Disaster Recovery with Simplyblock in AWS appeared first on simplyblock.

How We Built Our Distributed Data Placement Algorithm

Michael Schmidt — Wed, 22 May 2024 12:11:23 +0000

Modern cloud applications demand more from their storage than ever before – ultra-low latency, predictable performance, and bulletproof reliability. Simplyblock’s software-defined storage cluster technology, built upon its distributed data placement algorithm, reimagines how we utilize NVMe devices in public cloud environments.

This article deep dives into how we’ve improved upon traditional distributed data placement algorithms to create a high-performance I/O processing environment that meets modern enterprise storage requirements.

Design Of Simplyblock’s Storage Cluster

Simplyblock storage cluster technology is designed to utilize NVMe storage devices in public cloud environments for use cases that require predictable and ultra-low access latency (sub-millisecond) and the highest performance density (high IOPS per GiB).

To combine high performance with a high degree of data durability, high availability, and fault tolerance, as well as zero downtime scalability, the known distributed data placement algorithms had to be improved, re-combined, and implemented into a high-performance IO processing environment.

Our innovative approach combines:

Predictable, ultra-low latency performance (<1ms)
Maximum IOPS density optimization
Enterprise-grade durability and availability
Zero-downtime scalability
Advanced failure domain management

Modern Storage Requirements

Use cases such as high-load databases, time-series databases with high-velocity data, Artificial Intelligence (AI), Machine Learning (ML), and many others require fast and predictable storage solutions.

Anyhow, performance isn’t everything. The fastest storage is writing to /dev/null, but only if you don’t need the data durability. That said, the main goals for a modern storage solution are:

High Performance Density, meaning a high amount of IOPS per Gigabyte (at an affordable price).
Predictable, low Latency, especially for use cases that require consistent response times.
High degree of Data Durability, to distribute the data across failure domains, enabling it to survive multiple failure scenarios.
High Availability and Fault Tolerance, for the data to remain accessible in case of node outage. Clusters are automatically re-balanced in the case of element failures.
Zero Downtime Scalability, meaning that clusters can grow in real-time and online and are automatically re-balanced.

Distributed Data Placement

Data placement in storage clusters commonly uses pseudo-randomization. Additionally, features such as weighted distribution of storage across the cluster (based on the capacity and performance of available data buckets) are introduced to handle failure domains and cluster rebalancing – for scaling, downsizing, or removal of failed elements – at minimal cost. A prominent example of such an algorithm is CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data), which is used in Ceph, an open-source software-defined storage platform designed to provide object storage, block storage, and file storage in a unified system.

Simplyblock uses a different algorithm to achieve the following characteristics for its distributed data placement feature:

High storage efficiency (raw to effective storage ratio) with minimal performance overhead. Instead of using three data replicas, which is the standard mechanism to protect data from storage device failure in software-defined storage clusters, simplyblock uses error coding algorithms with a raw-to-effective ratio of about 1.33 (instead of 3).
Very low access latency below 100 microseconds for read and write. Possible write amplification below 2.
Ultra-high IOPS density with more than 200.000 IOPS per CPU core.
Performant re-distribution of storage in the cluster in the case of cluster scaling and removal of failed storage devices. Simplyblock’s algorithm will only re-distribute the amount of data that is close to the theoretical minimum to rebalance the cluster.
Support for volume high-availability based on the NVMe industry standard. Support for simple failure domains as they are available in cloud environments (device, node, rack, availability zone).
Performance efficiency aims to address the typical performance bottlenecks in cloud environments.

Implementing a storage solution to keep up with current trends required us to think out of the box. The technical design consists of several elements.

Low-level I/O Processing Pipeline

On the lower level, simplyblock uses a fixed-size page mapping algorithm implemented in a virtual block device (a virtual block device implements a filter or transformation step in the IO processing pipeline).

For that purpose, IO is organized into “pages” ( 2^m blocks, with m in the range of 8 to 12). Cross-page IO has to be split before processing. This is done on the mid-level processing pipeline. We’ll get to that in a second.

This algorithm can place data received via IO-write requests from multiple virtual block devices on a single physical block device. Each virtual block device has its own logical block address space though. The algorithm is designed to read, write, and unmap (deallocate) data with minimal write amplification for metadata updates (about 2%) and minimal increase in latency (on average in the sub-microseconds range). Furthermore, it is optimized for sudden power cuts (crash-consistent) by storing all metadata inline of the storage blocks on the underlying device.

Like all block device IO requests, each request contains an LBA (logical block address) and a length (in blocks). The 64-bit LBA is internally organized into a 24-bit VUID (a cluster-wide unique identifier of the logical volume) and a 39-bit virtual LBA. The starting LBA of the page on the physical device is identified by the key (VUID, LPA), where LPA is the logical page address (LBA / (2^m)), and the address offset within the page is determined by (LBA modulo 2^m).

The IO processing services of this virtual device work entirely asynchronously on CPU-pinned threads with entirely private data (no synchronization mechanisms between IO threads required).

They are placed on top of an entirely asynchronous NVMe driver, which receives IO and submits responses via IO-queue pairs sitting at the bottom of the IO processing stack.

Mid-level IO Processing Pipeline

On top of the low-level mapping device, a virtual block device, which implements distributed data placement, has access to the entire cluster topology. This topology is maintained by a centralized multi-tenant control plane, which knows about the state of each node and device in the attached clusters and manages changes to cluster topology (adding or removing devices and nodes).

It uses multiple mechanisms to determine the calculated and factual location of each data page and then issues asynchronous IO to this mapping device in the cluster locally (NVMe) or remotely using NVMe over Fabrics (NVMe-oF):

All IO is received and forwarded on private IO threads, with no inter-thread communication, and entirely asynchronously, both inbound and outbound.
First, IO is split at page boundaries so that single requests can be processed within a single page.
Data is then striped into n chunks, and (double) parity is calculated from the chunks. Double parity is calculated using the RDP algorithm. The data is, therefore, organized in 2-dimensional arrays. n equals 1, 2, 4, or 8. This way, 4 KiB blocks can be mapped into 512-byte device blocks, and expensive partial stripe writes can be avoided.
To determine a primary target for each combination of (VUID, page, chunk-index), a flat list of devices is fed into the “list bucket” algorithm (see …) with (VUID, page, chunk-index) being the key.
Each of the data and parity chunks in a stripe have to be placed on a different device. In addition, more failure domains, such as nodes and racks, can be considered for placement anti-affinity rules. In case of a collision, the algorithm repeats recursively with an adjusted chunk-index (chunk-index + i x p, where p is the next prime number larger than the maximum chunk index and i is the iteration).
In case a selected device is (currently) not available, the algorithm repeats recursively to find an alternative and also stores the temporary placement data for each chunk in the IO. This temporary placement data is now also journaled as metadata. Metadata journaling is an important and complex part of the algorithm. It is described separately below.
On read, the process is reversed: the chunks to read from a determined placement location are determined by the same algorithm.
In case of single or dual device failure at read, the missing data will be reconstructed on the fly from parity chunks.
The algorithm pushes any information on IO errors straight to the control plane, and the control plane may update the cluster map (status of nodes and devices) and push the updated cluster map back to all nodes and virtual devices.

Top-level IO Processing Pipeline

On top of the stack of virtual block devices, simplyblock includes multiple optional virtual block devices, including a snapshot device – the device can take instant snapshots of volumes, supports snapshot chains, and instant (copy-on-write) cloning of volumes. Additionally, there is a virtual block device layer, which supports synchronous and asynchronous replication of block storage volumes across availability zones.

The highest virtual block device in the stack is then published to the fabric as a separate NVMe-oF volume with its own unique NVMe identifier (NQN) via the control plane.

High-Availability Support

The algorithm supports highly available volumes based on NVMe multipathing and ANA (asynchronous namespace access). This means that a transparent fail-over of IO for a single volume in case of a node outage is realized without having to add any additional software to clients.

Due to the features of the high-, mid-, and low-level IO pipeline, this is easy to realize: identical stacks of virtual block devices with an identical VUID are created on multiple nodes and published to the fabric using “asynchronous namespace access” (which prefers one volume over others and essentially implements an active/passive/passive mechanism).

Metadata Journaling

Metadata Journaling persists in non-primary placement locations to locate data in the cluster. It has the following important features:

It has to persist every change in location for the block range addressed in an IO request to be consistent in “sudden power-off” situations (node outages)
It has to minimize write amplification – this is achieved by smartly “batching” multiple metadata write requests into single IO operations in the form of “high-priority” IO
It has to be fast – not delaying data IO – this is achieved by high-priority NVMe queues
It has to be reliable – this is achieved by replicating metadata writes to three nodes using NVMe over Fabrics remote connections
Its storage footprint has to be small and remain constant over time; it cannot grow forever with new IO – this is achieved by introducing a regular compression mechanism, which replaces the transactional journal with a “snapshot” of placement metadata at a certain moment

Data Migrations

Data Migrations run as background processes, which take care of the movement of data in cases of failed devices (re-rebuild and re-distribution of data in the cluster), cluster scaling (to reduce the load on utilized elements and rebalance the cluster), and temporary element outage (to migrate data back to its primary locations). Running data migrations keeps a cluster in a state of transition and has to be coordinated to not conflict with any ongoing IO.

Conclusion

Building an architecture for a fast, scalable, fault-tolerant distributed storage solution isn’t easy. To be fair, I don’t think anyone expected that. Distributed systems are always complicated, and a lot of brain power goes into their design.

Simplyblock separates itself by rethinking the data placement in distributed storage environments. Part of it is the fundamentally different way of using erasure coding for parity information. We don’t just use them on a single node, between the local drives, simplyblock uses erasure coding throughout the cluster, distributing parity information from each disk on another disk on another node, hence increasing the fault tolerance.

To test simplyblock, get started right away. If you want to learn more about the features simplyblock offers you, see our feature overview.

The post How We Built Our Distributed Data Placement Algorithm appeared first on simplyblock.