Erasure Coding Archives | simplyblock

How We Built Our Distributed Data Placement Algorithm

Michael Schmidt — Wed, 22 May 2024 12:11:23 +0000

Modern cloud applications demand more from their storage than ever before – ultra-low latency, predictable performance, and bulletproof reliability. Simplyblock’s software-defined storage cluster technology, built upon its distributed data placement algorithm, reimagines how we utilize NVMe devices in public cloud environments.

This article deep dives into how we’ve improved upon traditional distributed data placement algorithms to create a high-performance I/O processing environment that meets modern enterprise storage requirements.

Design Of Simplyblock’s Storage Cluster

Simplyblock storage cluster technology is designed to utilize NVMe storage devices in public cloud environments for use cases that require predictable and ultra-low access latency (sub-millisecond) and the highest performance density (high IOPS per GiB).

To combine high performance with a high degree of data durability, high availability, and fault tolerance, as well as zero downtime scalability, the known distributed data placement algorithms had to be improved, re-combined, and implemented into a high-performance IO processing environment.

Our innovative approach combines:

Predictable, ultra-low latency performance (<1ms)
Maximum IOPS density optimization
Enterprise-grade durability and availability
Zero-downtime scalability
Advanced failure domain management

Modern Storage Requirements

Use cases such as high-load databases, time-series databases with high-velocity data, Artificial Intelligence (AI), Machine Learning (ML), and many others require fast and predictable storage solutions.

Anyhow, performance isn’t everything. The fastest storage is writing to /dev/null, but only if you don’t need the data durability. That said, the main goals for a modern storage solution are:

High Performance Density, meaning a high amount of IOPS per Gigabyte (at an affordable price).
Predictable, low Latency, especially for use cases that require consistent response times.
High degree of Data Durability, to distribute the data across failure domains, enabling it to survive multiple failure scenarios.
High Availability and Fault Tolerance, for the data to remain accessible in case of node outage. Clusters are automatically re-balanced in the case of element failures.
Zero Downtime Scalability, meaning that clusters can grow in real-time and online and are automatically re-balanced.

Distributed Data Placement

Data placement in storage clusters commonly uses pseudo-randomization. Additionally, features such as weighted distribution of storage across the cluster (based on the capacity and performance of available data buckets) are introduced to handle failure domains and cluster rebalancing – for scaling, downsizing, or removal of failed elements – at minimal cost. A prominent example of such an algorithm is CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data), which is used in Ceph, an open-source software-defined storage platform designed to provide object storage, block storage, and file storage in a unified system.

Simplyblock uses a different algorithm to achieve the following characteristics for its distributed data placement feature:

High storage efficiency (raw to effective storage ratio) with minimal performance overhead. Instead of using three data replicas, which is the standard mechanism to protect data from storage device failure in software-defined storage clusters, simplyblock uses error coding algorithms with a raw-to-effective ratio of about 1.33 (instead of 3).
Very low access latency below 100 microseconds for read and write. Possible write amplification below 2.
Ultra-high IOPS density with more than 200.000 IOPS per CPU core.
Performant re-distribution of storage in the cluster in the case of cluster scaling and removal of failed storage devices. Simplyblock’s algorithm will only re-distribute the amount of data that is close to the theoretical minimum to rebalance the cluster.
Support for volume high-availability based on the NVMe industry standard. Support for simple failure domains as they are available in cloud environments (device, node, rack, availability zone).
Performance efficiency aims to address the typical performance bottlenecks in cloud environments.

Implementing a storage solution to keep up with current trends required us to think out of the box. The technical design consists of several elements.

Low-level I/O Processing Pipeline

On the lower level, simplyblock uses a fixed-size page mapping algorithm implemented in a virtual block device (a virtual block device implements a filter or transformation step in the IO processing pipeline).

For that purpose, IO is organized into “pages” ( 2^m blocks, with m in the range of 8 to 12). Cross-page IO has to be split before processing. This is done on the mid-level processing pipeline. We’ll get to that in a second.

This algorithm can place data received via IO-write requests from multiple virtual block devices on a single physical block device. Each virtual block device has its own logical block address space though. The algorithm is designed to read, write, and unmap (deallocate) data with minimal write amplification for metadata updates (about 2%) and minimal increase in latency (on average in the sub-microseconds range). Furthermore, it is optimized for sudden power cuts (crash-consistent) by storing all metadata inline of the storage blocks on the underlying device.

Like all block device IO requests, each request contains an LBA (logical block address) and a length (in blocks). The 64-bit LBA is internally organized into a 24-bit VUID (a cluster-wide unique identifier of the logical volume) and a 39-bit virtual LBA. The starting LBA of the page on the physical device is identified by the key (VUID, LPA), where LPA is the logical page address (LBA / (2^m)), and the address offset within the page is determined by (LBA modulo 2^m).

The IO processing services of this virtual device work entirely asynchronously on CPU-pinned threads with entirely private data (no synchronization mechanisms between IO threads required).

They are placed on top of an entirely asynchronous NVMe driver, which receives IO and submits responses via IO-queue pairs sitting at the bottom of the IO processing stack.

Mid-level IO Processing Pipeline

On top of the low-level mapping device, a virtual block device, which implements distributed data placement, has access to the entire cluster topology. This topology is maintained by a centralized multi-tenant control plane, which knows about the state of each node and device in the attached clusters and manages changes to cluster topology (adding or removing devices and nodes).

It uses multiple mechanisms to determine the calculated and factual location of each data page and then issues asynchronous IO to this mapping device in the cluster locally (NVMe) or remotely using NVMe over Fabrics (NVMe-oF):

All IO is received and forwarded on private IO threads, with no inter-thread communication, and entirely asynchronously, both inbound and outbound.
First, IO is split at page boundaries so that single requests can be processed within a single page.
Data is then striped into n chunks, and (double) parity is calculated from the chunks. Double parity is calculated using the RDP algorithm. The data is, therefore, organized in 2-dimensional arrays. n equals 1, 2, 4, or 8. This way, 4 KiB blocks can be mapped into 512-byte device blocks, and expensive partial stripe writes can be avoided.
To determine a primary target for each combination of (VUID, page, chunk-index), a flat list of devices is fed into the “list bucket” algorithm (see …) with (VUID, page, chunk-index) being the key.
Each of the data and parity chunks in a stripe have to be placed on a different device. In addition, more failure domains, such as nodes and racks, can be considered for placement anti-affinity rules. In case of a collision, the algorithm repeats recursively with an adjusted chunk-index (chunk-index + i x p, where p is the next prime number larger than the maximum chunk index and i is the iteration).
In case a selected device is (currently) not available, the algorithm repeats recursively to find an alternative and also stores the temporary placement data for each chunk in the IO. This temporary placement data is now also journaled as metadata. Metadata journaling is an important and complex part of the algorithm. It is described separately below.
On read, the process is reversed: the chunks to read from a determined placement location are determined by the same algorithm.
In case of single or dual device failure at read, the missing data will be reconstructed on the fly from parity chunks.
The algorithm pushes any information on IO errors straight to the control plane, and the control plane may update the cluster map (status of nodes and devices) and push the updated cluster map back to all nodes and virtual devices.

Top-level IO Processing Pipeline

On top of the stack of virtual block devices, simplyblock includes multiple optional virtual block devices, including a snapshot device – the device can take instant snapshots of volumes, supports snapshot chains, and instant (copy-on-write) cloning of volumes. Additionally, there is a virtual block device layer, which supports synchronous and asynchronous replication of block storage volumes across availability zones.

The highest virtual block device in the stack is then published to the fabric as a separate NVMe-oF volume with its own unique NVMe identifier (NQN) via the control plane.

High-Availability Support

The algorithm supports highly available volumes based on NVMe multipathing and ANA (asynchronous namespace access). This means that a transparent fail-over of IO for a single volume in case of a node outage is realized without having to add any additional software to clients.

Due to the features of the high-, mid-, and low-level IO pipeline, this is easy to realize: identical stacks of virtual block devices with an identical VUID are created on multiple nodes and published to the fabric using “asynchronous namespace access” (which prefers one volume over others and essentially implements an active/passive/passive mechanism).

Metadata Journaling

Metadata Journaling persists in non-primary placement locations to locate data in the cluster. It has the following important features:

It has to persist every change in location for the block range addressed in an IO request to be consistent in “sudden power-off” situations (node outages)
It has to minimize write amplification – this is achieved by smartly “batching” multiple metadata write requests into single IO operations in the form of “high-priority” IO
It has to be fast – not delaying data IO – this is achieved by high-priority NVMe queues
It has to be reliable – this is achieved by replicating metadata writes to three nodes using NVMe over Fabrics remote connections
Its storage footprint has to be small and remain constant over time; it cannot grow forever with new IO – this is achieved by introducing a regular compression mechanism, which replaces the transactional journal with a “snapshot” of placement metadata at a certain moment

Data Migrations

Data Migrations run as background processes, which take care of the movement of data in cases of failed devices (re-rebuild and re-distribution of data in the cluster), cluster scaling (to reduce the load on utilized elements and rebalance the cluster), and temporary element outage (to migrate data back to its primary locations). Running data migrations keeps a cluster in a state of transition and has to be coordinated to not conflict with any ongoing IO.

Conclusion

Building an architecture for a fast, scalable, fault-tolerant distributed storage solution isn’t easy. To be fair, I don’t think anyone expected that. Distributed systems are always complicated, and a lot of brain power goes into their design.

Simplyblock separates itself by rethinking the data placement in distributed storage environments. Part of it is the fundamentally different way of using erasure coding for parity information. We don’t just use them on a single node, between the local drives, simplyblock uses erasure coding throughout the cluster, distributing parity information from each disk on another disk on another node, hence increasing the fault tolerance.

To test simplyblock, get started right away. If you want to learn more about the features simplyblock offers you, see our feature overview.

The post How We Built Our Distributed Data Placement Algorithm appeared first on simplyblock.

What is Erasure Coding: A Shield Against Data Loss

Chris Engelbert — Wed, 06 Mar 2024 12:13:27 +0000

Erasure Coding (erasure code) is a data protection mechanism that protects against data loss by breaking data items, such as files, into fragments, calculating additional data pieces (parity information), and storing them across a set of independent locations or storage media.

For decades, traditional methods like replication have been the go-to solution for protecting against data loss or corruption. In recent years, however, a more efficient and resource-friendly technique has become more prevalent—erasure coding. This innovative approach ensures data integrity, optimizes storage capacity, and reduces the chances of catastrophic data loss. Let us delve into the elements of erasure codes, exploring what they are and how they revolutionize how we protect our digital assets.

What is Erasure Coding?

Like many more commonly known technologies, such as RAID or replication/mirroring, erasure coding is a data protection method. It is a class of high-performance Forward Error Correction (FEC). A simplified explanation would say that it breaks down data into smaller pieces, does some mathematical magic, and writes the pieces to different disks. It doesn’t sound too complicated.

What that really means is slightly more involved, though. Erasure code schemes break down pieces of information, such as files, into fragments (sometimes called chunks), which are enriched with redundancy information (meaning fragments are extended with results of multiple mathematical equations) and eventually distributed across multiple storage nodes and disks.

Unlike traditional replication, which duplicates the entire data, erasure coding allows for more efficient storage utilization. This method employs advanced mathematical algorithms to create parity fragments, which can later be used to reconstruct the original data even if some fragments are lost or corrupted.

The Core Principles of Erasure Coding

While it may sound like erasure coding is the new kid on the block, it was actually invented in 1960 by Irving Reed and Gustave Solomon. Together, they created a new encoding mechanism known as the Reed-Solomon code. Today, this algorithm is widely used in a wide variety of systems, including distributed storage solutions, communication services, and aerospace systems.

These days, while there are many more erasure coding schemes, the three most common ones are:

The Reed-Solomon code is simple and efficient, can be applied to a wide range of applications, and is very common for simple data storage solutions, such as DVD and Blu-Ray disks.
The low-density parity check (LDPC or Gallager code), which is more complex but shows better performance in certain use cases, such as 10GBASE-T (10 Gbit/s Ethernet).
The turbo codes, originally invented by Claude Berrou in 1991, are more complex than LDPC but provide the best performance of data protection to efficiency ratio, and are widely used in mobile communications technologies such as UMTS and LTE.

Anyhow, all the different implementations combine a set of particular features. Storage solutions that utilize erasure coding for data protection most commonly use either a Reed-Solomon or LDPC algorithm.

Data Fragmentation

Erasure coding begins by breaking down the original data into smaller fragments. These fragments are the building blocks that will be distributed across the storage nodes. The size and number of these fragments depend on the specific erasure coding scheme being used.

Parity Creation

Parity fragments (sometimes called coding chunks) are generated using mathematical functions that operate on the original data fragments. These parity fragments are calculated so that any combination of original fragments and parity fragments can be used to reconstruct the original data. This redundancy is the key to the ability of erasure coding to tolerate the loss of information pieces without actual data loss.

Distribution across Nodes

Once the data and parity fragments are created, they are distributed across different storage nodes and disks. This distribution ensures that a failure in one node does not result in the loss of the entire dataset. Each node stores a unique combination of data and parity fragments.

Reconstruction Mechanism

In the event of a node failure or data loss, the erasure coding system can reconstruct the missing or corrupted fragments using the available fragments stored on other nodes. The mathematical relationships established during the parity creation phase facilitate this reconstruction process.

Erasure Coding Profile

Common to all erasure coding algorithms are two specific numbers, called K and M . K defines the amount of fragments the original piece of information is split into, meaning that a K=3 says to split the original object, say a file, into three fragments. M, on the other hand, defines how many parity fragments are distributed. A M=2 means that the parity information is stored on two different systems. In a configuration of K=3, M=2 a storage cluster would need five servers to store the data fragments and and parity fragments.

Advantages of Erasure Coding

Erasure coding provides several advantages over more traditional data protection mechanisms, such as RAID or replication.

Optimized Storage Utilization

Erasure coding significantly reduces the amount of storage space required compared to traditional replication methods. While replication duplicates data in its entirety, erasure coding introduces redundancy at the fragment level, allowing for more efficient use of storage resources.

Editor’s Note: Our Erasure Coding Calculator can help you determine the erasure coding overhead.

Fault Tolerance

The distributed nature of erasure coding ensures that the failure of a single storage node does not result in data loss. The original data can be reconstructed as long as the required number of fragments is available across the surviving nodes. This fault tolerance is crucial for systems requiring high availability and reliability.

Cost-Effective Scalability

Traditional replication can become prohibitively expensive as data volumes grow. Erasure coding provides a cost-effective alternative, allowing organizations to scale their storage infrastructure without a linear cost increase.

Reduced Bandwidth Requirements

Transmitting and storing parity fragments instead of full data copies reduces the bandwidth and storage requirements. This is particularly advantageous in scenarios where network bandwidth or storage capacity is a limiting factor.

Use Cases of Erasure Coding

Erasure coding has many use cases, not only in the storage ecosystem but also in communication. UMTS, LTE, and certain satellite communication systems use different erasure code schemes to implement forward error correction.

Anyhow, in terms of storage solutions, next to consumer storage media such as DVD, there are three main storage alternatives that heavily benefit from erasure codes, both in terms of reliability or durability, as well as storage efficiency.

Cloud Storage

Erasure coding is widely adopted in cloud storage environments, where cost efficiency and fault tolerance are paramount. These solutions leverage erasure codes to ensure data durability and availability across many storage nodes or data centers.

Distributed File Systems

Systems like Hadoop Distributed File System (HDFS) and Ceph rely on erasure coding for fault tolerance and efficient storage utilization. It enables these systems to handle large-scale data processing and storage requirements.

Object Storage

Erasure coding optimizes storage space without compromising data integrity, making it an ideal choice for long-term data retention. Object storage platforms, commonly used for archival and backup purposes, benefit from this storage space savings.

Challenges and Considerations of Erasure Coding

While erasure coding offers numerous advantages, it’s important to consider that there is always good and bad news. That said, it has some characteristics that need to be understood.

Computational Overhead

The encoding and decoding processes involve more or less complex mathematical calculations, which can introduce computational overhead. However, advancements in hardware and algorithm optimization have mitigated this concern to a great extent.

Latency

The additional steps involved in the encoding and decoding processes can introduce latency. Organizations must carefully evaluate their performance requirements and select an erasure code scheme that aligns with their needs. Anyhow, the typically distributed nature of the storage using erasure coding commonly mitigates this issue by parallelizing storage requests.

Algorithm and Scheme Selection

Different erasure coding algorithms and schemes offer varying levels of efficiency, fault tolerance, and complexity. Choosing the right scheme requires thoroughly understanding the specific use case and performance considerations.

Erasure Coding and Simplyblock

Erasure coding is a powerful and efficient way to protect data, offering a great balance between storage efficiency, fault tolerance, and cost-effectiveness. As organizations grapple with the ever-growing volumes of data, adopting erasure coding becomes not just a choice but a strategic imperative. That said, it reshapes the data storage landscape, ensuring that our digital assets remain secure, resilient, and accessible in the face of evolving challenges.

That’s why simplyblock utilizes erasure coding for data protection and fault tolerance in our clustered storage solution, enabling “more bang for the buck” in terms of storage efficiency combined with the industry-standard NVMe over TCP protocol. Simplyblock enables logical devices that are high-performance, predictable, low-latency, and cost-effective, available as Kubernetes Persistent Volumes, for the easiest access possible. Not to mention all the other cool features such as compression, deduplication, encryption, thin provisioning, and more; learn more now.

The post What is Erasure Coding: A Shield Against Data Loss appeared first on simplyblock.