NVMe-oF Archives | simplyblock

NVMe over TCP vs iSCSI: Evolution of Network Storage

Chris Engelbert — Wed, 08 Jan 2025 14:27:48 +0000

TLDR: In a direct comparison of NVMe over TCP vs iSCSI, we see that NVMe over TCP outranks iSCSI in all categories with IOPS improvements of up to 50% (and more) and latency improvements by up to 34%.

When data grows, storage needs to grow, too. That’s when remotely attached SAN (Storage Area Network) systems come in. So far, these were commonly connected through one of three protocols: Fibre Channel, Infiniband, or iSCSI. However, with the latter being on the “low end” side of things, without the need for special hardware to operate. NMVe over Fabrics (NVMe-oF), and specifically NVMe over TCP (NVMe/TCP) as the successor of iSCSI, is on the rise to replace these legacy protocols and bring immediate improvements in latency, throughput, and IOPS.

iSCSI: The Quick History Lesson

Figure 1: Nokia 3310, released September 2000 (Source: Wikipedia)

iSCSI is a protocol that connects remote storage solutions (commonly hardware storage appliances) to storage clients. The latter are typically servers without (or with minimal) local storage, as well as virtual machines. In recent years, we have also seen iSCSI being used as a backend for container storage.

iSCSI stands for Internet Small Computer Storage Interface and encapsulates the standard SCSI commands within TCP/IP packets. That means that iSCSI works over commodity Ethernet networks, removing the need for specialized hardware such as network cards (NICs) and switches.

The iSCSI standard was first released in early 2000. A world that was very different from today. Do you remember what a phone looked like in 2000?

That said, while there was access to the first flash-based systems, prices were still outrageous, and storage systems were designed with spinning disks in mind. Remember that. We’ll come back to it later.

What is SCSI?

SCSI, or, you guessed it, the Small Computer Storage Interface, is a set of standards for connecting and transferring data between computers and peripheral devices. Originally developed in the 1980s, SCSI has been a foundational technology for data storage interfaces, supporting various device types, primarily hard drives, optical drives, and scanners.

While SCSI kept improving and adding new commands for technologies like NVMe. The foundation is still rooted in the early 1980s, though. However, many standards still use the SCSI command set, SATA (home computers), SAS (servers), and iSCSI.

What is NVMe?

Non-Volatile Memory Express (NVMe) is a modern PCI Express-based (PCI-e) storage interface. With the original specification dating back to 2011, NVMe is engineered specifically for solid-state drives (SSDs) connected via the PCIe bus. Therefore, NVMe devices are directly connected to the CPU and other devices on the bus to increase throughput and latency. NVMe dramatically reduces latency and increases input/output operations per second (IOPS) compared to traditional storage interfaces.

As part of the NVMe standard, additional specifications are developed, such as the transport specification which defines how NVMe commands are transported (e.g., via the PCI Express Bus, but also networking protocols like TCP/IP).

The Fundamental Difference of Spinning Disks and NVMe

Traditional spinning hard disk drives (HDDs) rely on physical spinning platters and moveable read/write heads to write or access data. When data is requested, the mechanical component must be physically placed in the correct location of the platter stack, resulting in significant access latencies ranging from 10-14 milliseconds.

Flash storage, including NVMe devices, eliminates the mechanical parts, utilizing NAND flash chips instead. NAND stores data purely electronically and achieves access latencies as low as 20 microseconds (and even lower on super high-end gear). That makes them 100 times faster than their HDD counterparts.

For a long time, flash storage had the massive disadvantage of limited storage capacity. However, this disadvantage slowly fades away with companies introducing higher-capacity devices. For example, Toshiba just announced a 180TB flash storage device.

Cost, the second significant disadvantage, also keeps falling with improvements in development and production. Technologies like QLC NAND offer incredible storage density for an affordable price.

Anyhow, why am I bringing up the mechanical vs electrical storage principle? The reason is simple: the access latency. SCSI and iSCSI were never designed for super low access latency devices because they didn’t really exist at the time of their development. And, while some adjustments were made to the protocol over the years, their fundamental design is outdated and can’t be changed for backward compatibility reasons.

NVMe over Fabrics: Flash Storage on the Network

NVMe over Fabrics (also known as NVMe-oF) is an extension to the NVMe base specification. It allows NVMe storage to be accessed over a network fabric while maintaining the low-latency, high-performance characteristics of local NVMe devices.

NVMe over Fabrics itself is a collection of multiple sub-specifications, defining multiple transport layer protocols.

NVMe over TCP: NVMe/TCP utilizes the common internet standard protocol TCP/IP. It deploys on commodity Ethernet networks and can run parallel to existing network traffic. That makes NVMe over TCP the modern successor to iSCSI, taking over where iSCSI left off. Therefore, NVMe over TCP is the perfect solution for public cloud-based storage solutions that typically only provide TCP/IP networking.
NVME over Fibre Channel: NVMe/FC builds upon the existing Fibre Channel network fabric. It tunnels NVMe commands through Fibre Channel packets and enables reusing available Fibre Channel hardware. I wouldn’t recommend it for new deployments due to the high entry cost of Fibre Channel equipment.
NVMe over Infiniband: Like NVMe over Fibre Channel, NVMe/IB utilizes existing Infiniband networks to tunnel the NVMe protocol. If you have existing Infiniband equipment, NVMe over Infiniband might be your way. For new deployments, the initial entry cost is too high.
NVME over RoCE: NVME over Converged Ethernet is a transport layer that uses an Ethernet fabric for remote direct memory access (RDMA). To use NVMe over RoCE, you need RDMA-capable NICs. RoCE comes in two versions: RoCEv1, which is a layer-2 protocol and not routable, and RoCEv2, which uses UDP/IP and can be routed across complex networks. NVMe over RoCE doesn’t scale as easily as NVMe over TCP but provides even lower latencies.

NVMe over TCP vs iSCSI: The Comparison

When comparing NVMe over TCP vs iSCSI, we see considerable improvements in all three primary metrics: latency, throughput, and IOPS.

Figure 2: Medium queue-depth workload at 4KB blocksize I/O (Source: Blockbridge)

The folks over at Blockbridge ran an extensive comparison of the two technologies, which shows that NVMe over TCP outperformed iSCSI, regardless of the benchmark.

I’ll provide the most critical benchmarks here, but I recommend you read through the full benchmark article right after finishing here.

Anyhow, let’s dive a little deeper into the actual facts on the NVMe over TCP vs iSCSI benchmark.

Editor’s Note: Our Developer Advocate, Chris Engelbert, gave a talk recently at SREcon in Dublin, talking about the performance between NVMe over TCP and iSCSI, which led to this blog post. Find the full presentation NVMe/TCP makes iSCSI look like Fortran.

Benchmarking Network Storage

Evaluating storage performance involves comparing four major performance indicators.

IOPS: Number of input/output operations processed per second
Latency: Time required to complete a single input/output operation
Throughput: Total data transferred per unit of time
Protocol Overhead: Additional processing required by the communication protocol

Editor’s note: For latency, throughput, and IOPS, we have an exhaustive blog post talking deeper about the necessities, their relationships, and how to calculate them.

A comprehensive performance testing involves simulated workloads that mirror real-world scenarios. To simplify this process, benchmarks use tools like FIO (Flexible I/O Tester) to generate consistent, reproducible test data and results across different storage configurations and systems.

IOPS Improvements of NVMe over TCP vs iSCSI

Running IOPS-intensive applications, the number of available IOPS in a storage system is critical. IOPS-intensive application means systems such as databases, analytics platforms, asset servers, and similar solutions.

Improving IOPS by exchanging the storage-network protocol is an immediate win for the database and us.

Using NVMe over TCP instead of iSCSI shows a dramatic increase in IOPS, especially for smaller block sizes. At 512 bytes block size, Blockbridge found an average 35.4% increase in IOPS. At a more common 4KiB block size, the average increase was 34.8%.

That means the same hardware can provide over one-third more IOPS using NVMe over TCP vs iSCSI at no additional cost.

Figure 3: Average IOPS improvement of NVMe over TCP vs iSCSI by blocksize (Source: Blockbridge)

Latency Improvements of NVMe over TCP vs iSCSI

While IOPS-hungry use cases, such as compaction events in databases (Cassandra), benefit from the immense increase in IOPS, latency-sensitive applications love low access latencies. Latency is the primary factor that causes people to choose local NVMe storage over remotely-attached storage, knowing about many or all drawbacks.

Latency-sensitive applications range from high-frequency trading systems, where milliseconds are measured in hard money, over telecommunication systems, where latency can introduce issues with system synchronization, to cybersecurity and threat detection solutions that need to react as fast as possible.

Therefore, decreasing latency is a significant benefit for many industries and solutions. Apart from that, a lower access latency always speeds up data access, even if your system isn’t necessarily latency-sensitive. You will feel the difference.

Blockbridge found the most significant benefit in access latency reduction with a block size of 16KiB with a queue depth of 128 (which can easily be hit with I/O demanding solutions). The average latency for iSCSI was 5,871μs compared to NVMe over TCP with 5,089μs. A 782μs (~25%) decrease in access latency—just by exchanging the storage protocol.

Figure 4: Average access latency comparison, NVMe over TCP vs iSCSI, for 4, 8, 16 KiB (Source: Blockbridge)

Throughput Improvement of NVMe over TCP vs iSCSI

As the third primary metric of storage performance, throughput describes how much data is actually pumped from the disk into your workload.

Throughput is the major factor for applications such as video encoding or streaming platforms, large analytical systems, and game servers streaming massive worlds into memory. Furthermore, there are also time-series storage, data lakes, and historian databases.

Throughput-heavy systems benefit from higher throughput to get the “job done faster.” Oftentimes, increasing the throughput isn’t easy. You’re either bound by the throughput provided by the disk or, in the case of a network-attached system, the network bandwidth. To achieve high throughput and capacity, remote network storage utilizes high bandwidth networking or specialized networking systems such as Fibre Channel or Infiniband.

Blockbridge ran their tests on a dual-port 100Gbit/s network card, limited by the PCI Express x16 Gen3 bus to a maximum throughput of around 126Gbit/s. Newer PCIe standards achieve much higher throughput. Hence, NVMe devices and NICs aren’t bound by the “limiting” factor of the PCIe bus anymore.

With a 16KiB block size and a queue depth of 32, their benchmark saw a whopping 2.3GB/s increase in performance on NVMe over TCP vs iSCSI. The throughput increased from 10.387GBit/s on iSCSI to 12.665GBit/s, an easy 20% on top—again, using the same hardware. That’s how you save money.

Figure 5: Average throughput of NVMe over TCP vs iSCSI for different queue depths of 1, 2, 4, 8, 16, 32, 64, 128 (Source: Blockbridge)

The Compelling Case for NVMe over TCP

We’ve seen that NVMe over TCP has significant performance advantages over iSCSI in all three primary storage performance metrics. Nevertheless, there are more advantages to NVMe over TCP vs iSCSI.

Standard Ethernet: NVMe over TCP’s most significant advantage is its ability to operate over standard Ethernet networks. Unlike specialized networking technologies (Infiniband, Fibre Channel), NVMe/TCP requires no additional hardware investments or complex configuration, making it remarkably accessible for organizations of all sizes.
Performance Characteristics: NVMe over TCP delivers exceptional performance by minimizing protocol overhead and leveraging the efficiency of NVMe’s design. It can achieve latencies comparable to local storage while providing the flexibility of network-attached resources. Modern implementations can sustain throughput rates exceeding traditional storage protocols by significant margins.
Ease of Deployment: NVMe over TCP integrates seamlessly with Linux and Windows (Server 2025 and later) since the necessary drivers are already part of the kernel. That makes NVMe/TCP straightforward to implement and manage. Seamless compatibility reduces the learning curve and integration challenges typically associated with new storage technologies.

Choosing Between NVMe over TCP and iSCSI

Deciding between two technologies isn’t always easy. It isn’t that hard in the case of NVMe over TCP vs iSCSI. The use cases for new iSCSI deployment are very sparse. From my perspective, the only valid use case is the integration of pre-existing legacy systems that don’t yet support NVMe over TCP

That’s why simplyblock, as an NVMe over TCP first solution, still provides iSCSI if you really need it. We offer it exactly for the reason that migrations don’t happen from today to tomorrow. Still, you want to leverage the benefits of newer technologies, such as NVMe over TCP, wherever possible. With simplyblock, logical volumes can easily be provisioned as NVMe over TCP or iSCSI devices. You can even switch over from iSCSI to NVMe over TCP later on.

In any case, you should go with NVMe over TCP when:

You operate high-performance computing environments
You have modern data centers with significant bandwidth
You deploy workloads requiring low-latency, high IOPS, or throughput storage access
You find yourself in scenarios that demand scalable, flexible storage solutions
You are in any other situation where you need remotely attached storage

You should stay on iSCSI (or slowly migrate away) when:

You have legacy infrastructure with limited upgrade paths

You see, there aren’t a lot of reasons. Given that, it’s just a matter of selecting your new storage solution. Personally, these days, I would always recommend software-defined storage solutions such as simplyblock, but I’m biased. Anyhow, an SDS provides the best of both worlds: commodity storage hardware (with the option to go all in with your 96-bay storage server) and performance.

Simplyblock: Embracing Versatility

Simplyblock demonstrates forward-thinking storage design by supporting both NVMe over TCP and iSCSI, providing customers with the best performance when available and the chance to migrate slowly in the case of existing legacy clients.

Furthermore, simplyblock offers features known from traditional SAN storage systems or “filesystems” such as ZFS. This includes a full copy-on-write backend with instant snapshots and clones. It includes synchronous and asynchronous replication between storage clusters. Finally, simplyblock is your modern storage solution, providing storage to dedicated hosts, virtual machines, and containers. Regardless of the client, simplyblock offers the most seamless integration with your existing and upcoming environments.

The Future of NVMe over TCP

As enterprise and cloud computing continue to evolve, NVMe over TCP stands as the technology of choice for remotely attached storage. Firstly, it combines simplicity, performance, and broad compatibility. Secondly, it provides a cost-efficient and scalable solution utilizing commodity network gear.

The protocol’s ongoing development (last specification update May 2024) and increasing adoption show continued improvements in efficiency, reduced latency, and enhanced scalability.

NVMe over TCP represents a significant step forward in storage networking technology. Furthermore, combining the raw performance of NVMe with the ubiquity of Ethernet networking offers a compelling solution for modern computing environments. While iSCSI remains relevant for specific use cases and during migration phases, NVME over TCP represents the future and should be adopted as soon as possible.

We, at simplyblock, are happy to be part of this important step in the history of storage.

Questions and Answers

Is NVMe over TCP better than iSCSI?

Yes, NVMe over TCP is superior to iSCSI in almost any way. NVMe over TCP provides lower protocol overhead, better throughput, lower latency, and higher IOPS compared to iSCSI. It is recommended that iSCSI not be used for newly designed infrastructures and that old infrastructures be migrated wherever possible.

How much faster is NVMe over TCP compared to iSCSI?

NVMe over TCP is superior in all primary storage metrics, meaning IOPS, latency, and throughput. NVMe over TCP shows up to 35% higher IOPS, 25% lower latency, and 20% increased throughput compared to iSCSI using the same network fabric and storage.

What is NVMe over TCP?

NVMe/TCP is a storage networking protocol that utilizes the common internet standard protocol TCP/IP as its transport layer. It is deployed through standard Ethernet fabrics and can be run parallel to existing network traffic, while separation through VLANs or physically separated networks is recommended. NVMe over TCP is considered the successor of the iSCSI protocol.

What is iSCSI?

iSCSI is a storage networking protocol that utilizes the common internet standard protocol TCP/IP as its transport layer. It connects remote storage solutions (commonly hardware storage appliances) to storage clients through a standard Ethernet fabric. iSCSI was initially standardized in 2000. Many companies replace iSCSI with the superior NVMe over TCP protocol.

What is SCSI?

SCSI (Small Computer Storage Interface) is a command set that connects computers and peripheral devices and transfers data between them. Initially developed in the 1980s, SCSI has been a foundational technology for data storage interfaces, supporting various device types such as hard drives, optical drives, and scanners.

What is NVMe?

NVMe (Non-Volatile Memory Express) is a specification that defines the connection and transmission of data between storage devices and computers. The initial specification was released in 2011. NVMe is designed specifically for solid-state drives (SSDs) connected via the PCIe bus. NVMe devices have improved latency and performance than older standards such as SCSI, SATA, and SAS.

The post NVMe over TCP vs iSCSI: Evolution of Network Storage appeared first on simplyblock.

NVMe Storage for Database Optimization: Lessons from Tech Giants

Rob Pankow — Thu, 17 Oct 2024 13:27:59 +0000

Database Scalability Challenges in the Age of NVMe

In 2024, data-driven organizations increasingly recognize the crucial importance of adopting NVMe storage solutions to stay competitive. With NVMe adoption still below 30%, there’s significant room for growth as companies seek to optimize their database performance and storage efficiency. We’ve looked at how major tech companies have tackled database optimization and scalability challenges, often turning to self-hosted database solutions and NVMe storage.

While it’s interesting to see what Netflix or Pinterest engineers are investing their efforts into, it is also essential to ask yourself how your organization is adopting new technologies. As companies grow and their data needs expand, traditional database setups often struggle to keep up. Let’s look at some examples of how some of the major tech players have addressed these challenges.

Pinterest’s Journey to Horizontal Database Scalability with TiDB

Pinterest, which handles billions of pins and user interactions, faced significant challenges with its HBase setup as it scaled. As their business grew, HBase struggled to keep up with evolving needs, prompting a search for a more scalable database solution. They eventually decided to go with TiDB as it provided the best performance under load.

Selection Process:

Evaluated multiple options, including RocksDB, ShardDB, Vitess, VoltDB, Phoenix, Spanner, CosmosDB, Aurora, TiDB, YugabyteDB, and DB-X.
Narrowed down to TiDB, YugabyteDB, and DB-X for final testing.

Evaluation:

Conducted shadow traffic testing with production workloads.
TiDB performed well after tuning, providing sustained performance under load.

TiDB Adoption:

Deployed 20+ TiDB clusters in production.
Stores over 200+ TB of data across 400+ nodes.
Primarily uses TiDB 2.1 in production, with plans to migrate to 3.0.

Key Benefits:

Improved query performance, with 2-10x improvements in p99 latency.
More predictable performance with fewer spikes.
Reduced infrastructure costs by about 50%.
Enabled new product use cases due to improved database performance.

Challenges and Learnings:

Encountered issues like TiCDC throughput limitations and slow data movement during backups.
Worked closely with PingCAP to address these issues and improve the product.

Future Plans:

Exploring multi-region setups.
Considering removing Envoy as a proxy to the SQL layer for better connection control.
Exploring migrating to Graviton instance types for a better price-performance ratio and EBS for faster data movement (and, in turn, shorter MTTR on node failures).

Uber’s Approach to Scaling Datastores with NVMe

Uber, facing exponential growth in active users and ride volumes, needed a robust solution for their datastore “Docstore” challenges.

Hosting Environment and Limitations:

Initially on AWS, later migrated to hybrid cloud and on-premises infrastructure
Uber’s massive scale and need for customization exceeded the capabilities of managed database services

Uber’s Solution: Schemaless and MySQL with NVMe

Schemaless: A custom solution built on top of MySQL
Sharding: Implemented application-level sharding for horizontal scalability
Replication: Used MySQL replication for high availability
NVMe storage: Leveraged NVMe disks for improved I/O performance

Results:

Able to handle over 100 billion queries per day
Significantly reduced latency for read and write operations
Improved operational simplicity compared to Cassandra

Discord’s Storage Evolution and NVMe Adoption

Discord, facing rapid growth in user base and message volume, needed a scalable and performant storage solution.

Hosting Environment and Limitations:

Google Cloud Platform (GCP)
Discord’s specific performance requirements and need for customization led them to self-manage their database infrastructure

Discord’s storage evolution:

MongoDB: Initially used for its flexibility, but faced scalability issues
Cassandra: Adopted for better scalability but encountered performance and maintenance challenges
ScyllaDB: Finally settled on ScyllaDB for its performance and compatibility with Cassandra

Discord also created a solution, “superdisk” with a RAID0 on top of the Local SSDs, and a RAID1 between the Persistent Disk and RAID0 array. They could configure the database with a disk drive that would offer low-latency reads while still allowing us to benefit from the best properties of Persistent Disks. One can think of it as a “simplyblock v0.1”.

Figure 1: Discord’s “superdisk” architecture

Key improvements with ScyllaDB:

Reduced P99 latencies from 40-125ms to 15ms for read operations
Improved write performance, with P99 latencies dropping from 5-70ms to a consistent 5ms
Better resource utilization, allowing Discord to reduce their cluster size from 177 Cassandra nodes to just 72 ScyllaDB nodes

Summary of Case Studies

In the table below, we can see a summary of the key initiatives taken by tech giants and their respective outcomes. What is notable, all of the companies were self-hosting their databases (on Kubernetes or on bare-metal servers) and have leveraged local SSD (NVMe) for improved read/write performance and lower latency. However, at the same time, they all had to work around data protection and scalability of the local disk. Discord, for example, uses RAID to mirror the disk, which causes significant storage overhead. Such an approach doesn’t also offer a logical management layer (i.e. “storage/disk virtualization”). In the next paragraphs, let’s explore how simplyblock adds even more performance, scalability, and resource efficiency to such setups.

Company	Database	Hosting environment	Key Initiative
Pinterest	TiDB	AWS EC2 & Kubernetes, local NVMe disk	Improved performance & scalability
Uber	MySQL	Bare-metal, NVMe storage	Reduced read/write latency, improved scalability
Discord	ScyllaDB	Google Cloud, local NVMe disk with RAID mirroring	Reduced latency, improved performance and resource utilization

The Role of Intelligent Storage Optimization in NVMe-Based Systems

While these case studies demonstrate the power of NVMe and optimized database solutions, there’s still room for improvement. This is where intelligent storage optimization solutions like simplyblock are spearheading market changes.

Simplyblock vs. Local NVMe SSD: Enhancing Database Scalability

While local NVMe disks offer impressive performance, simplyblock provides several critical advantages for database scalability. Simplyblock builds a persistent layer out of local NVMe disks, which means that is not just a cache and it’s not just ephemeral storage. Let’s explore the benefits of simplyblock over local NVMe disk:

Scalability: Unlike local NVMe storage, simplyblock offers dynamic scalability, allowing storage to grow or shrink as needed. Simplyblock can scale performance and capacity beyond the local node’s disk size, significantly improving tail latency.
Reliability: Data on local NVMe is lost if an instance is stopped or terminated. Simplyblock provides advanced data protection that survives instance outages.
High Availability: Local NVMe loses data availability during the node outage. Simplyblock ensures storage remains fully available even if a compute instance fails.
Data Protection Efficiency: Simplyblock uses erasure coding (parity information) instead of triple replication, reducing network load and improving effective-to-raw storage ratios by about 150% (for a given amount of NVMe disk, there is 150% more usable storage with simplyblock).
Predictable Performance: As IOPS demand increases, local NVMe access latency rises, often causing a significant increase in tail latencies (p99 latency). Simplyblock maintains constant access latencies at scale, improving both median and p99 access latency. Simplyblock also allows for much faster write at high IOPS as it’s not using NVMe layer as write-through cache, hence its performance isn’t dependent on a backing persistent storage layer (e.g. S3)
Maintainability: Upgrading compute instances impacts local NVMe storage. With simplyblock, compute instances can be maintained without affecting storage.
Data Services: Simplyblock provides advanced data services like snapshots, cloning, resizing, and compression without significant overhead on CPU performance or access latency.
Intelligent Tiering: Simplyblock automatically moves infrequently accessed data to cheaper S3 storage, a feature unavailable with local NVMe.
Thin Provisioning: This allows for more efficient use of storage resources, reducing overprovisioning common in cloud environments.
Multi-attach Capability: Simplyblock enables multiple nodes to access the same volume, which is useful for high-availability setups without data duplication. Additionally, multi-attach can decrease the complexity of volume management and data synchronization.

Technical Deep Dive: Simplyblock’s Architecture

Simplyblock’s architecture is designed to maximize the benefits of NVMe while addressing common cloud storage challenges:

NVMe-oF (NVMe over Fabrics) Interface: Exposes storage as NVMe volumes, allowing for seamless integration with existing systems while providing the low-latency benefits of NVMe.
Distributed Data Plane: Uses a statistical placement algorithm to distribute data across nodes, balancing performance and reliability.
Logical Volume Management: Supports thin provisioning, instant resizing, and copy-on-write clones, providing flexibility for database operations.
Asynchronous Replication: Utilizes a block-storage-level write-ahead log (WAL) that’s asynchronously replicated to object storage, enabling disaster recovery with near-zero RPO (Recovery Point Objective).
CSI Driver: Provides seamless integration with Kubernetes, allowing for dynamic provisioning and lifecycle management of volumes.

Below is a short overview of simplyblock’s high-level architecture in the context of PostgreSQL, MySQL, or Redis instances hosted in Kubernetes. Simplyblock creates a clustered shared pool out of local NVMe storage attached to Kubernetes compute worker nodes (storage is persistent, protected by erasure coding), serving database instances with the performance of local disk but with an option to scale out into other nodes (which can be either other compute nodes or separate, disaggregated, storage nodes). Further, the “colder” data is tiered into cheaper storage pools, such as HDD pools or object storage.

Figure 2: Simplified simplyblock architecture

Applying Simplyblock to Real-World Scenarios

Let’s explore how simplyblock could enhance the setups of the companies we’ve discussed:

Pinterest and TiDB with simplyblock

While TiDB solved Pinterest’s scalability issues, and they are exploring Graviton instances and EBS for a better price-performance ratio and faster data movement, simplyblock could potentially offer additional benefits:

Price/Performance Enhancement: Simplyblock’s storage orchestration could complement Pinterest’s move to Graviton instances, potentially amplifying the price-performance benefits. By intelligently managing storage across different tiers (including EBS and local NVMe), simplyblock could help optimize storage costs while maintaining or even improving performance.
MTTR Improvement & Faster Data Movements: In line with Pinterest’s goal of faster data movement and reduced Mean Time To Recovery (MTTR), simplyblock’s advanced data management capabilities could further accelerate these processes. Its efficient data protection with erasure coding and multi-attach capabilities helps with smooth failovers or node failures without performance degradation. If a node fails, simplyblock can quickly and autonomously rebuild the data on another node using parity information provided by erasure coding, eliminating downtime.
Better Scalability through Disaggregation: Simplyblock’s architecture allows for the disaggregation of storage and compute, which aligns well with Pinterest’s exploration of different instance types and storage options. This separation would provide Pinterest with greater flexibility in scaling their storage and compute resources independently, potentially leading to more efficient resource utilization and easier capacity planning.

Figure 3: Simplyblock’s multi-attach functionality visualized

Uber’s Schemaless

While Uber’s custom Schemaless solution on MySQL with NVMe storage is highly optimized, simplyblock could still offer benefits:

Unified Storage Interface: Simplyblock could provide a consistent interface across Uber’s diverse storage needs, simplifying operations.
Intelligent Data Placement: For Uber’s time-series data (like ride information), simplyblock’s tiering could automatically optimize data placement based on age and access patterns.
Enhanced Disaster Recovery: Simplyblock’s asynchronous replication to S3 could complement Uber’s existing replication strategies, potentially improving RPO.

Discord and ScyllaDB

Discord’s move to ScyllaDB already provided significant performance improvements, but simplyblock could further enhance their setup:

NVMe Resource Pooling: By pooling NVMe resources across nodes, simplyblock would allow Discord to further reduce their node count while maintaining performance.
Cost-Efficient Scaling: For Discord’s rapidly growing data needs, simplyblock’s intelligent tiering could help manage costs as data volumes expand.
Simplified Cloning for Testing: Simplyblock’s instant cloning feature could be valuable for Discord’s development and testing processes.It allows for quick replication of production data without additional storage overhead.

What’s next in the NVMe Storage Landscape?

The case studies from Pinterest, Uber, and Discord highlight the importance of continuous innovation in database and storage technologies. These companies have pushed beyond the limitations of managed services like Amazon RDS to create custom, high-performance solutions often built on NVMe storage.

However, the introduction of intelligent storage optimization solutions like simplyblock represents the next frontier in this evolution. By providing an innovative layer of abstraction over diverse storage types, implementing smart data placement strategies, and offering features like thin provisioning and instant cloning alongside tight integration with Kubernetes, simplyblock spearheads market changes in how companies approach storage optimization.

As data continues to grow exponentially and performance demands increase, the ability to intelligently manage and optimize NVMe storage will become ever more critical. Solutions that can seamlessly integrate with existing infrastructure while providing advanced features for performance, cost optimization, and disaster recovery will be key to helping companies navigate the challenges of the data-driven future.

The trend towards NVMe adoption, coupled with intelligent storage solutions like simplyblock is set to reshape the database infrastructure landscape. Companies that embrace these technologies early will be well-positioned to handle the data challenges of tomorrow, gaining a significant competitive advantage in their respective markets.

The post NVMe Storage for Database Optimization: Lessons from Tech Giants appeared first on simplyblock.

How We Built Our Distributed Data Placement Algorithm

Michael Schmidt — Wed, 22 May 2024 12:11:23 +0000

Modern cloud applications demand more from their storage than ever before – ultra-low latency, predictable performance, and bulletproof reliability. Simplyblock’s software-defined storage cluster technology, built upon its distributed data placement algorithm, reimagines how we utilize NVMe devices in public cloud environments.

This article deep dives into how we’ve improved upon traditional distributed data placement algorithms to create a high-performance I/O processing environment that meets modern enterprise storage requirements.

Design Of Simplyblock’s Storage Cluster

Simplyblock storage cluster technology is designed to utilize NVMe storage devices in public cloud environments for use cases that require predictable and ultra-low access latency (sub-millisecond) and the highest performance density (high IOPS per GiB).

To combine high performance with a high degree of data durability, high availability, and fault tolerance, as well as zero downtime scalability, the known distributed data placement algorithms had to be improved, re-combined, and implemented into a high-performance IO processing environment.

Our innovative approach combines:

Predictable, ultra-low latency performance (<1ms)
Maximum IOPS density optimization
Enterprise-grade durability and availability
Zero-downtime scalability
Advanced failure domain management

Modern Storage Requirements

Use cases such as high-load databases, time-series databases with high-velocity data, Artificial Intelligence (AI), Machine Learning (ML), and many others require fast and predictable storage solutions.

Anyhow, performance isn’t everything. The fastest storage is writing to /dev/null, but only if you don’t need the data durability. That said, the main goals for a modern storage solution are:

High Performance Density, meaning a high amount of IOPS per Gigabyte (at an affordable price).
Predictable, low Latency, especially for use cases that require consistent response times.
High degree of Data Durability, to distribute the data across failure domains, enabling it to survive multiple failure scenarios.
High Availability and Fault Tolerance, for the data to remain accessible in case of node outage. Clusters are automatically re-balanced in the case of element failures.
Zero Downtime Scalability, meaning that clusters can grow in real-time and online and are automatically re-balanced.

Distributed Data Placement

Data placement in storage clusters commonly uses pseudo-randomization. Additionally, features such as weighted distribution of storage across the cluster (based on the capacity and performance of available data buckets) are introduced to handle failure domains and cluster rebalancing – for scaling, downsizing, or removal of failed elements – at minimal cost. A prominent example of such an algorithm is CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data), which is used in Ceph, an open-source software-defined storage platform designed to provide object storage, block storage, and file storage in a unified system.

Simplyblock uses a different algorithm to achieve the following characteristics for its distributed data placement feature:

High storage efficiency (raw to effective storage ratio) with minimal performance overhead. Instead of using three data replicas, which is the standard mechanism to protect data from storage device failure in software-defined storage clusters, simplyblock uses error coding algorithms with a raw-to-effective ratio of about 1.33 (instead of 3).
Very low access latency below 100 microseconds for read and write. Possible write amplification below 2.
Ultra-high IOPS density with more than 200.000 IOPS per CPU core.
Performant re-distribution of storage in the cluster in the case of cluster scaling and removal of failed storage devices. Simplyblock’s algorithm will only re-distribute the amount of data that is close to the theoretical minimum to rebalance the cluster.
Support for volume high-availability based on the NVMe industry standard. Support for simple failure domains as they are available in cloud environments (device, node, rack, availability zone).
Performance efficiency aims to address the typical performance bottlenecks in cloud environments.

Implementing a storage solution to keep up with current trends required us to think out of the box. The technical design consists of several elements.

Low-level I/O Processing Pipeline

On the lower level, simplyblock uses a fixed-size page mapping algorithm implemented in a virtual block device (a virtual block device implements a filter or transformation step in the IO processing pipeline).

For that purpose, IO is organized into “pages” ( 2^m blocks, with m in the range of 8 to 12). Cross-page IO has to be split before processing. This is done on the mid-level processing pipeline. We’ll get to that in a second.

This algorithm can place data received via IO-write requests from multiple virtual block devices on a single physical block device. Each virtual block device has its own logical block address space though. The algorithm is designed to read, write, and unmap (deallocate) data with minimal write amplification for metadata updates (about 2%) and minimal increase in latency (on average in the sub-microseconds range). Furthermore, it is optimized for sudden power cuts (crash-consistent) by storing all metadata inline of the storage blocks on the underlying device.

Like all block device IO requests, each request contains an LBA (logical block address) and a length (in blocks). The 64-bit LBA is internally organized into a 24-bit VUID (a cluster-wide unique identifier of the logical volume) and a 39-bit virtual LBA. The starting LBA of the page on the physical device is identified by the key (VUID, LPA), where LPA is the logical page address (LBA / (2^m)), and the address offset within the page is determined by (LBA modulo 2^m).

The IO processing services of this virtual device work entirely asynchronously on CPU-pinned threads with entirely private data (no synchronization mechanisms between IO threads required).

They are placed on top of an entirely asynchronous NVMe driver, which receives IO and submits responses via IO-queue pairs sitting at the bottom of the IO processing stack.

Mid-level IO Processing Pipeline

On top of the low-level mapping device, a virtual block device, which implements distributed data placement, has access to the entire cluster topology. This topology is maintained by a centralized multi-tenant control plane, which knows about the state of each node and device in the attached clusters and manages changes to cluster topology (adding or removing devices and nodes).

It uses multiple mechanisms to determine the calculated and factual location of each data page and then issues asynchronous IO to this mapping device in the cluster locally (NVMe) or remotely using NVMe over Fabrics (NVMe-oF):

All IO is received and forwarded on private IO threads, with no inter-thread communication, and entirely asynchronously, both inbound and outbound.
First, IO is split at page boundaries so that single requests can be processed within a single page.
Data is then striped into n chunks, and (double) parity is calculated from the chunks. Double parity is calculated using the RDP algorithm. The data is, therefore, organized in 2-dimensional arrays. n equals 1, 2, 4, or 8. This way, 4 KiB blocks can be mapped into 512-byte device blocks, and expensive partial stripe writes can be avoided.
To determine a primary target for each combination of (VUID, page, chunk-index), a flat list of devices is fed into the “list bucket” algorithm (see …) with (VUID, page, chunk-index) being the key.
Each of the data and parity chunks in a stripe have to be placed on a different device. In addition, more failure domains, such as nodes and racks, can be considered for placement anti-affinity rules. In case of a collision, the algorithm repeats recursively with an adjusted chunk-index (chunk-index + i x p, where p is the next prime number larger than the maximum chunk index and i is the iteration).
In case a selected device is (currently) not available, the algorithm repeats recursively to find an alternative and also stores the temporary placement data for each chunk in the IO. This temporary placement data is now also journaled as metadata. Metadata journaling is an important and complex part of the algorithm. It is described separately below.
On read, the process is reversed: the chunks to read from a determined placement location are determined by the same algorithm.
In case of single or dual device failure at read, the missing data will be reconstructed on the fly from parity chunks.
The algorithm pushes any information on IO errors straight to the control plane, and the control plane may update the cluster map (status of nodes and devices) and push the updated cluster map back to all nodes and virtual devices.

Top-level IO Processing Pipeline

On top of the stack of virtual block devices, simplyblock includes multiple optional virtual block devices, including a snapshot device – the device can take instant snapshots of volumes, supports snapshot chains, and instant (copy-on-write) cloning of volumes. Additionally, there is a virtual block device layer, which supports synchronous and asynchronous replication of block storage volumes across availability zones.

The highest virtual block device in the stack is then published to the fabric as a separate NVMe-oF volume with its own unique NVMe identifier (NQN) via the control plane.

High-Availability Support

The algorithm supports highly available volumes based on NVMe multipathing and ANA (asynchronous namespace access). This means that a transparent fail-over of IO for a single volume in case of a node outage is realized without having to add any additional software to clients.

Due to the features of the high-, mid-, and low-level IO pipeline, this is easy to realize: identical stacks of virtual block devices with an identical VUID are created on multiple nodes and published to the fabric using “asynchronous namespace access” (which prefers one volume over others and essentially implements an active/passive/passive mechanism).

Metadata Journaling

Metadata Journaling persists in non-primary placement locations to locate data in the cluster. It has the following important features:

It has to persist every change in location for the block range addressed in an IO request to be consistent in “sudden power-off” situations (node outages)
It has to minimize write amplification – this is achieved by smartly “batching” multiple metadata write requests into single IO operations in the form of “high-priority” IO
It has to be fast – not delaying data IO – this is achieved by high-priority NVMe queues
It has to be reliable – this is achieved by replicating metadata writes to three nodes using NVMe over Fabrics remote connections
Its storage footprint has to be small and remain constant over time; it cannot grow forever with new IO – this is achieved by introducing a regular compression mechanism, which replaces the transactional journal with a “snapshot” of placement metadata at a certain moment

Data Migrations

Data Migrations run as background processes, which take care of the movement of data in cases of failed devices (re-rebuild and re-distribution of data in the cluster), cluster scaling (to reduce the load on utilized elements and rebalance the cluster), and temporary element outage (to migrate data back to its primary locations). Running data migrations keeps a cluster in a state of transition and has to be coordinated to not conflict with any ongoing IO.

Conclusion

Building an architecture for a fast, scalable, fault-tolerant distributed storage solution isn’t easy. To be fair, I don’t think anyone expected that. Distributed systems are always complicated, and a lot of brain power goes into their design.

Simplyblock separates itself by rethinking the data placement in distributed storage environments. Part of it is the fundamentally different way of using erasure coding for parity information. We don’t just use them on a single node, between the local drives, simplyblock uses erasure coding throughout the cluster, distributing parity information from each disk on another disk on another node, hence increasing the fault tolerance.

To test simplyblock, get started right away. If you want to learn more about the features simplyblock offers you, see our feature overview.

The post How We Built Our Distributed Data Placement Algorithm appeared first on simplyblock.

What is NVMe Storage?

Chris Engelbert — Wed, 08 May 2024 12:12:00 +0000

NVMe, or Non-Volatile Memory Express, is a modern access and storage protocol for flash-based solid-state storage. Designed for low overhead, latency, and response times, it aims for the highest achievable throughput. With NVMe over TCP, NVMe has its own successor to the familiar iSCSI.

While commonly found in home computers and laptops (M.2 factor), it is designed from the ground up for all types of commodity and enterprise workloads. It guarantees fast load times and response times, even in demanding application scenarios.

The main intention in developing the NVMe storage protocol was to transfer data through the PCIe (PCI Express) bus. Since the low-overhead protocol, more use cases have been found through NVMe specification extensions managed by the NVM Express group. Those extensions include additional transport layers, such as Fibre Channel, Infiniband, and TCP (collectively known as NVMe-oF or NVMe over Fabrics).

How does NVMe Work?

Traditionally, computers used SATA or SAS (and before that, ATA, IDE, SCSI, …) as their main protocols for data transfers from the disk to the rest of the system. Those protocols were all developed when spinning disks were the prevalent type of high-capacity storage media.

NVMe, on the other hand, was developed as a standard protocol to communicate with modern solid-state drives (SSDs). Unlike traditional protocols, NVMe fully takes advantage of SSDs’ capabilities. It also provides support for much lower latency due to the missing repositioning of read-write heads and rotating spindles.

The main reason for developing the NVMe protocol was that SSDs were starting to experience throughput limitations due to the traditional protocols SAS and SATA.

Anyhow, NVMe communicates through a high-speed Peripheral Component Interconnect Express bus (better known as PCIe). The logic for NVMe resides inside the controller chip on the storage adapter board, which is physically located inside the NVMe-capable device. This board is often co-located with controllers for other features, such as wear leveling. When accessing or writing data, the NVMe controller talks directly to the CPU through the PCIe bus.

The NVMe standard defines registers (basically special memory locations) to control the protocol, a command set of possible operations to be executed, and additional features to improve performance for specific operations.

What are the Benefits of NVMe Storage?

Compared to traditional storage protocols, NVMe has much lower overhead and is better optimized for high-speed and low-latency data access.

Additionally, the PCI Express bus can transfer data faster than SATA or SAS links. That means that NVMe-based SSDs provide a latency of a few microseconds over the 40-100ms for SATA-based ones.

Furthermore, NVMe storage comes in many different packages, depending on the use case. That said, many people know the M.2 form factor for home use, however it is limited in bandwidth due to the much fewer available PCIe lanes on consumer-grade CPUs. Enterprise NVMe form factors, such as U.2, provide more and higher capacity uplinks. These enterprise types are specifically designed to sustain high throughput for ambiguous data center workloads, such as high-load databases or ML / AI applications.

Last but not least, NVMe commands can be streamlined, queued, and multipath for more efficient parsing and execution. Due to the non-rotational nature of solid-state drives, multiple operations can be executed in parallel. This makes NVMe a perfect candidate for tunneling the protocol over high-speed communication links.

What is NVMe over Fabrics (NVMe-oF)?

NVMe over Fabrics is a tunneling mechanism for access to remote NVMe devices. It extends the performance of access to solid-state drives over traditional tunneling protocols, just like iSCSI.

NVMe over Fabrics is directly supported by the NVMe driver stacks of common operating systems, such as Linux and Windows (Server), and doesn’t require additional software on the client side.

At the time of writing, the NVM Express group has standardized the tunneling of NVMe commands through the NVMe-friendly protocols Fibre Channel, Infiniband, and Ethernet, or more precisely, over TCP.

NVMe over Fibre Channel (NVMe/FC)

NVMe over Fibre Channel is a high-speed transfer that connects NVMe storage solutions to client devices. Fibre Channel, initially designed to transport SCSI commands, needed to translate NVMe commands into SCSI commands and back to communicate with newer solid-state hardware. To mitigate that overhead, the Fibre Channel protocol was enhanced to natively support the transport of NVMe commands. Today, it supports native, in-order transfers between NVMe storage devices across the network.

Due to the fact that Fibre Channel is its own networking stack, cloud providers (at least none of my knowledge) don’t offer support for NVMe/FC.

NVMe over TCP (NVMe/TCP)

NVMe over TCP provides an alternative way of transferring NVMe communication through a network. In the case of NVMe/TCP, the underlying network layer is the TCP/IP protocol, hence an Ethernet-based network. That increases the availability and commodity of such a transport layer beyond separate and expensive enterprise networks running Fibre Channel.

NVMe/TCP is rising to become the next protocol for mainstream enterprise storage, offering the best combination of performance, ease of deployment, and cost efficiency.

Due to its reliance on TCP/IP, NVMe/TCP can be utilized without additional modifications in all standard Ethernet network gear, such as NICs, switches, and copper or fiber transports. It also works across virtual private networks, making it extremely interesting in cloud, private cloud, and on-premises environments, specifically with public clouds with limited network connectivity options.

NVMe over RDMA (NVMe/RDMA)

A special version of NVMe over Fabrics is NVMe over RDMA (or NVMe/RDMA). It implements a direct communication channel between a storage controller and a remote memory region (RDMA = Remote Direct Memory Access). This lowers the CPU overhead for remote access to storage (and other peripheral devices). To achieve that, NVMe/RDMA bypasses the kernel stack, hence it mitigates the memory copying between the driver stack, the kernel stack, and the application memory.

NVMe over RDMA has two sub-protocols: NVMe over Infiniband and NVMe over RoCE (Remote Direct Memory Access over Converged Ethernet). Some cloud providers offer NVMe over RDMA access through their virtual networks.

How does NVMe/TCP Compare to ISCSI?

NVMe over TCP provides performance and latency benefits over the older iSCSI protocol. The improvements include about 25% lower protocol overhead, meaning more actual data can be transferred with every TCP/IP packet, increasing the protocol’s throughput.

Furthermore, NVMe/TCP enables native transfer of the NVMe protocol, removing multiple translation layers between the older SCSI protocol (which is used in iSCSI, hence the name) and NVMe.

That said, the difference is measurable. Blockbridge Networks, a provider of all-flash storage hardware, did a performance benchmarking of both protocols and found a general improvement of access latency of up to 20% and an IOPS improvement of up to 35% using NVMe/TCP over iSCSI to access the remote file storage.

Use Cases for NVMe Storage?

NVMe storage’s benefits and ability to be tunneled through different types of networks (including virtual private networks in cloud environments through NVMe/TCP) open up a wide range of high-performance, latency-sensitive, or IOPS-hungry use cases.

Relational Databases with high load or high-velocity data. That includes:

Time-series databases for IoT or Observability data
Big Data
Data Warehouses
Analytical databases
Artificial Intelligence (AI) and Machine Learning (ML)
Blockchain storage and other Crypto use cases
Large-scale data center storage solutions
Graphics editing storage servers

The Future is NVMe Storage

No matter how we look at it, the amount of data we need to transfer (quickly) from and to storage devices won’t shrink. NVMe is the current gold standard for high-performance and low-latency storage. Making NVMe available throughout a network and accessing the data remotely is becoming increasingly popular over the still prevalent iSCSI protocol. The benefits are imminent whenever NVMe-oF is deployed.

The storage solution by simplyblock is designed around the idea of NVMe being the better way to access your data. Built from the ground up to support NVMe throughout the stack, it combines NVMe solid-state drives into a massive storage pool. It creates logical volumes, with data spread around all connected storage devices and simplyblock cluster nodes. Simplyblock provides these logical volumes as NVMe over TCP devices, which are directly accessible from Linux and Windows. Additional features such as copy-on-write clones, thin provisioning, compression, encryption, and more are given.

If you want to learn more about simplyblock, read our feature deep dive. You want to test it out, then get started right away.

The post What is NVMe Storage? appeared first on simplyblock.