Storage Performance Archives | simplyblock

Local NVMe Storage on AWS – Pros and Cons

Rob Pankow — Thu, 03 Oct 2024 12:13:26 +0000

What is the Best Storage Solution on AWS?

The debate over the optimal storage solution has been ongoing. Local instance storage on AWS (i.e. ephemeral NVMe disk attached to EC2 instance) brings remarkable cost-performance ratios. It offers 20 times better performance and 10 times lower access latency than EBS. It’s a powerhouse for quick, ephemeral storage needs. In simple words, local NVME disk is very fast and relatively cheap, but not scalable and not persistent.

Recently, Vantage posted an article titled “Don’t use EBS for Cloud Native Services“. We agree with this problem statement, however we also strongly believe that there is a better solution that using Local NVMe SSD Storage on AWS as an alternative to EBS. Local NVMe to EBS is not like comparing apples to apples, but more like apples to oranges.

The Local Instance NVMe Storage Advantage

Local storage on AWS excels in speed and cost-efficiency, delivering performance that’s 20 times better and latency that’s 10 times lower compared to EBS. For certain workloads with temporary storage needs, it’s a clear winner. But, let’s acknowledge the reasons why data centers have traditionally separated storage and compute.

Overcoming Traditional Challenges of Local Storage

Challenges	Local Storage	simplyblock
Scalability	Limited capacity, unable to resize dynamically	Dynamic scalability with simplyblock
Reliability	Data loss if instance is stopped or terminated	Advanced data protection, data survives instance outage
High Availability	Inconsistent access in case of compute instance outage	Access to storage must remain fully available in case of a compute instance outage
Data Protection Efficiency	N/A	Use of erasure coding instead of three replicas to reduce load on the network and effective-to-raw storage ratios by a factor of about 2.5x
Predictability/Consistency	Access latency increases with rising IOPS demand	Constant access latencies with simplyblock
Maintainability	Impact on storage during compute instance upgrades	Upgrading and maintaining compute instances without impact on storage is an important aspect of operations
Data Services Offloading	N/A	No impact on local CPU, performance and access latency for data services such as volume snapshot, copy-on-write cloning, instant volume resizing, erasure coding, encryption and data compression
Intelligent Storage Tiering	N/A	Automatically move infrequently accessed data chunks from more expensive, fast storage to cheap S3 buckets

Simplyblock provides an innovatrive approach that marries the cost and performance advantages of local instance storage with the benefits of pooled cloud storage. It offers the best of both worlds—high-speed, low-latency performance near to local storage, coupled with the robustness and flexibility of pooled cloud storage.

Why Choose simplyblock on AWS?

Performance and Cost Efficiency: Enjoy the benefits of local storage without compromising on scalability, reliability, and high availability.
Data Protection: simplyblock employs advanced data protection mechanisms,
ensuring that your data survives any instance outage.
Seamless Operations: Upgrade and maintain compute instances without impacting storage, ensuring continuous operations.
Data Services Galore: Unlock the potential of various data services without affecting local CPU performance.

While local instance storage has its merits, the future lies in a harmonious blend of the speed of local storage and the resilience of cloud-pooled storage. With simplyblock, we transcend the limitations of local NVMe disk, providing you with a storage solution that’s not just powerful but also versatile, scalable, and intelligently designed for the complexities of the cloud era.

The post Local NVMe Storage on AWS – Pros and Cons appeared first on simplyblock.

What is NVMe Storage?

Chris Engelbert — Wed, 08 May 2024 12:12:00 +0000

NVMe, or Non-Volatile Memory Express, is a modern access and storage protocol for flash-based solid-state storage. Designed for low overhead, latency, and response times, it aims for the highest achievable throughput. With NVMe over TCP, NVMe has its own successor to the familiar iSCSI.

While commonly found in home computers and laptops (M.2 factor), it is designed from the ground up for all types of commodity and enterprise workloads. It guarantees fast load times and response times, even in demanding application scenarios.

The main intention in developing the NVMe storage protocol was to transfer data through the PCIe (PCI Express) bus. Since the low-overhead protocol, more use cases have been found through NVMe specification extensions managed by the NVM Express group. Those extensions include additional transport layers, such as Fibre Channel, Infiniband, and TCP (collectively known as NVMe-oF or NVMe over Fabrics).

How does NVMe Work?

Traditionally, computers used SATA or SAS (and before that, ATA, IDE, SCSI, …) as their main protocols for data transfers from the disk to the rest of the system. Those protocols were all developed when spinning disks were the prevalent type of high-capacity storage media.

NVMe, on the other hand, was developed as a standard protocol to communicate with modern solid-state drives (SSDs). Unlike traditional protocols, NVMe fully takes advantage of SSDs’ capabilities. It also provides support for much lower latency due to the missing repositioning of read-write heads and rotating spindles.

The main reason for developing the NVMe protocol was that SSDs were starting to experience throughput limitations due to the traditional protocols SAS and SATA.

Anyhow, NVMe communicates through a high-speed Peripheral Component Interconnect Express bus (better known as PCIe). The logic for NVMe resides inside the controller chip on the storage adapter board, which is physically located inside the NVMe-capable device. This board is often co-located with controllers for other features, such as wear leveling. When accessing or writing data, the NVMe controller talks directly to the CPU through the PCIe bus.

The NVMe standard defines registers (basically special memory locations) to control the protocol, a command set of possible operations to be executed, and additional features to improve performance for specific operations.

What are the Benefits of NVMe Storage?

Compared to traditional storage protocols, NVMe has much lower overhead and is better optimized for high-speed and low-latency data access.

Additionally, the PCI Express bus can transfer data faster than SATA or SAS links. That means that NVMe-based SSDs provide a latency of a few microseconds over the 40-100ms for SATA-based ones.

Furthermore, NVMe storage comes in many different packages, depending on the use case. That said, many people know the M.2 form factor for home use, however it is limited in bandwidth due to the much fewer available PCIe lanes on consumer-grade CPUs. Enterprise NVMe form factors, such as U.2, provide more and higher capacity uplinks. These enterprise types are specifically designed to sustain high throughput for ambiguous data center workloads, such as high-load databases or ML / AI applications.

Last but not least, NVMe commands can be streamlined, queued, and multipath for more efficient parsing and execution. Due to the non-rotational nature of solid-state drives, multiple operations can be executed in parallel. This makes NVMe a perfect candidate for tunneling the protocol over high-speed communication links.

What is NVMe over Fabrics (NVMe-oF)?

NVMe over Fabrics is a tunneling mechanism for access to remote NVMe devices. It extends the performance of access to solid-state drives over traditional tunneling protocols, just like iSCSI.

NVMe over Fabrics is directly supported by the NVMe driver stacks of common operating systems, such as Linux and Windows (Server), and doesn’t require additional software on the client side.

At the time of writing, the NVM Express group has standardized the tunneling of NVMe commands through the NVMe-friendly protocols Fibre Channel, Infiniband, and Ethernet, or more precisely, over TCP.

NVMe over Fibre Channel (NVMe/FC)

NVMe over Fibre Channel is a high-speed transfer that connects NVMe storage solutions to client devices. Fibre Channel, initially designed to transport SCSI commands, needed to translate NVMe commands into SCSI commands and back to communicate with newer solid-state hardware. To mitigate that overhead, the Fibre Channel protocol was enhanced to natively support the transport of NVMe commands. Today, it supports native, in-order transfers between NVMe storage devices across the network.

Due to the fact that Fibre Channel is its own networking stack, cloud providers (at least none of my knowledge) don’t offer support for NVMe/FC.

NVMe over TCP (NVMe/TCP)

NVMe over TCP provides an alternative way of transferring NVMe communication through a network. In the case of NVMe/TCP, the underlying network layer is the TCP/IP protocol, hence an Ethernet-based network. That increases the availability and commodity of such a transport layer beyond separate and expensive enterprise networks running Fibre Channel.

NVMe/TCP is rising to become the next protocol for mainstream enterprise storage, offering the best combination of performance, ease of deployment, and cost efficiency.

Due to its reliance on TCP/IP, NVMe/TCP can be utilized without additional modifications in all standard Ethernet network gear, such as NICs, switches, and copper or fiber transports. It also works across virtual private networks, making it extremely interesting in cloud, private cloud, and on-premises environments, specifically with public clouds with limited network connectivity options.

NVMe over RDMA (NVMe/RDMA)

A special version of NVMe over Fabrics is NVMe over RDMA (or NVMe/RDMA). It implements a direct communication channel between a storage controller and a remote memory region (RDMA = Remote Direct Memory Access). This lowers the CPU overhead for remote access to storage (and other peripheral devices). To achieve that, NVMe/RDMA bypasses the kernel stack, hence it mitigates the memory copying between the driver stack, the kernel stack, and the application memory.

NVMe over RDMA has two sub-protocols: NVMe over Infiniband and NVMe over RoCE (Remote Direct Memory Access over Converged Ethernet). Some cloud providers offer NVMe over RDMA access through their virtual networks.

How does NVMe/TCP Compare to ISCSI?

NVMe over TCP provides performance and latency benefits over the older iSCSI protocol. The improvements include about 25% lower protocol overhead, meaning more actual data can be transferred with every TCP/IP packet, increasing the protocol’s throughput.

Furthermore, NVMe/TCP enables native transfer of the NVMe protocol, removing multiple translation layers between the older SCSI protocol (which is used in iSCSI, hence the name) and NVMe.

That said, the difference is measurable. Blockbridge Networks, a provider of all-flash storage hardware, did a performance benchmarking of both protocols and found a general improvement of access latency of up to 20% and an IOPS improvement of up to 35% using NVMe/TCP over iSCSI to access the remote file storage.

Use Cases for NVMe Storage?

NVMe storage’s benefits and ability to be tunneled through different types of networks (including virtual private networks in cloud environments through NVMe/TCP) open up a wide range of high-performance, latency-sensitive, or IOPS-hungry use cases.

Relational Databases with high load or high-velocity data. That includes:

Time-series databases for IoT or Observability data
Big Data
Data Warehouses
Analytical databases
Artificial Intelligence (AI) and Machine Learning (ML)
Blockchain storage and other Crypto use cases
Large-scale data center storage solutions
Graphics editing storage servers

The Future is NVMe Storage

No matter how we look at it, the amount of data we need to transfer (quickly) from and to storage devices won’t shrink. NVMe is the current gold standard for high-performance and low-latency storage. Making NVMe available throughout a network and accessing the data remotely is becoming increasingly popular over the still prevalent iSCSI protocol. The benefits are imminent whenever NVMe-oF is deployed.

The storage solution by simplyblock is designed around the idea of NVMe being the better way to access your data. Built from the ground up to support NVMe throughout the stack, it combines NVMe solid-state drives into a massive storage pool. It creates logical volumes, with data spread around all connected storage devices and simplyblock cluster nodes. Simplyblock provides these logical volumes as NVMe over TCP devices, which are directly accessible from Linux and Windows. Additional features such as copy-on-write clones, thin provisioning, compression, encryption, and more are given.

If you want to learn more about simplyblock, read our feature deep dive. You want to test it out, then get started right away.

The post What is NVMe Storage? appeared first on simplyblock.

IOPS vs Throughput vs Latency – Storage Performance Metrics

Chris Engelbert — Wed, 24 Apr 2024 12:13:37 +0000

IOPS, throughput, and latency are interrelated metrics that provide insight into the read, write, and access performance of storage entities and network interconnects.

Measuring the performance of a storage solution isn’t hard, but understanding the measured values is. IOPS, throughput, and latency are related to each other, but how? In this blog post I try to explain how they are related and what you should know to really get the most out of your performance test.

IOPS, throughput, and latency are all very important metrics. HDDs (spinning or rotating disks), SSDs (I mean SATA or SAS connected ones), and NVMe devices (connected through some PCIe one way or the other), have very different performance profiles.

Simplyblock’s storage engine uses NVMe disks and the NVMe protocol to provide its virtual (or logical) storage volumes to the outside world. Enough of simplyblock, though, let’s get into the performance metrics!

What is IOPS?

IOPS (spoken as eye-ops) means input/output operations per second. An IOPS basically says how many actual read or write operations a storage device can perform (and sustain) in a single second.

Reads are operations that access information stored on the storage device, while writes perform operations adding new or updating existing information on the disk.

As a good rule of thumb, higher IOPS means better performance. However, it is important to understand how IOPS relates to throughput and block size. A measured or calculated amount of IOPS for one specific block size doesn’t necessarily translate to the same amount of IOPS with another block size. We’ll come back to block size later.

How to Calculate IOPS?

Calculating IOPS is a fairly simple formula. We take the throughputs (read and write combined) and divide it by the number of seconds we measured.

IOPS = (ReadThroughputs + WriteThroughputs) / TimeInSeconds

With that knowledge we can actually simplify the calculation down to a single measurement of read and write throughput.

IOPS = (ReadThroughput + WriteThroughput) / BlockSize

With spinning disks it is a bit more complicated since we have additional seek times. Since spinning disks aren’t used for high-performance storage anymore, I’ll leave it out.

Anyhow, while this sounds easy enough, sometimes, write operations require additional read operations to succeed. That can happen in a RAID setup (or similar technologies like erasure coding), where multiple bits of information are “mathematically” connected into recovery information (often called parity data). Meaning, we take two (or more) input information, connect them through some mathematical formula, and write the result to different disks.

As a result, we can lose one of the input information but use the calculation result and the other input information to reverse the mathematical operation and extract the lost information. Anyway, the result is that we have 1 additional read and 1 additional write for every write operation:

Write: New information to write
Read: Second information input for mathematical calculation
Write: Recovery calculation result

For reading data from such a setup, as long as the information to be read isn’t lost, it’s a single-read operation. In the case of recovering the information due to disk failure (or any other issue), there will be two read operations (parity and the second information input), as well as the reversed mathematical operation.

When calculating how many IOPS storage can sustain, we need to make sure we take those cases into account. And finally, some storage solutions may have a slightly asynchronous ability when it comes to writing and reading, meaning they aren’t equally fast. That said, to find the max IOPS value, we may have to test with a 60:40 read-write ratio or any other one. This is a plain trial and error situation. It’s always best to test with your specific use case and read-write ratio.

What is Throughput?

Throughput is the measurement of how much data can be transferred (read or written) in a given amount of time, for storage solutions it’s typically a measure in bytes per second. Just be careful since there are metrics using bits per second, too. Especially when your storage is connected via a network link (like Ethernet or FiberChannel).

A higher throughput normally means better performance, as in faster data transfers. In comparison to IOPS, throughput tells us a specific amount of data that can be read or written in a given amount of time. This metric can be used directly to understand if a storage solution is “fast enough.” It will not tell us if you can sustain a certain pattern of read or write operations though.

How to Calculate Throughput?

To calculate the throughput without measuring it, we take the amount of IOPS (read and write) and multiply it by the configured block size.

Throughput = (Read IOPS + Write IOPS) x BlockSize

We differentiate between read and write IOPS because the maximum amount of IOPS for either direction may not necessarily equal the other direction. See the How to Calculate IOPS section for more information.

IOPS vs Throughput

Since IOPS and throughput are related to each other. That means, to give a full picture of storage performance both values are required.

Throughput gives us a feeling of how much actual data we can read or write in a given amount of time. If we know that we have to write a 1GB file in less than 10 seconds, we know what write throughput is required.

On the other hand, if we know that our database makes many small read operations, we want to know the potential read IOPS limit of our storage.

Anyhow, there is more to performance than those two metrics, latency.

Figure 1: How is latency and throughput defined?

What is Latency?

Latency, in storage systems, describes the amount of time a storage entity needs to process a single data request. With spinning disks that included seek time (the time to find a specific location and position the read-write heads and for the platter to be at the correct position), but it also includes computation time inside the disk controller.

The latter still holds true for flash storage devices. Due to features such as wear leveling, the controller has to calculate where a piece of information is stored at the current point in time.

That said, latency directly affects how many IOPS can be performed, hence the throughput. Thanks to flash technology, the latency these days is fairly consistent (per device) and doesn’t depend on the physical location anymore. With spinning disks, the latency was different depending on the current and new position of the heads and if the information was stored on the inside or outside of the platter. Luckily, we can ignore this information today. Measuring the latency once per disk should be enough.

IOPS, throughput, and latency are directly related to each other but provide different information.

While IOPS tells us how many actual operations we can perform per second, throughput tells us how much data can be transferred at the same time. Latency, however, tells us how long we have to wait for this operation to be performed (or finished, depending on what type of latency you’re looking for).

There is no one best metric, though. When you want to understand the performance characteristics of a storage entity, all 3 metrics are equally important. Though, for a specific question, one may have more weight than the other.

Reads and Writes

So far, we have always talked about reading and writing. We also mentioned that their ratio may impact the overall performance. In addition, we briefly talked about RAID and erasure coding (and now I just throw in the mix mirroring or replication).

Diverging read and write speeds mostly come from hardware implications, such as flash cells needing to be erased before being written. That, inherently, makes writes slower.

For other situations, such as RAID and similar, additional reads or writes have to be executed so that the effective write IOPS value may be different from the physically possible write IOPS one.

Generally, it’s a valid approach to try to find the absolute maximum read and write values (in terms of throughput and IOPS) by trying out different ratios of writing and reading in the same test. I’d recommend starting with a 50:50 ratio and then going 60:40 or 40:60, seeing which one has the higher result, and then making smaller steps (1 or 5 increments or decrements) until you seem to get the highest numbers.

This doesn’t necessarily reflect your real-world pattern, though, because depending on the workload you run, you may have a different ratio requirement.

Why do I have more Reads than Writes?

Workloads like static asset web servers (found on CDN providers), generally, have a much read than write rate. I mean their whole purpose is to cache data from other systems and present them as fast and as many times as possible. In this case, the storage should provide the fastest read throughput, as well as the highest read IOPS.

While those systems need to store and update existing data in the cache, the extreme imbalance of read overwrite makes write performance (almost) negligible.

Another example of such a system may be an analytics database, which is filled with updated data irregularly but is used for analytical purposes most of the time. That means that while a lot of data is probably written in a short amount of time (ingress burst), the data is mostly read for running queries and analytics requests.

Why do I have more Writes than Reads?

Data lakes, or IoT databases, often see much higher writes than reads. In both situations we want to optimize the storage for faster writes.

Data lakes commonly collect data for later analytics. The collected data is often pre-evaluated and used as a training set for algorithms such as fraud detection.

With IoT databases, the amount of data being delivered from IoT devices is often much higher than what is necessary to present to the user. For that reason, many databases optimized for IoT data (or time series data in general) have features to pre-aggregate data for dashboards, hence increasing the write rates even higher. Anyhow, many such databases are optimized to hold the active data set (the data commonly presented to users) in RAM to prevent it from being read from disk, decreasing the typical read rates further.

Why do I have Writes Only?

Finally, there are systems that observe pretty much only writes. These include mirrors or something like hot standby systems. In this case, changes on a primary system are replicated to the mirror and written down. While the system is “waiting” for itself to become the primary, there are basically no read operations happening.

Reads and Writes Distribution

Depending on your workload, your read-write ratio will change. Before trying to find out if a storage solution fits your needs, you should figure out what your workload’s real-world distribution looks like. Only afterward is there a way to really make an educated guess.

Generally speaking, the read-write distribution will change the outcome of your tests. While a storage solution may have a massive amount of IOPS, the question is more around whether it can sustain the IOPS. At what block size can I expect those amounts of IOPS? And how those IOPS values will change when I change the block sizes.

Different workloads have different requirements. Just keep that in mind.

Why Block Size Matters

Finally, block size. We talked about it before over and over again. The block size defines how much data (in bytes) is read or written per operation. That said, 1000 operations (IOPS) with each 8KB means that we manage to read 8000KB (or 8MB) per second. We immediately see our throughput.

Does that mean that with 64KB block size, I get 8 times the speed? It depends. We all hate this answer, but it does.

It mostly depends on how the storage solution is built and how it is connected. Say, you have a network interconnect to the storage solution of 1Gbit/s, I can get roughly 125MB/s through the wire. 64KB x 1000 IOPS results in 64MB/s, so it should fit. If I increase the block size to 128KB, the network bandwidth is now the limiting factor.

Anyhow, the network isn’t your only limiting factor. Transferring larger block sizes needs more time, meaning your potential maximum IOPS will drop. Though, what I didn’t tell you before, when running your benchmark and trying to find the perfect read-write ratio, you can also play with the block size values. There is a good chance that the default block size may not be the perfect one.

Conclusion

That was a lot of information. Your two main metrics are IOPS and throughput. Latency should be measured but can be considered stable for most flash based systems today. Block size is something to experiment with. It will directly impact both IOPS and throughput, but only if you reach another limiting factor (mostly likely some bandwidth limitation).

Up to that point, you can try and increase the block size for best performance. Remember, you may also have to make changes to your workload configuration or implementation to make use of the increased block size.

The following table is supposed to help a bit in terms of judgment of both IOPS and throughput.

	IOPS	Throughput
Measurement	Input/Output operations per second	(Mega-)Bytes per second
Meaning	The amount of input/output (read or write) operations per second.	The amount of data (in bytes) that can be transferred through the storage connection per second.
Difficulty	Measurement requires specific software, and measurements depend directly on the selected block size. For measurements, I’d recommend the command line too l fio.	Easy to measure. Most operating systems have built-in tools for that or provide tools in the repositories (such as iostat). A specialized tool such as fio can help with benchmarking.
Where it helps	If you have a lot of random I/O, throughput will not be a good measurement due to the influences of latency and queuing of storage requests.	Good measurement of random and small operations.
Where is doesn’t help	Doesn’t say much about the amount of data that can be transferred. Also very much dependent on the chosen block size.	If you have a lot of random I/O, throughput will not be a good measurement due to influences of latency and queuing of storage requests.

That said, both metrics, IOPS and throughput, are useful for assessing the performance of a storage system. They shouldn’t be used in isolation, though.

Simplyblock provides a disaggregated storage solution for latency-sensitive and high-IO workloads in the cloud. To achieve that, simplyblock creates a unified storage pool of NVMe disks connected to the simplyblock cluster nodes (virtual machines) and offers logical volumes through NVMe over TCP. Those logical volumes will be distributed across all connected cluster nodes and disks and secured using erasure coding, bringing the highest performance without sacrificing fault tolerance. Learn now more about simplyblock, or get started right away.

The post IOPS vs Throughput vs Latency – Storage Performance Metrics appeared first on simplyblock.