How a Simple Command Transforms SSD Performance and Endurance - TRIM and WAF
When new interns came into our group at Intel in the non-volatile memory solutions group (NSG), the first thing I would do is take them to a whiteboard and explain how SSDs work, the first concept being WAF. WAF (Write Amplification Factor) is arguably the most critical concept in SSDs, and I observe even senior storage software developers make fundamental SSD mistakes. NAND flash memory must be erased before being programmed. Erases happen in blocks (tens to hundreds of megabytes) and programmed in pages (commonly 16 kilobytes). Because of this, SSDs use spare area, sometimes referred to as overprovisioning (OP), for garbage collection. OP is NAND storage space that is outside the user-accessible LBA range. Garbage collection (GC) algorithms are needed to move valid data on an erase block to a new erase block, before it gets erased to make room for new data. If there is already free space, then this isn’t needed – even on random writes, the SSD FTL (Flash Translation Layer) just writes the data to the next available free space. When the drive gets full, the performance falls off a cliff as the GC kicks in until the performance arrives at steady state, where the WAF reaches the worst case.
The SSD Write Cliff
I will illustrate this with the most basic workload, an fio script for 4k random write for 20 minutes. I’m performing this test on a Phison E18 that has been programmed into pSLC mode only. This is a 320GB drive that is extremely high performance and endurance. I’m using a small drive here because doing this on a large-capacity SSD takes a considerable amount of time.
If you take the area under the curve, you will see the performance drops off right when we write 320GB of data, and this is where the GC kicks in. As the available OP gets used, performance drops even further. How do we measure the efficiency of the garbage collection algorithms? WAF!
The formula for WAF is…
WAF = NAND writes / host writes.
It is one of the most straightforward formulas in the world, yet so widely misunderstood. In other words, how much data had to be written to the NAND to accommodate the host workload. WAF ultimately governs random write performance (due to GC efficiency) and endurance.
SSD Endurance in TBW = (NAND P/E Cycles * Capacity in TB) / WAF
If you have a WAF of 2, it will reduce the real endurance of the drive by 50%.
What people miss, is this also impacts performance. If you have a workload that can do 2000MB/s at WAF=1, it will do 1000MB/s at WAF=2, 500MB/s at WAF=4. We can demonstrate this with a simple fio script that runs through a random write, but only targets some of the LBA span. We start at 10% of the drive (90% user LBA span free) and run a random write for 5 minutes, then move to 20%, 30% and so on.
SSDs will use any LBA that hasn’t been written to as spare area!! This ultimately lowers the WAF and simulates the drive overprovisioning. This is why TRIM is absolutely extremely important. More on that below…
This means you can get this same effect of overprovisioning in many ways. An NVMe format is the easiest way to set the drive back to a “factory state” of performance by deallocating (TRIM) all the LBAs. Remember, SSDs talk in LBAs, so the only way for them to know which data is needed by the host for real objects or files is for the host operating system to send a TRIM when data is not needed anymore. You can create a partition that is smaller than the total LBA span, you can create an NVMe namespace that is smaller than the original one.
Busting Myths About TRIM
This brings me to why I wrote this post. I keep getting questions like, “should I enable TRIM? I heard it isn’t good or required on some SSDs.”
The answer, 99% of the time is yes, for the love of god, please enable TRIM. TRIM lowers WAF and will increase the SSD performance and endurance, especially over time as the drive gets full. TRIM has many names – deallocate in NVMe, discard in Linux, unmap in SCSI, Trim Data Set Management in SATA.
Modern filesystems in Linux STILL don’t enable discard on mount by default. This goes for ext4, xfs, btrfs, and many more. Even Windows enables TRIM by default for the last decade, when you empty the recycle bin or permanently delete at file. The argument is, that discard can be done async as a background operation. If you have a workload like an OS that doesn’t write much data, this is totally fine. For any storage workload that writes any amount of data this can be catastrophic for WAF over time, and the long term performance and endurance of the workload.
TRIM in SATA was not a queued command, so performing a TRIM was more costly from an application latency standpoint. This is not an issue in NVMe, and hasn’t been for years. I tested a variety of workloads on drives as old as the P3600, and the drives still had better performance with TRIM enabled. There are some workloads, like hyperscale, that are extremely sensitive to tail latencies. Modern drives can TRIM an entire 7.68TB drive in under 1 second. There is no reason on earth to disable TRIM when drives have this sort of TRIM bandwidth. I can go into how to measure the impact of TRIM on a workload in another post, but this is well available in fio.
Are 3 DWPD drives a scam?
Short answer is, yes. Most SSD vendors have the same the exact same NAND on the 1 and 3 DWPD drives, and they just have different levels of OP. You can accomplish a lower WAF in many different ways, and using user defined OP is always better because of the added flexibility to use the extra capacity if it is needed.
Now that you understand WAF, using the SNIA SSD endurance calculator spreadsheet should make total sense. Read the whitepaper I wrote a while back that explains how SSDs are rated for TBW and DWPD with the JEDEC enterprise workload. SSD endurance also seems like a mystery to a lot of people, but is 100% predictable. Which probably warrants a future post.
Thanks, I needed to rant a bit about WAF and why it is so important. If this was helpful let me know what other SSD topics we want to deep dive into!