#9 An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

18 Jun 2021

Link https://arxiv.org/abs/1908.03583
Year FAST 2020

Big Ideas

Details

Background

On the Intel Cascade Lake platform, each processor contains one or two processor dies (not too sure how this works. Recall that a die is simply a piece of silicon). As shown in Figure 1 (a), each processor die contains two integrated memory controllers (iMC), and each iMC supports three memory channels. In total, each processor die can connect up to six 3DXP DIMMs. The iMC maintains read and write queues (RPQ and WPQ) for each of the DIMMs, as shown in Figure 2 (b). Recall that in ADR the WPQ is included in the persistence domain. The iMC talks with the NVMs via DDR-T in cache-line granularity (65B). This interface is physically the same as the DDR4 but uses a different protocol (I think this is DDR-T).

An NVM Access

A NVM accesses can be broken down into the following steps:

  1. first arrives at the on-DIMM controller (XPController) that coordinates access to the storage media.
  2. The controller performs internal address translation for wear-leveling. The address indirection table (AIT) is used for this translation.
  3. The physical media’s access granularity is 256B (XPLine). The XPController translates smaller requests into 256B accesses. This causes write amplification (an undesirable phenomenon where the actual amount of data written to the media is larger than the logical amount). The XPController has a small write-combining buffer (XPBuffer) to perform write merging (64B to 256B internal writes). If the write is smaller than 256B, the write operation becomes the more expensive read-modify-write (if 256B write, just overwrite everything).

Seems like the XPBuffer is also used as a cache. E.g. a 64B read will bring the entire 256B line into the XPBuffer, so that subsequent reads within the 256B line can be directly served from the XPBuffer.

The NVDIMMs can also be interleaved across channels and DIMMs, as shown in Figure 1 (c). I think this means partitioning the address space in an interleaved fashion for better performance. The paper notes that the only interleave size supported by their platform is 4KB, which ensures a single page falls into a single DIMM. If we have 6 DIMMs and access a continuous memory larger than 24KB, we will access all DIMMs.

Figure 1: Overview of Intel 3D XPoint NVMDIMM [1].

A Quick Word on App Direct Mode

Since the caches can change PM store ordering (making recovery complicated). To enforce PM write ordering (to the physical media), programmers can use clflush and clflushopt to flush cache line to PM. clwb writes the line to PM but does not evict the entry from the cache. ntstore allows the write to bypass the caches.

Performance Characterization

The authors first develop a microbenchmark toolkit called LATTester to replace prior benchmarks designed for DRAM or disk.

Typical latency (latency seen by the software, not the device) :

Key numbers to remember: 300ns random read, 2x better if sequential. 100ns for writes.

8B accesses

Tail latency:

Summary: tail latencies are rare but 2 orders of magnitude slower.

Bandwidth:

Summary:

Sequential 256B accesses:

Random accesses:

Latency under load: When max BW is reached, latency degrades significantly due to queuing effect (likely in the WPQ). This is also true for DRAM.

Best Practices for PM (aka. so what do these mean for the application?)

  1. Avoid random accesses smaller than < 256 B.
    • Small random reads are obvious not efficient since they do not take advantage of the XPBuffer cache.
    • Small random writes are even worse since writes < 256B causes write amp + read-modify-write and write BW is scarce.
    • But, if the small writes can be somewhat combined into 256B writes, performance is not so bad. Authors show that the XPBuffer has 64 lines, or 256B*64=16KB in size (per DIMM).
    • The paper suggests to “avoid small stores, but if that is not possible, limit the working set to 16 KB.” Why does working set < 16 KB help with small accesses? The authors seem to be saying that the internal XPBuffer can reorder writes. So even if the writes are small, the XPBuffer can wait (buffer) for more writes within the 256B line.
  2. Use non-temporal stores when possible for large transfers, and control of cache evictions.
    • Non-temporal store, or ntstore, is an alternative to clwb or clflush that bypasses the cache hierarchy entirely. Note that we still need a sfence after it.
    • Experiment results show that non-temporal stores have higher bw than cache line flushes for accesses over 256B. The authors suspect that this is because using the cache requires the CPU to load the cache line from NVM first, thus using up the NVM’s bandwidth.
  3. Limit the number of concurrent threads accessing a 3D XPoint DIMM.
    • The paper identifies two factors that limit NVM’s poor concurrent performance: 1) more threads means more XPBuffer contention and evictions 2) iMC queue contention. The authors suspect that if the WPQ injection rate is much higher than the service rate, the queue will block.
  4. Avoid NUMA accesses (especially read-modify-write sequences).
    • The authors note that NUMA effects for NVM is much larger than DRAM.
    • I do not know how NUMA works yet. I should definitely learn this.

Thoughts and Comments

Questions

  1. “It is important to note that all updates that reach the XPBuffer are already persistent since XPBuffer resides within the ADR. Consequently, the NVDIMM can buffer and merge updates regardless of ordering requirements that the program specifies with memory fences.” Why does persistence mean the NVM can ignore ordering? Ex. two persists to the same location?
  2. Why does NVM have more pattern dependency than DRAM (larger random vs. sequential latency gap)? The authors believe this is caused by the XPBuffer, but does not elaborate further.
    • Recall that the XPBuffer performs write combinig from 64B to 256B internal writes. It makes sense for random writes to be slower, but the experiments show read latency. How does XPBuffer affect read latency?
    • Ah, this is because the XPBuffer is used as a cache. The first read brings an entire XPLine into the XPBuffer and the following reads can read from the buffer.
    • But why does DRAM not have this problem? Is DRAM’s internal access size much smaller and thus the penalty for “cache” miss is lower? Seems like the typical DRAM access size (or page size) is 8kB.
    • Then I think this is because DRAM has inherently lower “cache” miss penalty, as “…the cost of opening a page of DRAM is much lower than accessing a new page of 3D XPoint.”

Sources

[1] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020