#7 Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks

17 Jun 2021

Link https://dl.acm.org/doi/10.1145/3297858.3304077
Year ASPLOS 2019

Background

File Systems for NVM

The paper distinguishes between native NVM fs, fs designed specifically for NVMs, and adapted fs, block based fs extended to implement NVM features. The Linux kernel uses xfs-DAX and ext4-DAX, which are adapted fs. Adapted fs offers limited NVM optimizations because they still ne to be optimized for disk operations (yes this is kind of vague).

SQLite (Mine) [2] [3]

SQLite uses either rollback journals (undo) or write-ahead logs (WAL) to implement atomic transactions. For rollback journal, in DELETE mode, the journal file (one per database) is created and deleted at the start and end of each transaction. In PERSIST mode, instead of deleting the log file, it is marked as invalid. Rollback journal logs a copy of the original unchanged database and write changes directly in place. WAL is the opposite, where the original content is untouched while modifications are appended into the log file. Multiple transactions can be appended to a single WAL log file (log does not need to reset at commit). Eventually we would like to flush the changes in the log to the database. This is called checkpointing. SQLite by default checkpoints when the WAL log file reaches 1000 pages.

WAL is faster than rollback since its writes are all sequential (to the log) and only require writing the content once (to the WAL log. I guess the second write to the actual database is delayed. Rollback requires writing twice, once log and once database). WAL allows user to delay syncing with the disk by sacrificing durability. On the other hand, WAL’s read performance degrades as the log grows, since we need to check all entries to find the lastest value.

In-Memory Database (Mine) [4]

An in-memory database stores its data, well, in memory. In contrast to dbs that store data on disk, in-memory databases have lower response time. To persist data, in-memory dbs store each operation in a log residing on disk. Redis is an in-memory database.

Details

Optimally Adapting Applications to NVM

only focuses on legacy apps built for blk based storage. (so not dram use case) first hand exp from porting 5 dbs and key-stores to better utilize nvm. identify how to do this for best perf

SQLite

SQLite offers four logging modes: DELETE, TRUNCATE, PERSIST, and WAL. WAL uses redo logging and the rest undo logging. Figure 1 shows the SET throughput of each modes under different filesystems. xfs-DAX and ext4-DAX are adapted NVM filesystems while NOVA is a native NVM fs. In the WAL column, we observe that even with WAL, the two adapted fs has similar performance as the three undo logging modes, while NOVA has significantly better performance. This is because xfs-DAX and ext4-DAX keep their allocator state (I think this is basically which blocks are used by which files) in NVM, so WAL allocations are expensive (since they need to go to the NVM). This is because block-based fs keep allocator state in the disk to avoid expensive media scan after a crash (why do the DAX extensions still do this with NVM? See question 1). To combat this, the authors use fallocate to preallocate the WAL log file, which closes the gap between the two adapted fs and NOVA. To get even better performance, the paper proposes FiLe Emulation with DAX (FLEX), which allows the fs to access the WAL log in DAX mode (instead of through the fs). The key takeaway here is that the FLEX technique enables adapted NVM fs to achieve similar high performance as a native NVM fs.

Figure 2: SQLite SET throughput with different journaling modes [1].

Kyoto Cabinet and LMDB

We won’t go into too much details here. Both of these database/libraries already uses mmap to access files on traditional disk storage medias. This makes them a natural match for porting to DAX systems, but we can still optimize them more. The key takeaway here is that rather than using msync (traditional media), we should use CLWB and SFENCE on NVM platforms for better performance. This is because cache line flush is on cache line granularity while msync operate on pages.

Redis and RockDB

These two applications keep data in DRAM and flush to disk only when nessesary. Redis uses an Append Only File (AOF) to log all write operations. The user may change the AOF flush frequency to trade performance for recoverability. Naively porting Redis to NVM is low, since the AOF log append operations are expensive (see SQLite section). The proposed solution is to make Redis’ internal hash table (I am guessing this is how Redis implements key-value store) persistent. This eliminates the need for the AOF log, because the authors adopts an atomic scheme to update key-value pairs (I think this might mean no transaction support? See question 2). Hence, by eliminating the AOF, performance is significantly improved. However, the authors note that making the hash table persistent requries significant programming effort.

Below is a summary of best practices to exploit NVM performance benefits (when porting application from disk to NVM)

More Notes on FLEX

The authors note that applying FLEX to legacy programs could be simple:

File System Scalability

Coming soon…

Thoughts and Comments

Questions

  1. In the SQLite evaluation, the authors mention that WAL is slow under xfs-DAX and ext4-DAX because these two fs keep their allocator state on NVM (while NOVA in DRAM). The reason behind this is that these two adapted fs do this for block-based fs to avoid expensive media scan after a crash. Why do the NVM extensions still do this then?

  2. The authors claim that by making Redis’ internal top level hash table persistent, we can eliminate the need for AOF log. What does this mean for Redis transactions support? Do transactions use a different log to achieve atomicity?

  3. The paper’s strategy to optimize Redis on NVM is to persist the internal hash table thus eliminating the AOF log. But doesn’t this defeat the purpose of in-memory database, since the database is not in memory anymore (I am assuming the internal hash table is a significant portion of the db)? We eliminated AOF from the NVM, but we persisted another (potentially larger) structure. How does this improve performance?

Sources

[1] Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019. Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘19). Association for Computing Machinery, New York, NY, USA, 427-439. DOI:https://doi.org/10.1145/3297858.3304077

[2] https://www.sqlite.org/tempfiles.html

[3] https://www.sqlite.org/wal.html

[4] https://aws.amazon.com/nosql/in-memory/