Computer Architecture: A Quantitative Approach, Appendix B Memory Hiearachy Review

24 Jun 2021

Oh boy do I need a refresher on caches and memory. I guess its only natural since I have not used it since the undergrad computer architecture course.

Cache Basics

Caches operate in block (or line) granularity. A block contains multiple words, aimed to exploit spatial locality.

Block placement: where should blocks be placed in the cache? Three basic options: fully associate, direct mapped, and set associative. A set is a group of continuous blocks in the cache. Figure 1 is a pretty good illustration.

Figure 1: Cache organization illustration [1].

Block identification: how can we find a block in the cache?

Figure 2: Cache address breakdown [1].

Block replacement: which block should we evict on a miss? Three primary strategies: random, least recently used, fifo.

Write strategy: what happens on a write? Two basic options: write through and write back.

Six Basic Cache Optimizations

The average memory access time = hit_time + miss_rate * miss_penalty. Thus, to improve cache performance, we improve each of the three variables: reducing miss rate, reducing miss penalty, and reducing hit time.

To understand misses better, recall the 3 Cs:

The six basic optimizations are:

  1. Larger block size to reduce miss rate. Pros: lower miss rate due to higher spatial locality. Cons: increased miss penalty and conflict misses due to increased block size.
  2. Larger caches to reduce miss rate. Pros: less capacity miss. Cons: longer hit time and higher cost and power.
  3. Higher associativity to reduce miss rate. Two observations from data. 1) Eight-way set associative is practically as effective as fully associative in reducing cache misses (for a particular cache size). 2) 2:1 cache rule of thumb, where a direct-mapped cache of size N has about the same miss rate as a two-way set associative cache of size N/2.
  4. Multilevel caches to reduce miss penalty. One thing to consider is whether L1 data should be also in L2, which is multilevel inclusion. On the other hand, we may want multilevel exclusion for small L2 caches.
  5. Give priority to read misses over writes to reduce miss penalty. With write buffers, we need to be mindful of read after write data dependencies. When a write is in the write buffer, a following and dependent read can check the contents of the write buffer on a read miss and does not need to wait for the entire buffer. to finish writing.
  6. Avoiding Address Translation During Indexing of the Cache to Reduce Hit Time.
    • Since processors use virtual addresses and DRAM uses physical addresses, the virtual -> physical conversion needs to happen somewhere. We could make the cache operate entirely on virtual addresses. This means that on a cache hit, no virtual -> physical translation is needed.
    • Although pure virtual caches do not require address translations, the cache must be flushed on a process context switch (since each process has their own virtual addresses). It also suffers from aliasing, where a program may have multiple virtual addresses mapped to the same physical address (ex. mmaping the same file twice to different pointers). The cache will then contain two copies of the same data, causing coherency issues. The third drawback is protection. Page-level protection is offered during the virtual -> physical translation. If the cache is before this translation, one process may access another process’ protected data.
    • We should distinguish between two parts of the address: index (find the set) and tag (compare addresses within the set).
    • Recall that the major drawback for physical cache is its hit latency. If we can somehow hide this translation latency, then we can use a physical cache to avoid the above virtual cache drawbacks. One solution is virtually index physically tagged cache. In this scheme, we use a part of the page offset, which is the same for virtual and physical (thus do not require translation), as the cache index. While we are searching the cache set with this index, we translate the virtual page number (using the TLB) to a physical page number. We then use this physical page number as the tag to find the cache block within the set. An example is shown in Figure 6.

Figure 3: Virtual cache vs physical cache (page table not shown) [2].

Takeaway: many of these optimizations improve one aspect at the expense of another.

Virtual Memory

All processes are assign their own virtual memory address space and page table. When a process request memory access, the OS must translate the virtual address into the physical address. The virtual address and physical address share the same page offset. Only the page number needs to be translated from virtual to physical (since virtual pages are mapped to physical pages).

Page placement: pages can be placed anywhere in memory because the miss penalty to disk is too high. This is analogous to a fully associative cache.

Page identification (how to find a page in main memory): look up the page table, which maps virtual page number to its physical address, as shown in Figure 5. A translation lookaside buffer (TLB) is used as a page table cache to speed up this translation.

Page replacement: all most all OS uses LRU. This is usually implemented with a use-bit, which is set whenever the page is used. The OS periodically clears the use bits.

Write strategy: since disks are very slow, existing designs always use write-back and never write-through.

Figure 4: Virtual memory. The program consists of four pages, each 4KB [1].

Figure 5: Virtual address mapping [1].

An example that includes virtual memory and caches is shown in Figure 6. First, we try to find the data in L1 cache by performing virtual -> physical page number translation and virtual cache index look up in parallel. To perform the translation, the virtual page number is divided in to tag and index for the TLB, just like with a cache. If a hit occurs in the L1 cache, we are done. If not, we look up the data in L2. Note that L2 operates with physical address only.

Figure 6: Virtual memory and cache example [1].

Thoughts and Comments

Sources

[1] J. L. Hennessy and D. A. Patterson, Computer Architecture: Quantitative Approach, Sixth. Cambridge, MA: Morgan Kaufmann, 2019.

[2] Arvind, Modern Virtual Memory Systems. 2005, [Online]. Available: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-fall-2005/lecture-notes/l10_vrtl_mem.pdf