#26 Profiling a warehouse-scale computer

15 Sep 2022

Link https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf
Year ISCA 2015

The Big Ideas

Backend Boundedness: Memory Bottleneck

Figure 6 is, in my opinion, the most important result in this paper, because it shows that the dominant bottleneck in Google workloads is memory. Unfortunately, the analysis of this major bottleneck is lacking. The only explanation seem to be in the title of section 7 (dependent accesses) and the bandwidth experiments in Figure 13). Perhaps the student did not have access to the workloads and was only given the execution traces, thus can only infer the cause of the memory bottleneck.

Since Figure 13 shows that BW utilization is low, then the memory bottleneck is caused by high data access latency. Based on the formula access latency = hit% * hit time + miss% * miss time, and the fact that the hit and miss times do not change across workloads, the high data access latency should be a result of a high cache miss rate. There can be multiple causes for the high miss rate. I can currently think of two: independent random accesses and dependent random accesses. Here, by random, I mean any access pattern that the prefetcher cannot identify. Note the difference: random accesses do not have to be dependent on each other. If dependent accesses end up being sequential, then the miss rate will be low.

int arr[16] = {1,2,3,...,16}
prev = 0
for i from 0 to 15
    read arr[prev]
    prev = arr[i]

Here, reads to arr are dependent (loop carried). However, the accesses end up being sequential, and there will only be one cache miss for all 16 loop iterations.

So there are two potential causes of high cache miss rate: independent random accesses and dependent random accesses. If the high miss rate is caused by the former, then the memory BW utilization should be high, as [1] shows that the DRAM BW should be saturated even under random accesses. So I agree with the author’s hypothesis that the memory bottleneck is caused by dependent accesses that cannot be effectively prefetched.

However, a distinction should be made on whether the dependent accesses cannot be effectively prefetched by automatic solutions (e.g. hw prefetcher, gcc sw prefetching), or is prefetching impossible (e.g. it is not possible to know the location of the next address ahead of time)? Given that these Google workloads are likely already highly optimized, I imagine there are already explicit prefetching optimizations in the code. Or perhaps doing so is too costly in terms of engineering and thus a better programmer-assisted prefetcher is required?

Memory Bandwidth Underutilization

Back to Figure 13. The fact that 95% of the sample traces show memory BW utilization of less than 31% is unexpected. This means the memory BW is overprovisioned. So what can we do about this?

Note that the DRAM latency will increase as the BW utilization increases (loaded latency). Perhaps 31% BW utilization is not terrible considering the overall performance.

This abundance of DRAM BW may be good news for multi-tiered memory systems. The application can perform more aggressive prefetching from lower-tier memory (e.g. CXL memory) to DRAM to tradeoff BW for latency, although in this case, the best possible latency is DRAM latency.

Workload Diversity

The main idea of this section is that the authors observe that there is no clear hotspot in terms of the profiled binaries and functions in each binary. The hottest binary “only” consumes 9.9% of total cycles in the data center. I would imagine that 10% of Google data centers is still a huge amount of cycles and worth optimizing for. In addition, one dimension that is lacking from Figures 1 and 3 is whether there are similar computation/memory access codes across binaries (in addition to the data center taxes). For example, perhaps half of the 50 hottest binaries heavily use breath first search. Although given a large number of leaf functions in each binary, this dimension may or may not be significant.

One interesting observation from Figure 3 is that the distribution follows the 80-20 rule quick closely: 18% (353/2000) of the leaf functions do 80% of the work (total cycles). The authors mention that this contrasts with the data analytics workload from CloudSuite, which has 3 functions contributing to 65% of the runtime. Well, how many functions are in this data analytics workload in total? Furthermore, perhaps we should be looking at the number of lines of code instead of the number of functions. Within a hot function, there could be also hot instructions (e.g. repeatedly executed loops). Perhaps a lines of code vs. distribution of cycles graph will tell a different story.

Datacenter Tax

Since there are few opportunities to optimize WSC workloads using Amdahl’s law in the binary/functions dimension, the authors try to look for dominant common denominators across binaries, shown in Figure 4. While the result of 22-27% of all WSC cycles seems to suggest that there is a dominant bottleneck in the data center taxes, note that this is 6 categories of taxes combined.

The most cycle-consuming tax is memory allocation, consuming a bit less than 10% of total cycles. This is surprising. I did not know memory allocation was such a cycle-intensive kernel. Does the number of cycles consumed include time spent waiting for memory? The authors mention that memory allocations involve recursive data structures, which I assume are also stored on the DRAM. If these access to recursive data structures exhibit dependent accesses that are difficult to prefetch, then this bottleneck makes sense. Unfortunately, similar to dependent data accesses, the analysis on allocation is again somewhat shallow.

Note that Figure 1 shows that the hottest binary also consumes ~10% of total cycles. Not sure why the conclusion of Figure 1 is “There is no killer application to optimize for” while the conclusion here is data center taxes are “prime candidates for hardware acceleration.”

Another question is, can all allocations be treated the same? Are certain allocations more cycle-intensive than others? Are there hotspots within memory allocations (i.e. are there code regions within the allocator that are hot?)

Aside: The idea of data center tax has been echoed by other works such as [2], although the definition of “datacenter tax” seems to be somewhat different between these two works. In this work, tax taxes are “common building blocks” between various workloads, some of which are integral parts of the program. The workloads cannot run without these taxes (e.g. memory allocation), even on a single node. In my opinion, Meta’s “datacenter tax” is closer to being a “tax”, which includes e.g. logging and routing. On a single machine, the workloads can still run without taxes. This is a small distinction that does not affect the contributions of this paper from Google, but it is interesting to see the fact that there seems to be no standard definition of data center tax yet.

Thoughts and Questions

Sources

[1] https://arxiv.org/pdf/1908.03583.pdf

[2] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf