The Tail at Scale

20 Mar 2022

This post is my notes while going through [1]. See the Sources section for all sources used in this post. I do not own any of the materials.

The Big Ideas

Why Are Latency Variabilities Important?

Results need to be returned to the user within tens of milliseconds. Prior studies have reported that for the user to feel as if the system is reacting instantaneously, the response time needs to be lower than 100ms.

Why Do Latency Variabilities Exist?

Applications contend for shared resources in an unpredictable manner; Maintainance activities such as periodic log compactions and garbage collection; CPUs may throttle if too heated; SSD may need to perform garbage collection.

Queuing delays on a server before a request begins execution. Turns out that once the request begins execution, the latency variation drops significantly.

Latency variability of individual components is magnified at the system/service level. This is similar to fault tolerance, Based on [3], if components are connected in “series”, i.e. single failure causes system failure, then the system availability is computed by multiplying the availability of each component. E.g. 10*99.999% components = 99.999%^10 = 99.99% (10*3 nines = 2 nines).

If components are connected in “parallel”, e.g. replicated, then the system availability is 1 - system failure rate, where the system failure rate is the product of the failure rate of each component. E.g 2*95% components = 1 - 5%*5% = 99.75%.

To illustrate the effect scale on latency variability, the author gives the following example:

Try to Eliminate Latency Variability

Live with Latency Variability

Just like fault tolerance, it is infeasible to completely eliminate all latency variations in the system. Categorize techniques into 1) short term, react in milliseconds 2) long term, tens of seconds.

Short term

Assuming we have multiple replicas of data items (for higher throughput/availability). We can use these replicas to improve tail latency as well. This also assumes that the cause of latency variation does not affect all replicas equally.

Long term

Some Google workloads allow “good-enough” responses, instead of requiring the perfect answer all the time. Occasionally returning “good-enough” responses can help with reducing tail latency.

Sources

[1] https://research.google/pubs/pub40801/

[2] https://en.wikipedia.org/wiki/Head-of-line_blocking

[3] http://www.webtorials.com/main/eduweb/manfr99/tutorial/five-nines/five-nines.pdf