#24 The Google File System

22 Jan 2022

Link https://dl.acm.org/doi/10.1145/1165389.945450
Year SOSP 2003

The Big Ideas

Backgrounds

The fundemental goal of large scale file system (or distributed systems in general) is to offer high aggregated performance using clusters of machines. But horizontal scaling is not easier. High performance usually means replication -> but requires extra work to ensure consistency and fault-tolerance -> lower performance.

^ widely accepted prior wisdom

This is a loop, or tradeoff that distributed systems must face.

Linux file append atomicity. How do they work? Atomicity guarantee? What are file namespaces

Details

Assumptions

(1) Hardware components fail on a regular basis (2) Files are large (3) Heavy large sequential reads + small random reads + large sequential appends (4) Heavy multi-client concurrent appends

Architecture and Operation

GFS uses a single master, multiple chunk servers, and multiple clients. Each file is divided into multiple 64MB chunks, which are replicated and stored across multiple chunk servers.

The master maintains only the system metadata while the chunk servers interact directly with the clients to handle data transfers. GFS is designed to minimize master interaction with the client to avoid performance bottlenecks on the master server. Separating metadata and data operations is a key insight of GFS, as these two types of data have very different access patterns. Google workloads typically require large and sequential data accesses while metadata operations are random and small.

Below are the data structures maintained in the master node, which will help us understand GFS’s operations [1].

struct master {
    file files[filename]; // indexed by filename
    log operation_log;
}

struct file {
    string name;
    chunk_hdl chunk_handles[];
}

struct chunk_hdl {
    chunk_server chunk_servers[]; // list of chunk servers that has this chunk
    string version_num;
    node primary; // primary server for this chunk at current lease
    time lease_exp;
}

Read operation

  1. The application issues a read request with the desired file name and offset.
  2. The GFS client translates offset into the chunk index in that file and sends it to GFS master.
  3. GFS master returns the chunk handle (which is a 64bit globally unique chunk ID) and chunk server locations to the client. Why does the client need the chunk handle? See next step.
  4. The GFS client then uses the chunk server location (probably IP addresses) to send the chunk read request to the chunk server. Since each chunk server contains multiple chunks, the client must send the chunk handle. Notice here the master is not involved anymore.
  5. Chunk server directly returns the requested data to the client.

Write Operation and Dataflow

Writing is more complex than reading, because it changes the data state. Since GFS uses replication, this means it needs to somehow support data consistency (later we will see that while GFS does provide some level of consistency, it is not very strong. However, the authors believe that it is strong ENOUGH if the applications are written in specific ways).

GFS provides consistency by selecting a primary replica, which can be viewed as a leader consensus approach. For each chunk, every so often, the master selects one of the replica chunk servers as this chunk’s primary and grants it a lease. The primary dictates the order of write operations for this chunk, and the follower replicas follow.

Now we are ready to take a look at the actions required for write operations. I found this to be one of the most confusing parts of this paper. Let us focus on append operations, which is the common case for GFS:

There is a lot more stuff going on in GFS, including Atomic Append, File Management, and Fault Tolerance. We might come back to study them in the future.

Further Readings

Sources

[1] https://www.youtube.com/watch?v=EpIgvowZr00