Coordination between multiple machines is one of the trickiest challenges in building distributed systems: How do you elect a leader? How do you ensure only one server is handling critical operations? How do you manage configuration across your entire fleet?

Google faced similar problems on a massive scale. Critical systems — like Google File System, Bigtable, and MapReduce — needed to elect leaders, partition work between servers, and perform other coordination tasks. Each team had developed ad-hoc solutions, duplicating the work across teams and sometimes requiring engineers to intervene manually during failovers. At Google's scale, this approach was unsustainable.

To address coordination challenges, Google developed Chubby, a distributed lock service designed to handle coordination and synchronization. This article explores Chubby's system design and the key technical decisions that made it successful. We'll examine why Google chose a lock service over alternatives, how the system works internally, and the clever innovations that solved fundamental distributed systems problems.

Design Rationale

Consensus algorithms like Paxos and Raft can achieve distributed coordination. So why build a centralized lock service instead of providing a consensus library that each service could use directly? After all, a centralized service introduces an additional network hop and creates a dependency for all client services.

The paper presents four key arguments for choosing a lock service approach:

Incremental Development: Most systems start as simple prototypes without considerations for consensus protocols. As they mature and gain importance, adding consensus becomes necessary. Retrofitting an existing system to use a consensus library often requires significant architectural changes, while integrating with a lock service typically requires minimal code modifications.
Advertising Result: Services often need to communicate coordination results to other components. For example, when a database elects a new master, clients must discover the new master's location. This pattern naturally led to Chubby providing small file storage capabilities, allowing elected leaders to advertise their identity and location.
Developer Familiarity: Lock-based interfaces are more intuitive to most programmers than consensus protocols. This familiarity reduces the adoption barrier and makes it easier to convince development teams to use a reliable coordination mechanism.
Simplified Client Requirements: While consensus algorithms require a majority of replicas per application to achieve high availability through quorums, a lock service allows even a single client process to make progress safely. The lock service acts as a "shared electorate", eliminating the need for each client application to maintain its quorum of replicas.

System Architecture

Core Components

Chubby consists of two main components that communicate via RPC:

Chubby Cell: A distributed service instance consisting of typically five server replicas. One replica acts as the master, handling all client requests, while the others act as followers.
Client Library: Integrated directly into client applications, this library handles all communication with the Chubby cell, including master discovery, session management, and caching.

The Chubby Cell

Master Election: Replicas use the Paxos consensus protocol to elect a master. The master handles all read and write operations while other replicas maintain synchronized database copies.
Database Replication: The master maintains a simple database storing file content, metadata, session information, and lock state. Write operations are propagated through Paxos to ensure all replicas remain consistent. Reads are served directly by the master. For backup purposes, the master periodically writes database snapshots to GFS file servers.
Master Discovery: Clients locate the current master by querying any replica listed in DNS. Non-master replicas respond with the master's IP address, after which clients direct all requests to the master.

Fault Tolerance

Master Failure: When a master fails, the remaining replicas automatically elect a new master within seconds (typically 4-6 seconds) using Paxos consensus.
Replica Failure: Failed replicas that don't recover within a few hours are automatically replaced. The replacement process updates DNS entries and allows the new replica to catch up using database backups and incremental updates from the master.

Usage Example

Consider a distributed service running across 10 hosts that needs to elect a leader.
Each host attempts to acquire a lock on the same Chubby file (e.g., /ls/datacenter1/myservice/master).
The host that successfully acquires the lock becomes the leader.
The leader can then write its IP address to the lock file, allowing followers and clients to discover the current leader's location.

Key Design Choices

Coarse-Grained Locking

Chubby is explicitly designed for coarse-grained locking, where locks are held for extended periods—hours or days rather than seconds or milliseconds. The paper outlines the following reasons for choosing a coarse-grained strategy -

Fine-grained locking would create an unacceptable dependency relationship between client applications and the Chubby service. If clients needed to acquire and release locks frequently, they would become highly sensitive to Chubby's availability and performance characteristics.
With coarse-grained locks, brief Chubby outages (typically under 30 seconds) have minimal impact on client applications. Services can continue operating with their existing coordination state while Chubby recovers.
Coarse-grained usage keeps Chubby's traffic low and predictable. The lock acquisition rate becomes weakly related to the overall traffic of client applications, making Chubby's load more manageable and allowing a modest number of servers to support many client services.

Advisory Lock Model

Every file and directory in Chubby can function as a reader-writer lock, supporting both exclusive (writer) mode for single clients and shared (reader) mode for multiple clients. Locks can be of two types -

Mandatory locks prevent other clients from accessing the file when a client has acquired a lock on the file.
Advisory locks allow other clients to access the file even if a client has acquired a lock on the file.

Chubby chose advisory locks over mandatory locks for the following reasons -

Chubby locks typically protect resources implemented by other services rather than the Chubby files themselves. Making a Chubby file inaccessible wouldn't meaningfully protect a MySQL database or distributed file system running on separate servers.
Mandatory locks would force administrators to shut down applications when they needed to access locked files for debugging or administrative purposes.
Developers perform error checking through assertions like "verify lock X is held before proceeding”. Since applications must cooperate for correctness anyway, mandatory locks provide little additional safety while reducing operational flexibility.

Centralized Master Architecture

Chubby uses a single master per cell that processes all read and write operations, creating an apparent single point of failure. But this approach has several advantages -

Since clients don't frequently interact with Chubby, most services remain unaffected by brief master outages.
Master elections typically complete within 4-6 seconds, with most occurring in under 30 seconds.
A centralized master eliminates complex distributed coordination for reads and writes, making the system easier to understand, debug, and operate.

File Interface

Chubby presents a familiar UNIX-like file system interface to make adoption easier for developers. The data is organized in a strict hierarchical tree using slash-separated paths: /ls/datacenter1/myservice/master, /ls/datacenter1/myservice/config, /ls/global/policies/access-control

Path Components:

/ls: Universal prefix standing for "lock service"
Cell name: Identifies which Chubby cell (e.g., datacenter1, global)
Service path: Application-specific hierarchy for organization (e.g. myservice/config)

Each file path (directory or file) in Chubby is known as a Node. There are two types of Nodes:

Permanent Nodes are persisted until explicitly deleted. They are used for configuration files, access control lists (ACLs), leader-election coordination files etc.
Ephemeral Nodes are automatically deleted when no clients have them open. They are used for temporary coordination files like client presence indicators, session-based data etc.

Every node (i.e. file and directory) in Chubby can function as an advisory lock. Locks can be of two types-

Exclusive locks: One client can hold writer access.
Shared locks: Multiple clients can hold reader access.

Clients interact with Chubby files through handles - references obtained by opening a node. These handles support standard operations like reading, writing, and lock acquisition, as well as Chubby-specific operations like sequencer generation for distributed coordination.

Key API Operations

Open(): Get a handle to a Chubby file/directory
Acquire(): Obtain a lock on a node
GetContentsAndStat(): Read file content and metadata
SetContents(): Write data (used by leaders to advertise themselves)
GetSequencer(): Get proof of lock ownership for external services

Locks and Sequencers

The Delayed Message Problem

Distributed systems face a fundamental coordination challenge known as the delayed message problem:

Client A acquires lock L and sends request R1 to a file server
Client A crashes before R1 arrives at the file server
Client B acquires lock L (now available) and sends request R2
Due to network delays, R2 arrives and is processed first
R1 arrives later as a stale request from a client no longer holding the lock

Processing R1 after R2 could cause data corruption since it represents an operation from a previous lock holder.

Chubby's Solution: Sequencers

Chubby solves this with sequencers - opaque tokens that prove current lock ownership. Sequencer contains following information:

Lock name (e.g., /ls/datacenter1/database/master)
Lock mode (exclusive or shared)
Generation number (increments with each lock acquisition)

How External Services Use Sequencers

When a client sends a request to an external service (like a database or file server), it includes the sequencer. The external service has two validation options:

1: Real-time Validation: Ensures only the messages from current lock holder are processed.

2: Cached Validation: Ensures that once a message from the current lock holder has been processed, no messages from the previous lock holder will be processed.

Alternative: Lock-Delay

For services that cannot use sequencers, Chubby provides lock-delay: when a lock holder crashes, other clients are prevented from acquiring the lock for a configurable period (up to 1 minute). This gives in-flight requests time to complete, though it delays failover.

Trade-off: Lock-delay provides safety for legacy services but sacrifices availability, while sequencers enable both safety and fast failover.

Session Management and Master Failover

The Session Model

A Chubby session represents the relationship between a client and a Chubby cell. Sessions serve two critical purposes: maintain the client's lease with Chubby service and provide a reliable channel for event delivery.

Session Lifecycle:

Creation: Established when the client first contacts the master.
Maintenance: Kept alive through periodic KeepAlive handshakes.
Termination: Ends when the client crashes, becomes idle, or explicitly closes the session.

Lease-Based Fault Tolerance

Each session has an associated lease - a time interval during which the master promises not to terminate the session unilaterally.

Key Timeouts:

Master lease timeout: Authoritative timer on the master (typically 12 seconds).
Local lease timeout: Client's conservative estimate accounting for network delays and clock skew.
Grace period: Extended timeout (45 seconds), allowing sessions to survive master failovers.

The KeepAlive Protocol

KeepAlive serves a dual purpose: session renewal and event delivery.

Normal Operation:

The client sends KeepAlive to the master.
Master blocks the request for most of the lease period.
The master responds with a new lease timeout. The master can also allow the KeepAlive to be returned early when an event or invalidation is to be delivered.
The client immediately sends the subsequent KeepAlive request.

This design ensures there's always a KeepAlive request pending at the master, enabling efficient event delivery.

Master Failover and Grace Period

When a master failover occurs, clients lose connection with the master during the failover period. During this time, clients must determine whether the disconnection is due to a network partition or master failure.

When a client cannot connect with the master, it takes the following steps:

Continues serving requests from the cache until the local lease timeout expires.
Once the local lease timeout expires:
- The client cache is disabled, and requests from the application return errors.
- The client session enters a JEOPARDY state, indicating uncertainty about session validity.
- The grace period timer starts (default 45 seconds), during which the client continues trying to contact any available master.
Suppose a new master is elected before the grace period expires. In that case, The client contacts the new master, re-establishes the session, re-enables the cache, and the application resumes normal operation.
If the grace period expires before the master election, the session permanently expires, and all handles are marked invalid. The application must restart coordination completely.

The grace period allows Chubby to achieve three objectives:

Availability: Sessions survive master elections for up to 45 seconds.
Consistency: Cache disabled during uncertainty prevents applications from reading stale data.
Operational Resilience: Chubby library and master cooperate so applications may not even realize a failure occurred - from the application's perspective, there was a brief delay rather than an outage

New Master Recovery Process

When a new Chubby master is elected, it takes the following steps to reconstruct the in-memory state:

Reads persistent data from database snapshots stored on GFS file servers.
Uses conservative lease assumptions about what the previous master may have granted to clients.
Builds in-memory structures for sessions, locks, and ephemeral files from the database.
Initializes new epoch number to distinguish requests from the previous master.
Allows KeepAlive operations but initially blocks other session-related operations.
Sends fail-over events to each reconnecting session to notify clients of potential missed events.
Waits for acknowledgment of fail-over events from all sessions.
Recreates handles on-demand when clients present handles from the previous master.
Resumes normal operations once state reconstruction is complete.

Caching Strategy

To reduce read traffic and improve performance, Chubby clients maintain a consistent, write-through cache in memory that stores file data, node metadata, and even file absence information (negative caching). This caching approach enables Chubby to support tens of thousands of clients while maintaining strong consistency guarantees.

Cache Invalidation Protocol

When file data or metadata needs to be modified, Chubby follows a strict invalidation protocol to maintain consistency:

Client requests file or metadata modification (write, delete, ACL change)
Master blocks the modification and identifies all clients that may have cached the affected data
Master sends invalidations to every client with potentially cached data via their KeepAlive connections
Clients acknowledge invalidations by flushing their cache and responding via their subsequent KeepAlive request
Master proceeds with modification only after all clients have acknowledged the invalidation

During the invalidation process, the master marks the node as temporarily un-cacheable. It prevents new clients from caching the data while existing clients are clearing their caches.

Only one round of invalidations is needed because the master treats the node as un-cacheable until all invalidations are acknowledged. It allows reads to continue without delay while ensuring consistency.

The caching protocol invalidates cached data on changes rather than sending updates to clients. Chubby could have chosen to send updated file contents to all caching clients, but this would be inefficient for several reasons:

Unnecessary traffic: A client that reads a file once but never again would receive every subsequent update.
Bandwidth waste: A file updated 100 times would generate 100 update messages to every caching client.
Arbitrary inefficiency: Update-only protocols can become arbitrarily expensive (i.e., the network traffic grows linearly with no upper bound) as update frequency increases.

Additional Caching Optimizations

Handle Caching:

Clients cache open file handles to avoid repeated Open() operations.
Handles capable of acquiring locks cannot be shared between multiple application threads concurrently to maintain thread safety.

Lock Caching:

Clients can hold locks longer than strictly necessary, hoping to reuse them
When another client requests a conflicting lock, the current holder receives an event
At that point, the lock holder can release the lock precisely when another client needs it
This reduces lock acquisition overhead for frequently contested locks

Scaling Strategies

As Chubby's usage grew at Google, scalability became a critical concern. Chubby employs two complementary scaling approaches: proxies and partitioning.

Proxy Architecture

93% of all requests to Chubby masters are KeepAlive requests for session maintenance. This creates a fundamental scaling bottleneck - even if you could distribute file operations, the session management load remains concentrated on individual masters.

This is where proxies come into the picture. A Chubby proxy server sits between clients and the master, aggregating multiple client sessions into a single master session.

How Proxies Work:

Session aggregation: Thousands of clients connect to the proxy instead of directly to the master.
Single upstream session: The proxy maintains one session with the Chubby master on behalf of all its clients.
KeepAlive consolidation: Instead of 10,000 individual KeepAlive requests, the master receives only one from the proxy.
Cache sharing: Multiple clients can benefit from the same cached data at the proxy.

Benefits:

Dramatic KeepAlive reduction: If a proxy serves N clients, KeepAlive traffic to the master reduces by a factor of N.
Read traffic reduction: Proxy maintains its own cache, reducing read requests to the master.
Scalability multiplication: A single master can effectively serve hundreds of thousands of clients through proxies.

Limitations:

Write traffic unchanged: Write operations must still pass through to the master for consistency. This limitation is acceptable since writes are rare (<1% of traffic).
Additional latency: Adds one extra network hop for writes and cache misses.
Operational complexity: Proxy failures require sophisticated failover mechanisms.

Namespace Partitioning

Another approach for scaling is to distribute the traffic across multiple masters using hash-based partitioning.

How Partitioning Works:

Partition assignment: P(path) = hash(directory) mod N, where N is the number of partitions.
Example with 3 partitions:
/ls/cell/serviceA/master → Partition 1
/ls/cell/serviceB/config → Partition 2
/ls/cell/serviceC/locks → Partition 0

Benefits:

Read/write distribution: File operations spread across N masters, reducing load by a factor of N per partition.
Independent scaling: Each partition can handle its subset of the namespace independently.
Fault isolation: Failure of one partition doesn't affect others.

Limitations:

KeepAlive traffic unchanged: Session management still requires clients to contact multiple partitions.
Cross-partition operations: Some operations (like directory deletion) require coordination between partitions.
Client complexity: Clients must discover and communicate with multiple masters.

Combined Strategy

Chubby's scaling strategy combines both approaches to address different bottlenecks. Proxies address KeepAlive traffic, Session management overhead, and Client connection limits, while partitioning addresses read/write operation distribution, namespace size limits, and storage capacity scaling.

Evolution: From Lock Service to Name Service

Despite being designed as a distributed lock service for coordination, Chubby's most popular use case at Google became entirely different: name service and service discovery. This unexpected evolution highlights how systems often grow beyond their original design intent when deployed at scale.

The DNS Scaling Problem

To understand why Chubby became Google's primary name service, it's important to examine the limitations of traditional DNS at Google's scale.

DNS operates on a time-to-live (TTL) model where entries expire after a fixed period and must be refreshed regardless of whether the underlying data has changed. While this works well for typical internet usage, it creates significant problems in Google's high-scale computing environment.

Google frequently runs large distributed jobs involving thousands of processes that need to communicate with each other. This creates a quadratic communication pattern where each process must discover and communicate with every other process.

For example - Consider a MapReduce job with 3,000 worker processes where each worker needs to communicate with all others:

Total communication pairs: 3,000 × 3,000 = 9,000,000 connections needed
With DNS TTL of 60 seconds: 9,000,000 ÷ 60 = 150,000 DNS lookups per second

This massive lookup rate could easily overwhelm DNS servers.

Chubby's Approach

Unlike DNS's time-based expiration, Chubby uses explicit cache invalidation. When a service's location changes, Chubby immediately notifies all interested clients rather than waiting for TTL expiration.

With this approach, the load on the master remains constant regardless of the number of cached entries a client maintains. A single KeepAlive request can maintain thousands of cached name entries indefinitely, as long as they don't change.

Advantages

Precision Over Approximation: Chubby's caching strategy provides exact notifications for invalidations compared to DNS's time-based approximations. Developers no longer needed to choose between fast failover (short TTL, high load) and low DNS load (long TTL, slow failover).
Immediate Consistency: When a service fails over to a new host, Chubby clients learn about the change immediately through cache invalidation events rather than waiting up to TTL seconds for DNS cache expiration.

Conclusion

Chubby's evolution from a distributed lock service to Google's primary name service demonstrates how thoughtful system design can solve problems beyond original intentions. Chubby proved that prioritizing developer adoption and operational simplicity over theoretical performance creates a lasting impact.

Chubby's design principles — coarse-grained locking, advisory coordination, and event-driven caching — continue to influence modern systems like etcd, Consul, and ZooKeeper. For anyone building distributed systems, Chubby remains essential reading, offering both technical depth and practical wisdom about coordination at scale.

Understanding Chubby: Google's Distributed Lock Service

Table of contents