Scaling Caching at Etsy: A Story of handlings 1,000,000's of lookups / sec
Etsy optimised 1,000,000,000,000's of cached data handlings 1,000,000's lookups/sec
Story of Consistent Hashing with a secret sauce.
Etsy, uses memcached (to cache db queries, expensive calculations etc) and Varnish (to cache internal HTTP requests) to improve performance and reduce load.
They have a pool of caching servers, each running memcached or Varnish.
How to distribute Cached data across Servers?
1. Modulo Hashing
It is implemented as follows
Hash the key to a number
Modulo it by the size of cache pool to store key value pair to a specific server
Example
We need to store
some_key
andsome_value
.We have 3 host servers numbered as 0, 1, 2
We map
some_key
to a number (via a hash function) let's say number is 4Taking modulo
4 % 3 == 1
, hence we will storesome_key
to "1" server
Pros
- Each cached value is stored on a single host
Cons
Changing the size of the cache pool will cause most cache keys to hash to a new server leading to tons of cache misses.
In above example, if we add a fourth server.
Taking modulo
4 % 4 == 0
,some_key
will now be cached at "0" server.Worsening the cache hit rate
2. Consistent hashing
It is implemented as follows
Fix the Hash space and divide it into large contiguous buckets
One (or more) buckets are assigned to a cache host
Keys that hash to a Range are looked up on the matching host
Example
Let's say, Hash Space = 1,000
We divide it into 3 buckets of [0, 333), [333, 666), [666, 999]
We have two hosts,
host 0
is assigned two buckets viz [0, 333), [333, 666)host 1
is assigned one bucket viz [666, 999]
We need to store
some_key
andsome_value
.we map
some_key
to a number (via a hash function) let's say number is 44.As "44" falls in first bucket range, it is stored in
host 0
Pro
Changing the pool size means moving few buckets from existing server to a new server. Hence it moves only a small number of Keys
In above example, if we add another server
host 2
and move [333, 666) bucket to itOnly 333 keys will be moved to a new server i.e. 33% of key movement.
We can play around with bucket size to further decrease key redistribution.
They leverage Ketama to implement Consistent Hashing
How to serve Hot keys?
Certain cache items are read and written more than others.
Ideally, a good hash function shall distribute the "hot keys" among many servers
"Celebrity" Key
At Etsy, they've seen keys that are hit quite often, and stores a large enough value, that it saturate the network interface of their cache host.
The large number of clients are collectively generating a higher read rate than the cache server can provide.
To fix this they explored following solutions
Solution 1: Horizontal Scaling
Adding more cache hosts doesn't help because it only changes the distribution of keys to hosts i.e. it would only move the problem key and saturate a different host
Solution 2: Faster network cards
It will have a substantial hardware cost.
Solution 3: Cache Smearing
They used consistent hashing and added a small amount of entropy to certain hot cache keys for all reads and writes.
This effectively turns one key into several, each storing the same data and letting the hash function distribute the read and write volume for the new keys among several hosts.
Example
Let's say
celebrity_key
is a Hot KeyCache smearing appends a random number in a small range (let's say,
[0, 5
)) to the key before each read or write. So for e.g.celebrity_key
will be transformed ascelebrity_key_1
,celebrity_key_3
,celebrity_key_0
As
celebrity_key_1
,celebrity_key_3
,celebrity_key_0
keys are different, they will be hashed to different hosts, sharing the load among the pool.
Tradeoffs in choosing the range of random number
Too-large range duplicates the cached data more than necessary to reduce load, and can lead to data inconsistencies
A too-small range doesn't spread the load effectively among the cache pool (and could even result in all the smeared keys still being hashed to the same pool member)
Conclusion
In summary, combination of consistent hashing and cache smearing has been a great combination of efficiency and practicality
Allowing multiple hosts to serve requests for a Hot key, prevent any one host from being overloaded, increasing the overall read rate of the cache pool.
At Etsy, they manually added cache smearing to hottest keys, with entropy in a small range like 0 to 8 or 0 to 16.
Understanding the Foundational Principles is the key to Effective Learning!
Follow along to Improve System Design Skills.
Reference and Image Credit
Subscribe to my newsletter
Read articles from Kamran Ali directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kamran Ali
Kamran Ali
Hi, I'm Kamran Ali. I have ~11.5 years of experience in Designing and Building Transactional / Analytical Systems. I'm actually getting paid while pursuing my Hobby and its already been a decade. In my next decade, I'm looking forward to guide and mentor engineers with experience and resources