Distributed Cache System Design

12 May 2026

This blog describes how to design a distributed cache like Redis, MamCached.

Today we will be talking about how to design a distributed cache. But before we dive deep into how to design a distributed cache, we need to understand the topic of cache in significant detail. Hop on for an exciting journey.

Caching is used at multiple places to reduce latency and fast retrieval of data. Let’s explore common places where caching is used.

CPU Cache — The CPU uses cache to store frequently accessed data in L1, L2, or L3 cache. retrieval of data is faster from these caches than from RAM.

CPU Cache — The CPU uses cache to store frequently accessed data in L1, L2, or L3 cache. retrieval of data is faster from these caches than from RAM.

CPU Cache — The CPU uses cache to store frequently accessed data in L1, L2, or L3 cache. retrieval of data is faster from these caches than from RAM.

CPU Cache — The CPU uses cache to store frequently accessed data in L1, L2, or L3 cache. retrieval of data is faster from these caches than from RAM.

CPU Cache — The CPU uses cache to store frequently accessed data in L1, L2, or L3 cache. retrieval of data is faster from these caches than from RAM.

We need to understand the validity of the data that is cached. Effective caching strategies ensure a high cache hit rate(meaning a lot of requests are satisfied by cache). If the cache cannot respond to the query, then it is cache-miss. Now there would be a question in your mind as to what data will reside in the cache and for how long. TTL stands for time to live. It is measured in seconds/milliseconds. Each entry in the cache has an associated TTL. when TTL expires for the record, it gets deleted from the cache and subsequent requests for the record will result in a cache miss.

Below are the requirements that we are considering for our design

The storage capacity of the cache should be in TeraBytes

Number of request/sec ~ 1 million

Latency for retrieval ~ 1 millisecond

High Availability

High Scalabality

Let’s explore cache access strategies. There are three main strategies to access the cache. Write-Through, Write-Around, and Write back.

Write-Through

In this strategy, all operations happen through the cache. When a write operation is performed, data is first written to the cache and then to the database. This strategy ensures that there is strong consistency between cache and Db. Write is considered successful only when the data is written to both cache and DB. This strategy is employed when we have a read-heavy system.

Pros: Strong consistency is maintained. Preferred in read-heavy systems.

Cons:

1. Write latency is higher as writes happen to both cache and DB.

2. Load on Db is higher as every write is performed on db.

Real-world System Use Case:

Consider an online banking application where users frequently check their balance and perform transactions and data is written to both cache and db. Here write-through policy is used to ensure that there is no discrepancy between cache and persisted data. Why is write-through used here? Because we need a strongly consistent system here. If the user deposits the money in the bank and then checks his balance, and if it does not get reflected immediately, the user will lose his shit.

Write Around

In the write-around strategy, write operations completely bypass the cache and are directly performed on the database. Data in the cache from the db is only populated when there is a cache miss. This strategy is often employed when no of operations are write heavy but not frequently read.

Pros: Write latency is reduced as writes only happen to ddb.

Cons: Read latency is increased for cache miss data.

Real-world use cases:

Archival Systems

Archival Systems

Archival Systems

Write Back

In the write-back cache strategy, data is written to the cache only. As soon as the data is successfully written to the cache, a successful response is sent. Data from the cache is then periodically synced with database asynchronously. This strategy is used when both read-heavy and write-heavy requests are present and tolerance for data loss is present.

Pros: Low latency for both read and write operations as we are only doing operations in cache.

Cons:

1. Data is inconsistent as there is a delay in synchronizing cache and ddb.

2. There is a possibility for data loss if the cache goes down before the data is synced with the ddb.

Real-world use cases:

Write write-back strategy is used in HFT(High-frequency trading) systems. This system demands extremely low latency and therefore whenever a write operation is performed, it is immediately acknowledged thereby improving the responsiveness of the trading system. Here eventual consistency model is followed where data is asynchronously written to the database. In HFT systems, speed and responsiveness are prioritized over consistency with the underlying data store.

The data structure used to implement cache is a hash table. We need a key, value, and hash function that will map the value to the appropriate bucket.

Working of a hash table is simple and intuitive. We calculate the hash value aka bucketId in which we need to store the element using the hashcode() method. Once the bucket is located, we store the element in that bucket. Here there is a possibility of collision. There are several strategies like separate chaining, open addressing for collision handling. Let’s discuss both of them in detail.

Separate Chaining — In this strategy, we calculate the hashIndex and then append the element to the linked list of hashIndex. The drawback of this approach is that the linked list becomes large if there are sufficient collisions. Therefore periodic re-hashing should be performed to distribute the elements uniformly. In Java hash map implementation, initially, the elements will be appended in the linked list but if the size of the linked list grows then the linked list is converted to a balanced binary search tree which is more efficient for insertion and retrieval providing logarithmic time complexity. Separate chaining is used in database indexing to handle collisions.

Open Addressing — In this strategy, we calculate the hash index. Then if the element is already present in that location, we start iterating forward and place the element in the first bucket that is empty. During retrieval we follow the same technique, we iterate over all the elements in a linear fashion until we find the desired element. This process is called linear probing because we are linearly iterating. The drawback of this approach is that clustering will happen. What do I mean by that? Clustering means that elements will be cluttered together and you will need to iterate over a large list to fetch the item which means read latency will increase. Therefore the choice of probing strategy is crucial to mitigate these concerns.

Distributed caching is done using a consistent hashing algorithm which is very efficient at distributing the load as well as minimizing the data movement. Let’s dive deep into how a consistent hashing algorithm works.

Imagine a circular ring where the entire hash space is distributed on the ring. As you can see from the above diagram we have five servers on the ring. Each position on the ring represents a hash space. The position of the server on the ring is determined by calculating the hash value of the server identifier. Once the servers are mapped to the ring space, we calculate the position of the key-value pair on the ring using the same hash function. we move clockwise from that position and put the key-value pair in the first server that we encounter. Retrieval of data also works similarly. Now consider a scenario if one of the servers (Server 4) goes down. In this case, key 4 will now be mapped to Server 5. Notice that data movement is minimal and we only needed to change the position of a small subset of keys. This is the property that is very useful while designing scalable systems as new nodes can be easily added or removed with minimal data movement.

Cache eviction policy

We cannot keep all the data in the cache since cache is very expensive so we need to prioritize which data should be present in the cache and which data to remove. Generally LRU(Least recently used) policy is used to evict elements from the cache.

Lets explore the internal working of cache. Even though read and write operations are fast from RAM, calls are blocking. So what to do when there are n number of requests coming simultaneously. We can do two things in this case:1. create new thread for each request which processes the request and returns a response. This method is generally not preferred due to several reasons. A large number of threads are created especially if the requests are short lived and each thread creation has associated overhead. Also as the capacity of new thread creation is limited, this is not scalable.2. Put the requests in a queue and process them asynchronously. there will be a single threaded event loop which continuously polls the queue and assigns request to one of the worker threads. After assigning the request to the worker thread, it will poll another and the process continues. When the worker thread finish execution, it returns the response to the event loop which then returns back to the user.

How do we make our system fault tolerant. There are two approaches to make the system fault tolerant.

Periodically sync the cache data to disk. When the cache goes down, the data in disk is used to reconstruct the cache.

Log reconstruction. Each operation is saved in the log file in append only manner and when the cache goes down, these operations are performed in sequence to reconstruct the cache.

How do we ensure High availability?

We can have multiple copy of cache so that each replica can serve read/write requests. But here the challenge would be to maintain consistency between those replicas. So in order to avoid that problem, we can have one master node which does read/write operations and other nodes will perform read operations. We will have one zookeeper server which handles the responsibility of electing a new master node in case master node goes down.

That’s it from today's tutorial. Happy learning.