Manthan

Partitioning In Distributed Systems

· Manthan Gupta

Yeah, I have been away for half a year now. A lot has happened, and that’s for some other time or blog. But here we are to discuss partitioning in distributed systems. We will discuss different ways to partition data, how to relieve hot spots, strategies to rebalance partitions, etc.

Partition & Replication

partition_replication

We have already discussed replication in depth in my 3 part series. If you have missed it, you can find the 1st part here.

Our goal with partitioning is to spread the data and the query load evenly across nodes. The main reason for wanting to partition data is scalability. If the partitioning is unfair, such that some partitions have more data or query load than others, we call it skewed. The presence of skews makes partitioning much less effective. A partition with a disproportionately high load is called a hot spot.

To avoid hot spots, the simplest approach is to assign records randomly to nodes. But this isn’t a silver bullet solution, and the catch is that there is no way of knowing which node a particular item is on, so we have to query all nodes in parallel.

Partitioning by Key Range

One way of partitioning is to assign a contiguous range of keys to each partition. If the boundaries between ranges are known, you can easily determine which partitions contain a given key. If we also know which partition is assigned to which node, we can make direct requests to the appropriate node. Within each partition, we can keep keys in sorted order. It has the advantage that range scans are easy, but the downside is that certain access patterns can lead to hot spots. If a key is a timestamp, then, unfortunately, all the writers for a particular day go to the same partition if one partition per day.

Partitioning by Hash of Key

As seen in the previous method, the problem of hot spots can occur, so there is another way of partitioning used by many distributed data stores because of the risk of skew and hot spots i.e., to use a hash function to determine the partition for a given key. This technique is good at distributing keys fairly among the partitions. The partition boundaries can be evenly spaced or chosen pseudo-randomly, in which case the technique is sometimes known as consistent hashing. The tradeoff here is that the range queries aren’t efficient.

Skewed Workloads & Relieving Hot Spots

Partitioning by hashing a key can help reduce hot spots, but we can’t avoid them entirely. In the extreme case where all reads and writes are for the same key, we still end up with all requests routed to the same partition. For example, A celebrity user with millions of followers on a social media site may cause a storm of activity when they do something. It may result in a large volume of writes to the same key.

A simple technique to avoid this is to add a random number to the beginning or end of the key. A 2-digit decimal random number would split the writes to key evenly across 100 different keys. The downside is any reads now have to do additional work, as they have to read from 100 different keys and combine them. Book-keeping is required to have a track of which keys have been split if we are doing it for a small number of hotkeys.

Rebalancing Partitions

Over time, things change in the database. Query throughput increases, so we add more CPUs to handle the load. Dataset size increases, so we add more disks and RAM to store it. A machine can fail, and other machines need to take over the failed machine’s responsibilities. All these changes call for data and requests to be moved from one node to another. The process of moving load from one node to another in a cluster is known as rebalancing.

After rebalancing, a few things we expect are the load should be fairly balanced between the nodes in the cluster, the database should continue accepting reads and writes while it’s rebalancing, and no more data than necessary should be moved between nodes to make rebalancing fast, and to minimize the network and disk I/O load.

Strategies for Rebalancing

  1. Fixed number of partitions: Create many more partitions than there are nodes, and assign several partitions to each node. For example, A database running on a cluster of 10 nodes may be split into 1000 partitions so that approximately 100 partitions are assigned to each node. If a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. Only entire partitions are moved between nodes. The number of partitions doesn’t change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. The number of partitions is usually fixed when the database is first set up and not changed afterward in this configuration. If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead.

  2. Dynamic Partitioning: For databases that use key range partitioning, a fixed number of partitions with fixed boundaries would be very inconvenient. If you got the boundaries wrong, you could end up with all the data in one partition and others empty. When partitions grow to exceed a configured size, it is split into 2. Conversely, if lots of data is deleted, and a partition shrinks below some threshold, it can be merged with an adjacent partition. An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. A caveat is that an empty database starts with a single partition. All writes are processed by a single node while the other nodes sit idle.

  3. Partitioning proportionally to nodes: Several partitions are proportional to the number of nodes - a fixed number of partitions per node. The size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged. When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one-half of each of those split partitions while leaving the other of each partition in place.

Request Routing

After partitioning the dataset across multiple nodes, there is a question that remains – When a client wants to make a request, how does it know which node to connect to? This is an instance of a more general problem called service discovery, which isn’t just limited to databases. A few of the high-level approaches to this problem are

  • Allow clients to contact any node, for example, via a round-robin load balancer. If that node coincidentally owns the partition to which the request applies, it can handle the request directly, otherwise, it forwards the request to the appropriate node.
  • Send all requests from clients to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier acts only as a partition-aware load balancer.
  • The client is aware of partitioning and assignment of partitions to nodes.

But all these approaches have the same problem – How does the component making the routing decision learn about changes in the assignment of partitions to nodes? Many distributed systems rely on a separate coordination service such ZooKeeper to keep track of this cluster metadata. Each service registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions of nodes.

Parting Words

This is it from this one and I hope to see you in the next one (hoping that I become consistent again). If you liked the blog then don’t forget to share on your socials and follow me on X/ Twitter.

References

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann