Database Sharding: Designing for Scalability to User Growth

Tipps on how to design a database and application such that it can keep up with an almost infinite growth in users with nearly linear scalability.

Database sharding means that parts of the database a broken up into smaller pieces that reside on different databases.

Why Sharding

It is the only way to be able to support an endless growth in users.

A cluster environment, like MySQL Cluster or anything similar, can only get the application scalable up to a certain point. Once that point is reached: BOOM, game over for easy growth, a redesign of the database and re-coding of application parts is necessary.

While a master slave setup provides nearly infinity scalability for reading operations by simply adding more copy-slaves, it can reach a bottleneck with writing.

In the end, it is sharding.

How to Do It

Database Layout

Sharding means splitting up the database. If the growth driver of an application is number of users, then that is probably a good choice to shard.

That means that there will be multiple databases that contain user data. For simplicity, it is probably best to have all data that belongs to a single user on only one shard. It is also good practice from a high availability standpoint: if a shard with user data goes down, users who reside on other shards can still continue to log in.

This means that when a new user registers, the application has to decide on which user shard to save the new user. Once a user is registered and wants to log in, the application has to know on which shard the user resides. There are basically two ways to make this decision:

One can use simple rules for that (e.g. user id 1-100 is on database userShard0, 101-200 on user userShard1 etc.).
Or one can use a central database where the relationship of username/id to userShard is recorded. The application should, of course, employ caching of that information, so the central database has to be queried only once. If that central database still gets too heavy traffic, master slave replication or again sharding can be employed. This method is preferable, as it permits to move users to different shards at free will.

It may also be useful to define the single point of truth for user data to be the userShard and define every other place where user data is saved as a copy. This way there will be no headaches as to which data to trust in case of discrepancies (as you obviously would update the data in the userShard first and then the copies in case the process terminates before finishing).

If there is cross interaction between users on different shards, then a solution must be found specific to each such case. For example, if there is an geo-search where user can find users in their proximity, then one could implement a separate table where the users are listed with their longitude and latitude data. If the application has extremely many users, then this table could be sharded and broken up on longitude and latitude.

Application

The database queries of the application have to be directed to the correct shard. It is useful to make these decisions in a central place, that the application has to ask before making any query.

pShardDecider = new ShardDecider();

And before a query is executed:

//e.g. values of stringDatabaseNameToUse userShard0, userShard1, ...
stringDatabaseNameToUse = pShardDecider->getUserShardName(userID);

//Parameters to connect to the database
//e.g. values of stringRealDatabaseName: 'user' for 'userShard0', 'user' for 'userShard1', etc.
stringRealDatabaseIp = pShardDecider->getIpFromShardName(stringDatabaseNameToUse);
stringRealDatabaseName = pShardDecider->getRealDataBaseName(stringDatabaseNameToUse);

Bonus: Separate Reads from Writes

For a high scalability environment, it is also important to separate reads from writes. If that is implemented from the beginning, then adding a master slave environment later on becomes much easier.

//’r’ for read and ‘w’ for write
stringDatabaseNameToUse = pShardDecider->getUserShardName(userID,’r’);

and some more examples

stringDatabaseNameToUse = pShardDecider->getCentralShardName(’w’);
stringDatabaseNameToUse = pShardDecider->getGeoSearchShardNameList(latitude, longitude, ’w’);

//always read
arrayStringDatabaseNameToUse = pShardDecider->getGeoSearchShardNameList(fromLatitude, toLatitude, fromLongitude, toLongitude);

Implementing Sharing Ability from the very Beginning into the Application

It is possible to implement the basic sharding capability into an application from the very beginning, even if it is not used yet. If everything is in one database, then the shard decider class in the application will simply always return the same database name regardless of arguments given.

Published: October 13, 2019
Last modification: November 21, 2019