Starting a simple sharded environment of two shards
In this recipe, we will set up a simple sharded setup made up of two data shards. There will be no replication configured as this is the most basic shard setup to demonstrate the concept. We won't be getting deep into the internals of sharding, which we will explore more in the administration section.
Here is a bit of theory before we proceed. Scalability and availability are two important cornerstones to build any mission-critical application. Availability is something that was taken care of by the replica sets, which we discussed in previous recipes in this chapter. Let's look at scalability now. Simply put, scalability is the ease with which the system can cope with increasing data and request load. Consider an e-commerce platform. On regular days, the number of hits to the site and load is fairly modest and the system's response times and error rates are minimal. (This is subjective.) Now, consider the days where the system load becomes twice, thrice, or even more than that of an average day's load, say on Thanksgiving day, Christmas, and so on. If the platform is able to deliver similar levels of service on these high load days as on any other day, the system is said to have scaled up well to the sudden increase in the number of requests.
Now, consider an archiving application that needs to store the details of all the requests that hit a particular website over the past decade. For each request hitting the website, we create a new record in the underlying data store. Suppose that each record is of 250 bytes with an average load of three million requests per day, we will cross 1 TB of the data mark in about five years. This data would be used for various analytics purposes and might be frequently queried. The query performance should not be drastically affected when the data size increases. If the system is able to cope with this increasing data volume and still give decent performance comparable to performance on low data volumes, the system is said to have scaled up well.
Now that we have seen in brief what scalability is, let me tell you that sharding is a mechanism that lets a system scale to increasing demands. The crux lies in the fact that the entire data is partitioned into smaller segments and distributed across various nodes called shards. Suppose that we have a total of 10 million documents in a mongo collection. If we shard this collection across 10 shards, then we will ideally have 10,000,000/10 = 1,000,000 documents on each shard. At a given point of time, only one document will reside on one shard (which by itself will be a replica set in a production system). However, there is some magic involved that keeps this concept hidden from the developer who is querying the collection and who gets one unified view of the collection irrespective of the number of shards. Based on the query, it is mongo that decides which shard to query for the data and returns the entire result set. With this background, let's set up a simple shard and take a closer look at it.
Getting ready
Apart from the MongoDB server already installed, no prerequisites are there from a software perspective. We will be creating two data directories, one for each shard. There will be a directory for the data and one for logs.
How to do it…
- We start by creating directories for the logs and data. Create the following directories,
/data/s1/db
,/data/s2/db
, and/logs
. On Windows, we can havec:\data\s1\db
and so on for the data and log directories. There is also a configuration server that is used in the sharded environment to store some metadata. We will use/data/con1/db
as the data directory for the configuration server. - Start the following mongod processes, one for each of the two shards, one for the configuration database, and one mongos process. For the Windows platform, skip the
--fork
parameter as it is not supported.$ mongod --shardsvr --dbpath /data/s1/db --port 27000 --logpath /logs/s1.log --smallfiles --oplogSize 128 --fork $ mongod --shardsvr --dbpath /data/s2/db --port 27001 --logpath /logs/s2.log --smallfiles --oplogSize 128 --fork $ mongod --configsvr --dbpath /data/con1/db --port 25000 --logpath /logs/config.log --fork $ mongos --configdb localhost:25000 --logpath /logs/mongos.log --fork
- From the command prompt, execute the following command. This should show a mongos prompt as follows:
$ mongo MongoDB shell version: 3.0.2 connecting to: test mongos>
- Finally, we set up the shard. From the mongos shell, execute the following two commands:
mongos> sh.addShard("localhost:27000") mongos> sh.addShard("localhost:27001")
- On each addition of a shard, we should get an ok reply. The following JSON message should be seen giving the unique ID for each shard added:
{ "shardAdded" : "shard0000", "ok" : 1 }
Note
We used localhost everywhere to refer to the locally running servers. It is not a recommended approach and is discouraged. The better approach would be to use hostnames even if they are local processes.
How it works…
Let's see what all we did in the process. We created three directories for data (two for the shards and one for the configuration database) and one directory for logs. We can have a shell script or batch file to create the directories as well. In fact, in large production deployments, setting up shards manually is not only time-consuming but also error-prone.
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Let's try to get a picture of what exactly we have done and are trying to achieve. The following is an image of the shard setup that we just did:
If we look at the preceding image and the servers started in step 2, we have shard servers that would store the actual data in the collections. These were the first two of the four processes that we started listening to ports 27000
and 27001
. Next, we started a configuration server that is seen on the left side in this image. It is the third server of the four servers started in step 2 and it listens to port 25000
for the incoming connections. The sole purpose of this database is to maintain the metadata about the shard servers. Ideally, only the mongos process or drivers connect to this server for the shard details/metadata and the shard key information. We will see what a shard key is in the next recipe, where we play around a sharded collection and see the shards that we have created in action.
Finally, we have a mongos process. This is a lightweight process that doesn't do any persistence of data and just accepts connections from clients. This is the layer that acts as a gatekeeper and abstracts the client from the concept of shards. For now, we can view it as basically a router that consults the configuration server and takes the decision to route the client's query to the appropriate shard server for execution. It then aggregates the result from various shards if applicable and returns the result to the client. It is safe to say that no client connects directly to the configuration or shard servers; in fact, no one ideally should connect to these processes directly except for some administration operations. Clients simply connect to the mongos process and execute their queries and insert or update operations.
Just starting the shard server, configuration server, and mongos process doesn't create a sharded environment. On starting up the mongos process, we provided it with the details of the configuration server. What about the two shards that would be storing the actual data? However, the two mongod processes started as shard servers are not yet declared anywhere as shard servers in the configuration. This is exactly what we do in the final step by invoking sh.addShard()
for both the shard servers. The mongos process is provided with the configuration server's details on startup. Adding shards from the shell stores this metadata about the shards in the configuration database, and the mongos processes then would be querying this config database for the shard's information. On executing all the steps of the recipe, we have an operational shard as follows:
Before we conclude, the shard that we have set up here is far from ideal and not how it would be done in a production environment. The preceding image gives us an idea of how a typical shard would be in a production environment. The number of shards would not be two but many more. Additionally, each shard will be a replica set to ensure high availability. There would be three configuration servers to ensure availability of the configuration servers as well. Similarly, there will be any number of mongos processes created for a shard listening for client connections. In some cases, it might even be started on a client application's server.
There's more…
What good is a shard unless we put it to action and see what happens from the shell on inserting and querying the data? In the next recipe, we will make use of the shard setup here, add some data, and see it in action.