Connecting to a shard in the shell and performing operations
In this recipe, we will connect to a shard from a command prompt, see how to shard a collection, and observe the data splitting in action on some test data.
Getting ready
Obviously, we need a sharded mongo server setup up and running. See the previous recipe, Starting a simple sharded environment of two shards, for more details on how to set up a simple shard. The mongos process, as in the previous recipe, should be listening to port number 27017
. We have got some names in a JavaScript file called names.js
. This file needs to be downloaded from the Packt website and kept on the local filesystem. The file contains a variable called names
and the value is an array with some JSON documents as the values, each one representing a person. The contents look as follows:
names = [ {name:'James Smith', age:30}, {name:'Robert Johnson', age:22}, … ]
How to do it…
- Start the mongo shell and connect to the default port on localhost as follows. This will ensure that the names will be available in the current shell:
mongo --shell names.js MongoDB shell version: 3.0.2 connecting to: test mongos>
- Switch to the database that would be used to test the sharding; we call it
shardDB
:mongos> use shardDB
- Enable sharding at the database level as follows:
mongos> sh.enableSharding("shardDB")
- Shard a collection called
person
as follows:mongos>sh.shardCollection("shardDB.person", {name: "hashed"}, false)
- Add the test data to the sharded collection:
mongos> for(i = 1; i <= 300000 ; i++) { ... person = names[Math.round(Math.random() * 100) % 20] ... doc = {_id:i, name:person.name, age:person.age} ... db.person.insert(doc) }
- Execute the following to get a query plan and the number of documents on each shard:
mongos> db.person.getShardDistribution()
How it works…
This recipe demands some explanation. We downloaded a JavaScript file that defines an array of 20 people. Each element of the array is a JSON object with the name
and age
attributes. We start the shell connecting to the mongos process loaded with this JavaScript file. We then switch to shardDB
, which we use for the purpose of sharding.
For a collection to be sharded, the database in which it will be created needs to be enabled for the sharding first. We do this using sh.enableSharding()
.
The next step is to enable the collection to be sharded. By default, all the data will be kept on one shard and not split across different shards. Think about it; how will Mongo be able to split the data meaningfully? The whole intention is to split it meaningfully and as evenly as possible so that whenever we query based on the shard key, Mongo would easily be able to determine which shard(s) to query. If a query doesn't contain the shard key, the execution of the query will happen on all the shards and the data would then be collated by the mongos process before returning it to the client. Thus, choosing the right shard key is very crucial.
Let's now see how to shard the collection. We do this by invoking sh.shardCollection("shardDB.person", {name: "hashed"}, false)
. There are three parameters here:
- The fully qualified name of the collection in the
<db name>.<collection name>
format is the first parameter of theshardCollection
method. - The second parameter is the field name to shard on in the collection. This is the field that would be used to split the documents on the shards. One of the requirements of a good shard key is that it should have high cardinality. (The number of possible values should be high.) In our test data, the name value has very low cardinality and thus is not a good choice as a shard key. We hash this key when using this as a shard key. We do so by mentioning the key as
{name: "hashed"}
. - The last parameter specifies whether the value used as the shard key is unique or not. The name field is definitely not unique and thus it will be false. If the field was, say, the person's social security number, it could have been set as true. Additionally, SSN is a good choice for a shard key due to its high cardinality. Remember that the shard key has to be present for the query to be efficient.
The last step is to see the execution plan for the finding of all the data. The intent of this operation is to see how the data is being split across two shards. With 300,000 documents, we expect something around 150,000 documents on each shard. However, from the distribution statistics, we can observe that shard0000
has 1,49,715
documents whereas shard0001
has 150285
:
Shard shard0000 at localhost:27000 data : 15.99MiB docs : 149715 chunks : 2 estimated data per chunk : 7.99MiB estimated docs per chunk : 74857 Shard shard0001 at localhost:27001 data : 16.05MiB docs : 150285 chunks : 2 estimated data per chunk : 8.02MiB estimated docs per chunk : 75142 Totals data : 32.04MiB docs : 300000 chunks : 4 Shard shard0000 contains 49.9% data, 49.9% docs in cluster, avg obj size on shard : 112B Shard shard0001 contains 50.09% data, 50.09% docs in cluster, avg obj size on shard : 112B
There are a couple of additional suggestions that I would recommend you to do.
Connect to the individual shard from the mongo shell and execute queries on the person collection. See that the counts in these collections are similar to what we see in the preceding plan. Additionally, one can find out that no document exists on both the shards at the same time.
We discussed in brief about how cardinality affects the way the data is split across shards. Let's do a simple exercise. We first drop the person collection and execute the shardCollection operation again but, this time, with the {name: 1}
shard key instead of {name: "hashed"}
. This ensures that the shard key is not hashed and stored as is. Now, load the data using the JavaScript function we used earlier in step number 5, and then execute the explain()
command on the collection once the data is loaded. Observe how the data is now split (or not) across the shards.
There's more…
A lot of questions must now be coming up such as what are the best practices? What are some tips and tricks? How is the sharding thing pulled off by MongoDB behind the scenes in a way that is transparent to the end user?
This recipe here only explained the basics. In the administration section, all such questions will be answered.