Running a MapReduce job on Amazon EMR
This recipe involves running the MapReduce job on the cloud using AWS. You will need an AWS account in order to proceed. Register with AWS at http://aws.amazon.com/. We will see how to run a MapReduce job on the cloud using Amazon Elastic Map Reduce (Amazon EMR). Amazon EMR is a managed MapReduce service provided by Amazon on the cloud. Refer to https://aws.amazon.com/elasticmapreduce/ for more details. Amazon EMR consumes data, binaries/JARs, and so on from AWS S3 bucket, processes them and writes the results back to S3 bucket. Amazon Simple Storage Service (Amazon S3) is another service by AWS for data storage on the cloud. Refer to http://aws.amazon.com/s3/ for more details on Amazon S3. Though we will use the mongo-hadoop connector, an interesting fact is that we won't require a MongoDB instance to be up and running. We will use the MongoDB data dump stored in an S3 bucket for our data analysis. The MapReduce program will run on the input BSON dump...