Running MapReduce jobs on Hadoop using streaming
In our previous recipe, we implemented a simple MapReduce job using the Java API of Hadoop. The use case was the same as what we did in the recipes in Chapter 3, Programming Language Drivers where we implemented MapReduce using the Mongo client APIs in Python and Java. In this recipe, we will use Hadoop streaming to implement MapReduce jobs.
The concept of streaming works on communication using stdin
and stdout
. You can get more information on Hadoop streaming and how it works at http://hadoop.apache.org/docs/r1.2.1/streaming.html.
Getting ready…
Refer to the Executing our first sample MapReduce job using the mongo-hadoop connector recipe in this chapter to see how to set up Hadoop for development purposes and build the mongo-hadoop project using Gradle. As far as the Python libraries are concerned, we will be installing the required library from the source; however, you can use pip
(Python's package manager) to set up if you do not wish to build...