Installing a single-node cluster - HDFS components
Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.
Getting ready
You will need some information before stepping through this recipe.
Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com
, preinstalled with CentOS 6.5.
Tip
Create a system user named hadoop
and set a password for that user.
Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.
How to do it...
- Log into the machine/host as root user and install jdk:
# yum install jdk –y or it can also be installed using the command as below # rpm –ivh jdk-1.7u45.rpm
- Once Java is installed, make sure Java is in
PATH
for execution. This can be done by settingJAVA_HOME
and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:# export JAVA_HOME=/usr/java/latest
- Now we need to copy the tarball
hadoop-2.7.3.tar.gz
--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:# scp –r hadoop-2.7.3.tar.gz [email protected]:~/
- Create a directory named
/opt/cluster
to be used for Hadoop:# mkdir –p /opt/cluster
- Then untar the
hadoop-2.7.3.tar.gz
to the preceding created directory:# tar –xzvf hadoop-2.7.3.tar.gz -C /opt/Cluster/
- Create a user named
hadoop
, if you haven't already, and set the password ashadoop
:# useradd hadoop # echo hadoop | passwd --stdin hadoop
- As step 6 was done by the root user, the directory and file under
/opt/cluster
will be owned by the root user. Change the ownership to the Hadoop user:# chown -R hadoop:hadoop /opt/cluster/
- If the user lists the directory structure under
/opt/cluster
, he will see it as follows: - The directory structure under
/opt/cluster/hadoop-2.7.3
will look like the one shown in the following screenshot: - The listing shows
etc
,bin
,sbin
, and other directories. - The
etc/hadoop
directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files arecore-site.xml
,hdfs-site.xml
,hadoop-env.xml
, andmapred-site.xml
among others, which will be explained in the later sections: - The directories
bin
andsbin
contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on: - To execute a command
/opt/cluster/hadoop-2.7.3/bin/hadoop, a
complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variableHADOOP_HOME
. - Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
- The environment file is set up system-wide so that any user can use the commands. Once the
hadoopenv.sh
file is in place, execute the command to export the variables defined in it: - Change to the
Hadoop
user using the commandsu – hadoop
: - Change to the
/opt/cluster
directory and create a symlink: - To verify that the preceding changes are in place, the user can execute either the
which Hadoop
orwhich java
commands, or the user can execute the commandhadoop
directly without specifying the complete path. - In addition to setting the environment as discussed, the user has to add the
JAVA_HOME
variable in thehadoop-env.sh
file. - The next thing is to set up the Namenode address, which specifies the
host:port
address on which it will listen. This is done using the filecore-site.xml
: - The important thing to keep in mind is the property
fs.defaultFS
. - The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file
hdfs-site.xml
: - The next step is to format the Namenode. This will create an HDFS file system:
$ hdfs namenode -format
- Similarly, we have to add the rule for the
Datanode
directory underhdfs-site.xml
. Nothing needs to be done to thecore-site.xml
file: - Then the services need to be started for Namenode and Datanode:
$ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start datanode
- The command
jps
can be used to check for running daemons:
How it works...
The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage
, edits
, and VERSION
. These are very important for the functioning of the cluster.
The parameters dfs.data.dir
and dfs.datanode.data.dir
are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir
has been deprecated in favor of dfs.namenode.name.dir
in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.
There's more...
Setting up ResourceManager and NodeManager
In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.