HBase is an Apache project and the current Version, 0.98.7, of HBase is available as a stable release. HBase Version 0.98.7 supersedes Version 0.94.x and 0.96.x.
Note
This book only focuses on HBase Version 0.98.7, as this version is fully supported and tested with Hadoop Versions 2.x and deprecates the use of Hadoop 1.x.
Hadoop 2.x is much faster compared to Hadoop 1.x and includes important bug fixes that will improve the overall HBase performance.
Older versions, 0.96.x, of HBase which are now extinct, supported both versions of Hadoop (1.x and 2.x). The HBase version prior to 0.96.x only supported Hadoop 1.x.
HBase is written in Java, works on top of Hadoop, and relies on ZooKeeper. A HBase cluster can be set up in either local or distributed mode. Distributed mode can further be classified into either pseudo-distributed or fully distributed mode.
Note
HBase is designed and developed to work on kernel-based operating systems; hence, the commands referred to in this book are only for a kernel-based OS, for example, CentOS. In the case of Windows, it is recommended that you have a CentOS-based virtual machine to play with HBase.
An HBase cluster requires only Oracle Java to be installed on all the machines that are part of the cluster. In case any other flavor of Java, such as OpenJDK, is installed with the operating system, it needs to be uninstalled first before installing Oracle Java. HBase and other components such as Hadoop and ZooKeeper require a minimum of Java 6 or later.
Perform the following steps for installing Java 1.7 or later:
- Download the
jdk-7u55-linux-x64.rpm
kit from Oracle's website at http://www.oracle.com/technetwork/java/javase/downloads/index.html. - Make sure that the file has all the permissions before installation for the root user using the following command:
- Install RPM using the following command:
- Finally, add the environment variable,
JAVA_HOME
. The following command will write the JAVA_HOME
environment variable to the /etc/profile
file, which contains a system-wide environment configuration: - Once
JAVA_HOME
is added to the profile, either close the command window and reopen it or run the following command. This step is required to reload the latest profile setting for the user:Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
The local or standalone mode means running all HBase services in just one Java process. Setting up HBase in the local mode is the easiest way to get started with HBase and can be used to explore further or for local development. The only step required is to download the recent release of HBase and unpack the archive (.tar
) in some directory such as /opt
. Perform the following steps to set up HBase in the local mode:
- Create the
hbase
directory using the following commands: - Download the
hbase
binaries as the archive (.tar
) files and unpack it, as shown in the following command:In the preceding command, http://mirrors.sonic.net/apache/hbase/
can be different for different users, which is based on the user's location. Check the suggested mirror site at http://www.apache.org/dyn/closer.cgi/hbase/ for the new URL.
Note
HBase version 0.98.7 is available for Hadoop 1 and 2 as hbase-0.98.7-hadoop1-bin.tar.gz
and hbase-0.98.7-hadoop2-bin.tar.gz
. It is recommended that you use Hadoop 2 only with HBase 0.98.7, and Hadoop 1 is available as a deprecated support. In the local mode, a Hadoop cluster is not required as it can use the Hadoop binaries provided in the lib
directory of HBase. Other versions of HBase can also be checked out at http://www.apache.org/dyn/closer.cgi/hbase/.
- Once the HBase binaries are downloaded, extract them using the following command:
- Add the environment variable,
HBASE_HOME
. The following command will write the HBASE_HOME
environment variable to the /etc/profile
file, which contains system-wide environment configuration: - Once
HBASE_HOME
is added to the profile, either close the command window and reopen it or run the following command; this step is required to reload the latest profile settings for the user: - Edit the configuration file,
conf/hbase-site.xml
, and set the data directory for HBase by assigning a value to the property key named hbase.rootdir
and hbase.zookeeper.property.dataDir
, as follows:The default base directory value for the hbase.rootdir
and hbase.zookeeper.property.dataDir
properties is /tmp/hbase-${user.name}
, that is, /tmp/hbase-root
for the "root" user which may lead to the possibility of data loss at the time of server reboot. Hence, it is always advisable to set the value for this property to avoid a data-loss scenario.
- Start HBase and verify the output with the following command:
This gives the following output:
HBase also comes with a preinstalled web-based management console that can be accessed using http://localhost:60010
. By default, it is deployed on HBase's Master host at port 60010
. This UI provides information about various components such as region servers, tables, running tasks, logs, and so on, as shown in the following screenshot:
The HBase tables and monitored tasks are shown in the following screenshot:
The following screenshot displays information about the HBase attributes, provided by the UI:
Once the HBase setup is done correctly, the following directories are created in a local filesystem, as shown in the following screenshot:
The pseudo-distributed mode
The standalone/local mode is only useful for basic operations and is not at all suitable for real-world workloads. In the pseudo-distributed mode, all HBase services (HMaster, HRegionServer, and Zookeeper) run as separate Java processes on a single machine. This mode can be useful during the testing phase.
In the pseudo-distributed mode, HDFS setup is another prerequisite (HDFS setup also needs to be present in pseudo-distributed mode). After setting up Hadoop and downloading the HBase binary, edit the conf/hbase-site.xml
configuration file. Also, set the HBase in the running mode by assigning a value to the property key named hbase.cluster.distributed
, as well as the data storage pointer to the running Hadoop HDFS instance by assigning a value to the property key named hbase.rootdir
:
Once the settings are done, we can use the following command to start HBase:
Note
Before starting HBase, make sure that the Hadoop services are running and working fine.
Once HBase is configured correctly, the jps
command should show the HMaster and HRegionServer processes running along with the Hadoop processes. Use the hadoop fs
command in Hadoop's bin/
directory to list the directories created in HDFS as follows:
The fully distributed mode
A pseudo-distributed mode, where all the HBase services (HMaster, HRegionServer, and Zookeeper) run as separate Java processes on a single machine, is preferred for a local development environment or test environment. However, for a production environment, fully distributed mode is a must. In the fully distributed mode, an HBase cluster is set up on multiple nodes and HBase services run on these different cluster nodes. To enable fully distributed mode, add the hbase.cluster.distributed
property to conf/hbase-site.xml
and set it to true
; also point the hbase.rootdir
HBase to the HDFS
node:
Note
This book does not touch upon information on building a fully distributed HBase cluster and also does not talk about the hardware considerations, such as, server configurations, network settings, and so on; and software considerations, such as server OS setting, Hadoop settings, and so on. For this book, it is recommended that you use either the local mode or the pseudo-distributed mode.
For understanding this mode in depth, the building blocks that play a vital role in a fully distributed HBase cluster need to be understood well. The next section will give you a glimpse of what these components are.