Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Hadoop Beginner's Guide
Hadoop Beginner's Guide

Hadoop Beginner's Guide: Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial.

eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Hadoop Beginner's Guide

Chapter 2. Getting Hadoop Up and Running

Now that we have explored the opportunities and challenges presented by large-scale data processing and why Hadoop is a compelling choice, it's time to get things set up and running.

In this chapter, we will do the following:

  • Learn how to install and run Hadoop on a local Ubuntu host

  • Run some example Hadoop programs and get familiar with the system

  • Set up the accounts required to use Amazon Web Services products such as EMR

  • Create an on-demand Hadoop cluster on Elastic MapReduce

  • Explore the key differences between a local and hosted Hadoop cluster

Hadoop on a local Ubuntu host


For our exploration of Hadoop outside the cloud, we shall give examples using one or more Ubuntu hosts. A single machine (be it a physical computer or a virtual machine) will be sufficient to run all the parts of Hadoop and explore MapReduce. However, production clusters will most likely involve many more machines, so having even a development Hadoop cluster deployed on multiple hosts will be good experience. However, for getting started, a single host will suffice.

Nothing we discuss will be unique to Ubuntu, and Hadoop should run on any Linux distribution. Obviously, you may have to alter how the environment is configured if you use a distribution other than Ubuntu, but the differences should be slight.

Other operating systems

Hadoop does run well on other platforms. Windows and Mac OS X are popular choices for developers. Windows is supported only as a development platform and Mac OS X is not formally supported at all.

If you choose to use such a platform, the...

Time for action – checking the prerequisites


Hadoop is written in Java, so you will need a recent Java Development Kit (JDK) installed on the Ubuntu host. Perform the following steps to check the prerequisites:

  1. First, check what's already available by opening up a terminal and typing the following:

    $ javac
    $ java -version
    
  2. If either of these commands gives a no such file or directory or similar error, or if the latter mentions "Open JDK", it's likely you need to download the full JDK. Grab this from the Oracle download page at http://www.oracle.com/technetwork/java/javase/downloads/index.html; you should get the latest release.

  3. Once Java is installed, add the JDK/bin directory to your path and set the JAVA_HOME environment variable with commands such as the following, modified for your specific Java version:

    $ export JAVA_HOME=/opt/jdk1.6.0_24
    $ export PATH=$JAVA_HOME/bin:${PATH}
    

What just happened?

These steps ensure the right version of Java is installed and available from the command line...

Time for action – downloading Hadoop


Carry out the following steps to download Hadoop:

  1. Go to the Hadoop download page at http://hadoop.apache.org/common/releases.html and retrieve the latest stable version of the 1.0.x branch; at the time of this writing, it was 1.0.4.

  2. You'll be asked to select a local mirror; after that you need to download the file with a name such as hadoop-1.0.4 -bin.tar.gz.

  3. Copy this file to the directory where you want Hadoop to be installed (for example, /usr/local), using the following command:

    $ cp Hadoop-1.0.4.bin.tar.gz /usr/local
    
  4. Decompress the file by using the following command:

    $ tar –xf hadoop-1.0.4-bin.tar.gz
    
  5. Add a convenient symlink to the Hadoop installation directory.

    $ ln -s /usr/local/hadoop-1.0.4 /opt/hadoop
    
  6. Now you need to add the Hadoop binary directory to your path and set the HADOOP_HOME environment variable, just as we did earlier with Java.

    $ export HADOOP_HOME=/usr/local/Hadoop
    $ export PATH=$HADOOP_HOME/bin:$PATH
    
  7. Go into the conf directory within...

Time for action – setting up SSH


Carry out the following steps to set up SSH:

  1. Create a new OpenSSL key pair with the following commands:

    $ ssh-keygen
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
    Created directory '/home/hadoop/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    
    
  2. Copy the new public key to the list of authorized keys by using the following command:

    $ cp .ssh/id _rsa.pub  .ssh/authorized_keys 
    
  3. Connect to the local host.

    $ ssh localhost
    The authenticity of host 'localhost (127.0.0.1)' can't be established.
    RSA key fingerprint is b6:0c:bd:57:32:b6:66:7c:33:7b:62:92:61:fd:ca:2a.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
    
  4. Confirm that the password-less SSH is working.

    $ ssh localhost...

Time for action – using Hadoop to calculate Pi


We will now use a sample Hadoop program to calculate the value of Pi. Right now, this is primarily to validate the installation and to show how quickly you can get a MapReduce job to execute. Assuming the HADOOP_HOME/bin directory is in your path, type the following commands:

$ Hadoop jar hadoop/hadoop-examples-1.0.4.jar  pi 4 1000
Number of Maps  = 4
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
12/10/26 22:56:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/10/26 22:56:11 INFO mapred.FileInputFormat: Total input paths to process : 4
12/10/26 22:56:12 INFO mapred.JobClient: Running job: job_local_0001
12/10/26 22:56:12 INFO mapred.FileInputFormat: Total input paths to process : 4
12/10/26 22:56:12 INFO mapred.MapTask: numReduceTasks: 1

12/10/26 22:56:14 INFO mapred.JobClient:  map 100% reduce 100%
12/10/26 22:56:14...

Time for action – configuring the pseudo-distributed mode


Take a look in the conf directory within the Hadoop distribution. There are many configuration files, but the ones we need to modify are core-site.xml, hdfs-site.xml and mapred-site.xml.

  1. Modify core-site.xml to look like the following code:

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    </property>
    </configuration>
  2. Modify hdfs-site.xml to look like the following code:

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>
  3. Modify mapred...

Time for action – changing the base HDFS directory


Let's first set the base directory that specifies the location on the local filesystem under which Hadoop will keep all its data. Carry out the following steps:

  1. Create a directory into which Hadoop will store its data:

    $ mkdir /var/lib/hadoop
    
  2. Ensure the directory is writeable by any user:

    $ chmod 777 /var/lib/hadoop
    
  3. Modify core-site.xml once again to add the following property:

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/lib/hadoop</value>
    </property>

What just happened?

As we will be storing data in Hadoop and all the various components are running on our local host, this data will need to be stored on our local filesystem somewhere. Regardless of the mode, Hadoop by default uses the hadoop.tmp.dir property as the base directory under which all files and data are written.

MapReduce, for example, uses a /mapred directory under this base directory; HDFS uses /dfs. The danger is that the default value...

Time for action – formatting the NameNode


Before starting Hadoop in either pseudo-distributed or fully distributed mode for the first time, we need to format the HDFS filesystem that it will use. Type the following:

$  hadoop namenode -format

The output of this should look like the following:

$ hadoop namenode -format
12/10/26 22:45:25 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = vm193/10.0.0.193
STARTUP_MSG:   args = [-format]

12/10/26 22:45:25 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
12/10/26 22:45:25 INFO namenode.FSNamesystem: supergroup=supergroup
12/10/26 22:45:25 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/10/26 22:45:25 INFO common.Storage: Image file of size 96 saved in 0 seconds.
12/10/26 22:45:25 INFO common.Storage: Storage directory /var/lib/hadoop-hadoop/dfs/name has been successfully formatted.
12/10/26 22:45:26 INFO namenode.NameNode: SHUTDOWN_MSG...

Time for action – starting Hadoop


Unlike the local mode of Hadoop, where all the components run only for the lifetime of the submitted job, with the pseudo-distributed or fully distributed mode of Hadoop, the cluster components exist as long-running processes. Before we use HDFS or MapReduce, we need to start up the needed components. Type the following commands; the output should look as shown next, where the commands are included on the lines prefixed by $:

  1. Type in the first command:

    $ start-dfs.sh
    starting namenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-namenode-vm193.out
    localhost: starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-vm193.out
    localhost: starting secondarynamenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-vm193.out
    
  2. Type in the second command:

    $ jps
    9550 DataNode
    9687 Jps
    9638 SecondaryNameNode
    9471 NameNode
    
  3. Type in the third command:

    $ hadoop dfs -ls /
    Found 2 items
    drwxr-xr-x   - hadoop...

Time for action – using HDFS


As the preceding example shows, there is a familiar-looking interface to HDFS that allows us to use commands similar to those in Unix to manipulate files and directories on the filesystem. Let's try it out by typing the following commands:

Type in the following commands:

$ hadoop -mkdir /user
$ hadoop -mkdir /user/hadoop
$ hadoop fs -ls /user
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2012-10-26 23:09 /user/Hadoop
$ echo "This is a test." >> test.txt
$ cat test.txt
This is a test.
$ hadoop dfs -copyFromLocal test.txt  .
$ hadoop dfs -ls
Found 1 items
-rw-r--r--   1 hadoop supergroup         16 2012-10-26 23:19/user/hadoop/test.txt
$ hadoop dfs -cat test.txt
This is a test.
$ rm test.txt 
$ hadoop dfs -cat test.txt
This is a test.
$ hadoop fs -copyToLocal test.txt
$ cat test.txt
This is a test.

What just happened?

This example shows the use of the fs subcommand to the Hadoop utility. Note that both dfs and fs commands are equivalent). Like...

Time for action – WordCount, the Hello World of MapReduce


Many applications, over time, acquire a canonical example that no beginner's guide should be without. For Hadoop, this is WordCount – an example bundled with Hadoop that counts the frequency of words in an input text file.

  1. First execute the following commands:

    $ hadoop dfs -mkdir data
    $ hadoop dfs -cp test.txt data
    $ hadoop dfs -ls data
    Found 1 items
    -rw-r--r--   1 hadoop supergroup         16 2012-10-26 23:20 /user/hadoop/data/test.txt
    
  2. Now execute these commands:

    $ Hadoop Hadoop/hadoop-examples-1.0.4.jar  wordcount data out
    12/10/26 23:22:49 INFO input.FileInputFormat: Total input paths to process : 1
    12/10/26 23:22:50 INFO mapred.JobClient: Running job: job_201210262315_0002
    12/10/26 23:22:51 INFO mapred.JobClient:  map 0% reduce 0%
    12/10/26 23:23:03 INFO mapred.JobClient:  map 100% reduce 0%
    12/10/26 23:23:15 INFO mapred.JobClient:  map 100% reduce 100%
    12/10/26 23:23:17 INFO mapred.JobClient: Job complete: job_201210262315_0002...

Using Elastic MapReduce


We will now turn to Hadoop in the cloud, the Elastic MapReduce service offered by Amazon Web Services. There are multiple ways to access EMR, but for now we will focus on the provided web console to contrast a full point-and-click approach to Hadoop with the previous command-line-driven examples.

Setting up an account in Amazon Web Services

Before using Elastic MapReduce, we need to set up an Amazon Web Services account and register it with the necessary services.

Creating an AWS account

Amazon has integrated their general accounts with AWS, meaning that if you already have an account for any of the Amazon retail websites, this is the only account you will need to use AWS services.

Note that AWS services have a cost; you will need an active credit card associated with the account to which charges can be made.

If you require a new Amazon account, go to http://aws.amazon.com, select create a new AWS account, and follow the prompts. Amazon has added a free tier for some services...

Time for action – WordCount on EMR using the management console


Let's jump straight into an example on EMR using some provided example code. Carry out the following steps:

  1. Browse to http://aws.amazon.com, go to Developers | AWS Management Console, and then click on the Sign in to the AWS Console button. The default view should look like the following screenshot. If it does not, click on Amazon S3 from within the console.

  2. As shown in the preceding screenshot, click on the Create bucket button and enter a name for the new bucket. Bucket names must be globally unique across all AWS users, so do not expect obvious bucket names such as mybucket or s3test to be available.

  3. Click on the Region drop-down menu and select the geographic area nearest to you.

  4. Click on the Elastic MapReduce link and click on the Create a new Job Flow button. You should see a screen like the following screenshot:

  5. You should now see a screen like the preceding screenshot. Select the Run a sample application radio button and...

Comparison of local versus EMR Hadoop


After our first experience of both a local Hadoop cluster and its equivalent in EMR, this is a good point at which we can consider the differences of the two approaches.

As may be apparent, the key differences are not really about capability; if all we want is an environment to run MapReduce jobs, either approach is completely suited. Instead, the distinguishing characteristics revolve around a topic we touched on in Chapter 1, What It's All About, that being whether you prefer a cost model that involves upfront infrastructure costs and ongoing maintenance effort over one with a pay-as-you-go model with a lower maintenance burden along with rapid and conceptually infinite scalability. Other than the cost decisions, there are a few things to keep in mind:

  • EMR supports specific versions of Hadoop and has a policy of upgrading over time. If you have a need for a specific version, in particular if you need the latest and greatest versions immediately after...

Summary


We covered a lot of ground in this chapter, in regards to getting a Hadoop cluster up and running and executing MapReduce programs on it.

Specifically, we covered the prerequisites for running Hadoop on local Ubuntu hosts. We also saw how to install and configure a local Hadoop cluster in either standalone or pseudo-distributed modes. Then, we looked at how to access the HDFS filesystem and submit MapReduce jobs. We then moved on and learned what accounts are needed to access Elastic MapReduce and other AWS services.

We saw how to browse and create S3 buckets and objects using the AWS management console, and also how to create a job flow and use it to execute a MapReduce job on an EMR-hosted Hadoop cluster. We also discussed other ways of accessing AWS services and studied the differences between local and EMR-hosted Hadoop.

Now that we have learned about running Hadoop locally or on EMR, we are ready to start writing our own MapReduce programs, which is the topic of the next chapter...

Left arrow icon Right arrow icon

Key benefits

  • Learn tools and techniques that let you approach big data with relish and not fear
  • Shows how to build a complete infrastructure to handle your needs as your data grows
  • Hands-on examples in each chapter give the big picture while also giving direct experience

Description

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills."Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Who is this book for?

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

What you will learn

  • The trends that led to Hadoop and cloud services, giving the background to know when to use the technology
  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Developing applications to run on Hadoop with examples in Java and Ruby
  • How Amazon Web Services can be used to deliver a hosted Hadoop solution and how this differs from directly-managed environments
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • How Flume can collect data from multiple sources and deliver it to Hadoop for processing
  • What other projects and tools make up the broader Hadoop ecosystem and where to go next

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 22, 2013
Length: 398 pages
Edition : 1st
Language : English
ISBN-13 : 9781849517317
Vendor :
Apache
Category :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Feb 22, 2013
Length: 398 pages
Edition : 1st
Language : English
ISBN-13 : 9781849517317
Vendor :
Apache
Category :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 83.98
Practical Data Analysis
€41.99
Hadoop Beginner's Guide
€41.99
Total 83.98 Stars icon
Banner background image

Table of Contents

11 Chapters
What It's All About Chevron down icon Chevron up icon
Getting Hadoop Up and Running Chevron down icon Chevron up icon
Understanding MapReduce Chevron down icon Chevron up icon
Developing MapReduce Programs Chevron down icon Chevron up icon
Advanced MapReduce Techniques Chevron down icon Chevron up icon
When Things Break Chevron down icon Chevron up icon
Keeping Things Running Chevron down icon Chevron up icon
A Relational View on Data with Hive Chevron down icon Chevron up icon
Working with Relational Databases Chevron down icon Chevron up icon
Data Collection with Flume Chevron down icon Chevron up icon
Where to Go Next Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(13 Ratings)
5 star 15.4%
4 star 46.2%
3 star 30.8%
2 star 7.7%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




MarcTC Feb 05, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I am very interested on Big Data and have read a lot of books about the subject. It was time for me to get hands. So I was looking for a beginner book for Hadoop. I already had the knowledge on Big Data as a subject and I also knew the type of problems we are solving. So I looked for a beginner book for Hadoop as I didn't have any prior experience with the platform. I am using Mac OSX machine. I know my way around linux/Unix shell.I started working through the book. The first chapter is a great short and sweet introduction to Hadoop, MapReduce and HDFS. Its not too lengthy. Just the right length for a person like me who gets bored pretty quickly with a lot of information. From Chapter 2, the real hands on work started. Although the instructions are mainly for Ubuntu linux, it worked like a charm in Mac. I didnt have any issues that other reviewers mentioned. when you run the first example, i couldn't use the exact command given in the book to run the first example of Pi calculation simply because i was using the latest version of Hadoop so the Jar name was different. Thats not something that the author can fix. once you give the correct jar name, it worked like a charm. Im now moving alone to the other chapters very quickly. Its a really good book if you just wanna familiarize yourself with the product, which is exactly the book is about. its well written. easy to follow. Author doesnt try to bore you to death.I recommend reading the below book if you first wanna understand what the Big Data is all about about and see what type of issues its trying solvehttp://www.amazon.com/gp/product/B009N08NKW/ref=cm_cr_ryp_prd_ttl_sol_39
Amazon Verified review Amazon
Cheng Sep 24, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A very good book for the beginner
Amazon Verified review Amazon
Amazon Customer Nov 08, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I like this book, it is very clear and has good examples to try out. I don't give it 5 stars because many of the examples have mistakes, many of them very obvious.
Amazon Verified review Amazon
Varun Muriyanat Sep 29, 2013
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Very good for beginners. Gives both the big picture, as well as the start up implementation for newbies. Execellent choice.
Amazon Verified review Amazon
Alexander Tarnowski Jun 03, 2013
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I read this book "out of context", meaning that I didn't have an interesting problem solvable by MapReduce at hand and a dire need to learn Hadoop at the time of reading. Instead, I took time to read this book with the purpose of determining whether it's a good beginner book or not. All in all, I'd say that it is. The author really succeeds in creating a context for Hadoop and its ecosystem.From the second chapter and onwards, Hadoop is gradually introduced using very detailed instructions. The general format for doing this is by listing every single command the user needs to type and its output, so the book is full of terminal session listings. All such listings are followed by sections called "What just happened?" that explain in detail the purpose of the commands and their output. This is actually quite helpful for readers who understand what's happening from just looking at the session listing; such readers can safely skip these sections.The above approach should enable any reader, regardless of level of experience, to follow along and do the exercises or labs, which is a good thing for a beginner book. I have a remark about this though: the session dumps could have been proofread better! I can't say that I read them through a magnifying glass, but still I found quite a few errors.As for the contents, the book never shows the monster! In my opinion, the introductory chapter fails to actually establish a case for Hadoop and MapReduce. Yes, it's about big data, scaling and problems and so on, but I couldn't find a logical transition to Hadoop as a solution to these problems. Instead, chapter two illustrates the framework with a distributed calculation of pi and the word counting program (Hadoop's version of the "Hello world" program).In a later chapter, Hadoop is used to process a dataset with UFO sightings, and then, in a chapter on advanced techniques a graph problem is solved. Not until that chapter did I start getting a feeling for what kind of problems Hadoop and MapReduce should be used for. This is what I mean by "never showing the monster". Being an introductory text, I'd prefer the first or second chapter to describe some problems that are good candidates for the MapReduce paradigm, illustrate one of them, and then show how a distributed computation would help.That said, I may be off track here. This is a book on Hadoop, and not MapReduce in general, and I did say that I read it without having intricate MapReduce problems at hand. This is pretty much my only criticism. If a reader doesn't perceive this as a problem, then there's nothing to complain about. After reading the book, I feel that I have a very good feeling for what Hadoop does and what building blocks in its ecosystem to use.One more thing... Here and there the book contains examples of how to use Amazon's EMR. This didn't feel awfully important to me, but it provides an even more solid explanation of how to apply the framework to bigger problems and how it can be used in a cloud environment.To sum up: a good and comprehensive book on Hadoop that covers the framework and its ecosystem, verbose and easy to follow examples, and a structure that leaves the reader with a sense of getting the big picture.Minuses: could devote some more pages to the MapReduce paradigm and have its sample listings proofread better.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.