Search This Blog

Friday, October 19, 2012

Hadoop Installation for Beginners

Well folks,
Here i will be giving you step by step procedures to install and configure hadoop (version 1.1.0) on a linux (debian based distro) as a single node cluster. This guide is for beginners and you need to boot into your linux machine as a root user

Step 1: First you need to download hadoop source from the following URL

Open a terminal

# cd <to directory where you downloaded hadoop>
# mv hadoop-1.1.0.tar.gz /usr/local/
# cd /usr/local/
# tar zxvf hadoop-1.1.0.tar.gz

From the above commands, you have actually moved hadoop src to /usr/local and uncompressed that file in /usr/local/

Step 2: Hadoop is a standalone java based application, so it requires java 1.6 as its dependency which is to be installed by your own ( if not already installed).

Step 3: Next you need to add a specific user to associate to hadoop

# adduser hadoop

It prompts you to enter password and few other information

             Adding user `hadoop' ...
             Adding new group `hadoop' (1001) ...
             Adding new user `hadoop' (1001) with group `hadoop' ...
             Creating home directory `/home/hadoop' ...
             Copying files from `/etc/skel' ...
             Enter new UNIX password:
             Retype new UNIX password:
             passwd: password updated successfully
             Changing the user information for hadoop
             Enter the new value, or press ENTER for the default
                   Full Name []:
                   Room Number []:
                   Work Phone []:
                  Home Phone []:
                 Other []:
             Is the information correct? [Y/n] Y

Step 4: Change the configuration files

Befor we configure, type the following to identify java home

# which java

if for example output is


your JAVA_HOME is /usr


# cd /usr/local/hadoop-1.1.0/
# cd conf/
# vi

Find the following line

                  # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

and replace it as

                  export JAVA_HOME=/usr/

Next paste the following content into the file core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


Next paste the following content into the file hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>




Next paste the following content into the file mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


Next check the file /etc/hosts if the following content exists as the first line, if not add it       localhost <your host name>

          <your host name> is the hostname of your machine.

you can find the hostname by

# hostname

Step 5: Associate user hadoop to your source folder

# cd /usr/local/
# chown -R hadoop hadoop-1.1.0

Step 6: Format HDFS file system Name node and Data Node

# cd /usr/local/hadoop-1.1.0/bin
# su hadoop
# ./hadoop namenode -format

It provides information like

12/10/19 12:00:20 INFO namenode.NameNode: STARTUP_MSG: 
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = vignesh: vignesh
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.1.0
STARTUP_MSG:   build = -r 1394289; compiled by 'hortonfo' on Thu Oct  4 22:06:49 UTC 2012
12/10/19 12:00:20 INFO util.GSet: VM type       = 64-bit
12/10/19 12:00:20 INFO util.GSet: 2% max memory = 17.77875 MB
12/10/19 12:00:20 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/10/19 12:00:20 INFO util.GSet: recommended=2097152, actual=2097152
12/10/19 12:00:21 INFO namenode.FSNamesystem: fsOwner=hadoop
12/10/19 12:00:21 INFO namenode.FSNamesystem: supergroup=supergroup
12/10/19 12:00:21 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/10/19 12:00:21 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/10/19 12:00:21 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/10/19 12:00:21 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/10/19 12:00:21 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/10/19 12:00:21 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
12/10/19 12:00:21 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
12/10/19 12:00:21 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/10/19 12:00:21 INFO namenode.NameNode: SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down NameNode at vignesh: vignesh

Similarly format the data node by

# ./hadoop datanode -format

Step 7: Make passwordless ssh for hadoop user

# ssh-keygen -t rsa -P ""

Press enter when it promts

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 

and it generates the key as

Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/
The key fingerprint is:
f7:e3:1d:e6:2d:7d:23:2f:64:ea:1c:77:99:26:af:e0 hadoop@vignesh
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|                 |
|        S .      |
|         . . o  o|
|           o*oo* |
|          oo+B*+o|
|          .E..B++|

# cat /home/hadoop/.ssh/ > /home/hadoop/.ssh/authorized_keys
# ssh hadoop@localhost

type "yes" if it prompts as below

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 7e:4a:40:b5:57:06:0d:83:34:58:80:80:c3:e7:18:20.
Are you sure you want to continue connecting (yes/no)? 

After this it logs into hadoop user and you have successfully configured passwordless ssh

Now type

# exit

The above command must be used only once. So you are still as hadoop user

Step 8: Start Hadoop services

# ./

it starts 5 services


You can check if the services are running by

# jps

You must see something like this. If not you are facing some errors

26207 TaskTracker
26427 Jps
25847 DataNode
25986 SecondaryNameNode
26089 JobTracker
25738 NameNode

Log into 


for hadoop map/reduce administration (optional)

Log into 


for browsing the hdfs file system (optional)

Step 9: Follow these commands

# ./hadoop dfsadmin -report

This command gives you information on your hdfs system

# ./hadoop fs -mkdir test

This command creates a directory "test" in your hdfs file system

# vi test_input

 In the text editor type

 "hi all hello all"

 save and exit the file

# ./hadoop fs -put test_input test/input

This command copyies the file (test_input)  that we just created into hdfs file system (inside test folder)

#./hadoop fs -ls test

This command list all files in folder "test" of  hdfs file system.

#./hadoop jar ../hadoop-examples-1.1.1.jar wordcount test/input test/output

This command runs a mapreduce program (word count) for your input and generates output in "test/output" of hdfs file system.

You can check the output in the following url


Browse the filesystem -> user -> hadoop -> test -> output ->part-r-00000

Step 10: To stop hadoop (optional)

# sh

Here end our step by step guide to work with hadoop ( for beginners ).

1 comment:

  1. how to create jar file of my own mapreduce program ...
    For example i have a java file named Spatial and i need to create the jar file

    In your example, hadoop-examples.jar is used for running...
    For my program,how can i create jar ...and should i execute the program?

    could u please help me?