Install Hadoop Multinode Cluster using CDH4 in RHEL/CentOS 6.5

Hadoop is an open source programing framework developed by apache to process big data. It uses HDFS (Hadoop Distributed File System) to store the data across all the datanodes in the cluster in a distributive manner and mapreduce model to process the data.

Install Hadoop Multinode Cluster in CentOS

Install Hadoop Multinode Cluster

Namenode (NN) is a master daemon which controls HDFS and Jobtracker (JT) is master daemon for mapreduce engine.

Requirements

In this tutorial I’m using two CentOS 6.3 VMs ‘master‘ and ‘node‘ viz. (master and node are my hostnames). The ‘master’ IP is 172.21.17.175 and node IP is ‘172.21.17.188‘. The following instructions also works on RHEL/CentOS 6.x versions.

On Master
[[email protected] ~]# hostname

master
[[email protected] ~]# ifconfig|grep 'inet addr'|head -1

inet addr:172.21.17.175  Bcast:172.21.19.255  Mask:255.255.252.0
On Node
[[email protected] ~]# hostname

node
[[email protected] ~]# ifconfig|grep 'inet addr'|head -1

inet addr:172.21.17.188  Bcast:172.21.19.255  Mask:255.255.252.0

First make sure that all the cluster hosts are there in ‘/etc/hosts‘ file (on each node), if you do not have DNS set up.

On Master
[[email protected] ~]# cat /etc/hosts

172.21.17.175 master
172.21.17.188 node
On Node
[[email protected] ~]# cat /etc/hosts

172.21.17.197 qabox
172.21.17.176 ansible-ground

Installing Hadoop Multinode Cluster in CentOS

We use official CDH repository to install CDH4 on all the hosts (Master and Node) in a cluster.

Step 1: Download Install CDH Repository

Go to official CDH download page and grab the CDH4 (i.e. 4.6) version or you can use following wget command to download the repository and install it.

On RHEL/CentOS 32-bit
# wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/i386/cloudera-cdh-4-0.i386.rpm
# yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm
On RHEL/CentOS 64-bit
# wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
# yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Before installing Hadoop Multinode Cluster, add the Cloudera Public GPG Key to your repository by running one of the following command according to your system architecture.

## on 32-bit System ##

# rpm --import http://archive.cloudera.com/cdh4/redhat/6/i386/cdh/RPM-GPG-KEY-cloudera
## on 64-bit System ##

# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Step 2: Setup JobTracker & NameNode

Next, run the following command to install and setup JobTracker and NameNode on Master server.

[[email protected] ~]# yum clean all 
[[email protected] ~]# yum install hadoop-0.20-mapreduce-jobtracker
[[email protected] ~]# yum clean all
[[email protected] ~]# yum install hadoop-hdfs-namenode

Step 3: Setup Secondary Name Node

Again, run the following commands on the Master server to setup secondary name node.

[[email protected] ~]# yum clean all 
[[email protected] ~]# yum install hadoop-hdfs-secondarynam

Step 4: Setup Tasktracker & Datanode

Next, setup tasktracker & datanode on all cluster hosts (Node) except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts ( on node in this case ).

[[email protected] ~]# yum clean all
[[email protected] ~]# yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

Step 5: Setup Hadoop Client

You can install Hadoop client on a separate machine ( in this case I have installed it on datanode you can install it on any machine).

[[email protected] ~]# yum install hadoop-client

Step 6: Deploy HDFS on Nodes

Now if we are done with above steps let’s move forward to deploy hdfs (to be done on all the nodes ).

Copy the default configuration to /etc/hadoop directory ( on each node in cluster ).

[[email protected] ~]# cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster
[[email protected] ~]# cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster

Use alternatives command to set your custom directory, as follows ( on each node in cluster ).

[[email protected] ~]# alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
reading /var/lib/alternatives/hadoop-conf

[[email protected] ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
[[email protected] ~]# alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
reading /var/lib/alternatives/hadoop-conf

[[email protected] ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

Step 7: Customizing Configuration Files

Now open ‘core-site.xml‘ file and update “fs.defaultFS” on each node in cluster.

[[email protected] conf]# cat /etc/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://master/</value>
</property>
</configuration>
[[email protected] conf]# cat /etc/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://master/</value>
</property>
</configuration>

Next update “dfs.permissions.superusergroup” in hdfs-site.xml on each node in cluster.

[[email protected] conf]# cat /etc/hadoop/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
     <name>dfs.name.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
  </property>
  <property>
     <name>dfs.permissions.superusergroup</name>
     <value>hadoop</value>
  </property>
</configuration>
[[email protected] conf]# cat /etc/hadoop/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
     <name>dfs.name.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
  </property>
  <property>
     <name>dfs.permissions.superusergroup</name>
     <value>hadoop</value>
  </property>
</configuration>

Note: Please make sure that, the above configuration is present on all the nodes (do on one node and run scp to copy on rest of the nodes ).

Step 8: Configuring Local Storage Directories

Update “dfs.name.dir or dfs.namenode.name.dir” in ‘hdfs-site.xml’ on the NameNode ( on Master and Node ). Please change the value as highlighted.

[[email protected] conf]# cat /etc/hadoop/conf/hdfs-site.xml
<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///data/1/dfs/nn,/nfsmount/dfs/nn</value>
</property>
[[email protected] conf]# cat /etc/hadoop/conf/hdfs-site.xml
<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn</value>
</property>

Step 9: Create Directories & Manage Permissions

Execute below commands to create directory structure & manage user permissions on Namenode (Master) and Datanode (Node) machine.

[[email protected]]# mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
[[email protected]]# chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
[[email protected]]#  mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
[[email protected]]#  chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

Format the Namenode (on Master), by issuing following command.

[[email protected] conf]# sudo -u hdfs hdfs namenode -format

Step 10: Configuring the Secondary NameNode

Add the following property to the hdfs-site.xml file and replace value as shown on Master.

<property>
  <name>dfs.namenode.http-address</name>
  <value>172.21.17.175:50070</value>
  <description>
    The address and port on which the NameNode UI will listen.
  </description>
</property>

Note: In our case value should be ip address of master VM.

Now let’s deploy MRv1 ( Map-reduce version 1 ). Open ‘mapred-site.xml‘ file following values as shown.

[[email protected] conf]# cp hdfs-site.xml mapred-site.xml
[[email protected] conf]# vi mapred-site.xml
[[email protected] conf]# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
 <name>mapred.job.tracker</name>
 <value>master:8021</value>
</property>
</configuration>

Next, copy ‘mapred-site.xml‘ file to node machine using the following scp command.

[[email protected] conf]# scp /etc/hadoop/conf/mapred-site.xml node:/etc/hadoop/conf/
mapred-site.xml                                                                      100%  200     0.2KB/s   00:00

Now configure local storage directories to use by MRv1 Daemons. Again open ‘mapred-site.xml‘ file and make changes as shown below for each TaskTracker.

<property>
 <name>mapred.local.dir</name>
 <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>

After specifying these directories in the ‘mapred-site.xml‘ file, you must create the directories and assign the correct file permissions to them on each node in your cluster.

mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local

Step 10 : Start HDFS

Now run the following command to start HDFS on every node in the cluster.

[[email protected] conf]# for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
[[email protected] conf]# for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Step 11 : Create HDFS /tmp and MapReduce /var Directories

It is required to create /tmp with proper permissions exactly as mentioned below.

[[email protected] conf]# sudo -u hdfs hadoop fs -mkdir /tmp
[[email protected] conf]# sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
[[email protected] conf]# sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[[email protected] conf]# sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[[email protected] conf]# sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Now verify the HDFS File structure.

[[email protected] conf]# sudo -u hdfs hadoop fs -ls -R /

drwxrwxrwt   - hdfs hadoop          	0 2014-05-29 09:58 /tmp
drwxr-xr-x   	- hdfs hadoop          	0 2014-05-29 09:59 /var
drwxr-xr-x  	- hdfs hadoop          	0 2014-05-29 09:59 /var/lib
drwxr-xr-x   	- hdfs hadoop         	0 2014-05-29 09:59 /var/lib/hadoop-hdfs
drwxr-xr-x   	- hdfs hadoop          	0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   	- mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   	- mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

After you start HDFS and create ‘/tmp‘, but before you start the JobTracker please create the HDFS directory specified by the ‘mapred.system.dir’ parameter (by default ${hadoop.tmp.dir}/mapred/system and change owner to mapred.

[[email protected] conf]# sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
[[email protected] conf]# sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

Step 12: Start MapReduce

To start MapReduce : please start the TT and JT services.

On each TaskTracker system
[[email protected] conf]# service hadoop-0.20-mapreduce-tasktracker start

Starting Tasktracker:                               [  OK  ]
starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-node.out
On the JobTracker system
[[email protected] conf]# service hadoop-0.20-mapreduce-jobtracker start

Starting Jobtracker:                                [  OK  ]

starting jobtracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-master.out

Next, create a home directory for each hadoop user. it is recommended that you do this on NameNode; for example.

[[email protected] conf]# sudo -u hdfs hadoop fs -mkdir  /user/<user>
[[email protected] conf]# sudo -u hdfs hadoop fs -chown <user> /user/<user>

Note: where is the Linux username of each user.

Alternatively, you cancreate the home directory as follows.

[[email protected] conf]# sudo -u hdfs hadoop fs -mkdir /user/$USER
[[email protected] conf]# sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 13: Open JT, NN UI from Browser

Open your browser and type the url as http://ip_address_of_namenode:50070 to access Namenode.

Hadoop NameNode Interface

Hadoop NameNode Interface

Open another tab in your browser and type the url as http://ip_address_of_jobtracker:50030 to access JobTracker.

Hadoop Map/Reduce Administration

Hadoop Map/Reduce Administration

This procedure has been successfully tested on RHEL/CentOS 5.X/6.X. Please comment below if you face any issues with the installation, I will help you out with the solutions.

If You Appreciate What We Do Here On TecMint, You Should Consider:

TecMint is the fastest growing and most trusted community site for any kind of Linux Articles, Guides and Books on the web. Millions of people visit TecMint! to search or browse the thousands of published articles available FREELY to all.

If you like what you are reading, please consider buying us a coffee ( or 2 ) as a token of appreciation.

Support Us

We are thankful for your never ending support.

Kuldeep Kulkarni

I'm Kuldeep Kulkarni - Crazy about Linux, Hadoop etc open-source technologies!! By profession I'm Senior system engineer and hadoop administrator in well known IT industry since 2011. Always enthusiastic about sharing my knowledge via blogs :)

Your name can also be listed here. Got a tip? Submit it here to become an TecMint author.

RedHat RHCE and RHCSA Certification Book
Linux Foundation LFCS and LFCE Certification Preparation Guide

You may also like...

46 Responses

  1. suresh kumar says:

    All the daemons started properly but slaves are not communicating with master one throwing an error port no 22 connection refused

    how can resolve this problem

  2. Rohit says:

    How about writing a blog on how to install pig using hadoop single node cluster

  3. Adiputra says:

    need tutorial using nosql and hadoop please

  4. Nitish says:

    Getting Error in Step 11 : Create HDFS /tmp and MapReduce /var Directories

    sudo -u hdfs hadoop fs -mkdir /tmp
    Error is:-
    mkdir: Call From master/192.168.56.101 to master:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

    The hdfs-site.xml is configured as:-

    dfs.namenode.name.dir
    file:///data/1/dfs/nn,/nfsmount/dfs/nn

    dfs.permissions.superusergroup
    hadoop

    dfs.namenode.http-address
    192.168.56.101:50070

    The address and port on which the NameNode UI will listen.

    Can you please assist me in understanding what I missed.

    Telnet isn’t working as no process is listening on 8020 port.

    Regards
    Nitish

  5. harini says:

    @Lyle Gilbert I have same issue as u had.Can u please help me fix it?

  6. Govind says:

    HI sir, can we also have this setup using Ansible.? please.

  7. Lyle Gilbert says:

    I apologize for my ignorance in advance, but when i follow your instructions i can get to
    sudo -u hdfs hadoop fs -mkdir /tmp
    and all i get is the following error
    mkdir: Call From master/172.21.181.108 to master:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    I do a netstat -tulpn and do not see anything listening on that port. what config should i modify?

    • Lyle Gilbert says:

      I have found the issue, it was an error in my hdfs-site.xml
      again thank you for this wonderful tutorial

      • HAIFA BEN AOUICHA says:

        Hello Lyle,

        May you please give me further details about the modifications you brought to the HDFS-site.xml.

        I am facing the same issue as you : connection refused in port 8020 and telnet / netstat commands don’t fetch any process listening in port 8020.

        Thanks a lot in advance for your feedback, best regards

  8. Ravi says:

    Helle Kuldeep,
    Thanks for this article. I’m trying to install on Centos7, few of the commands are outdated :(
    Do you have an video of making the cluster ?

    Thanks
    Ravi

  9. manish says:

    hello Kuldeep.
    Thanks for the instructions above. I am new in hadoop + linux world. It worked very well for me untill this step.

    Next, copy ‘mapred-site.xml‘ file to node machine using the following scp command.
    [[email protected] conf]# scp /etc/hadoop/conf/mapred-site.xml node:/etc/hadoop/conf/
    mapred-site.xml 100% 200 0.2KB/s 00:00

    I didnt have mapred-site.xml file and thus i created it using vi . I copied the xml you provided.
    However when i am trying to SCP this file to the node: i get an error :
    ssh: Could not resolve hostname node: name of service not known. Lost connection.

    What should i do in order to fix this?

    Thanks
    Manish

Got something to say? Join the discussion.

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.