Best Practices for Deploying Hadoop Server on CentOS/RHEL 7 – Part 1

In this series of articles, we are going to cover the entire Cloudera Hadoop Cluster Building building with Vendor and Industrial recommended best practices.

Part 4: How to Install CDH and Configure Service Placements on CentOS/RHEL 7
Part 5: How to Set Up High Availability for Namenode
Part 6: How to Set Up High Availability for Resource Manager
Part 7: How to Install and Configure Hive with High Availability
Part 8: How to Install and Configure Sentry (Authorization Tool)
Part 9: How to Install Kerberos (Kerberising the Cluster) for Hadoop Authentication
Part 10: How to Tune Cluster (Yarn Tuning) on CentOS/RHEL 7

OS installation and doing OS level Pre-requisites are the first steps to build a Hadoop Cluster. Hadoop can run on the various flavor of Linux platform: CentOS, RedHat, Ubuntu, Debian, SUSE etc., In real-time production, most of the Hadoop Clusters are built on top of RHEL/CentOS, we will use CentOS 7 for demonstration in this series of tutorials.

In an Organization, OS installation can be done using kickstart. If it is a 3 to 4 node cluster, manual installation is possible but if we build a big cluster with more than 10 nodes, it’s tedious to install OS one by one. In this scenario, the Kickstart method comes into the picture, we can proceed with the mass installation using kickstart.

Achieving good performance from a Hadoop Environment is depends on provisioning the correct Hardware & Software. So, building a production Hadoop cluster involves a lot of consideration regarding Hardware and Software.

In this article, we will go through various Benchmarks about OS installation and some best practices for deploying Cloudera Hadoop Cluster Server on CentOS/RHEL 7.

Important Consideration and Best Practices for Deploying Hadoop Server

The following are the best practices for setting up deploying Cloudera Hadoop Cluster Server on CentOS/RHEL 7.

  • Hadoop servers do not require enterprise standard servers to build a cluster, it requires commodity hardware.
  • In the production cluster, having 8 to 12 data disks are recommended. According to the nature of the workload, we need to decide on this. If the cluster is for compute-intensive applications, having 4 to 6 drives is best practice to avoid I/O issues.
  • Data drives should be partitioned individually, for example – starting from /data01 to /data10.
  • RAID configuration is not recommended for worker nodes, because Hadoop itself providing fault-tolerance on data by replicating the blocks into 3 by default. So JBOD is best for worker nodes.
  • For Master Servers, RAID 1 is the best practice.
  • The default filesystem on CentOS/RHEL 7.x is XFS. Hadoop supports XFS, ext3, and ext4. The recommended file-system is ext3 as it is tested for good performance.
  • All the servers should be having the same OS version, at-least same minor release.
  • It is best practice to have homogeneous hardware (all worker nodes should have the same hardware characteristics (RAM, disk space & Core etc).
  • According to the cluster workload (Balanced Workload, Compute Intensive, I/O Intensive) and size, resource (RAM, CPU) planning per server will get differ.

Find the below Example for Disk Partitioning of the servers of 24TB storage.

Disk Partitioning
Disk Partitioning

Installing CentOS 7 for Hadoop Server Deployment

Things you need to know before installing CentOS 7 server for Hadoop Server.

  • Minimal installation is enough for Hadoop Servers (worker nodes), in some cases, GUI can be installed only for Master servers or Management servers where we can use browsers for Web UIs of Management tools.
  • Configuring networks, hostname, and other OS-related settings can be done after OS installation.
  • In real-time, server vendors will be having their own console to interact and manage the servers, for example – Dell servers are having iDRAC which is a device, embedded with servers. Using that iDRAC interface we can install OS with having an OS image in our local system.

In this article, we have installed OS (CentOS 7) in VMware virtual machine. Here, we will not be having multiple disks to perform partitions. CentOS is similar to RHEL (same functionality), so we will see the steps to install CentOS.

1. Begin by downloading the CentOS 7.x ISO image in your local windows system and select it while booting the virtual machine. Select ‘Install CentOS 7‘ as shown.

Install CentOS 7 Boot Menu
Install CentOS 7 Boot Menu

2. Select the Language, default will be English, and click continue.

Select CentOS 7 Language
Select CentOS 7 Language

3. Software Selection – Select the ‘Minimal Installation‘ and click ‘Done‘.

CentOS Software Selection
CentOS Software Selection
CentOS 7 Minimal Installation
CentOS 7 Minimal Installation

4. Set the root password as it will prompt us to set.

Set Root Password
Set Root Password

5. Installation Destination – This is the important step to be cautious. We need to select the disk where the OS has to be installed, dedicated disk should be selected for OS. Click the ‘Installation Destination‘ and select the Disk, in real-time multiple disks will be there, we need to select, preferable ‘sda‘.

Select Installation Destination
Select Installation Destination
Select Disk for CentOS Installation
Select Disk for CentOS Installation

6. Other Storage Options – Choose the second option (I will configure partitioning) to configure OS related partitioning like /var, /var/log, /home, /tmp, /opt, /swap.

Manual CentOS Partitioning
Manual CentOS Partitioning

7. Once done, begin the installation.

Begin CentOS Installation
Begin CentOS Installation
CentOS 7 Installation
CentOS 7 Installation

8. Once the Installation completed, reboot the server.

CentOS 7 Installation Complete
CentOS 7 Installation Complete

9. Login into the server and set the hostname.

# hostnamectl status
# hostnamectl set-hostname tecmint
# hostnamectl status
Set Hostname on CentOS
Set Hostname on CentOS
Summary

In this article, we have gone through OS installation steps and best practices for filesystem partitioning. These are all general guideline, according to the nature of the workload, we may need to concentrate on more nuances to achieve the best performance of the cluster. Cluster planning is art for the Hadoop administrator. We will have deep dive into OS level pre-requisites and security Hardening in the next article.

If you liked this article, then do subscribe to email alerts for Linux tutorials. If you have any questions or doubts? do ask for help in the comments section.

If You Appreciate What We Do Here On TecMint, You Should Consider:

TecMint is the fastest growing and most trusted community site for any kind of Linux Articles, Guides and Books on the web. Millions of people visit TecMint! to search or browse the thousands of published articles available FREELY to all.

If you like what you are reading, please consider buying us a coffee ( or 2 ) as a token of appreciation.

Support Us

We are thankful for your never ending support.

Got something to say? Join the discussion.

Have a question or suggestion? Please leave a comment to start the discussion. Please keep in mind that all comments are moderated and your email address will NOT be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.