At Agira, Technology Simplified, Innovation Delivered, and Empowering Business is what we are passionate about. We always strive to build solutions that boost your productivity.

,

How to Setup Hadoop 2.8.0 (Single Node Cluster) on CentOS

  • By Saravana
  • September 13, 2017
  • 3014 Views

Introduction

Apache Hadoop 2.8.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.3.

The following are the features and improvements that are said to be available in Apache Hadoop 2.8.0

  • Common
    • Support async call retry and failover which can be used in async DFS implementation with retry effort.
    • Cross Frame Scripting (XFS) prevention for UIs can be provided through a common servlet filter.
    • S3A improvements: add ability to plug in any AWSCredentialsProvider, support read s3a credentials from Hadoop credential provider API in addition to XML configuration files, support Amazon STS temporary credentials
    • WASB improvements: adding append API support
    • Build enhancements: replace dev-support with wrappers to Yetus, provide a docker based solution to setup a build environment, remove CHANGES.txt and rework the change log and release notes.
    • Add posixGroups support for LDAP groups mapping service.
    • Support integration with Azure Data Lake (ADL) as an alternative Hadoop-compatible file system.
  • HDFS
    • WebHDFS enhancements: integrate CSRF prevention filter in WebHDFS, support OAuth2 in WebHDFS, disallow/allow snapshots via WebHDFS
    • Allow long-running Balancer to log in with keytab
    • Add ReverseXML processor which reconstructs an fsimage from an XML file. This will make it easy to create fsimages for testing, and manually edit fsimages when there is corruption
    • Support nested encryption zones
    • DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness. This can prevent the NameNode from incorrectly marking DataNodes as stale or dead in highly overloaded clusters where heartbeat processing is suffering delays.
    • Logging HDFS operation’s caller context into audit logs
    • A new datanode command for evicting writers which is useful when data node decommissioning is blocked by slow writers.
  • YARN
    • NodeManager CPU resource monitoring in Windows.
    • NM shut down more graceful: NM will unregister to RM immediately rather than waiting for the timeout to be LOST (if NM work preserving is not enabled).
    • Add ability to fail a specific AM attempt in the scenario of AM attempt gets stuck.
    • CallerContext support in YARN audit log.
    • ATS versioning support: a new configuration to indicate timeline service version.
  • MAPREDUCE
    • Allow node labels get specified in submitting MR jobs
    • Add a new tool to combine aggregated logs into HAR file

Reference: hadoop.apache.org

This blog will help you to install Hadoop 2.8.0 on CentOS operating system and this includes basic configuration required to start working with Hadoop. I have explained the entire process in simple and easy steps.

hadoop-logo

Step 1 – Installing Java

Java is required for running Hadoop on any system, So before installing hadoop make sure java is installed on your system

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.121-b04, mixed mode)

 

If Java is not installed in the system then install it by using the following commands. To Install Java OpenJDK 8

$ sudo yum install java-1.8.0-openjdk

 

After installing Java configure Java Environment Variables /etc/profile.d/java.sh

export JAVA_HOME=/usr/lib/jvm/java-openjdk

export JAVA_PATH=$JAVA_HOME

export PATH=$PATH:$JAVA_HOME/bin

Step 2 – Setup Hadoop user account

It is recommended to create non-root user account for hadoop environment

$ adduser hadoop
$ passwd hadoop

 

Setup key based ssh to its own account

$ su - hadoop
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

 

Let’s check key based login and exit from Hadoop

$ ssh localhost

 

Step 3 – Download Hadoop source file

Download Hadoop 2.8.0 source file, For different version, refer http://hadoop.apache.org

$ cd /usr/local
$ wget http://apache.claz.org/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
$ tar xzf hadoop-2.8.0.tar.gz
$ mv hadoop-2.8.0 hadoop

 

Step 4 – Configure Hadoop Pseudo-Distributed Mode

  1. Setup Environment Variables

Edit ~/.bashrc file and append following values at end of file.

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in current running environment

$ source ~/.bashrc

 

Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh and set JAVA_HOME

# Change Java home path as per java installed on your system

export JAVA_HOME=/usr/lib/jvm/java-openjdk

  1. Edit Configuration Files

Hadoop contains many configuration files, which need to be configured as per requirements of your hadoop environment.

$ cd $HADOOP_HOME/etc/hadoop

 

  1. i) Edit core-site.xml
<configuration>
 <property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
 </property>
</configuration>

 

  1. ii) Edit hdfs-site.xml
<configuration>
 <property>
<name>dfs.replication</name>
<value>1</value>
 </property>
 <property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
 </property>
 <property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
 </property>
</configuration>

 

iii) Edit mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml
<configuration>
 <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
 </property>
</configuration>

 

  1. iv) Edit yarn-site.xml
<configuration>
 <property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
 </property>
</configuration>

 

  1. Format Hadoop Namenode

Once hadoop single node cluster setup has done, it’s time to initialize HDFS file system by formatting

$ hdfs namenode -format

 

Sample output:

17/02/14 08:13:20 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.8.0
...
...
17/02/14 08:13:30 INFO namenode.FSImage: Allocated new BlockPoolId: BP-415680745-172.31.10.127-1487060010110
17/02/14 08:13:30 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
17/02/14 08:13:30 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
17/02/14 08:13:30 INFO util.ExitUtil: Exiting with status 0
17/02/14 08:13:30 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127
************************************************************/

 

Step 5 – Start Hadoop Cluster

Let’s start your Hadoop cluster using the scripts provides by hadoop. Just navigate to your Hadoop sbin directory and execute scripts one by one.

$ cd $HADOOP_HOME/sbin/

 

Run start-dfs.sh to start namenode, datanode and secondary namenodes

$ start-dfs.sh

 

Sample output:

17/02/14 08:16:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-ip-172-31-10-127.out
localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-ip-172-31-10-127.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is a2:9b:7c:8f:21:43:6e:ce:18:5e:85:5b:a1:57:d2:99.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-ip-172-31-10-127.out
17/02/14 08:16:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Run start-yarn.sh to start daemons, resourcemanager and nodemanager
$ start-yarn.sh

 

Sample output:

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-ip-172-31-10-127.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-ip-172-31-10-127.out
To check services status run jps command.
$ jps

 

Sample output:

12544 NameNode
13001 ResourceManager
13104 NodeManager
12672 DataNode
13993 Jps
12843 SecondaryNameNode

 

Step 6 – Check Hadoop Services

Access 50070 for getting information about NameNode

http://HOST_NAME:50070/

Access 8088 for getting information about cluster

http://HOST_NAME:8088/

Access 50090 for getting information about secondary namenode.

http://HOST_NAME:50090/

Access 50075 for getting information about DataNode

http://HOST_NAME:50075/

Step 7 – Test Hadoop Setup

  1. i) Make the HDFS directories

$ bin/hdfs dfs -mkdir /user

$ bin/hdfs dfs -mkdir /user/hadoop

Manage Hadoop Services

To start all hadoop instances run the below commands

$ start-dfs.sh
$ start-yarn.sh

 

To stop all hadoop instances run the below commands

$ stop-yarn.sh
$ stop-dfs.sh

 

Hope this article helped you to easily setup Hadoop 2.8.0 (Single Node Cluster) on CentOS. If you have any doubts or queries please comment below. For updates follow agiratechnologies.

Saravana

An enthusiastic Tech Lead with 7 plus years of experience in Web development arena. Owns legitimate experience in Ruby, Ruby On Rails, AngularJs, DevOps. Golang, Another add on, This young tech freak never miss a chance to get his hands on planting and Gardening even in his busy weekends.