Hadoop 2.6.4 pseudo distributed mode installation on ubuntu 14.04

posted on Nov 20th, 2016

Apache Hadoop

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system installed.

2) Apache Hadoop 2.6.4 Software (Download Here)

Pseudo Distributed Mode (Single Node Cluster)

The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS.

Hadoop Pseudo Distributed Mode Installation on Ubuntu 14.04

Step 1 - Update. Open a terminal (CTRL + ALT + T) and type the following sudo command. It is advisable to run this before installing any package, and necessary to run it to install the latest updates, even if you have not added or removed any Software Sources.

$ sudo apt-get update

Step 2 - Installing Java 7.

$ sudo apt-get install openjdk-7-jdk

Step 3 - Install open-ssh server. It is a cryptographic network protocol for operating network services securely over an unsecured network. The best known example application is for remote login to computer systems by users.

$ sudo apt-get install openssh-server

Step 4 - Create a Group. We will create a group, configure the group sudo permissions and then add the user to the group. Here 'hadoop' is a group name and 'hduser' is a user of the group.

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Step 5 - Configure the sudo permissions for 'hduser'.

$ sudo visudo

Since by default ubuntu text editor is nano we will need to use CTRL + O to edit.

ctrl+O

Add the permissions to sudoers.

hduser ALL=(ALL) ALL

Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.

ctrl+x

Step 6 - Creating hadoop directory.

$ sudo mkdir /usr/local/hadoop

Step 7 - Change the ownership and permissions of the directory /usr/local/hadoop. Here 'hduser' is an Ubuntu username.

$ sudo chown -R hduser /usr/local/hadoop
$ sudo chmod -R 755 /usr/local/hadoop

Step 8 - Switch User, is used by a computer user to execute commands with the privileges of another user account.

$ su hduser

Step 9 - Change the directory to /home/hduser/Desktop , In my case the downloaded hadoop-2.6.4.tar.gz file is in /home/hduser/Desktop folder. For you it might be in /downloads folder check it.

$ cd /home/hduser/Desktop/

Step 10 - Untar the hadoop-2.6.4.tar.gz file.

$ tar xzf hadoop-2.6.4.tar.gz

Step 11 - Move the contents of hadoop-2.6.4 folder to /usr/local/hadoop

$ mv hadoop-2.6.4/* /usr/local/hadoop

Step 12 - Edit $HOME/.bashrc file by adding the java and hadoop path.

$ sudo gedit $HOME/.bashrc

$HOME/.bashrc file. Add the following lines

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Step 13 - Reload your changed $HOME/.bashrc settings

$ source $HOME/.bashrc

Step 14 - Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Unless there is a good reason not to, you should always authenticate using SSH keys.

$ ssh-keygen -t rsa -P ""

Step 15 - Now you can add the public key to the authorized_keys

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 16 - Adding localhost to list of known hosts. A quick way of making sure that 'localhost' is added to the list of known hosts so that a script execution doesn't get interrupted by a question about trusting localhost's authenticity.

$ ssh localhost 

Step 17 - Change the directory to /usr/local/hadoop/etc/hadoop

$ cd $HADOOP_HOME/etc/hadoop

Step 18 - Edit hadoop-env.sh file.

$ sudo gedit hadoop-env.sh

Step 19 - Add the below lines to hadoop-env.sh file. Save and Close.

# remove comment and change java_HOME 
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Step 20 - Edit core-site.xml file.

$ sudo gedit core-site.xml

Step 21 - Add the below lines to core-site.xml file. Save and Close.

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Step 22 - Edit hdfs-site.xml file.

$ sudo gedit hdfs-site.xml

Step 23 - Add the below lines to hdfs-site.xml file. Save and Close.

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>/app/hadoop/tmp/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/app/hadoop/tmp/datanode</value>
</property>

Step 24 - Edit yarn-site.xml file.

$ sudo gedit yarn-site.xml

Step 25 - Add the below lines to yarn-site.xml file. Save and Close.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Step 26 - Copy the default mapred-site.xml.template to mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml

Step 27 - Edit mapred-site.xml file.

$ sudo gedit mapred-site.xml

Step 28 - Add the below lines to mapred-site.xml file. Save and Close.

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Step 29 - Edit slaves file.

$ sudo gedit slaves

Step 30 - Add the below line to slaves file. Save and Close.

localhost

Step 31 - Creating /app/hadoop/tmp directory.

$ sudo mkdir /app/hadoop/tmp

Step 32 - Change the ownership and permissions of the directory /app/hadoop/tmp. Here 'hduser' is an Ubuntu username.

$ sudo chown -R hduser /app/hadoop/tmp
$ sudo chmod -R 755 /app/hadoop/tmp

Step 33 - Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Step 34 - Format the datanode.

$ hadoop namenode -format

Step 35 - Start NameNode daemon and DataNode daemon.

$ start-dfs.sh

Step 36 - Start yarn daemons.

$ start-yarn.sh

OR

Instead of steps 35 and 36 you can use below command. It is deprecated now.

$ start-all.sh

Step 37 - The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.

$ jps

Hadoop Pseudo Distributed Mode Installation on Ubuntu 14.04

Once the Hadoop cluster is up and running check the web-ui of the components as described below

NameNode Browse the web interface for the NameNode; by default it is available at

http://localhost:50070/

Hadoop Pseudo Distributed Mode Installation on Ubuntu 14.04

ResourceManager Browse the web interface for the ResourceManager; by default it is available at

http://localhost:8088/

Step 38 - Make the HDFS directories required to execute MapReduce jobs.

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser

Step 39 - Copy the input files into the distributed filesystem.

$ hdfs dfs -put /usr/local/hadoop/etc/hadoop /user/hduser/input

Step 40 - Run some of the examples provided.

Hadoop Pseudo Distributed Mode Installation on Ubuntu 14.04

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep /user/hduser/input /user/hduser/output 'dfs[a-z.]+'

Step 41 - Examine the output files.

$ hdfs dfs -cat /user/hduser/output/*

Step 42 - Stop NameNode daemon and DataNode daemon.

$ stop-dfs.sh

Step 43 - Stop Yarn daemons.

$ stop-yarn.sh

OR

Instead of steps 42 and 43 you can use below command. It is deprecated now.

$ stop-all.sh

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Hadoop Standalone Mode Installation   Hadoop Fully Distributed Mode Installation   Hadoop HDFS commands usage   Hadoop Commissioning and Decommissioning DataNode   Hadoop WordCount Java Example   Hadoop Mapper/Reducer Java Example   Hadoop Combiner Java Example   Hadoop Partitioner Java Example   Hadoop HDFS operations using Java   Hadoop Distributed Cache Java Example