Pages

Monday, 4 November 2013

Cloudera Hadoop Installation

CDH4 Cluster Installation Guide is for Hadoop developers and system administrators interested in Hadoop cluster installation. The following sections describe how to install and configure version 4 of Cloudera’s.
This document helps you in configuring the multi node Hadoop Cluster. It also helps you in configuration of cluster for better performance. It also contains the best practices which will help you in the HBase and MapReduce configuration.



1)    Summary

CDH4 Cluster Installation Guide is for Hadoop developers and system administrators interested in Hadoop cluster installation. The following sections describe how to install and configure version 4 of Cloudera’s.

            This document helps you in configuring the multi node Hadoop Cluster. It also helps you in configuration of cluster for better performance. It also contains the best practices which will help you in the HBase and MapReduce configuration.


2)    prerequisites

A.     Supported Operating System

CDH4 supports the following operating systems:


  •   Red Hat-compatible systems

·         Red Hat Enterprise Linux 5.7 and CentOS 5.7
·         Red Hat Enterprise Linux 6.2 and CentOS 6.2
·         Oracle Enterprise Linux 5.6 with Unbreakable Enterprise Kernel
  •   SLES systems

·         SUSE Linux Enterprise Server 11. Service Pack 1 or later is required.
  •     Debian systems

·         Debian 6.0 (Squeeze)

  •   Ubuntu systems

·         Ubuntu 10.04
·         Ubuntu 12.04


Cloudera manager only support  64 bit operation System.

B.     Software

o   Perl
o   SSH
o   Open ssh –server
o   Open ssh –clients
o   Cloudera manager

C.     Unique Host name

o   Host name should be unique in your network ex. Node1.example.com



3)    Introduction to Cloudera Manager Installation

Cloudera Manager automates the installation and configuration of CDH on an entire cluster, requiring only that you have root SSH access to your cluster's machines, and access to the internet or a local repository with installation files for all these machines.
It Consist of following

·         Cloudera Manager Server
·         Cloudera Manager agent
·         PostgreSQL database


·         About cloudera manager, how it works?

o   Using SSH, discover the cluster hosts you specify via IP address ranges or hostnames
o   Configure the package repositories for Cloudera Manager, CDH, and the Oracle JDK
o   Install the Cloudera Manager Agent and CDH (including Hue) on the cluster hosts
o   Install the Oracle JDK if it's not already installed on the cluster hosts
o   Determine mapping of services to host
o   Suggest a Hadoop configuration and start the Hadoop services

You can also choose to add node and remove node from the cluster.


4)     Preparation for installation

A.     Networking

o   Check internet setting on every host
o   Cluster hosts must have Same DNS and reverse DNS properly configured
o   Check out the hostname it should  with standard  ex. Node1.example.com

B.     Firewall and Security

o   The Cloudera Manager Server must have SSH access to the cluster hosts when you run the installation wizard.
Note
You must log in using a root account or an account that has password-less sudo permission. For authentication during the installation and upgrade procedures, you will need to either enter the password or upload a public and private key pair for the root or sudo user account.

Cloudera Manager uses SSH only during the initial install or upgrade. Once your cluster is set up, you can safely disable root SSH access or change the root password. Cloudera Manager does not save SSH credentials and all credential information is discarded once the installation is complete.

I.            Configure the SSH

Steps:
a)      ssh-keygen
b)      cd /root/.ssh/
c)       ls
d)      cp idrsa.pub authorised_keys
e)      Cat the all idsa.pub and authorised_keys and store in the same location so that every machine can SSH to other.

II.            Check SSH_config file setting

When using multiple systems the indispensable tool is, as we all know, ssh. Using ssh you can login to other (remote) systems and work with them as if you were sitting in front of them. Even if some of your systems exist behind firewalls you can still get to them with ssh, but getting there can end up requiring a number of command line options and the more systems you have the more difficult it gets to remember them. However, you don't have to remember them, at least not more than once: you can just enter them into ssh's config file and be done with it.
Steps
1)      vi /etc/ssh/ssh_config
Change ask to no i.e. StrictHostkeychecking no

III.            IPTABLE turn to off

Iptables is administration tool / command for IPv4 packet filtering and NAT. You need to use the following tools:
[a] service is a command to run a System V init script. It is use to save / stop / start firewall service.
[b] Chkconfig command is used to update and queries run level information for system service. It is a system tool for maintaining the /etc/rc*.d hierarchy. Use this tool to disable firewall service at boot time.
Steps
a)      Service iptables save
b)      service iptables stop
c)       chkconfig iptables off
d)      /etc/init.d/network restart

C.     Disable SELINUX

Steps for disable the SELINUX

o   vi /etc/selinux/config

SELINUX= disabled

D.    Proxy Setting

 Check Use this proxy1 server for all protocols

1.1.1          In terminal

 sudo gedit /etc/yum.conf
# The proxy server - proxy server:port number
proxy=http://proxy1.xx.com:8080
# the account details for yum connections
proxy_username=<username>
proxy_password=<your password>
Then save the file
sudo "yum clean all"

E.      Host Entry

This snippet describes the format of the /etc/hosts file. This file is a simple text file that associates IP addresses with hostnames, one line per IP address. You should have some subset of all hostnames in /etc/hosts. You should have some sort of name resolution, even when no network interfaces are running, for example, during boot time. This is not only a matter of convenience, but it allows you to use symbolic hostnames in your network RC scripts. Thus, when changing IP addresses, you only have to copy an updated hosts file to all machines and reboot, rather than edit a large number of RC files separately. Usually you put all local hostnames and addresses in hosts, adding those of any gateways and NIS servers used. For each host a single line should be present with the following information:
 sudo gedit /etc/hosts
127.0.0.1              localhost              localhost
10.xxx.xxx.xxx  master.example.com     master
10.xxx.xxx.xxx  node01.example.com    node01
10.xxx.xxx.xxx  node02.example.com    node02
10.xxx.xxx.xxx  node03.example.com    node03

F.      Update OS Setup

Updating the OS not necessary but it is better to use the latest stable version.
Check for the fast mirror for update
vi /etc/yum/pluginconf.d/fastestmirror.conf
change the “Enabled=1”
yum -y update








G.    Download Cloudera manger

 If you have curl installed then use
or use click on the link
IF the link is not working the go to
download  the latest stable free version
Change the mode for the execution
Chmod +x cloudera-manager-installer.bin
Sudo ./cloudera-manager-installer.bin

Accept the agreements and click ok
After some time cloudera manager will be install
http://master.example.com:7180

To start the Cloudera Manager Admin Console:
1.       In a web browser, enter the URL, including the port, for the Cloudera Server. The login screen for Cloudera Manager appears.
2.        Log into Cloudera Manager. The default credentials are:
Username: admin
Password: admin

5)    Setting up the cloudera manger


Follow the following steps

Step 1: Registration Document

·         You can register on cloudera and click on Submit Registration.
·         Just click on Proceed.

 








Figure 1:- Registration Document

Step 2: Specify Host for installation

To enable Cloudera Manager to automatically discover your cluster hosts where you want to install CDH, enter the cluster hostnames or IP addresses and click Search. You can also specify hostname and IP address ranges:


Use This Expansion Range
To Specify These Hosts
10.1.1.[1-4]
10.1.1.1, 10.1.1.2, 10.1.1.3, 10.1.1.4
host[1-3].network.com
host1.network.com, host2.network.com, host3.network.com
host[07-10].network.com
host07.network.com, host08.network.com, host09.network.com, host10.network.com
You can specify multiple addresses and address ranges by separating them by commas, semicolons, tabs, or blank spaces, or by placing them on separate lines. Use this technique to make more specific searches instead of searching overly wide ranges.
The scan results will include all addresses scanned, but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default.
Note:If you don't know the IP addresses of all of the hosts, you can enter an address range that spans over unused addresses. Note that larger ranges will require more time to scan.
You can abort any actively running scan by clicking Abort Scan. To find additional hosts after scanning completes, add or modify the hostname or IP address ranges and click Search again.


Figure 2: Specify Host for Installation



Step 3: Connecting Specified hosts With SSH

This step is only for the searching the specified hosts it is reachable or not. SSH on the basis of starting node. If the node is not reachable from that node then it shows not reachable. If SSH is not running then it shows that “Could not connect to host”. It also shows the response time to respond while connecting to the particular node.




Figure 3: Connecting Specified hosts With SSH

Step 4: Choose CDH Version

This screen shot showing the available version of CDH and its versions. It also support offline installation with custom repository.


Figure 4: Choose CDH Version

Step 5: Provide SSH Login Credentials

To authenticate with the hosts, you must either use a root account that is on all of your cluster hosts, or use an account that has password-less sudo permissions. Select root or enter the user name for an account that has password-less sudo permissions. You can either use a shared password for the account, or use a public and private key pair.
To enter a password, click all hosts accept same password and enter the account password. To use a public and private key pair, click all hosts accept same public key. Specify or browse for the location of the public and private keys. If your keys contain a passphrase, enter it.





 Figure 5: SSH/Login Credentials

Step 6: Installation Done


The wizard runs a maximum of 10 installations in parallel to avoid excessive network load. The status of installation on each host is displayed on the page that appears after you click Start Installation. You can also click the Detailslink for individual hosts to view detailed information about the installation and error messages if installation fails on any hosts.
If you click the Abort Installation button while installation is in progress, it will halt any pending or in-progress installations and roll back any in-progress installations to a clean state. The Abort Installation button does not affect host installations that have already completed successfully or already failed.
If installation fails on a host, you can click the Retry link next to the failed host to try installation on that host again. To retry installation on all failed hosts, click Retry Failed Hosts at the bottom of the screen.
When the Continue button appears at the bottom of the screen, the installation process is complete.
If the installation has completed successfully on some hosts but failed on others, you can click Continue if you want to skip installation on the failed hosts and continue to the next screen to start installing the Cloudera Management services on the successful hosts.


 Figure 6: Installation on nodes

Step 7: Inspect hosts for correctness


When you continue, the Host Inspector runs to validate the installation, and provides a summary of what it finds, including all the versions of the installed components. If the validation is successful, click Continue.

Error 1: clock is not synchronized. – Synchronized clock with the cloudera manager server node.
Error 2: /etc/hosts file error- error in the host file.
Error 3: Localhost error – Localhost line is not added in the /etc/hosts file.

 


 Figure 7: Check hosts for correctness.



Step 8: Choose services to install on the cluster

Choose the services you want to start on your cluster.

·         Choose which version of CDH to use.
·         Choose the combination of services to install: Core Hadoop, HBase Services, All Services, or Custom Services.

Some services depend on others; for example, HBase requires HDFS and Zookeeper. Most of the combinations install MapReduce v1. Choose the custom option to install MapReduce v2 (YARN) or use the Add Service functionality to add YARN after installation completes.


 Figure 8: Choose services to Install.




Step 9: Inspect Role Assignments

Click Inspect Role Assignments to see how the wizard will assign roles for the services you have chosen, and change them if you need to. These assignments are typically acceptable, but you can reassign services to nodes of your choosing, if desired. The wizard evaluates the hardware configurations of the cluster hosts to determine the best machines for each role. For example, the wizard assigns the NameNode role to the machine that best meets the NameNode requirements. The wizard also configures other options, such as the number of map and reduces slots for TaskTracker, on the basis of the size of the cluster and the physical characteristics of each machine, such as the number of CPUs, amount of RAM, and disk space.




 Figure 9: Role Assignments.

Click Continue when you are satisfied with the assignments.



Step 10: Review the Configuration

Review Configuration Changes to be applied.

·         Confirm the settings entered for file system paths. The file paths required vary based on the services to be installed. For example, you might confirm the NameNode Data Directory and the DataNode Data Directory for HDFS or confirm the TaskTracker Local Data Directory List or JobTracker Local Data Directory for MapReduce.


 Figure 10: Configuration for your Cluster
·         Click on continue

·         The wizard starts the services on your cluster.
·         When all of the services are started, click Continue.
·         Click Continue.

Step 11: Change the Default Administrator Password

·         As soon as possible after running the wizard and beginning to use Cloudera Manager, you should change the default administrator password.
·         To change the administrator password:
·          
·         Click the gear icon to display the Administration page (Right TOP Corner).
·         Click the Userstab.
·         Click the Change Password button next to the admin account.
·         Enter a new password twice and then click Submit.




Step 12: Test the Installation





·       
 Figure 11:   All Services