Friday 25 January 2013

This Blog is intended to give new users some guidelines to install Hadoop into their local machines. It provides detailed installation steps. All the steps are tested but still anyone can reach out to me if they need any assistance.

CDH3 Pseudo installation on Ubuntu (Single node) Apache Hadoop is an implementation of the MapReduce platform and distributed file system (HDFS) which is written in Java. Its can be considered as a software framework that supports data intensive distributed applications under a free license. In this blog I have tried putting all steps which will hep you to install Hadoop on your windows machine by installing a Virtual Machine and then using ubantu. Since Hadoop is written in Java, we will need JDK (version 1.6 or above) installed. Lets get started............ 0) Install VMware
a> Download VMware Workstation 8 --> https://my.vmware.com/web/vmware/info/slug/desktop_end_user_computing/vmware_workstation/8_0 b> Install VMware-workstation-full-8.0.0-471780 (Click Enter) c> Provide the Serial number. d> While installing it will ask for 32/64 bit file location (provide the path of ubuntu-10.04.3-desktop-amd64 or 32)
Fig 1: Screen which you see after VM is installed. 1) Create a user other than hadoop Eg: Master (or 'Your Name') Master pwd (123456) pwd (123456) (next page) Master (or 'Your Name')
Fig 2: Ubantu Screen 'Master' Simillarly create the Slave VM 2) Install Java a> Download Java jdk-6u30-linux-x64.bin , save it in your Ubantu Desktop b> Open a terminal (ctrl+alt+t) c> Go to Desktop and copy the file to "/usr/local" d> Extract the java file ( go to /usr/local, you can see the .bin file thr): ./jdk-6u30-linux-x64.bin A new file will generate "jdk1.6.0_3/"
Fig 3: Java Installed 3) Install CDH3 package Go to : https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation Click on - Installing CDH3 on Ubuntu and Debian Systems Click on - this link for a Maverick system - on CDH3 installation page Install using GDebi package installer or issue the command below You will see "cdh3-repository_1.0_all.deb" gets downloaded (keep that in Download folder) Execute below commands(this is mentioned in Cloudera site) $ sudo dpkg -i Downloads/cdh3-repository_1.0_all.deb $ sudo apt-get update 4) Install Hadoop $ apt-cache search hadoop $ sudo apt-get install hadoop-0.20 hadoop-0.20-native sudo apt-get install hadoop-0.20-<daemon type> install all Daemons sudo apt-get install hadoop-0.20-namenode sudo apt-get install hadoop-0.20-datanode sudo apt-get install hadoop-0.20-secondarynamenode sudo apt-get install hadoop-0.20-jobtracker sudo apt-get install hadoop-0.20-tasktracker 5) Set Java and Hadoop Home Using command: gedit ~/.bashrc # Set Hadoop-related environment variables export HADOOP_HOME=/usr/lib/hadoop export PATH=$PATH:/usr/lib/hadoop/bin # Set JAVA_HOME export JAVA_HOME=/usr/local/jdk1.6.0_30 export PATH=$PATH:/usr/local/jdk1.6.0_30/bin close terminals and open new one and test JAVA HOME and HADOOP HOME 6) Configuration Set Java Home in ./conf/hadoop-env.sh $ sudo gedit hadoop-env.sh export JAVA_HOME=/usr/local/jdk1.6.0_30 7) test hadoop version and java version hadoop version java -version
Fig 4: Verify Java and Hadoop versions. 8) Adding dedicated users to hadoop group $ sudo gpasswd -a hdfs hadoop $ sudo gpasswd -a mapred hadoop In step 8, 9 and 10 we will configure using 3 files core-site.xml, hdfs-site.xml and mapred-site.xml, which are under ./conf 9) core-site.xml Add below script to core-site.xml. Core-site.xml contains configuration information that overrides the default values for core Hadoop properties. <property> <name>hadoop.tmp.dir</name> <value>/usr/lib/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> $ sudo mkdir /usr/lib/hadoop/tmp $ sudo chmod 750 tmp/ $ sudo chown hdfs:hadoop tmp/ 10) hdfs-site.xml Add below script to hdfs-site.xml. Here we specify the permission, storage and replication factor. <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.name.dir</name> <value>/storage/name</value> </property> <property> <name>dfs.data.dir</name> <value>/storage/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> $ sudo mkdir /storage $ sudo chmod 775 /storage/ $ chown hdfs:hadoop /storage/ 11) mapred-site.xml Add below script to mapred-site.xml. specifies MapReduce formulas and parameters. <property> <name>mapred.job.tracker</name> <value>hdfs://localhost:8021</value> </property> <property> <name>mapred.system.dir</name> <value>/home/your user name here/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/ your user name here /mapred/local</value> </property> <property> <name>mapred.temp.dir</name> <value>/home/ your user name here /mapred/temp</value> </property> $ sudo mkdir /home/ your user name here /mapred $ sudo chmod 775 /home/ your user name here /mapred $ sudo chown mapred:hadoop /home/ your user name here /mapred 12) User Assignment export HADOOP_NAMENODE_USER=hdfs export HADOOP_SECONDARYNAMENODE_USER=hdfs export HADOOP_DATANODE_USER=hdfs export HADOOP_JOBTACKER_USER=mapred export HADOOP_TASKTRACKER_USER=mapred 13) Format namenode Go to below directory and format $ cd /usr/lib/hadoop/bin/ $ sudo -u hdfs hadoop namenode -format 14) Start Daemons $ sudo /etc/init.d/hadoop-0.20-namenode start $ sudo /etc/init.d/hadoop-0.20-secondarynamenode start $ sudo /etc/init.d/hadoop-0.20-jobtracker start $ sudo /etc/init.d/hadoop-0.20-datanode start $ sudo /etc/init.d/hadoop-0.20-tasktracker start Check for any errors in /var/log/hadoop-0.20 for each daemon check all ports are opened using $netstat -ptlen 15) Check UI localhost:50070 -> Hadoop Admin localhost:50030 -> Mapreduce
Fig 5 :Hadoop Admin


Fig 6 :MapReducer WELCOME TO THE WORLD OF BIG DATA..............

Note: The contents of this blog are simply for learning purpose. This blog was created keeping beginners in mind. For more information please visit official Cloudera site http://www.cloudera.com