Home  |   Guest Book |    Technical Page |    Personal Page   

HADOOP and MapReduce CONCEPTS
 
About Me
Who is a Good QA
Latest News
Testing Concepts
Automation Tools
Agile Development
UNIX Basics
Perl Scripting
Python Scripting
MySQL
Technology
Repository
Imp. Commands
My Resume
 

 

 

HDFS works well with:

1.       Very large files

2.       Streaming Data access

3.       Commodity hardware

 

HDFS not Good for:

1.       Very fast random access

2.       Lots of small files

3.       Multiple writers

4.       Arbitrary write

 

NAME NODE

Name Node                 
Machine List 1      FILE-1 -  split1 , split2, split3 …. splitn
  .                                        .
  .                                        .
Machine List 1 n            FILE-n - split1 , split2, split3 …. splitn

 

 


DATA NODE

Data Node 1 

Data Node  2

Data Node  3

Data Node  4

Data Node  5

 

 

 

·         Redundancy and Load Balancing in Name nodes

·         Name node consists of Meta data of data nodes. (meta data is data about data)

·         Data Node constantly sending heartbeat to Name node .

·         1 Block is 64 MB

·         Name node knows that data block is replicated to these many blocks

·         Split brain situation in distributed system , : split-brain condition is the result of a Cluster Partition, where each side believes the other is dead, and then proceeds to take over resources

·         Minimum replication factor to be checked

·         If failure than block ID is renamed.

·         You can set network topology in Hadoop. Machines which are available according to locality

·         Hadoop used for

o   Log analysis

o   Advertisement depends on log

·         Structured data and unstructured data analysis

·          

 

 

MAPREDUCE JOB

1.       Job  -> Job Tracker (port 9001) -> get ID from name node

2.       transfer jar files to all nodes

3.       Define input split logic , and define job how to use split (Class input split )

4.       Issue command to HDFS to split

5.       Submit the job to jobTracker . (It has scheduler )

6.       Job get Initialized .

7.       JobTracker finds how many splits I have , to deifine how many MapTask I should start

8.       Task Trackers (Map Task and reduce Task )

9.       If 51 free I will start a job

10.   Job tracker assign 1 split to each task . define map task and reduce task

11.   Task tracker fetches jar files . (contact Name node and get where jar file is ?)

12.   Task tracker starts another process à JVM as Map task or reduce tak

13.     Closest data node having split 1

 

 

=======================================================================================

 

Following steps are needed to start HADOOP

 

Set HADOOP PATH:

echo  $HADOOP_PREFIX

export HADOOP_PREFIX=/usr/local/lib/hadoop/

 

Set JAVA PATH:

echo $ JAVA_HOME

export JAVA_HOME==/usr/lib/jvm/default-java

 

set PATH:

echo $PATH

export PATH=$PATH: =$ PATH: $HADOOP_PREFIX/bin

 

go to your working directory:

cd /home/vagarant/hadoop-training/

Check Path again:

echo  $HADOOP_PREFIX

echo $ JAVA_HOME

echo $PATH

 

 

Check 3 files in conf directory:

1.       core-site.xml

2.       hdfs-site.xml:

3.       mapred-site.xml

 

 

Now there are two modes of hadoop execution:

1.       pseudo mode :         Populate all 3 files properly 

a.       Apache Hadoop Pseudo-distributed mode installation helps you to simulate a multi node installation on a single node. Instead of installing hadoop on different servers, you can simulate it on a single server

2.       standalone mode :  all files should be empty

a.       In a single node standalone application, you don’t need to start any hadoop background process. Instead just call the ~/../bin/hadoop, which will execute hadoop as a single java process for your testing purpose.

 

 

To start hadoop first you have to format disk because it will create its own data structure.

 

bash$> hadoop namenode -format

 

now HDFS file system is created. Now start hadoopby using following command :

 

bash$>start-dfs.sh or 

bash$>start-all.sh

bash$>stop-all.sh

 

It will start following :

1.    NameNode

2.    DataNode

3.    Jobtracker

4.  TaskNode

5.  TaskTracker

6.  Job Tracker

7.  MapTask

8.  Reduce Task

 

sCheck which process are running on hadoop.

bash$>jps

 

sFollowing are the commands which you can run on hadoop:

                NOTE: All hadoop commands will start with “hadoop fs “ .

 

bash$> hadoop

bash$> hadoop fs

bash$> hadoop help

bash$> hadoop –fs help

 

 

bash$> hadoop fs –ls

bash$>hadoop fs –cat

bash$> hadoop fs –copyFromLocal hello.txt hello.txt      (will create /tmp/user directory inside which you can fine this file now. )

 

bash$> hadoop fs –cat hello.txt

bash$>hadoop  fs –cat hdfs://localhost:9000/user/vagrant/hello.txt

 

To run map reduce JOB:

 

bash$>hadoop jar workspace/WordCount/wordcount.jar WordCount Gatsby.txt out/

 

job will run …………………. finish

 

hadoop fs –ls

hadoop fs out

hsdoop fs –cat out/part-r-00000

 

Streaming command:

 

bash$>hadoop jar $HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar \

> -mapper workspace/streaming/mapper.py -file  workspace/streaming/mapper.py \

> -reducer workspace/streaming/reducer.py -file  workspace/streaming/reducer.py \

> -input testing \

> -output pythonstreamingoutput

 

            bash$>hadoop jar $HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar \

             > -mapper workspace/streaming/mapper.pl -file  workspace/streaming/mapper.pl \

             > -reducer workspace/streaming/reducer.pl -file  workspace/streaming/reducer.pl\

             > -input testing \

             > -output perlstreamingoutput

 

 

Copyright 2009 Kunal Saxena Inc. All rights reserved