HDFS works well with:

1.       Very large files

2.       Streaming Data access

3.       Commodity hardware


HDFS not Good for:

1.       Very fast random access

2.       Lots of small files

3.       Multiple writers

4.       Arbitrary write



Name Node                 
Machine List 1      FILE-1 -  split1 , split2, split3 …. splitn
  .                                        .
  .                                        .
Machine List 1 n            FILE-n - split1 , split2, split3 …. splitn




Data Node 1 

Data Node  2

Data Node  3

Data Node  4

Data Node  5




·         Redundancy and Load Balancing in Name nodes

·         Name node consists of Meta data of data nodes. (meta data is data about data)

·         Data Node constantly sending heartbeat to Name node .

·         1 Block is 64 MB

·         Name node knows that data block is replicated to these many blocks

·         Split brain situation in distributed system , : split-brain condition is the result of a Cluster Partition, where each side believes the other is dead, and then proceeds to take over resources

·         Minimum replication factor to be checked

·         If failure than block ID is renamed.

·         You can set network topology in Hadoop. Machines which are available according to locality

·         Hadoop used for

o   Log analysis

o   Advertisement depends on log

·         Structured data and unstructured data analysis





1.       Job  -> Job Tracker (port 9001) -> get ID from name node

2.       transfer jar files to all nodes

3.       Define input split logic , and define job how to use split (Class input split )

4.       Issue command to HDFS to split

5.       Submit the job to jobTracker . (It has scheduler )

6.       Job get Initialized .

7.       JobTracker finds how many splits I have , to deifine how many MapTask I should start

8.       Task Trackers (Map Task and reduce Task )

9.       If 51 free I will start a job

10.   Job tracker assign 1 split to each task . define map task and reduce task

11.   Task tracker fetches jar files . (contact Name node and get where jar file is ?)

12.   Task tracker starts another process à JVM as Map task or reduce tak

13.     Closest data node having split 1





Following steps are needed to start HADOOP




export HADOOP_PREFIX=/usr/local/lib/hadoop/



echo $ JAVA_HOME

export JAVA_HOME==/usr/lib/jvm/default-java


set PATH:

echo $PATH



go to your working directory:

cd /home/vagarant/hadoop-training/

Check Path again:


echo $ JAVA_HOME

echo $PATH



Check 3 files in conf directory:

1.       core-site.xml

2.       hdfs-site.xml:

3.       mapred-site.xml



Now there are two modes of hadoop execution:

1.       pseudo mode :         Populate all 3 files properly 

a.       Apache Hadoop Pseudo-distributed mode installation helps you to simulate a multi node installation on a single node. Instead of installing hadoop on different servers, you can simulate it on a single server

2.       standalone mode :  all files should be empty

a.       In a single node standalone application, you don’t need to start any hadoop background process. Instead just call the ~/../bin/hadoop, which will execute hadoop as a single java process for your testing purpose.



To start hadoop first you have to format disk because it will create its own data structure.


bash$> hadoop namenode -format


now HDFS file system is created. Now start hadoopby using following command :


bash$>start-dfs.sh or 




It will start following :

1.    NameNode

2.    DataNode

3.    Jobtracker

4.  TaskNode

5.  TaskTracker

6.  Job Tracker

7.  MapTask

8.  Reduce Task


sCheck which process are running on hadoop.



sFollowing are the commands which you can run on hadoop:

                NOTE: All hadoop commands will start with “hadoop fs “ .


bash$> hadoop

bash$> hadoop fs

bash$> hadoop help

bash$> hadoop –fs help



bash$> hadoop fs –ls

bash$>hadoop fs –cat

bash$> hadoop fs –copyFromLocal hello.txt hello.txt      (will create /tmp/user directory inside which you can fine this file now. )


bash$> hadoop fs –cat hello.txt

bash$>hadoop  fs –cat hdfs://localhost:9000/user/vagrant/hello.txt


To run map reduce JOB:


bash$>hadoop jar workspace/WordCount/wordcount.jar WordCount Gatsby.txt out/


job will run …………………. finish


hadoop fs –ls

hadoop fs out

hsdoop fs –cat out/part-r-00000


Streaming command:


bash$>hadoop jar $HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar \

> -mapper workspace/streaming/mapper.py -file  workspace/streaming/mapper.py \

> -reducer workspace/streaming/reducer.py -file  workspace/streaming/reducer.py \

> -input testing \

> -output pythonstreamingoutput


            bash$>hadoop jar $HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar \

             > -mapper workspace/streaming/mapper.pl -file  workspace/streaming/mapper.pl \

             > -reducer workspace/streaming/reducer.pl -file  workspace/streaming/reducer.pl\

             > -input testing \

             > -output perlstreamingoutput



