HDFS works well with:
1.
Very large
files
2.
Streaming Data
access
3.
Commodity hardware
HDFS not Good for:
1.
Very fast random
access
2.
Lots of small
files
3.
Multiple writers
4.
Arbitrary write
NAME NODE
Name Node
|
Machine List 1
FILE-1 - split1 ,
split2, split3 …. splitn |
.
. |
.
. |
Machine List 1 n
FILE-n -
split1 , split2, split3 ….
splitn |
DATA
NODE
Data
Node 1 |
Data
Node 2 |
Data
Node 3 |
Data
Node 4 |
Data
Node 5 |
·
Redundancy and Load
Balancing in Name nodes
·
Name node consists of
Meta data of data nodes. (meta data is data about data)
·
Data Node constantly
sending heartbeat to Name node .
·
1 Block is 64
MB
·
Name node knows that
data block is replicated to these many blocks
·
Split brain
situation in distributed system , : split-brain condition is the result of
a Cluster Partition, where each side believes the other is dead, and
then proceeds to take over resources
·
Minimum replication
factor to be checked
·
If failure than block
ID is renamed.
·
You can set network
topology in Hadoop. Machines which are available according to
locality
·
Hadoop used
for
o
Log
analysis
o
Advertisement depends
on log
·
Structured data and
unstructured data analysis
·
MAPREDUCE
JOB
1.
Job -> Job Tracker (port
9001) -> get ID from name node
2.
transfer jar files to
all nodes
3.
Define input split
logic , and define job how to use split (Class input split
)
4.
Issue command to HDFS
to split
5.
Submit the job to
jobTracker . (It has scheduler )
6.
Job get Initialized .
7.
JobTracker finds how
many splits I have , to deifine how many MapTask I should start
8.
Task Trackers (Map
Task and reduce Task )
9.
If 51 free I will
start a job
10.
Job tracker assign 1
split to each task . define map task and reduce task
11.
Task tracker fetches
jar files . (contact Name node and get where jar file is
?)
12.
Task tracker
starts another process à JVM as Map task or reduce tak
13.
Closest data node
having split 1
=======================================================================================
Following steps are needed to start
HADOOP
Set HADOOP PATH:
echo $HADOOP_PREFIX
export HADOOP_PREFIX=/usr/local/lib/hadoop/
Set JAVA PATH:
echo $ JAVA_HOME
export JAVA_HOME==/usr/lib/jvm/default-java
set PATH:
echo $PATH
export PATH=$PATH: =$ PATH:
$HADOOP_PREFIX/bin
go to your working directory:
cd /home/vagarant/hadoop-training/
Check Path again:
echo $HADOOP_PREFIX
echo $ JAVA_HOME
echo $PATH
Check 3 files in conf directory:
1.
core-site.xml
2.
hdfs-site.xml:
3.
mapred-site.xml
Now there are two modes of hadoop execution:
1.
pseudo mode :
Populate all 3 files properly
a.
Apache
Hadoop Pseudo-distributed mode installation helps you to simulate a
multi node installation on a single node. Instead of installing
hadoop on different servers, you can simulate it on a single
server
2.
standalone mode : all files should be empty
a.
In
a single node standalone application, you don’t need to start any
hadoop background process.
Instead just call the ~/../bin/hadoop, which will execute hadoop as
a single java process for your testing
purpose.
To start hadoop first you have to format disk because
it will create its own data structure.
bash$> hadoop
namenode -format
now HDFS file system is created. Now start hadoopby
using following command :
bash$>start-dfs.sh or
bash$>start-all.sh
bash$>stop-all.sh
It will start following :
1. NameNode
2. DataNode
3. Jobtracker
4. TaskNode
5.
TaskTracker
6. Job Tracker
7. MapTask
8. Reduce
Task
sCheck which process
are running on hadoop.
bash$>jps
sFollowing are the commands which you can run on
hadoop:
NOTE: All hadoop
commands will start with “hadoop fs “
.
bash$> hadoop
bash$> hadoop fs
bash$> hadoop help
bash$> hadoop –fs help
bash$> hadoop fs –ls
bash$>hadoop fs –cat
bash$> hadoop fs –copyFromLocal hello.txt
hello.txt (will
create /tmp/user directory inside which you can fine this file now.
)
bash$> hadoop fs –cat hello.txt
bash$>hadoop fs –cat
hdfs://localhost:9000/user/vagrant/hello.txt
To run map reduce JOB:
bash$>hadoop jar workspace/WordCount/wordcount.jar
WordCount Gatsby.txt out/
job will run …………………. finish
hadoop fs –ls
hadoop fs out
hsdoop fs –cat out/part-r-00000
Streaming command:
bash$>hadoop jar
$HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar
\
> -mapper workspace/streaming/mapper.py -file
workspace/streaming/mapper.py \
> -reducer workspace/streaming/reducer.py -file
workspace/streaming/reducer.py \
> -input testing \
> -output pythonstreamingoutput
bash$>hadoop jar
$HADOOP_PREFIX/contrib./streaming/hadoop-streaming-1.0.4.jar
\
> -mapper workspace/streaming/mapper.pl -file
workspace/streaming/mapper.pl \
>
-reducer workspace/streaming/reducer.pl -file
workspace/streaming/reducer.pl\
>
-input testing \
> -output perlstreamingoutput