Saturday, January 05, 2008

First trials with Hadoop


I followed the Hadoop Quickstart guide and the whole process is described below.

This post can be used as a reference for other people installing Hadoop.

My system is OpenSuse 10.3 and Java version is 1.6.0_03.

After downloading and installing the package I did the Standalone operation test:

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Here is the output (line feeds may be corrupted, sorry for that):

gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/hadoop jar hadoop-0.14.4-examples.jar grep input output 'dfs[a-z.]+' 08/01/05 15:47:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 08/01/05 15:47:13 INFO mapred.FileInputFormat: Total input paths to process : 3 08/01/05 15:47:13 INFO mapred.JobClient: Running job: job_local_1 08/01/05 15:47:13 INFO mapred.MapTask: numReduceTasks: 1 08/01/05 15:47:13 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/mapred-default.xml:0+180 08/01/05 15:47:13 INFO mapred.MapTask: numReduceTasks: 1 08/01/05 15:47:13 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-default.xml:0+27489 08/01/05 15:47:13 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-default.xml:0+27489 08/01/05 15:47:13 INFO mapred.MapTask: numReduceTasks: 1 08/01/05 15:47:13 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-site.xml:0+178 08/01/05 15:47:13 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-site.xml:0+178 08/01/05 15:47:14 INFO mapred.LocalJobRunner: reduce > reduce 08/01/05 15:47:14 INFO mapred.TaskRunner: Saved output of task 'reduce_3r6jh8' to file:/home/telzur/downloads/hadoop-0.14.4/grep-temp-346467784 08/01/05 15:47:14 INFO mapred.JobClient: Job complete: job_local_1 08/01/05 15:47:14 INFO mapred.JobClient: Counters: 9 08/01/05 15:47:14 INFO mapred.JobClient: Map-Reduce Framework 08/01/05 15:47:14 INFO mapred.JobClient: Map input records=940 08/01/05 15:47:14 INFO mapred.JobClient: Map output records=34 08/01/05 15:47:14 INFO mapred.JobClient: Map input bytes=27847 08/01/05 15:47:14 INFO mapred.JobClient: Map output bytes=942 08/01/05 15:47:14 INFO mapred.JobClient: Combine input records=34 08/01/05 15:47:14 INFO mapred.JobClient: Combine output records=33 08/01/05 15:47:14 INFO mapred.JobClient: Reduce input groups=33 08/01/05 15:47:14 INFO mapred.JobClient: Reduce input records=33 08/01/05 15:47:14 INFO mapred.JobClient: Reduce output records=33 08/01/05 15:47:14 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 08/01/05 15:47:14 INFO mapred.FileInputFormat: Total input paths to process : 1 08/01/05 15:47:14 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/mapred-default.xml:0+180 08/01/05 15:47:14 INFO mapred.JobClient: Running job: job_local_1 08/01/05 15:47:14 INFO mapred.MapTask: numReduceTasks: 1 08/01/05 15:47:14 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/grep-temp-346467784/part-00000:0+1279 08/01/05 15:47:14 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/grep-temp-346467784/part-00000:0+1279 08/01/05 15:47:14 INFO mapred.LocalJobRunner: reduce > reduce 08/01/05 15:47:14 INFO mapred.TaskRunner: Saved output of task 'reduce_h016y4' to file:/home/telzur/downloads/hadoop-0.14.4/output 08/01/05 15:47:14 INFO mapred.LocalJobRunner: reduce > reduce 08/01/05 15:47:14 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-default.xml:0+27489 08/01/05 15:47:14 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/input/hadoop-site.xml:0+178 08/01/05 15:47:15 INFO mapred.LocalJobRunner: reduce > reduce 08/01/05 15:47:15 INFO mapred.JobClient: Job complete: job_local_1 08/01/05 15:47:15 INFO mapred.JobClient: Counters: 7 08/01/05 15:47:15 INFO mapred.JobClient: Map-Reduce Framework 08/01/05 15:47:15 INFO mapred.JobClient: Map input records=33 08/01/05 15:47:15 INFO mapred.JobClient: Map output records=33 08/01/05 15:47:15 INFO mapred.JobClient: Map input bytes=1193 08/01/05 15:47:15 INFO mapred.JobClient: Map output bytes=929 08/01/05 15:47:15 INFO mapred.JobClient: Reduce input groups=4 08/01/05 15:47:15 INFO mapred.JobClient: Reduce input records=66 08/01/05 15:47:15 INFO mapred.JobClient: Reduce output records=66 08/01/05 15:47:15 INFO mapred.LocalJobRunner: file:/home/telzur/downloads/hadoop-0.14.4/grep-temp-346467784/part-00000:0+1279 gtz2:/home/telzur/downloads/hadoop-0.14.4 # cat output/* 2 dfs. 1 dfs.block.size 1 dfs.blockreport.interval 1 dfs.client.block.write.retries 1 dfs.client.buffer.dir 1 dfs.data.dir 1 dfs.datanode.bind 1 dfs.datanode.dns.interface 1 dfs.datanode.dns.nameserver 1 dfs.datanode.du.pct 1 dfs.datanode.du.reserved 1 dfs.datanode.port 1 dfs.default.chunk.view.size 1 dfs.df.interval 1 dfs.heartbeat.interval 1 dfs.hosts 1 dfs.hosts.exclude 1 dfs.impl 1 dfs.info.bind 1 dfs.info.port 1 dfs.name.dir 1 dfs.namenode.handler.count 1 dfs.namenode.logging.level 1 dfs.network.script 1 dfs.replication 1 dfs.replication.consider 1 dfs.replication.max 1 dfs.replication.min 1 dfs.replication.min. 1 dfs.safemode.extension 1 dfs.safemode.threshold.pct 1 dfs.secondary.info.bind 1 dfs.secondary.info.port gtz2:/home/telzur/downloads/hadoop-0.14.4 #


Next step: Pseudo-Distributed Operation

Format a new distributed-filesystem:
gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/hadoop namenode -format
08/01/05 17:06:57 INFO dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = gtz2/127.0.0.1
STARTUP_MSG: args = [-format]
************************************************************/
08/01/05 17:06:58 INFO dfs.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
08/01/05 17:06:58 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at gtz2/127.0.0.1
************************************************************/
gtz2:/home/telzur/downloads/hadoop-0.14.4 #

gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/start-all.sh
starting namenode, logging to /home/telzur/downloads/hadoop-0.14.4/bin/../logs/hadoop-telzur-namenode-gtz2.out
localhost: starting datanode, logging to /home/telzur/downloads/hadoop-0.14.4/bin/../logs/hadoop-root-datanode-gtz2.out
localhost: starting secondarynamenode, logging to /home/telzur/downloads/hadoop-0.14.4/bin/../logs/hadoop-root-secondarynamenode-gtz2.out
starting jobtracker, logging to /home/telzur/downloads/hadoop-0.14.4/bin/../logs/hadoop-telzur-jobtracker-gtz2.out
localhost: starting tasktracker, logging to /home/telzur/downloads/hadoop-0.14.4/bin/../logs/hadoop-root-tasktracker-gtz2.out
gtz2:/home/telzur/downloads/hadoop-0.14.4 #


Browsing the web interface:






Run the examples:

gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/hadoop dfs -put conf input gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' 08/01/05 17:21:39 INFO mapred.FileInputFormat: Total input paths to process : 10 08/01/05 17:21:40 INFO mapred.JobClient: Running job: job_200801051712_0001 08/01/05 17:21:41 INFO mapred.JobClient: map 0% reduce 0% 08/01/05 17:21:51 INFO mapred.JobClient: map 18% reduce 0% 08/01/05 17:21:52 INFO mapred.JobClient: map 36% reduce 0% 08/01/05 17:21:53 INFO mapred.JobClient: map 54% reduce 0% 08/01/05 17:21:54 INFO mapred.JobClient: map 63% reduce 0% 08/01/05 17:21:55 INFO mapred.JobClient: map 72% reduce 0% 08/01/05 17:21:56 INFO mapred.JobClient: map 90% reduce 0% 08/01/05 17:21:57 INFO mapred.JobClient: map 100% reduce 0% 08/01/05 17:22:06 INFO mapred.JobClient: map 100% reduce 27% 08/01/05 17:22:07 INFO mapred.JobClient: map 100% reduce 100% 08/01/05 17:22:08 INFO mapred.JobClient: Job complete: job_200801051712_0001 08/01/05 17:22:08 INFO mapred.JobClient: Counters: 12 08/01/05 17:22:08 INFO mapred.JobClient: Job Counters 08/01/05 17:22:08 INFO mapred.JobClient: Launched map tasks=11 08/01/05 17:22:08 INFO mapred.JobClient: Launched reduce tasks=1 08/01/05 17:22:08 INFO mapred.JobClient: Data-local map tasks=11 08/01/05 17:22:08 INFO mapred.JobClient: Map-Reduce Framework 08/01/05 17:22:08 INFO mapred.JobClient: Map input records=1153 08/01/05 17:22:08 INFO mapred.JobClient: Map output records=43 08/01/05 17:22:08 INFO mapred.JobClient: Map input bytes=34316 08/01/05 17:22:08 INFO mapred.JobClient: Map output bytes=1118 08/01/05 17:22:08 INFO mapred.JobClient: Combine input records=43 08/01/05 17:22:08 INFO mapred.JobClient: Combine output records=39 08/01/05 17:22:08 INFO mapred.JobClient: Reduce input groups=38 08/01/05 17:22:08 INFO mapred.JobClient: Reduce input records=39 08/01/05 17:22:08 INFO mapred.JobClient: Reduce output records=38 08/01/05 17:22:08 INFO mapred.FileInputFormat: Total input paths to process : 1 08/01/05 17:22:09 INFO mapred.JobClient: Running job: job_200801051712_0002 08/01/05 17:22:10 INFO mapred.JobClient: map 0% reduce 0% 08/01/05 17:22:18 INFO mapred.JobClient: map 100% reduce 0% 08/01/05 17:22:24 INFO mapred.JobClient: map 100% reduce 100% 08/01/05 17:22:25 INFO mapred.JobClient: Job complete: job_200801051712_0002 08/01/05 17:22:25 INFO mapred.JobClient: Counters: 10 08/01/05 17:22:25 INFO mapred.JobClient: Job Counters 08/01/05 17:22:25 INFO mapred.JobClient: Launched map tasks=1 08/01/05 17:22:25 INFO mapred.JobClient: Launched reduce tasks=1 08/01/05 17:22:25 INFO mapred.JobClient: Data-local map tasks=1 08/01/05 17:22:25 INFO mapred.JobClient: Map-Reduce Framework 08/01/05 17:22:25 INFO mapred.JobClient: Map input records=38 08/01/05 17:22:25 INFO mapred.JobClient: Map output records=38 08/01/05 17:22:25 INFO mapred.JobClient: Map input bytes=1330 08/01/05 17:22:25 INFO mapred.JobClient: Map output bytes=1026 08/01/05 17:22:25 INFO mapred.JobClient: Reduce input groups=3 08/01/05 17:22:25 INFO mapred.JobClient: Reduce input records=38 08/01/05 17:22:25 INFO mapred.JobClient: Reduce output records=38

Examine the output files:
gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/hadoop dfs -get output output gtz2:/home/telzur/downloads/hadoop-0.14.4 # cat output/* cat: output/output: Is a directory 2 dfs. 1 dfs.block.size 1 dfs.blockreport.interval 1 dfs.client.block.write.retries 1 dfs.client.buffer.dir 1 dfs.data.dir 1 dfs.datanode.bind 1 dfs.datanode.dns.interface 1 dfs.datanode.dns.nameserver 1 dfs.datanode.du.pct 1 dfs.datanode.du.reserved 1 dfs.datanode.port 1 dfs.default.chunk.view.size 1 dfs.df.interval 1 dfs.heartbeat.interval 1 dfs.hosts 1 dfs.hosts.exclude 1 dfs.impl 1 dfs.info.bind 1 dfs.info.port 1 dfs.name.dir 1 dfs.namenode.handler.count 1 dfs.namenode.logging.level 1 dfs.network.script 1 dfs.replication 1 dfs.replication.consider 1 dfs.replication.max 1 dfs.replication.min 1 dfs.replication.min. 1 dfs.safemode.extension 1 dfs.safemode.threshold.pct 1 dfs.secondary.info.bind 1 dfs.secondary.info.port gtz2:/home/telzur/downloads/hadoop-0.14.4 #

Re-check the web interfaces after the job ended:







Finally, stop the daemons when we are done:
gtz2:/home/telzur/downloads/hadoop-0.14.4 # bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenod

and that concludes the Hadoop Quitestart tutorial



No comments: