vrijdag, oktober 19, 2007

Writing A Hadoop MapReduce Program In PHP

Today I came across an article about how to write an Hadoop MapReduce program in Python by Michael Noll (link). I’ve always been a big fan of MapReduce and at my employer we develop mainly using PHP, so I thought it would be nice to port the example to PHP as an example.

I recommend that you reed Michael Noll’s tutorial first to understand a little bit more about what we’re doing.

First you’ll need to have Hadoop installed and running. Setting up a Hadoop cluster can be an all day job. However, if you want to experiment with the platform right now, there is a virtual machine image with a preconfigured single node instance of Hadoop. While this doesn't have the power of a full cluster, it does allow you to use the resources on your local machine to explore the Hadoop platform and run simple MapReduce jobs.
The virtual machine image is designed to be used with the free VMware Player.

Download the VMWare image here: http://code.google.com/edu/tools/hadoopvm/

Login as user ‘root’ (password ‘root’), run ‘apt-get php5-cli’ to install PHP5.
Now switch to user ‘guest’ (password ‘guest’).

Map: mapper.php
Save the following code in the file /home/guest/mapper.php:

#!/usr/bin/php
<?

$word2count = array();

// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}

// write the results to STDOUT (standard output)
// what we output here will be the input for the
// Reduce step, i.e. the input for reducer.py
foreach ($word2count as $word => $count) {
// tab-delimited
echo $word, chr(9), $count, PHP_EOL;
}

?>

Reduce: mapper.php
Save the following code in the file /home/guest/reducer.php:

#!/usr/bin/php
<?

$word2count = array();

// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace
$line = trim($line);
// parse the input we got from mapper.php
list($word, $count) = explode(chr(9), $line);
// convert count (currently a string) to int
$count = intval($count);
// sum counts
if ($count > 0) $word2count[$word] += $count;
}

// sort the words lexigraphically
//
// this set is NOT required, we just do it so that our
// final output will look more like the official Hadoop
// word count examples
ksort($word2count);

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
echo $word, chr(9), $count, PHP_EOL;
}

?>


Don’t forget to set execution rights for these files:
chmod +x /home/guest/mapper.php /home/guest/reducer.php

Running the PHP code on Hadoop:
Download example input data
Like Michael, we will use three ebooks from Project Gutenberg for this example:

Download each ebook and store them in a temporary directory of choice, for example /tmp/gutenberg

Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS
bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg


Run the MapReduce job
We’re all set and ready to run our PHP MapReduce job on the Hadoop cluster. We use HadoopStreaming for helping us passing data between our Map and Reduce code via STDIN and STDOUT.
bin/hadoop jar contrib/hadoop-streaming.jar -mapper /home/guest/mapper.php -reducer /home/guest/reducer.php -input gutenberg/* -output gutenberg-output

The job will read all the files in the HDFS directory gutenberg, process it, and store the results in a single result file in the HDFS directory gutenberg-output.

You can track the status of the job using Hadoop’s web interface. Go to http://localhost:50030/

When the job has finished, Check if the result is successfully stored in HDFS directory gutenberg-output:
bin/hadoop dfs -ls gutenberg-output

You can then inspect the contents of the file with the dfs -cat command:
bin/hadoop dfs -cat gutenberg-output/part-00000

That’s all! Have fun with Hadoop.

Labels: , ,