Hadoop

An easy way to browse HDFS clusters

After spending the better part of a day trying to get HDFS to mount on my Mac, I finally gave up. Luckily, I was able to find MuCommander, a cross-platform Java port of the Norton Commander of old, and as luck would have it, it supports HDFS in the latest version! Very handy for quickly browsing your HDFS clusters if you can’t mount it or don’t have the Hadoop toolset installed.

Distributing JARs for Map/Reduce jobs via HDFS

Hadoop has a built-in feature for easily distributing JARs to your worker nodes via HDFS but, unfortunately, it’s broken. There’s a couple of tickets open with a patch again 0.18 and 0.21 (trunk) but for some reason they still haven’t been committed yet. We’re currently running 0.20 so the patch does me no good anyway. So here’s my simple solution: I essentially copied the technique used by ToolRunner when you pass a “libjars” argument on the command line. You simply pass the function the HDFS paths to the JAR files you want included and it’ll take care of the rest. ...

Using Hadoop's DistributedCache

Using Hadoop’s DistributedCache mechanism is fairly straightforward, but as I’m finding is common with everything-Hadoop, not very well documented. Adding files When setting up your Job configuration: // Create symlinks in the job's working directory using the link name // provided below DistributedCache.createSymlink(conf); // Add a file to the cache. It must already exist on HDFS. The text // after the hash is the link name. DistributedCache.addCacheFile( new URI("hdfs://localhost:9000/foo/bar/baz.txt#baz.txt"), conf); Accessing files Now that we’ve cached our file, let’s access it: ...

now playing: