Using Hadoop’s DistributedCache mechanism is fairly straightforward, but as I’m finding is common with everything-Hadoop, not very well documented.

Adding files

When setting up your Job configuration:

// Create symlinks in the job's working directory using the link name
// provided below
DistributedCache.createSymlink(conf);

// Add a file to the cache. It must already exist on HDFS. The text
// after the hash is the link name.
DistributedCache.addCacheFile(
    new URI("hdfs://localhost:9000/foo/bar/baz.txt#baz.txt"), conf);

Accessing files

Now that we’ve cached our file, let’s access it:

// Direct access by name
File baz = new File("baz.txt");
// prints "true" since the file was found in the working directory
System.out.println(baz.exists());

// We can also get a list of all cached files
Path[] cached = DistributedCache.getLocalCacheFiles(conf);
for (int i = 0; i < cached.length; i++) {
    Path path = cached[i];
    String filename = path.toString();
}