Using Hadoop’s DistributedCache

Posted by chetan on December 28, 2010 in development

Using Hadoop’s DistributedCache mechanism is fairly straightforward, but as I’m finding is common with everything-Hadoop, not very well documented.

Adding files

When setting up your Job configuration:

// Create symlinks in the job's working directory using the link name 
// provided below
DistributedCache.createSymlink(conf);
 
// Add a file to the cache. It must already exist on HDFS. The text
// after the hash is the link name.
DistributedCache.addCacheFile(
    new URI("hdfs://localhost:9000/foo/bar/baz.txt#baz.txt"), conf);

Accessing files

Now that we’ve cached our file, let’s access it:

// Direct access by name
File baz = new File("baz.txt");
// prints "true" since the file was found in the working directory
System.out.println(baz.exists()); 
 
 
// We can also get a list of all cached files
Path[] cached = DistributedCache.getLocalCacheFiles(conf);
for (int i = 0; i < cached.length; i++) {
    Path path = cached[i];
    String filename = path.toString();
}
Comments

Leave a response

  1. Srivatsan Nallazhagappan Thu, 19 Jun 2014 06:06:14 EDT

    Hi, I am trying to read distributed cache file directly just like how you mentioned. I am also using MRUnit to Junit my map-reduce program. MRUnit framework unfortunately does not seem to support createSymlink. I am thinking to use getLocalCacheFiles and programmitcally maintaining a map (between fille name to path) and then make my client code to use to filename access. Any other better approach.

Comments