Travel of Software Developer: Need to remember

12.5.15

Need to remember

How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

Write a custom FileInputFormat and override the method isSplitable to always return false.

The isSplitable() method in your InputFormat is passed each filename; if it returns true then the file can be broken up and processed by multiple Mappers. If it returns false then the file is considered to be 'not splittable' - that is, the entire file must be processed by a single Mapper.

To make sure jar files other than the one with the Driver Class gets distributed to all nodes in the cluster, the hadoop command should be:

% hadoop jar job.jar MyDriver -libjar ex1.jar:ex2.jar
Just to remember <hadoop jar job.jar MyDriver> is the main part.

No comments:

Subscribe to: Post Comments (Atom)