12.5.15

Need to remember


  • How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?
    • Write a custom FileInputFormat and override the method isSplitable to always return false.
      • The isSplitable() method in your InputFormat is passed each filename; if it returns true then the file can be broken up and processed by multiple Mappers. If it returns false then the file is considered to be 'not splittable' - that is, the entire file must be processed by a single Mapper.
  • To make sure jar files other than the one with the Driver Class gets distributed to all nodes in the cluster, the hadoop command should be:
    • % hadoop jar job.jar MyDriver -libjar ex1.jar:ex2.jar
    • Just to remember <hadoop jar job.jar MyDriver> is the main part.

No comments: