13.5.15

Interfaces

FSDataInputStream, Seekable, PositionedReadable


  • The open() method of FileSystem returns a FSDataInputStream
    • rather than a standard java.io class
  • FSDataInputStream is a specialization of java.io.DataInputStream
    • with support of random access
    • so you can read from any part of the stream
      • Those two interface help for this purpose.

FSDataOutputStream, Progressable

Glob patterns and PathFilter

  • Hadoop supports the same set of glob characters as Unix bash
  • When glob patterns are not powerful enough to describe a set of files you want to access, you can use PathFilter.
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
  IOException
Here is the interface of PathFilter
  package org.apache.hadoop.fs;

  public interface PathFilter {
    boolean accept(Path path);
  }


Writable


package org.apache.hadoop.io;

import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;

public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}

The Writable interface defines two methods:

  • One for writing its state to a DataOutput binary stream and
  • One for reading its state from a DataInput binary stream

    public static void main(String[] args) throws Exception {
        IntWritable iw = new IntWritable(1024);
        
        ByteArrayOutputStream baos = new  ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);
        iw.write(dos);
        dos.flush();
        byte[] data = baos.toByteArray();
        
        System.out.println(StringUtils.byteToHexString(data));
        
        ByteArrayInputStream bais = new ByteArrayInputStream(data);
        DataInputStream dis = new DataInputStream(bais);
        IntWritable iw2 = new IntWritable();
        iw2.readFields(dis);
        System.out.println(iw2.get());
    }

WritableComparable and comparators


  • IntWritable implements the WritableComparable interface, which is just a subinterface of the Writable and java.lang.Comparable interface:
package org.apache.hadoop.io;

public interface WritableComparable extends Writable, Comparable {
}


  • Comparison of types is crucial for MapReduce
    • where there is a sorting phase during which keys are compared with one another.
  • RawComparator is an optimization that Hadoop provides
    • extension of Java's Comparator
    • allows implementors to compare records read from a stream without deserializing them into objects
      • Using big endian may help this also.
  • WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes.
    • It provides two main functions.
      • A default implementation of the raw compare() method that deserializes the objects to be compared from the stream and invokes the object compare() method.
      • acts as a factory for RawComparator instances
RawComparator comparator = WritableComparator.get(IntWritable.class);

IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));

GenericOptionsParser, Tool interface and ToolRunner

  • GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your applicatio to use as desired.
  • You don't usually use GenericOptionsParser directly, as it's more convenient to implement the Tool interface and run your application with the ToolRunner
    • which uses GenericOptionsParser internally
  • You can have your App class derives from Configured, which is an implementation of the Configurable interface.
    • All implementation of Tool need to implement Configurable
    • and subclassing Configured is often the easiest way to achieve this.
  • ToolRunner.run() method takes care of creating a Configuration object for the Tool before calling its run() method.
  • ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line and to set them on the Configuration instance. 
    • -conf <conf file>
  • GenericOptionsParser also allows you to set individual properties.
    • hadoop ConfigurationPrinter -D color=yellow | grep color
    • The -D option is used to set the configuration property with key color to the value yellow.
    • Options specified with -D take priority over properties from the configuration files.

InputSampler, Sampler



  • The InputSampler class defines a nested Sampler interface whose implementations return a sample of keys given an InputFormat and Job
  • This interface usually is not called directly by clients. Instead, the writePartitionFile()  static method on InputSampler is used, which creates a sequence file to store the keys that define the partitions.
  • The sequence file is used by TotalOrderPartitioner to create partitions for the sort job.

No comments: