Travel of Software Developer: Interfaces

13.5.15

Interfaces

FSDataInputStream, Seekable, PositionedReadable

The open() method of FileSystem returns a FSDataInputStream

rather than a standard java.io class

FSDataInputStream is a specialization of java.io.DataInputStream

with support of random access
so you can read from any part of the stream

Those two interface help for this purpose.

FSDataOutputStream, Progressable

Glob patterns and PathFilter

Hadoop supports the same set of glob characters as Unix bash
When glob patterns are not powerful enough to describe a set of files you want to access, you can use PathFilter.

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
  IOException

Here is the interface of PathFilter

  package org.apache.hadoop.fs;

  public interface PathFilter {
    boolean accept(Path path);
  }

Writable

package org.apache.hadoop.io;

import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;

public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}

The Writable interface defines two methods:

One for writing its state to a DataOutput binary stream and
One for reading its state from a DataInput binary stream

    public static void main(String[] args) throws Exception {
        IntWritable iw = new IntWritable(1024);
        
        ByteArrayOutputStream baos = new  ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);
        iw.write(dos);
        dos.flush();
        byte[] data = baos.toByteArray();
        
        System.out.println(StringUtils.byteToHexString(data));
        
        ByteArrayInputStream bais = new ByteArrayInputStream(data);
        DataInputStream dis = new DataInputStream(bais);
        IntWritable iw2 = new IntWritable();
        iw2.readFields(dis);
        System.out.println(iw2.get());
    }

WritableComparable and comparators

IntWritable implements the WritableComparable interface, which is just a subinterface of the Writable and java.lang.Comparable interface:

package org.apache.hadoop.io;

public interface WritableComparable extends Writable, Comparable {
}

Comparison of types is crucial for MapReduce

where there is a sorting phase during which keys are compared with one another.

RawComparator is an optimization that Hadoop provides

extension of Java's Comparator
allows implementors to compare records read from a stream without deserializing them into objects

Using big endian may help this also.

WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes.

It provides two main functions.

A default implementation of the raw compare() method that deserializes the objects to be compared from the stream and invokes the object compare() method.
acts as a factory for RawComparator instances

RawComparator comparator = WritableComparator.get(IntWritable.class);

IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));

GenericOptionsParser, Tool interface and ToolRunner

GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your applicatio to use as desired.
You don't usually use GenericOptionsParser directly, as it's more convenient to implement the Tool interface and run your application with the ToolRunner

which uses GenericOptionsParser internally

You can have your App class derives from Configured, which is an implementation of the Configurable interface.

All implementation of Tool need to implement Configurable
and subclassing Configured is often the easiest way to achieve this.

ToolRunner.run() method takes care of creating a Configuration object for the Tool before calling its run() method.
ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line and to set them on the Configuration instance.

-conf <conf file>

GenericOptionsParser also allows you to set individual properties.

hadoop ConfigurationPrinter -D color=yellow | grep color
The -D option is used to set the configuration property with key color to the value yellow.
Options specified with -D take priority over properties from the configuration files.

InputSampler, Sampler

The InputSampler class defines a nested Sampler interface whose implementations return a sample of keys given an InputFormat and Job
This interface usually is not called directly by clients. Instead, the writePartitionFile() static method on InputSampler is used, which creates a sequence file to store the keys that define the partitions.
The sequence file is used by TotalOrderPartitioner to create partitions for the sort job.

13.5.15

Interfaces

FSDataInputStream, Seekable, PositionedReadable

FSDataOutputStream, Progressable

Glob patterns and PathFilter

Writable

WritableComparable and comparators

GenericOptionsParser, Tool interface and ToolRunner

InputSampler, Sampler

No comments: