FSDataInputStream, Seekable, PositionedReadable
- The open() method of FileSystem returns a FSDataInputStream
- rather than a standard java.io class
- FSDataInputStream is a specialization of java.io.DataInputStream
- with support of random access
- so you can read from any part of the stream
- Those two interface help for this purpose.
FSDataOutputStream, Progressable
Glob patterns and PathFilter
- Hadoop supports the same set of glob characters as Unix bash
- When glob patterns are not powerful enough to describe a set of files you want to access, you can use PathFilter.
public FileStatus[] globStatus(Path pathPattern) throws IOException public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException
package org.apache.hadoop.fs; public interface PathFilter { boolean accept(Path path); }
Writable
package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; }
The Writable interface defines two methods:
- One for writing its state to a DataOutput binary stream and
- One for reading its state from a DataInput binary stream
public static void main(String[] args) throws Exception { IntWritable iw = new IntWritable(1024); ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); iw.write(dos); dos.flush(); byte[] data = baos.toByteArray(); System.out.println(StringUtils.byteToHexString(data)); ByteArrayInputStream bais = new ByteArrayInputStream(data); DataInputStream dis = new DataInputStream(bais); IntWritable iw2 = new IntWritable(); iw2.readFields(dis); System.out.println(iw2.get()); }
WritableComparable and comparators
- IntWritable implements the WritableComparable interface, which is just a subinterface of the Writable and java.lang.Comparable interface:
package org.apache.hadoop.io; public interface WritableComparableextends Writable, Comparable { }
- Comparison of types is crucial for MapReduce
- where there is a sorting phase during which keys are compared with one another.
- RawComparator is an optimization that Hadoop provides
- extension of Java's Comparator
- allows implementors to compare records read from a stream without deserializing them into objects
- Using big endian may help this also.
- WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes.
- It provides two main functions.
- A default implementation of the raw compare() method that deserializes the objects to be compared from the stream and invokes the object compare() method.
- acts as a factory for RawComparator instances
RawComparatorcomparator = WritableComparator.get(IntWritable.class); IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertThat(comparator.compare(w1, w2), greaterThan(0));
GenericOptionsParser, Tool interface and ToolRunner
- GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your applicatio to use as desired.
- You don't usually use GenericOptionsParser directly, as it's more convenient to implement the Tool interface and run your application with the ToolRunner
- which uses GenericOptionsParser internally
- You can have your App class derives from Configured, which is an implementation of the Configurable interface.
- All implementation of Tool need to implement Configurable
- and subclassing Configured is often the easiest way to achieve this.
- ToolRunner.run() method takes care of creating a Configuration object for the Tool before calling its run() method.
- ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line and to set them on the Configuration instance.
- -conf <conf file>
- GenericOptionsParser also allows you to set individual properties.
- hadoop ConfigurationPrinter -D color=yellow | grep color
- The -D option is used to set the configuration property with key color to the value yellow.
- Options specified with -D take priority over properties from the configuration files.
InputSampler, Sampler
- The InputSampler class defines a nested Sampler interface whose implementations return a sample of keys given an InputFormat and Job
- This interface usually is not called directly by clients. Instead, the writePartitionFile() static method on InputSampler is used, which creates a sequence file to store the keys that define the partitions.
- The sequence file is used by TotalOrderPartitioner to create partitions for the sort job.
No comments:
Post a Comment