Travel of Software Developer

12.5.15

Four-step Strategy for Incremental Updates in Apache Hive

Sqoop Commands

$ sqoop tool-name [tool-arguments]
$ sqoop help
$ HADOOP_HOME=/path/to/some/hadoop sqoop import --arguments...
$ export HADOOP_HOME=/some/path/to/hadoop
$ sqoop import --arguments...
$ sqoop help import
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
$ sqoop import --connect jdbc:mysql://database.example.com/employees
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username aaron --password 12345
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
--connect <connect-string> ...
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
...
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
...
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
$ sqoop import ... --map-column-java id=String,value=Integer
$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
$ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
--class-name SomeTableType
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--username SomeUser -P
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-m 8
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--direct
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefile
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n' \
--optionally-enclosed-by '\"'
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--hive-import
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--where "start_date > '2010-01-01'"
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--split-by dept_id
$ hadoop fs -ls EMPLOYEES
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where "id > 100000" --target-dir /incremental_dataset --append
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
$ hadoop fs -ls
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
$ sqoop job --list
$ sqoop job --show myjob
$ sqoop job --exec myjob
$ sqoop job --exec myjob -- --username someuser -P
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
$ sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
--table employees
$ sqoop create-hive-table (generic-args) (create-hive-table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-table-args)
$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
-e "INSERT INTO foo VALUES(42, 'bar')"
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
$ sqoop list-tables --connect jdbc:mysql://database.example.com/corp
$ sqoop help [tool-name]
$ sqoop-help [tool-name]
$ sqoop help
$ bin/sqoop help import
$ sqoop version
$ sqoop-version
$ sqoop version
$ sqoop import --table foo \
--connect jdbc:mysql://db.example.com/someDb?zeroDateTimeBehavior=round
$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
--connect jdbc:oracle:thin:@//db.example.com/foo --table bar

Efficiency Comparison of All Possible Ways to Dump Database (mysql) into HDFS

sqoop + mysql connector

It use mysqldump command to do the job, which makes it the most efficient solution.

sqoop + mysql jdbc driver
MapReduce + jdbc driver

It is good but less efficient than sqoop + mysql jdbc driver because sqoop will optimize the job.

mysqldump to local file system than copy to hdfs

overhead of storing the data twice.

Implicities in Hadoop Jobs

1. If no mapper is set, IdentityMapper will be used, or Mapper will be used since version <version?>.
2. If IdentityMapper is used and the job doesn't set up mapper output classes, classes used for output of reducers will be reused for mapper's output, which can cause problem -> IOException instead of ClassCastException:

MapTask.class (2.6.0)

    public synchronized void collect(K key, V value, final int partition
                                     ) throws IOException {
      reporter.progress();
      if (key.getClass() != keyClass) {
        throw new IOException("Type mismatch in key from map: expected "
                              + keyClass.getName() + ", received "
                              + key.getClass().getName());
      }
      if (value.getClass() != valClass) {
        throw new IOException("Type mismatch in value from map: expected "
                              + valClass.getName() + ", received "
                              + value.getClass().getName());
      }
      if (partition < 0 || partition >= partitions) {
        throw new IOException("Illegal partition for " + key + " (" +
            partition + ")");
      }

3. Default file location for Hive DB and Tables:

/user/hive/warehouse/<db>.db/<table>

Need to remember

How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

Write a custom FileInputFormat and override the method isSplitable to always return false.

The isSplitable() method in your InputFormat is passed each filename; if it returns true then the file can be broken up and processed by multiple Mappers. If it returns false then the file is considered to be 'not splittable' - that is, the entire file must be processed by a single Mapper.

To make sure jar files other than the one with the Driver Class gets distributed to all nodes in the cluster, the hadoop command should be:

% hadoop jar job.jar MyDriver -libjar ex1.jar:ex2.jar
Just to remember <hadoop jar job.jar MyDriver> is the main part.

InputFormat

SequenceFileInputFormat
SequenceFileInputFormat
TextInputFormat
ObjectPositionInputFormat
FileInputFormat

ToolRunner
LocalJobRunner
job.setNumReduceTasks(8);job.setPartitionerClass(MyPartitioner.class);
Write a custom FileInputFormat and override the method isSplitable to always return false
MRUnit
hadoop fs -setrep 4 f1

hadoop fs -Ddfs.replication=4 -cp f1 f1.tmp; hadoop fs -rm f1; hadoop fs -mv f1.tmp f1

How are keys and values presented and passed to the reduce() method during a standard shuffle and sort phase of MapReduce?
Does the MapReduce programming model provide a way for reduce tasks to communicate with each other?
Speculative Execution
WritableComparable
Workflow of Oozie
A single map task processes <>?
Hive table field delimiters
When testing a Reducer using MRUnit, you should only pass the Reducer a single key and list of values. In this case, we use the withInput() method twice, but only the second call will actually be used -- the first will be overridden by the second. If you want to test the Reducer with two inputs, you would have to write two tests.