12.5.15

Four-step Strategy for Incremental Updates in Apache Hive

Four-step Strategy for Incremental Updates in Apache Hive

Sqoop Commands

$ sqoop tool-name [tool-arguments]
$ sqoop help
$ HADOOP_HOME=/path/to/some/hadoop sqoop import --arguments...
$ export HADOOP_HOME=/some/path/to/hadoop
$ sqoop import --arguments...
$ sqoop help import
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
$ sqoop import --connect jdbc:mysql://database.example.com/employees
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username aaron --password 12345
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
    --connect <connect-string> ...
$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults
$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
    ...
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
    ...
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
    --direct -- --default-character-set=latin1
$ sqoop import ... --map-column-java id=String,value=Integer
$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...
$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...
$ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
    --class-name SomeTableType
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --username SomeUser -P
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --columns "employee_id,first_name,last_name,job_title"
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --direct
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --class-name com.foocorp.Employee --as-sequencefile
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --fields-terminated-by '\t' --lines-terminated-by '\n' \
    --optionally-enclosed-by '\"'
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --hive-import
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --split-by dept_id
$ hadoop fs -ls EMPLOYEES
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
    --where "id > 100000" --target-dir /incremental_dataset --append
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
$ hadoop fs -ls
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
    --table mytable
$ sqoop job --list
$ sqoop job --show myjob
$ sqoop job --exec myjob
$ sqoop job --exec myjob -- --username someuser -P
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
$ sqoop merge --new-data newer --onto older --target-dir merged \
    --jar-file datatypes.jar --class-name Foo --merge-key id
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
    --table employees
$ sqoop create-hive-table (generic-args) (create-hive-table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-table-args)
$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
    --table employees --hive-table emps
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
    -e "INSERT INTO foo VALUES(42, 'bar')"
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
$ sqoop list-tables --connect jdbc:mysql://database.example.com/corp
$ sqoop help [tool-name]
$ sqoop-help [tool-name]
$ sqoop help
$ bin/sqoop help import
$ sqoop version
$ sqoop-version
$ sqoop version
$ sqoop import --table foo \
    --connect jdbc:mysql://db.example.com/someDb?zeroDateTimeBehavior=round
$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
    --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

Efficiency Comparison of All Possible Ways to Dump Database (mysql) into HDFS


  • sqoop + mysql connector
    • It use mysqldump command to do the job, which makes it the most efficient solution.
  • sqoop + mysql jdbc driver
  • MapReduce + jdbc driver
    • It is good but less efficient than sqoop + mysql jdbc driver because sqoop will optimize the job.
  • mysqldump to local file system than copy to hdfs
    • overhead of storing the data twice.

Implicities in Hadoop Jobs

1. If no mapper is set, IdentityMapper will be used, or Mapper will be used since version <version?>.
2. If IdentityMapper is used and the job doesn't set up mapper output classes, classes used for output of reducers will be reused for mapper's output, which can cause problem -> IOException instead of ClassCastException:

MapTask.class (2.6.0)


    public synchronized void collect(K key, V value, final int partition
                                     ) throws IOException {
      reporter.progress();
      if (key.getClass() != keyClass) {
        throw new IOException("Type mismatch in key from map: expected "
                              + keyClass.getName() + ", received "
                              + key.getClass().getName());
      }
      if (value.getClass() != valClass) {
        throw new IOException("Type mismatch in value from map: expected "
                              + valClass.getName() + ", received "
                              + value.getClass().getName());
      }
      if (partition < 0 || partition >= partitions) {
        throw new IOException("Illegal partition for " + key + " (" +
            partition + ")");
      }



3. Default file location for Hive DB and Tables:

/user/hive/warehouse/<db>.db/<table>

Need to remember


  • How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?
    • Write a custom FileInputFormat and override the method isSplitable to always return false.
      • The isSplitable() method in your InputFormat is passed each filename; if it returns true then the file can be broken up and processed by multiple Mappers. If it returns false then the file is considered to be 'not splittable' - that is, the entire file must be processed by a single Mapper.
  • To make sure jar files other than the one with the Driver Class gets distributed to all nodes in the cluster, the hadoop command should be:
    • % hadoop jar job.jar MyDriver -libjar ex1.jar:ex2.jar
    • Just to remember <hadoop jar job.jar MyDriver> is the main part.

  • InputFormat
    • SequenceFileInputFormat
    • SequenceFileInputFormat
    • TextInputFormat
    • ObjectPositionInputFormat
    • FileInputFormat
  • ToolRunner
  • LocalJobRunner
  • job.setNumReduceTasks(8);job.setPartitionerClass(MyPartitioner.class);
  • Write a custom FileInputFormat and override the method isSplitable to always return false
  • MRUnit 
  • hadoop fs -setrep 4 f1
    • hadoop fs -Ddfs.replication=4 -cp f1 f1.tmp; hadoop fs -rm f1; hadoop fs -mv f1.tmp f1
  • How are keys and values presented and passed to the reduce() method during a standard shuffle and sort phase of MapReduce?
  • Does the MapReduce programming model provide a way for reduce tasks to communicate with each other?
  • Speculative Execution
  • WritableComparable
  • Workflow of Oozie
  • A single map task processes <>?
  • Hive table field delimiters
  • When testing a Reducer using MRUnit, you should only pass the Reducer a single key and list of values. In this case, we use the withInput() method twice, but only the second call will actually be used -- the first will be overridden by the second. If you want to test the Reducer with two inputs, you would have to write two tests.

sqoop 常用命令整理(一) - 岑玉海 - 博客园