River IQ

Sqoop import to Text, Avro, Parquet, Sequence

  Ashish Kumar      sqoop January 27, 2019
Image

In my previous article I explained how we can sqoop data in avro file, what kind of error it can throw and how we can resolve them...

Now here I am going to show you how we can sqoop import into multiple file format and build table on top of that.

As we know that we can sqoop data into multiple file format but sqoop support direct import for four file format.


File Format

Argument

Description

Avro Data Files

--as-avrodatafile

Imports data to Avro Data Files

Sequence Files

--as-sequence file

Imports data to SequenceFiles

Text file

--as-textfile

Imports data as plain text (default)

Parquet Files

--as-parquetfile

Imports data to Parquet Files


Sqoop data in text file format

sqoop import -Dmapreduce.job.queuename=TEST

-Dfs.permissions.umask-mode=007

--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"

--username riveriq

--password *******

--query 'select * from riveriq_db.dbo.Employee where $CONDITIONS'

--num-mappers 4

--as-textfile

--split-by EmpID

--outdir /data/home/riveriq/sqoop/Employee

--delete-target-dir

--target-dir /user/riveriq/sqoop/Employee

--fields-terminated-by '|'

--lines-terminated-by ' '

--enclosed-by '"'

--null-string 'N'

--null-non-string 'N'

 

bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee

Found 5 items

-rw-rw----+  3 riveriq domian.user          0 2019-01-25 18:16 /user/riveriq/sqoop/Employee/_SUCCESS

-rw-rw----+  3 riveriq domian.user    5325308 2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00000.deflate

-rw-rw----+  3 riveriq domian.user    4154694 2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00001.deflate

-rw-rw----+  3 riveriq domian.user    4592169 2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00002.deflate

-rw-rw----+  3 riveriq domian.user    4420824 2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00003.deflate

 

bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000.deflate | head

x▒▒}ْ [▒▒1q'▒▒▒׿;▒▒▒▒▒_▒▒▒֟ P2fjVU<▒▒ E▒▒▒▒8A▒▒▒▒▒▒▒▒~▒▒▒▒▒▒%▒▒V▒▒▒▒?

 

bash-4.2$ hdfs dfs -text /user/riveriq/sqoop/Employee/part-m-00000.deflate | head

19/01/25 18:31:39 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

19/01/25 18:31:39 INFO compress.CodecPool: Got brand-new decompressor [.deflate]

"1"|"1"|"13244"|"2007-05-01 00:00:00.0"|"247735.0000"|"10.000000000"|"52"|"8"|"10.0000"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"

"2"|"2"|"13244"|"2007-05-01 00:00:00.0"|"23432.0000"|"10.000000000"|"52"|"223"|"20.9800"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"

"3"|"3"|"13244"|"2007-05-01 00:00:00.0"|"324.3600"|"10.000000000"|"52"|"189"|"11.5100"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"

"4"|"4"|"13244"|"2007-05-01 00:00:00.0"|"73243.5100"|"10.000000000"|"52"|"82"|"10.0000"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"

 

hadoop fs -text hdfs_path > local_file.txt

 

Sqoop Data in Avro file format

sqoop import -Dmapreduce.job.classloader=true

-Dmapreduce.job.user.classpath.first=true

-Dmapreduce.job.queuename=TEST

-Dfs.permissions.umask-mode=007

--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"

--username riveriq

--password *******

--query 'select * from riveriq_db.dbo.Employee where $CONDITIONS'

--num-mappers 4

--as-avrodatafile

--split-by EmpID

--outdir /data/home/riveriq/sqoop/Employee

--delete-target-dir

--target-dir /user/riveriq/sqoop/Employee

--null-string 'N'

--null-non-string 'N'

 

bash-4.2$ cd /data/home/riveriq/sqoop

bash-4.2$ ls -lh

-rw-rw-r-- 1 riveriq domain.users 2.5K Jan 23 20:58 AutoGeneratedSchema.avsc

-rw-rw-r-- 1 riveriq domain.users  47K Jan 23 20:58 QueryResult.java

 

bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee

Found 5 items

-rw-r--r--   3 riveriq domian.user          0 2019-01-23 21:18 /user/riveriq/sqoop/Employee/_SUCCESS

-rw-r--r--   3 riveriq domian.user    5216053 2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00000.avro

-rw-r--r--   3 riveriq domian.user    4044383 2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00001.avro

-rw-r--r--   3 riveriq domian.user    4509544 2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00002.avro

-rw-r--r--   3 riveriq domian.user    4331419 2019-01-23 21:18 /user/riveriq/sqoop/Employee/part-m-00003.avro

 

bash-4.2$ cat /data/home/riveriq/sqoop/AutoGeneratedSchema.avsc

{

  "type" : "record",

  "name" : "AutoGeneratedSchema",

  "doc" : "Sqoop import of QueryResult",

  "fields" : [ {

    "name" : "EmpID",

    "type" : [ "null", "int" ],

    "default" : null,

    "columnName" : "EmpID",

    "sqlType" : "4"

  }, {

    "name" : "NAME",

    "type" : [ "null", "string" ],

    "default" : null,

    "columnName" : "NAME",

    "sqlType" : "4"

  }

 

bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000.avro | head

Objavro.schema"type":"record","name":"AutoGeneratedSchema","doc":"Sqoop import of QueryResult","fields":[{"name":"EmpID","type":["null","int"],"default":null,"columnName":"EmpID","sqlType":"4"},{"name":"NAME","type":["null","string"],"default":null,"columnName":"NAME","sqlType":"4"}],"tableName":"QueryResult"}avro.codecdeflateul>~9"8!,▒▒

▒▒▒}oLdU▒▒5▒▒ɋt▒▒L&/7o▒▒▒F▒▒N:▒▒▒▒▒▒▒e2y▒▒▒W▒▒▒y▒▒▒▒t▒▒▒a>

2

" 25▒▒▒

""H▒▒▒2▒▒▒▒▒e

"b

 

Sqoop Data in Sequential File format

sqoop import -Dmapreduce.job.classloader=true

-Dmapreduce.job.user.classpath.first=true

-Dmapreduce.job.queuename=TEST

--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"

--username riveriq

--password *******

--query 'select * from riveriq_db.dbo.Employee where $CONDITIONS'

--num-mappers 4

--as-sequencefile

--split-by EmpID

--outdir /data/home/riveriq/sqoop

--delete-target-dir

--target-dir /user/riveriq/sqoop/Employee

--fields-terminated-by '|'

--lines-terminated-by ' '

--null-string 'N'

--null-non-string 'N'

 

bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee

Found 5 items

-rw-r--r--   3 riveriq domian.user          0 2019-01-23 21:32 /user/riveriq/sqoop/Employee/_SUCCESS

-rw-r--r--   3 riveriq domian.user    6630294 2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00000

-rw-r--r--   3 riveriq domian.user    5128323 2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00001

-rw-r--r--   3 riveriq domian.user    5645723 2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00002

-rw-r--r--   3 riveriq domian.user    5407474 2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00003

 

bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000 | head

SEQ!org.apache.hadoop.io.LongWritable

                                     QueryResult*org.apache.hadoop.io.compress.DefaultCodec▒▒f3ͦn▒▒▒4}(▒▒▒▒▒▒▒f3ͦn▒▒▒4}(x▒▒▒! ▒▒▒▒(jxUS$A▒▒m▒▒▒

ض7▒▒m{c▒▒ض7▒▒m;▒▒y▒▒▒▒tW=33▒կ^DS4BK▒▒▒F[נFuY]u]=▒▒[▒▒  ▒▒'▒▒$I▒▒'O▒▒R?~j4i▒▒▒▒ϠQ?~f,Y▒▒gϡS?~n<y▒▒▒▒▒/_P▒▒▒▒/_D

▒▒▒-

 

Sqoop Data in Parquet File Format

sqoop import -Dmapreduce.job.queuename=TEST

-Dfs.permissions.umask-mode=007

--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"

--username riveriq

--password *******

--query 'select * from riveriq_db.dbo.Employee where $CONDITIONS'

--num-mappers 4

--as-parquetfile

--split-by EmpID

--outdir /data/home/riveriq/sqoop/Employee

--delete-target-dir

--target-dir /user/riveriq/sqoop/Employee

--null-string 'N'

--null-non-string 'N'

 

bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee

Found 6 items

drwxrwx---+  - riveriq domian.user          0 2019-01-25 18:00 /user/riveriq/sqoop/Employee/.metadata

drwxrwx---+  - riveriq domian.user          0 2019-01-25 18:01 /user/riveriq/sqoop/Employee/.signals

-rw-rw----+  3 riveriq domian.user    4281551 2019-01-25 18:01 /user/riveriq/sqoop/Employee/01fca637-faef-4b6d-ade5-74d6949ad4dd.parquet

-rw-rw----+  3 riveriq domian.user    4577423 2019-01-25 18:01 /user/riveriq/sqoop/Employee/07d8e6a2-5534-4f6e-a2e2-4268865808c7.parquet

-rw-rw----+  3 riveriq domian.user    4540697 2019-01-25 18:01 /user/riveriq/sqoop/Employee/74bf7849-801c-466e-b3b0-06d3f430f29b.parquet

-rw-rw----+  3 riveriq domian.user    5410166 2019-01-25 18:01 /user/riveriq/sqoop/Employee/c4a094ac-cd97-4b09-8078-01a85fc3a682.parquet

 

hdfs dfs -ls /user/riveriq/sqoop/Employee/.metadata

Found 3 items

-rw-rw----+  3 riveriq domian.user        195 2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/descriptor.properties

-rw-rw----+  3 riveriq domian.user       2486 2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/schema.avsc

drwxrwx---+  - riveriq domian.user          0 2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/schemas

bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/01fca637-faef-4b6d-ade5-74d6949ad4dd.parquet | head

▒▒▒▒<▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒opqrstuvwxyz{|}~▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒     

▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  

 

Hive Table on TextFile Data

CREATE EXTERNAL TABLE if not exists Text_Employee(

EmpID int,

NAME string,

JOB_TITLE  string,

ANNUAL_SALARY int,

CITY string,

STATE string,

COUNTRY int)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS TEXTFILE

location '/user/riveriq/sqoop/Employee';

 

Hive Table on Avro Data

CREATE EXTERNAL TABLE Avro_Employee

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/user/riveriq/sqoop/Employee'

TBLPROPERTIES

('avro.schema.url'='hdfs:///user/riveriq/sqoop/schema/Employee/AutoGeneratedSchema.avsc');

 

Hive Table on Parquet Data

CREATE EXTERNAL TABLE if not exists Prqt_Employee(

EmpID int,

NAME string,

JOB_TITLE  string,

ANNUAL_SALARY int,

CITY string,

STATE string,

COUNTRY int)

STORED AS PARQUET

location '/user/riveriq/sqoop/Employee';

 

Hive Table on Sequential Data

CREATE EXTERNAL TABLE if not exists Seq_Employee(

EmpID int,

NAME string,

JOB_TITLE  string,

ANNUAL_SALARY int,

CITY string,

STATE string,

COUNTRY int)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'

LOCATION '/user/riveriq/sqoop/Employee';



0 Comments

Be first to comment on this post.