Sqoop import to Text, Avro, Parquet, Sequence
In my previous article I explained how
we can sqoop data in avro file, what kind of error it can throw and how we can
resolve them...
Now here I am going to show you how we
can sqoop import into multiple file format and build table on top of that.
As we know that we can sqoop data into multiple file format but sqoop support direct import for four file format.
File Format |
Argument |
Description |
Avro Data Files |
--as-avrodatafile |
Imports data to Avro
Data Files |
Sequence Files |
--as-sequence file |
Imports data to
SequenceFiles |
Text file |
--as-textfile |
Imports data as plain
text (default) |
Parquet Files |
--as-parquetfile |
Imports data to Parquet
Files |
Sqoop
data in text file format
sqoop import -Dmapreduce.job.queuename=TEST
-Dfs.permissions.umask-mode=007
--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"
--username riveriq
--password *******
--query 'select * from riveriq_db.dbo.Employee
where $CONDITIONS'
--num-mappers 4
--as-textfile
--split-by EmpID
--outdir /data/home/riveriq/sqoop/Employee
--delete-target-dir
--target-dir /user/riveriq/sqoop/Employee
--fields-terminated-by '|'
--lines-terminated-by '
'
--enclosed-by '"'
--null-string 'N'
--null-non-string 'N'
bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee
Found 5 items
-rw-rw----+
3 riveriq domian.user 0
2019-01-25 18:16 /user/riveriq/sqoop/Employee/_SUCCESS
-rw-rw----+
3 riveriq domian.user 5325308
2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00000.deflate
-rw-rw----+
3 riveriq domian.user 4154694
2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00001.deflate
-rw-rw----+
3 riveriq domian.user 4592169
2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00002.deflate
-rw-rw----+
3 riveriq domian.user 4420824
2019-01-25 18:16 /user/riveriq/sqoop/Employee/part-m-00003.deflate
bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000.deflate
| head
x▒▒}ْ [▒▒1q'㇒▒▒▒;▒▒▒▒▒_▒▒▒֟
P▒2▒fj▒VU▒<▒▒
▒E▒▒▒▒8A▒▒▒▒▒▒▒▒~▒▒▒▒▒▒%▒▒핗▒V▒▒▒▒?
bash-4.2$ hdfs dfs -text /user/riveriq/sqoop/Employee/part-m-00000.deflate
| head
19/01/25 18:31:39 INFO zlib.ZlibFactory:
Successfully loaded & initialized native-zlib library
19/01/25 18:31:39 INFO compress.CodecPool: Got
brand-new decompressor [.deflate]
"1"|"1"|"13244"|"2007-05-01
00:00:00.0"|"247735.0000"|"10.000000000"|"52"|"8"|"10.0000"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"
"2"|"2"|"13244"|"2007-05-01
00:00:00.0"|"23432.0000"|"10.000000000"|"52"|"223"|"20.9800"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"
"3"|"3"|"13244"|"2007-05-01
00:00:00.0"|"324.3600"|"10.000000000"|"52"|"189"|"11.5100"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"
"4"|"4"|"13244"|"2007-05-01
00:00:00.0"|"73243.5100"|"10.000000000"|"52"|"82"|"10.0000"|"0.0000"|"0.0000"|"1"|"0"|"1"|"0"|"0"
hadoop fs -text hdfs_path > local_file.txt
Sqoop
Data in Avro file format
sqoop import -Dmapreduce.job.classloader=true
-Dmapreduce.job.user.classpath.first=true
-Dmapreduce.job.queuename=TEST
-Dfs.permissions.umask-mode=007
--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"
--username riveriq
--password *******
--query 'select * from riveriq_db.dbo.Employee
where $CONDITIONS'
--num-mappers 4
--as-avrodatafile
--split-by EmpID
--outdir /data/home/riveriq/sqoop/Employee
--delete-target-dir
--target-dir /user/riveriq/sqoop/Employee
--null-string 'N'
--null-non-string 'N'
bash-4.2$ cd /data/home/riveriq/sqoop
bash-4.2$ ls -lh
-rw-rw-r-- 1 riveriq domain.users 2.5K Jan 23
20:58 AutoGeneratedSchema.avsc
-rw-rw-r-- 1 riveriq domain.users 47K Jan 23 20:58 QueryResult.java
bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee
Found 5 items
-rw-r--r--
3 riveriq domian.user 0
2019-01-23 21:18 /user/riveriq/sqoop/Employee/_SUCCESS
-rw-r--r--
3 riveriq domian.user 5216053
2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00000.avro
-rw-r--r--
3 riveriq domian.user 4044383
2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00001.avro
-rw-r--r--
3 riveriq domian.user 4509544
2019-01-23 21:17 /user/riveriq/sqoop/Employee/part-m-00002.avro
-rw-r--r--
3 riveriq domian.user 4331419
2019-01-23 21:18 /user/riveriq/sqoop/Employee/part-m-00003.avro
bash-4.2$ cat /data/home/riveriq/sqoop/AutoGeneratedSchema.avsc
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "EmpID",
"type" : [ "null", "int" ],
"default" : null,
"columnName" : "EmpID",
"sqlType" : "4"
}, {
"name" : "NAME",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "NAME",
"sqlType" : "4"
}
bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000.avro
| head
Objavro.schema▒"type":"record","name":"AutoGeneratedSchema","doc":"Sqoop
import of QueryResult","fields":[{"name":"EmpID","type":["null","int"],"default":null,"columnName":"EmpID","sqlType":"4"},{"name":"NAME","type":["null","string"],"default":null,"columnName":"NAME","sqlType":"4"}],"tableName":"QueryResult"}avro.codecdeflate▒u▒l>~▒9"▒8!,▒▒
▒▒▒}oLdU▒▒5▒▒ɋ▒t▒▒L&/7o▒▒▒F▒▒N▒:▒▒▒▒▒▒▒e2y▒▒▒W▒▒▒y▒▒▒▒t▒▒▒a▒>▒
▒
2▒
"▒ 25▒▒▒
""▒H▒▒▒2▒▒▒▒▒e▒
"b▒
Sqoop
Data in Sequential File format
sqoop import -Dmapreduce.job.classloader=true
-Dmapreduce.job.user.classpath.first=true
-Dmapreduce.job.queuename=TEST
--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"
--username riveriq
--password *******
--query 'select * from riveriq_db.dbo.Employee
where $CONDITIONS'
--num-mappers 4
--as-sequencefile
--split-by EmpID
--outdir /data/home/riveriq/sqoop
--delete-target-dir
--target-dir /user/riveriq/sqoop/Employee
--fields-terminated-by '|'
--lines-terminated-by '
'
--null-string 'N'
--null-non-string 'N'
bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee
Found 5 items
-rw-r--r--
3 riveriq domian.user 0
2019-01-23 21:32 /user/riveriq/sqoop/Employee/_SUCCESS
-rw-r--r--
3 riveriq domian.user 6630294
2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00000
-rw-r--r--
3 riveriq domian.user 5128323
2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00001
-rw-r--r--
3 riveriq domian.user 5645723
2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00002
-rw-r--r--
3 riveriq domian.user 5407474
2019-01-23 21:32 /user/riveriq/sqoop/Employee/part-m-00003
bash-4.2$ hdfs dfs -cat /user/riveriq/sqoop/Employee/part-m-00000
| head
SEQ!org.apache.hadoop.io.LongWritable
QueryResult*org.apache.hadoop.io.compress.DefaultCodec▒▒f▒3ͦn▒▒▒4}(▒▒▒▒▒▒▒f▒3ͦn▒▒▒4}(▒x▒▒▒!▒
▒▒▒▒(▒j▒^ݩ▒x▒U▒S▒$A▒▒m▒▒▒
▒ض7▒▒m{c▒▒ض7▒▒m;▒▒y▒▒▒▒tW=▒33▒կ^▒DS▒4▒BK▒▒▒F[▒נ▒Fu▒Y]u▒]=▒▒[▒▒ ▒▒'▒▒$▒I▒▒'▒O▒▒R?▒~j▒4▒i▒▒▒▒Ϡ▒Q?▒~f▒,▒Y▒▒g▒ϡ▒S?▒~n▒<▒y▒▒▒▒▒/▒_P▒▒▒▒/▒_D
▒▒▒-
Sqoop
Data in Parquet File Format
sqoop import -Dmapreduce.job.queuename=TEST
-Dfs.permissions.umask-mode=007
--connect "jdbc:sqlserver://127.0.0.1:1433;databases=riveriq_db"
--username riveriq
--password *******
--query 'select * from riveriq_db.dbo.Employee
where $CONDITIONS'
--num-mappers 4
--as-parquetfile
--split-by EmpID
--outdir /data/home/riveriq/sqoop/Employee
--delete-target-dir
--target-dir /user/riveriq/sqoop/Employee
--null-string 'N'
--null-non-string 'N'
bash-4.2$ hdfs dfs -ls /user/riveriq/sqoop/Employee
Found 6 items
drwxrwx---+
- riveriq domian.user 0
2019-01-25 18:00 /user/riveriq/sqoop/Employee/.metadata
drwxrwx---+
- riveriq domian.user 0
2019-01-25 18:01 /user/riveriq/sqoop/Employee/.signals
-rw-rw----+
3 riveriq domian.user 4281551
2019-01-25 18:01 /user/riveriq/sqoop/Employee/01fca637-faef-4b6d-ade5-74d6949ad4dd.parquet
-rw-rw----+
3 riveriq domian.user 4577423
2019-01-25 18:01 /user/riveriq/sqoop/Employee/07d8e6a2-5534-4f6e-a2e2-4268865808c7.parquet
-rw-rw----+
3 riveriq domian.user 4540697
2019-01-25 18:01 /user/riveriq/sqoop/Employee/74bf7849-801c-466e-b3b0-06d3f430f29b.parquet
-rw-rw----+
3 riveriq domian.user 5410166
2019-01-25 18:01 /user/riveriq/sqoop/Employee/c4a094ac-cd97-4b09-8078-01a85fc3a682.parquet
hdfs dfs -ls /user/riveriq/sqoop/Employee/.metadata
Found 3 items
-rw-rw----+
3 riveriq domian.user 195
2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/descriptor.properties
-rw-rw----+
3 riveriq domian.user 2486
2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/schema.avsc
drwxrwx---+
- riveriq domian.user 0
2019-01-27 01:20 /user/riveriq/sqoop/Employee/.metadata/schemas
bash-4.2$
hdfs dfs -cat /user/riveriq/sqoop/Employee/01fca637-faef-4b6d-ade5-74d6949ad4dd.parquet
| head
▒▒▒▒<▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒o▒p▒q▒r▒s▒t▒u▒v▒w▒x▒y▒z▒{▒|▒}▒~▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ ▒
▒
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ ▒!▒"▒#▒$▒%▒&▒'▒(▒)▒*▒+▒,▒-▒.▒/▒0▒1▒2▒3▒4▒5▒6▒7▒8▒9▒:▒;▒<▒=▒>▒?▒@▒A▒B▒C▒D▒E▒F▒G▒H▒I▒J▒K▒L▒M▒N▒O▒P▒Q▒R▒S▒T▒U▒V▒W▒X▒Y▒Z▒[▒▒]▒^▒_▒`▒a▒b▒c▒d▒e▒f▒g▒h▒i▒j▒k▒l▒m▒n▒o▒p▒q▒r▒s▒t▒u▒v▒w▒x▒y▒z▒{▒|▒}▒~▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ ▒
Hive
Table on TextFile Data
CREATE EXTERNAL TABLE if not exists Text_Employee(
EmpID int,
NAME string,
JOB_TITLE
string,
ANNUAL_SALARY int,
CITY string,
STATE string,
COUNTRY int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/user/riveriq/sqoop/Employee';
Hive
Table on Avro Data
CREATE EXTERNAL TABLE Avro_Employee
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/riveriq/sqoop/Employee'
TBLPROPERTIES
('avro.schema.url'='hdfs:///user/riveriq/sqoop/schema/Employee/AutoGeneratedSchema.avsc');
Hive
Table on Parquet Data
CREATE EXTERNAL TABLE if not exists Prqt_Employee(
EmpID int,
NAME string,
JOB_TITLE
string,
ANNUAL_SALARY int,
CITY string,
STATE string,
COUNTRY int)
STORED AS PARQUET
location '/user/riveriq/sqoop/Employee';
Hive
Table on Sequential Data
CREATE EXTERNAL TABLE if not exists Seq_Employee(
EmpID int,
NAME string,
JOB_TITLE
string,
ANNUAL_SALARY int,
CITY string,
STATE string,
COUNTRY int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/user/riveriq/sqoop/Employee';
0 Comments
Be first to comment on this post.