Sqoop Import in Avro Files
Here today I will show you how we can sqoop data into avro file format.
Yeah, we know it very simple put --as-avrodatafile with your sqoop import command as per all apache documentation but in real life does all documented command works as simple as written???
Defiantly not…
And it happened same as others… so no worry here I’m goanna
show you all probable issue you can face and how you need to debug and
resolution for the same and if you have some different issue please comment. We
will try to solve it together.
But before talking to those issue, let’s see what is AVRO and why we need to sqoop data in AVRO…
What is AVRO?
An .avro file created by Apache Avro data serialization system.
- It contains data in
.avro file in compact binary format and schema
in .avsc file in JSON format.
- It’s very useful for the exchange of extremely large datasets as
you can split the data into multiple
.avro part files.
- It is a language-neutral
data serialization system but schema
bashed.
- It is quite useful when you need to deal with dynamic schema. Ex- if you want to be unimpacted with change in
schema of source over the time and what to access older ingested data even
after schema change.
- It can handle schema
changes
like any missing fields, new added fields and changed fields.
-
It can handle line breaks (
)
and that’s awesome if you had to deal it in .csv file.
Now I guess you must be thinking why sqoop to Avro why not to parquet…
As we know in open source to do single things we have several ways
and tools but we find best one by considering all pros and cons and our
requirements…
Here also we must select best file format bashed on our need.
Before comparing Avro and parquets here few points for parquet.
- It is columnar storage of data compared to row based file
- It has self-describing, complex nested schema, or structure,
within the data file.
- It is good for read intensive use case as it support indexing.
- parquet and avro both support schema evolution (changing schema of
data over time) but comparatively, Avro provide much richer schema evolution.
- parquet give good compaction ratios as compare to avro.
Now here is the key factor before choosing Avro over Parquet.
- Avro is write intensive where parquet is read intensive. So, to
sqoop import I would prefer to write intensive Avro.
- Avro provide much richer schema evolution.
- It can handle line breaks ( ).
Now let’s see sqoop import and related issues and resolutions
Environment
- HDP 3.0.1
- Sqoop version: 1.4.7
Sqoop Import Command
Here I will show you step wise execution and related error
and resolutions
Step 1-
sqoop import -Dmapreduce.job.queuename= xyz
--connect
"jdbc:sqlserver://*******:1433;databases=employee"
--username riveriq
--password *****
--query 'select * from employee.dbo.emp_details where
$CONDITIONS'
--m 4
--split-by empid
--target-dir /user/riveriq/sqoop/emp_details
--as-avrodatafile
Error :
19/01/20 23:38:18 INFO mapreduce.Job: Running job:
job_1546234110553_0499
19/01/20 23:38:29 INFO mapreduce.Job: Job job_1546234110553_0499
running in uber mode : false
19/01/20 23:38:29 INFO mapreduce.Job: map 0% reduce 0%
19/01/20 23:38:34 INFO mapreduce.Job: Task Id :
attempt_1546234110553_0499_m_000000_0, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/01/20 23:38:39 INFO mapreduce.Job: Task Id :
attempt_1546234110553_0499_m_000000_1, Status : FAILED
Error:
org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/01/20 23:38:44 INFO mapreduce.Job: Task Id :
attempt_1546234110553_0499_m_000000_2, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/01/20 23:38:51 INFO mapreduce.Job: map 100% reduce 0%
19/01/20 23:38:51 INFO mapreduce.Job: Job job_1546234110553_0499
failed with state FAILED due to: Task failed task_1546234110553_0499_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
killedMaps:0 killedReduces: 0
Debug:
yarn logs -applicationId
application_1546448110553_0499
java.class.path:
/data/yarn/local/usercache/ashishk/appcache/application_1546234110553_0499/container_e46_1546448235553_0499_01_000004/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/
/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0499/container_e46_1546448110553_0499_01_000004/avro-1.8.1.jar/
If
you google above issue (org.apache.avro.reflect.ReflectData.addLogicalTypeConversion)
then you will find this is a known sqoop issue related to avro version.
Sqoop
uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4
avro.
But
if you see above yarn application log, you will find that sqoop loading avro-1.7.7.jar first in class path than avro-1.8.1.jar and
that create avro type conversion issue.
Solution: -
Dmapreduce.job.user.classpath.first=true
https://community.hortonworks.com/questions/60890/sqoop-import-to-avro-failing-which-jars-to-be-used.html
Step 2:
sqoop import -Dmapreduce.job.user.classpath.first=true
-Dmapreduce.job.queuename=BDSS
--connect
"jdbc:sqlserver://*******:1433;databases=employee"
--username riveriq
--password *****
--query 'select * from employee.dbo.emp_details where
$CONDITIONS'
--m 4
--split-by empid
--target-dir /user/riveriq/sqoop/emp_details
--as-avrodatafile
I
run it with -Dmapreduce.job.user.classpath.first=true as above stated in above url that it reolved their issue.
But
it was not that easy for me, it resolved sqoop avro version issue but my map
reduce job started failing.
Then it made me to think why every time my issue is
different from others and I always trapped in those issue where other are not.
But
no worry… I feel if error is your enemy than logs are your friend…
Error :
[2019-01-21 00:02:29.593]Container exited with a non-zero
exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
Debug:
yarn
logs -applicationId application_1546448110553_0500
java.class.path:
/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/avro-1.8.1.jar/
/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/
Solution:
if
you see above log after adding -Dmapreduce.job.user.classpath.first=true, it resolved
sqoop avro version issue, now its loading avro-1.8.1.jar first and then
avro-1.7.7.jar but now map reduce job is failing as there is map reduce
dependency on avro-1.7.7.jar that not meeting with avro-1.8.1.jar.
to
resolve this, we have use separate class loaded for both.
Use -Dmapreduce.job.classloader=true
instead of -Dmapreduce.job.user.classpath.first=true
Step 3:
sqoop import -Dmapreduce.job.classloader=true -Dmapreduce.job.queuename=BDSS
--connect
"jdbc:sqlserver://*******:1433;databases=employee"
--username riveriq
--password *****
--query 'select * from
employee.dbo.emp_details where $CONDITIONS'
--m 4
--split-by empid
--target-dir
/user/riveriq/sqoop/emp_details
--as-avrodatafile
Debug:
yarn
logs -applicationId application_1546448110553_0500
java.class.path:
/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/
INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps/
Creating job
classloader
INFO [main] org.apache.hadoop.util.ApplicationClassLoader/
classpath/
/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/avro-1.8.1.jar,
Output : Data and schema fine
get created
/user/riveriq/sqoop/emp_details/part_****.avro
/home/riveriq/emp_details.avsc
Conclusion: for me -Dmapreduce.job.classloader=true is final solution to resolve this issue.
0 Comments
Be first to comment on this post.