River IQ

Sqoop Import in Avro Files

  Ashish Kumar      sqoop January 22, 2019
Image

Here today I will show you how we can sqoop data into avro file format.


Yeah, we know it very simple put --as-avrodatafile with your sqoop import command as per all apache documentation but in real life does all documented command works as simple as written???


Defiantly not…


And it happened same as others… so no worry here I’m goanna show you all probable issue you can face and how you need to debug and resolution for the same and if you have some different issue please comment. We will try to solve it together.

But before talking to those issue, let’s see what is AVRO and why we need to sqoop data in AVRO…


What is AVRO?

An .avro file created by Apache Avro data serialization system.

  • It contains data in .avro file in compact binary format and schema in .avsc file in JSON format.
  • It’s very useful for the exchange of extremely large datasets as you can split the data into multiple .avro part files.
  • It is a language-neutral data serialization system but schema bashed.
  • It is quite useful when you need to deal with dynamic schema. Ex- if you want to be unimpacted with change in schema of source over the time and what to access older ingested data even after schema change.
  • It can handle schema changes like any missing fields, new added fields and changed fields.
  • It can handle line breaks ( ) and that’s awesome if you had to deal it in .csv file.

Now I guess you must be thinking why sqoop to Avro why not to parquet…

As we know in open source to do single things we have several ways and tools but we find best one by considering all pros and cons and our requirements…

Here also we must select best file format bashed on our need.

Before comparing Avro and parquets here few points for parquet.

  • It is columnar storage of data compared to row based file
  • It has self-describing, complex nested schema, or structure, within the data file.
  • It is good for read intensive use case as it support indexing.
  • parquet and avro both support schema evolution (changing schema of data over time) but comparatively, Avro provide much richer schema evolution.
  • parquet give good compaction ratios as compare to avro.

Now here is the key factor before choosing Avro over Parquet.

  • Avro is write intensive where parquet is read intensive. So, to sqoop import I would prefer to write intensive Avro.
  • Avro provide much richer schema evolution.
  • It can handle line breaks ( ).

Now let’s see sqoop import and related issues and resolutions


Environment

  • HDP 3.0.1
  • Sqoop version: 1.4.7


Sqoop Import Command

Here I will show you step wise execution and related error and resolutions

Step 1-

sqoop import -Dmapreduce.job.queuename= xyz

--connect "jdbc:sqlserver://*******:1433;databases=employee"

--username riveriq

--password *****

--query 'select * from employee.dbo.emp_details where $CONDITIONS'

--m 4

--split-by empid

--target-dir /user/riveriq/sqoop/emp_details

--as-avrodatafile

 

Error :

19/01/20 23:38:18 INFO mapreduce.Job: Running job: job_1546234110553_0499

19/01/20 23:38:29 INFO mapreduce.Job: Job job_1546234110553_0499 running in uber mode : false

19/01/20 23:38:29 INFO mapreduce.Job:  map 0% reduce 0%

19/01/20 23:38:34 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_0, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:39 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_1, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:44 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_2, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:51 INFO mapreduce.Job:  map 100% reduce 0%

19/01/20 23:38:51 INFO mapreduce.Job: Job job_1546234110553_0499 failed with state FAILED due to: Task failed task_1546234110553_0499_m_000000

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

 

Debug:

yarn logs -applicationId application_1546448110553_0499

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546234110553_0499/container_e46_1546448235553_0499_01_000004/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0499/container_e46_1546448110553_0499_01_000004/avro-1.8.1.jar/

 

If you google above issue (org.apache.avro.reflect.ReflectData.addLogicalTypeConversion) then you will find this is a known sqoop issue related to avro version.

Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro.

But if you see above yarn application log, you will find that sqoop loading avro-1.7.7.jar first in class path than avro-1.8.1.jar and that create avro type conversion issue.

 

Solution: - Dmapreduce.job.user.classpath.first=true

https://community.hortonworks.com/questions/60890/sqoop-import-to-avro-failing-which-jars-to-be-used.html

 

Step 2:

 

sqoop import -Dmapreduce.job.user.classpath.first=true -Dmapreduce.job.queuename=BDSS

--connect "jdbc:sqlserver://*******:1433;databases=employee"

--username riveriq

--password *****

--query 'select * from employee.dbo.emp_details where $CONDITIONS'

--m 4

--split-by empid

--target-dir /user/riveriq/sqoop/emp_details

--as-avrodatafile

 

I run it with -Dmapreduce.job.user.classpath.first=true as above stated in above url that it reolved their issue.

 

But it was not that easy for me, it resolved sqoop avro version issue but my map reduce job started failing.

 

Then it made me to think why every time my issue is different from others and I always trapped in those issue where other are not.

 

But no worry… I feel if error is your enemy than logs are your friend…

 

Error :

[2019-01-21 00:02:29.593]Container exited with a non-zero exit code 1. Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :

Last 4096 bytes of stderr :

log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

 

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/avro-1.8.1.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

 

Solution:

if you see above log after adding -Dmapreduce.job.user.classpath.first=true, it resolved sqoop avro version issue, now its loading avro-1.8.1.jar first and then avro-1.7.7.jar but now map reduce job is failing as there is map reduce dependency on avro-1.7.7.jar that not meeting with avro-1.8.1.jar.

 

to resolve this, we have use separate class loaded for both.

Use -Dmapreduce.job.classloader=true instead of -Dmapreduce.job.user.classpath.first=true

 

Step 3:

 

sqoop import -Dmapreduce.job.classloader=true -Dmapreduce.job.queuename=BDSS

--connect "jdbc:sqlserver://*******:1433;databases=employee"

--username riveriq

--password *****

--query 'select * from employee.dbo.emp_details where $CONDITIONS'

--m 4

--split-by empid

--target-dir /user/riveriq/sqoop/emp_details

--as-avrodatafile

 

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps/

 Creating job classloader

 

INFO [main] org.apache.hadoop.util.ApplicationClassLoader/

 classpath/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/avro-1.8.1.jar,

 

Output :  Data and schema fine get created

/user/riveriq/sqoop/emp_details/part_****.avro

/home/riveriq/emp_details.avsc

Conclusion: for me -Dmapreduce.job.classloader=true is final solution to resolve this issue.

0 Comments

Be first to comment on this post.