River IQ | Blogs

spark udf with withColumn

Ashish Kumar Spark February 14, 2020

import org.apache.spark.sql.functions._ val events = Seq ( (1,1,2,3,4), (2,1,2,3,4), (3,1,2,3,4), (4,1,2,3,4), (5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4") var prev_amt5=0 var i=1 def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = { if(i==1){ i=i+1 prev_amt5=0 }else{ i=i+1 } if (ID == 0) { if(amt1==0) { val cur_amt5= 1 prev_amt5=cur_amt5 cur_amt5 }else{ val cur_amt5=1*(amt2+amt3) prev_amt5=cur_amt5 cur_amt5 } }el...

Hive Integration with Spark

Ashish Kumar Spark January 22, 2019

Are you struggling to access hive using spark?Is your hive table is not showing in spark?No worry here I am going to show you the key changes made in HDP 3.0 for hive and how we can access hive using spark. Now in HDP 3.0 both spark and hive ha their own meta store. Hive uses the "hive" catalog, and Spark uses the "spark" catalog. With HDP 3.0 in Ambari you can find below configuration for spark.As we know before we could access hive table in spark using HiveContext/SparkSession but now in HDP 3.0 we can access hive using Hive ...

Spark Performance Tuning

Ashish Kumar Spark September 26, 2018

Apache Spark overview Analytics is increasingly an integral part of day-to-day operations at today's leadingbusinesses, and transformation is also occurring through huge growth in mobile and digitalchannels. Previously acceptable response times and delays for analytic insight are no longerviable, with more push toward real-time and in-transaction analytics. In addition, data science skills are increasingly in demand. As a result, enterprise organizations are attempting to leverage analytics in new ways and transition existing analy...

Dynamic Allocation in Spark

Ashish Kumar Spark August 26, 2018

Why spark is faster than MapReduce?Here today I will give you deep dive about Spark Resource Allocation (Static and dynamic allocation of resources).Whenever this question arose, we have come up with below explanation that Spark does in-memory processing of data or it does better or effective utilization of YARN resources than MapReduce.How and when dynamic allocation of resource will give faster and effective utilization of resources.Effective utilization of cluster or yarn memory.What is Executors?Before we start talking about stati...