Monday, 7 November 2016

Performance Tuning in Spark

Below are the different ways we can tune the spark Jobs,
1) By increasing the resources for spark Job :       
 //Memory assigned to executor that run spark job    sparkConf.set("spark.executor.memory", 10G)  
 //Number of Executors that are assigned for the spark Job    sparkConf.set("spark.executor.instances", 12) 
  //Cores for each executor    
sparkConf.set("spark.executor.cores", 10)   
 (OR)    
sparkConf.set("spark.dynamicAllocation.enabled","true")  
This also equires spark.dynamicAllocation.minExecutors,spark.dynamicAllocation.maxExecutors,     and spark.dynamicAllocation.initialExecutors

2) Serialization : which reduces the network load and improves the execution speed.     //serialization     

sparkConf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)     sparkConf.registerKryoClasses(Array(classOf[Joiner]))

3) Partitioning :    

 More the number of RDD partitions more the parallel processing.It is normally from           100 to 10000 depending on the cluster resources and data size. Normally define the partitions as 2 or 3 times more than the spark.executor.instances*spark.executor.cores

4)  Efficient Shuffling:     

The operations like groupByKey,leftOutrJoin,rightOuterJoin,reparition,distinct result      into   shuffling since they requires all the output of the previous stage.     Use coalesce to reduce the number of partitions that in turn reduce shuffling on the network.     For shuffle intensive Jobs, increase memory percent for the shuffle Jobs by,     sparkConf.set("spark.shuffle.memoryFraction", "0.8")

5) Garbage collection :     Information about GC can be found at Spark GC     sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")   

No comments:

Post a Comment