Below are the different ways we can tune the spark Jobs,
1) By increasing the resources for spark Job :
//Memory assigned to executor that run spark job sparkConf.set("spark.executor.memory", 10G)
//Number of Executors that are assigned for the spark Job sparkConf.set("spark.executor.instances", 12)
//Cores for each executor
sparkConf.set("spark.executor.cores", 10)
(OR)
sparkConf.set("spark.dynamicAllocation.enabled","true")
This also equires
sparkConf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) sparkConf.registerKryoClasses(Array(classOf[Joiner]))
3) Partitioning :
More the number of RDD partitions more the parallel processing.It is normally from 100 to 10000 depending on the cluster resources and data size. Normally define the partitions as 2 or 3 times more than the spark.executor.instances*spark.executor.cores
4) Efficient Shuffling:
The operations like groupByKey,leftOutrJoin,rightOuterJoin,reparition,distinct result into shuffling since they requires all the output of the previous stage. Use coalesce to reduce the number of partitions that in turn reduce shuffling on the network. For shuffle intensive Jobs, increase memory percent for the shuffle Jobs by, sparkConf.set("spark.shuffle.memoryFraction", "0.8")
5) Garbage collection : Information about GC can be found at Spark GC sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
1) By increasing the resources for spark Job :
//Memory assigned to executor that run spark job sparkConf.set("spark.executor.memory", 10G)
//Number of Executors that are assigned for the spark Job sparkConf.set("spark.executor.instances", 12)
//Cores for each executor
sparkConf.set("spark.executor.cores", 10)
(OR)
sparkConf.set("spark.dynamicAllocation.enabled","true")
This also equires
spark.dynamicAllocation.minExecutors,spark.dynamicAllocation.maxExecutors, and spark.dynamicAllocation.initialExecutors
2) Serialization : which reduces the network load and improves the execution speed. //serialization sparkConf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) sparkConf.registerKryoClasses(Array(classOf[Joiner]))
3) Partitioning :
More the number of RDD partitions more the parallel processing.It is normally from 100 to 10000 depending on the cluster resources and data size. Normally define the partitions as 2 or 3 times more than the spark.executor.instances*spark.executor.cores
4) Efficient Shuffling:
The operations like groupByKey,leftOutrJoin,rightOuterJoin,reparition,distinct result into shuffling since they requires all the output of the previous stage. Use coalesce to reduce the number of partitions that in turn reduce shuffling on the network. For shuffle intensive Jobs, increase memory percent for the shuffle Jobs by, sparkConf.set("spark.shuffle.memoryFraction", "0.8")
5) Garbage collection : Information about GC can be found at Spark GC sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
No comments:
Post a Comment