Monday, 7 November 2016

Performance Tuning in Spark

Below are the different ways we can tune the spark Jobs,
1) By increasing the resources for spark Job :       
 //Memory assigned to executor that run spark job    sparkConf.set("spark.executor.memory", 10G)  
 //Number of Executors that are assigned for the spark Job    sparkConf.set("spark.executor.instances", 12) 
  //Cores for each executor    
sparkConf.set("spark.executor.cores", 10)   
 (OR)    
sparkConf.set("spark.dynamicAllocation.enabled","true")  
This also equires spark.dynamicAllocation.minExecutors,spark.dynamicAllocation.maxExecutors,     and spark.dynamicAllocation.initialExecutors

2) Serialization : which reduces the network load and improves the execution speed.     //serialization     

sparkConf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)     sparkConf.registerKryoClasses(Array(classOf[Joiner]))

3) Partitioning :    

 More the number of RDD partitions more the parallel processing.It is normally from           100 to 10000 depending on the cluster resources and data size. Normally define the partitions as 2 or 3 times more than the spark.executor.instances*spark.executor.cores

4)  Efficient Shuffling:     

The operations like groupByKey,leftOutrJoin,rightOuterJoin,reparition,distinct result      into   shuffling since they requires all the output of the previous stage.     Use coalesce to reduce the number of partitions that in turn reduce shuffling on the network.     For shuffle intensive Jobs, increase memory percent for the shuffle Jobs by,     sparkConf.set("spark.shuffle.memoryFraction", "0.8")

5) Garbage collection :     Information about GC can be found at Spark GC     sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")   

User Defined Configuration in Spark

Call the User defined configuration file as below,

def getConf(confName: String) = {
val conf = new Configuration()
conf.addResource(getClass.getClassLoader.getResource(confName))
conf
}

//Example Use Case
val sparkConf = new SparkConf().setAppName(conf.get(“appName”)).setMaster(conf.get(“locality”))
//memory tuning
sparkConf.set(“spark.executor.memory”, conf.get(“spark.executor.memory”))
//spark.executor.memory is defined in the user defined configuration files.

Saturday, 5 November 2016

Log4j properties for spark application

The log4j.properties file can be used to log the application specific logs on the edge node. log4j.properties

Creating your first spark application in scala

Hello all,
Follow the below steps to create your first spark application.
  1. download the Sample POM file for the sample spark Project with Maven. POM
  2. Create a Maven project in Scala IDE with the above POM file.
  3. Maven update the project. It will take few minutes to download the jars.
  4. Maven install the project and see if build is success.
  5. Right click on project and click on configure and change the nature to scala. Soon there will be a lot of errors on the problem view.
  6. To get rid of those errors go in property and click on Scala Compiler – tick Use Project Settings and then select Fixed Scala installation 2.10.5 if spark is build with 2.10.
  7. Now remove the scala libraries. right click on project -> configure build path -> libraries -> remove scala libraries.
  8. Now go on source tab and make sure /src/main/scala folder is added in there, so that it take source files from this folder.
  9. Now you can start using this project for spark project. You can add both Scala(/src/main/scala) and Java(/src/main/java) sources in the same project and run.