The points to remember while developing a spark application

The points to remember while developing a spark application

- September 25, 2016

A. The resources allocation should be optimised i.e following needs to be considered:

15 Cores/Exec can lead to HDFS I/O Throuput is Bad; so best core per executor is 4 to 6
Max(385MB,0.07*ExecMemory) is required for direct memory i.e Overhead.
Don’t have too high executor memory ; Garbage collections & parallelism would be impacted.
Consider at least 1 core & 1GB for the Os/hadoop
Consider the resources for one Application Master

B. The no of partitions should be optimised i.e initial partitions and intermediate partitions

No of Initial partitions = No of blocks in hadoop or same as value of spark.default.parallelism
Size of each SufflePartitions shouldn’t be more than 2G, otherwise job fails with IllegalArgument MAX_VALUE.
No of child partitions >=< No of partitions in parent RDD.

C. Always use reduceByKey instead of groupByKey and treeReduce instead of reduce wherever possible.

D. Take care of Skewness of data i.e some executors are overloaded and the others are lightly used.

Comments