Posts

Showing posts from September, 2016

The points to remember while developing a spark application

A. The resources allocation should be optimised i.e following needs to be considered: 15 Cores/Exec can lead to HDFS I/O Throuput is Bad; so best core per executor is 4 to 6 Max(385MB,0.07*ExecMemory) is required for direct memory i.e Overhead.  Don’t have too high executor memory ; Garbage collections & parallelism would be impacted.  Consider at least 1 core & 1GB for the Os/hadoop  Consider the resources for one Application Master  B. The no of partitions should be optimised i.e initial partitions and intermediate partitions  No of Initial partitions = No of blocks in hadoop or same as value of spark.default.parallelism  Size of each SufflePartitions shouldn’t be more than 2G, otherwise job fails with IllegalArgument MAX_VALUE.  No of child partitions >=< No of partitions in parent RDD.  C. Always use reduceByKey  instead of groupByKey and  treeReduce instead of reduce wherever possible. D. Take care of Skewness of data i.e some executors are

Setup hadoop and spark on MAC

Image
In this article, i'll take you through simple steps to setup hadoop/spark and run a spark job. Step1: Setup the java run the command:  java version & if not installed then download the one.  After installation get the JAVA_HOME with command: /usr/libexec/java_home Update the .bashrc with the JAVA_HOME as: export JAVA_HOME= Step-2 Setup SSH Keyless  Enable remote login in System Preference=> sharing  Generate rsa key:  ssh - keygen - t rsa - P ''   Add the RSA key to authorized key: cat ~ / . ssh / id_rsa . pub > > ~ / . ssh / authorized_keys Check ssh localhost ; it shouldn't prompt for the password. Step-3: Setup hadoop Download hadoop2.7.2.tar.gz:  http://www.apache.org/dyn/closer.cgi/hadoop/common/  Extract the tar file and move the hadoop2.7.2 to /usr/local/hadoop Setup the configuration files: [If the configuration files doesn't exist as they are copy from corresponding template files] Update  /usr/local/hadoop/

Few Important hadoop commands

To check the block size and replication factor of a file:   hadoop fs -stat %o hadoop fs -stat %r   How to create a file with different block size and replication factor of a file: hadoop fs -Ddfs.block.size <64mb> hadoop fs -Ddfs.replication.factor 2 How to change the block size and replication factor of a existing file: hadoop dfs -setrep -w 4 -R   there are two ways: either change in hdfs-site.xml & restart the cluster  or, copy the files using distcp to another path with new block size & delete the old ones as: hadoop distcp -Ddfs.block.size=XX /path/to/old/files /path/to/new/files/with/larger/block/sizes. get multiple files under a directory: hadoop fs -getmerge Start hadoop ecosystems: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh can be done through the master. or, hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager  Need to do on individual nodes. To View the FSImage to Text:  hdfs oiv -p XML -i fsimage_0