Networking, Python, BigData and Linux

Posts

Showing posts from September, 2016

The points to remember while developing a spark application

- September 25, 2016

A. The resources allocation should be optimised i.e following needs to be considered: 15 Cores/Exec can lead to HDFS I/O Throuput is Bad; so best core per executor is 4 to 6 Max(385MB,0.07*ExecMemory) is required for direct memory i.e Overhead. Don’t have too high executor memory ; Garbage collections & parallelism would be impacted. Consider at least 1 core & 1GB for the Os/hadoop Consider the resources for one Application Master B. The no of partitions should be optimised i.e initial partitions and intermediate partitions No of Initial partitions = No of blocks in hadoop or same as value of spark.default.parallelism Size of each SufflePartitions shouldn’t be more than 2G, otherwise job fails with IllegalArgument MAX_VALUE. No of child partitions >=< No of partitions in parent RDD. C. Always use reduceByKey instead of groupByKey and treeReduce instead of reduce wherever possible. D. Tak...

Setup hadoop and spark on MAC

- September 17, 2016

In this article, i'll take you through simple steps to setup hadoop/spark and run a spark job. Step1: Setup the java run the command: java version & if not installed then download the one. After installation get the JAVA_HOME with command: /usr/libexec/java_home Update the .bashrc with the JAVA_HOME as: export JAVA_HOME= Step-2 Setup SSH Keyless Enable remote login in System Preference=> sharing Generate rsa key: ssh - keygen - t rsa - P '' Add the RSA key to authorized key: cat ~ / . ssh / id_rsa . pub > > ~ / . ssh / authorized_keys Check ssh localhost ; it shouldn't prompt for the password. Step-3: Setup hadoop Download hadoop2.7.2.tar.gz: http://www.apache.org/dyn/closer.cgi/hadoop/common/ Extract the tar file and move the hadoop2.7.2 to /usr/local/hadoop Setup the configuration files: [If the configuration files doesn't exist as they are copy from corresponding template files] Up...

Few Important hadoop commands

- September 03, 2016

To check the block size and replication factor of a file: hadoop fs -stat %o hadoop fs -stat %r How to create a file with different block size and replication factor of a file: hadoop fs -Ddfs.block.size hadoop fs -Ddfs.replication.factor 2 How to change the block size and replication factor of a existing file: hadoop dfs -setrep -w 4 -R there are two ways: either change in hdfs-site.xml & restart the cluster or, copy the files using distcp to another path with new block size & delete the old ones as: hadoop distcp -Ddfs.block.size=XX /path/to/old/files /path/to/new/files/with/larger/block/sizes. get multiple files under a directory: hadoop fs -getmerge Start hadoop ecosystems: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh can be done through the master. or, hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager Need to do on individual nodes. To View the FSImage to Text: hdfs oiv -p...