Posts

Showing posts from 2016

Tableau integration with sparkSQL and basic data analysis with Tableau

Steps for Tableau integration with sparkSQL and basic data analysis: ================================================ Run the spark-Sql in NameNode[sparkSql server node] as: /opt/spark/sbin/start-thriftserver.sh   --hiveconf hive.server2.thrift.port=10001 Download & install  tableau-10 from the site:  https://www.tableau.com/products  [14-days trail version] Download & install tableau driver for spark-SQL:  https://downloads.tableau.com/drivers/mac/TableauDrivers.dmg Open tableau & connect to sparkSQL. Provide server as NameNode IP & port as 10001 [as in step-1 above] Select Type as ‘SparkThriftServer’ Select Authentication as  ‘Username and password’ Provide username as ‘hive’ [This is same as in hive-site.xml] Provide password as ‘hive@123’ [This is same as in hive-site.xml] Search & select the database name in ‘Select Schema’ dropdown. [This is the same parquet db sparkJobs created ] Search & select the table names. Drag & drop the table

How to connect SQL workbench to SparkSQL

Steps to setup sql workbench for accessing spark-sql datases: Start sparkSql on Namenode as:  /opt/spark/bin/spark-sql --verbose --master yarn --driver-memory 5G --executor-memory 5G --executor-cores 2 --num-executors 5 Download SQL workbench, for macOs download from:  http://www.sql-workbench.net/Workbench-Build117-MacJava7.tgz Extract the downloaded tgz file and launch SQLWorkbenchJ Copy the jar   /opt/spark/lib/spark-assembly-1.2.1-hadoop2.4.0.jar [Or, equivalent as per the hadoop version] from Namnode(spark-sql server) On SQLWorkbench, from menu go to file-> Manage drivers. Click on 'Create new entry' button on top left corner. Provide the driver name such as spark-sql_driver. In Library section, select the jar (needed for jdbc driver) copied from name node in step 3 above. In the classname section, click on the 'Search button'.  From the pop up window, select the driver 'org.apache.hive.jdbc.HiveDriver' and click 'Ok' From

The points to remember while developing a spark application

A. The resources allocation should be optimised i.e following needs to be considered: 15 Cores/Exec can lead to HDFS I/O Throuput is Bad; so best core per executor is 4 to 6 Max(385MB,0.07*ExecMemory) is required for direct memory i.e Overhead.  Don’t have too high executor memory ; Garbage collections & parallelism would be impacted.  Consider at least 1 core & 1GB for the Os/hadoop  Consider the resources for one Application Master  B. The no of partitions should be optimised i.e initial partitions and intermediate partitions  No of Initial partitions = No of blocks in hadoop or same as value of spark.default.parallelism  Size of each SufflePartitions shouldn’t be more than 2G, otherwise job fails with IllegalArgument MAX_VALUE.  No of child partitions >=< No of partitions in parent RDD.  C. Always use reduceByKey  instead of groupByKey and  treeReduce instead of reduce wherever possible. D. Take care of Skewness of data i.e some executors are

Setup hadoop and spark on MAC

Image
In this article, i'll take you through simple steps to setup hadoop/spark and run a spark job. Step1: Setup the java run the command:  java version & if not installed then download the one.  After installation get the JAVA_HOME with command: /usr/libexec/java_home Update the .bashrc with the JAVA_HOME as: export JAVA_HOME= Step-2 Setup SSH Keyless  Enable remote login in System Preference=> sharing  Generate rsa key:  ssh - keygen - t rsa - P ''   Add the RSA key to authorized key: cat ~ / . ssh / id_rsa . pub > > ~ / . ssh / authorized_keys Check ssh localhost ; it shouldn't prompt for the password. Step-3: Setup hadoop Download hadoop2.7.2.tar.gz:  http://www.apache.org/dyn/closer.cgi/hadoop/common/  Extract the tar file and move the hadoop2.7.2 to /usr/local/hadoop Setup the configuration files: [If the configuration files doesn't exist as they are copy from corresponding template files] Update  /usr/local/hadoop/

Few Important hadoop commands

To check the block size and replication factor of a file:   hadoop fs -stat %o hadoop fs -stat %r   How to create a file with different block size and replication factor of a file: hadoop fs -Ddfs.block.size <64mb> hadoop fs -Ddfs.replication.factor 2 How to change the block size and replication factor of a existing file: hadoop dfs -setrep -w 4 -R   there are two ways: either change in hdfs-site.xml & restart the cluster  or, copy the files using distcp to another path with new block size & delete the old ones as: hadoop distcp -Ddfs.block.size=XX /path/to/old/files /path/to/new/files/with/larger/block/sizes. get multiple files under a directory: hadoop fs -getmerge Start hadoop ecosystems: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh can be done through the master. or, hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager  Need to do on individual nodes. To View the FSImage to Text:  hdfs oiv -p XML -i fsimage_0

Important points on Apache spark

Spark Basics: Iterative programs (ML) Spark Scala(Functional programming language) has: Immutability, Lazy transformation(Execution before evaluation), type inferred, Because of Immutability, we can Cache & distribute. RDD is a big collection of data structure. RDD is big data collection of with properties: immutable,distributed & lazy evaluation,time Inference, resilient(Fault-tolerant) & cacheable. Spark Remembers all its transformation, Transformation doest apply any action. So it has multiple copies.(Bad!!!!!!!!! ) Scala code runs on top of JVM. Spark-shell is interactive.  var means immutable. but immutable value can be mapped to new value. Interactive Queries Real Time & batch processing unified Good use of resources (Multi-core),Network speed,disk Velocity is as much important as Volume. Realtime processing is as much important as Batch processing Existing map-reduce has tightly coupled with API. Spark makes use of Hadoop distributed storage. In

Some Linux Concepts.

How to check the utilisation of each cores: mpstat -P ALL 1 & lscpu or cat /proc/cpuinfo  ps -aeF  gives the details of which core ID is being used for a process. top -H -p => gives all the details of threads of all the processes. lsof -t => Gives the PID of a file name. netstat -a => all ; listening + established; -n => suppress host/port name resolution; -t => only tcp ; -p => program name; -r routing; -i interface   Find files smaller than 2K: find -type f -mindepth 2 -size -1c  Find files which are not older than 2 days: find -type f -mtime -2  stat zzz (Gives all the statistics of a file); stat %i ; %F; %G; %U; %b ; atime => when it was last accessed , mtime => when the file was last modified, ctime when the file was  last changed (changed means file attributes were changed)  Unix File system: A directory has name & a number; number  refers to a

Tested & verified powerful python oneliners

1. Reverse the words in odd position of a string and maintain the title cases for the reversed words. python -c "for p in [word[::-1].title() if index % 2==0 else word  for index,word in enumerate('A Quick Lazy Dogs Jumps Over The Fox'.split()) ]: print p," 2. Print only the lines of a file whose 3rd field is either of '123','234','245'. python -c "for p in [ line.strip('\n') for line in open('file.txt').readlines() if int(line.split(',')[2]) in [123,234,245] ]: print p"                            --------------OR------------------  cat file.txt | python -c "import sys; [sys.stdout.write(line) for line in sys.stdin if int(line.split(',')[2]) in [ 123,234,245 ] ]" 3. Print all the unique words in a file. python -c "for p in set(word for line in open('file.txt').readlines() for word in line.strip('\n').split(',')): print p, " 4. Awk equivalent: pr