Networking, Python, BigData and Linux

Posts

Showing posts from 2016

Tableau integration with sparkSQL and basic data analysis with Tableau

- October 25, 2016

Steps for Tableau integration with sparkSQL and basic data analysis: ================================================ Run the spark-Sql in NameNode[sparkSql server node] as: /opt/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 Download & install tableau-10 from the site: https://www.tableau.com/products [14-days trail version] Download & install tableau driver for spark-SQL: https://downloads.tableau.com/drivers/mac/TableauDrivers.dmg Open tableau & connect to sparkSQL. Provide server as NameNode IP & port as 10001 [as in step-1 above] Select Type as ‘SparkThriftServer’ Select Authentication as ‘Username and password’ Provide username as ‘hive’ [This is same as in hive-site.xml] Provide password as ‘hive@123’ [This is same as in hive-site.xml] Search & select the database name in ‘Select Schema’ dropdown. [This is the same parquet db sparkJobs created ] ...

How to connect SQL workbench to SparkSQL

- October 24, 2016

Steps to setup sql workbench for accessing spark-sql datases: Start sparkSql on Namenode as: /opt/spark/bin/spark-sql --verbose --master yarn --driver-memory 5G --executor-memory 5G --executor-cores 2 --num-executors 5 Download SQL workbench, for macOs download from: http://www.sql-workbench.net/Workbench-Build117-MacJava7.tgz Extract the downloaded tgz file and launch SQLWorkbenchJ Copy the jar /opt/spark/lib/spark-assembly-1.2.1-hadoop2.4.0.jar [Or, equivalent as per the hadoop version] from Namnode(spark-sql server) On SQLWorkbench, from menu go to file-> Manage drivers. Click on 'Create new entry' button on top left corner. Provide the driver name such as spark-sql_driver. In Library section, select the jar (needed for jdbc driver) copied from name node in step 3 above. In the classname section, click on the 'Search button'. From the pop up window, select the driver 'org.apache.hive.jdbc.HiveDriver' and click 'Ok' From...

The points to remember while developing a spark application

- September 25, 2016

A. The resources allocation should be optimised i.e following needs to be considered: 15 Cores/Exec can lead to HDFS I/O Throuput is Bad; so best core per executor is 4 to 6 Max(385MB,0.07*ExecMemory) is required for direct memory i.e Overhead. Don’t have too high executor memory ; Garbage collections & parallelism would be impacted. Consider at least 1 core & 1GB for the Os/hadoop Consider the resources for one Application Master B. The no of partitions should be optimised i.e initial partitions and intermediate partitions No of Initial partitions = No of blocks in hadoop or same as value of spark.default.parallelism Size of each SufflePartitions shouldn’t be more than 2G, otherwise job fails with IllegalArgument MAX_VALUE. No of child partitions >=< No of partitions in parent RDD. C. Always use reduceByKey instead of groupByKey and treeReduce instead of reduce wherever possible. D. Tak...

Setup hadoop and spark on MAC

- September 17, 2016

In this article, i'll take you through simple steps to setup hadoop/spark and run a spark job. Step1: Setup the java run the command: java version & if not installed then download the one. After installation get the JAVA_HOME with command: /usr/libexec/java_home Update the .bashrc with the JAVA_HOME as: export JAVA_HOME= Step-2 Setup SSH Keyless Enable remote login in System Preference=> sharing Generate rsa key: ssh - keygen - t rsa - P '' Add the RSA key to authorized key: cat ~ / . ssh / id_rsa . pub > > ~ / . ssh / authorized_keys Check ssh localhost ; it shouldn't prompt for the password. Step-3: Setup hadoop Download hadoop2.7.2.tar.gz: http://www.apache.org/dyn/closer.cgi/hadoop/common/ Extract the tar file and move the hadoop2.7.2 to /usr/local/hadoop Setup the configuration files: [If the configuration files doesn't exist as they are copy from corresponding template files] Up...

Few Important hadoop commands

- September 03, 2016

To check the block size and replication factor of a file: hadoop fs -stat %o hadoop fs -stat %r How to create a file with different block size and replication factor of a file: hadoop fs -Ddfs.block.size hadoop fs -Ddfs.replication.factor 2 How to change the block size and replication factor of a existing file: hadoop dfs -setrep -w 4 -R there are two ways: either change in hdfs-site.xml & restart the cluster or, copy the files using distcp to another path with new block size & delete the old ones as: hadoop distcp -Ddfs.block.size=XX /path/to/old/files /path/to/new/files/with/larger/block/sizes. get multiple files under a directory: hadoop fs -getmerge Start hadoop ecosystems: start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh can be done through the master. or, hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager Need to do on individual nodes. To View the FSImage to Text: hdfs oiv -p...

Important points on Apache spark

- May 11, 2016

Spark Basics: Iterative programs (ML) Spark Scala(Functional programming language) has: Immutability, Lazy transformation(Execution before evaluation), type inferred, Because of Immutability, we can Cache & distribute. RDD is a big collection of data structure. RDD is big data collection of with properties: immutable,distributed & lazy evaluation,time Inference, resilient(Fault-tolerant) & cacheable. Spark Remembers all its transformation, Transformation doest apply any action. So it has multiple copies.(Bad!!!!!!!!! ) Scala code runs on top of JVM. Spark-shell is interactive. var means immutable. but immutable value can be mapped to new value. Interactive Queries Real Time & batch processing unified Good use of resources (Multi-core),Network speed,disk Velocity is as much important as Volume. Realtime processing is as much important as Batch processing Existing map-reduce has tightly coupled with API. Spark makes use of Hadoop distributed storage....

Some Linux Concepts.

- April 15, 2016

How to check the utilisation of each cores: mpstat -P ALL 1 & lscpu or cat /proc/cpuinfo ps -aeF gives the details of which core ID is being used for a process. top -H -p => gives all the details of threads of all the processes. lsof -t => Gives the PID of a file name. netstat -a => all ; listening + established; -n => suppress host/port name resolution; -t => only tcp ; -p => program name; -r routing; -i interface Find files smaller than 2K: find -type f -mindepth 2 -size -1c Find files which are not older than 2 days: find -type f -mtime -2 stat zzz (Gives all the statistics of a file); stat %i ; %F; %G; %U; %b ; atime => when it was last accessed , mtime => when the file was last modified, ctime when the file was last changed (changed means file attributes were changed) Unix File system: A directory has name ...

Tested & verified powerful python oneliners

- January 09, 2016

1. Reverse the words in odd position of a string and maintain the title cases for the reversed words. python -c "for p in [word[::-1].title() if index % 2==0 else word for index,word in enumerate('A Quick Lazy Dogs Jumps Over The Fox'.split()) ]: print p," 2. Print only the lines of a file whose 3rd field is either of '123','234','245'. python -c "for p in [ line.strip('\n') for line in open('file.txt').readlines() if int(line.split(',')[2]) in [123,234,245] ]: print p" --------------OR------------------ cat file.txt | python -c "import sys; [sys.stdout.write(line) for line in sys.stdin if int(line.split(',')[2]) in [ 123,234,245 ] ]" 3. Print all the unique words in a file. python -c "for p in set(word for line in open('file.txt').readlines() for word in line.strip('\n').split(',')): print p, " 4. Awk equivalent: pr...