How to access & use sparkSQL via PySpark in spark1.5?



To access the sparkSQL in spark1.5, follow just following steps:

1. Import the Spark Context and hive context 

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import col

2. Set the application name and configurations [This is  mandatory only if you are running your code in yarn-client mode]
appName = "SqlPyspark"
conf = SparkConf().setAppName(appName)
conf.setExecutorEnv('PYTHONPATH', '/opt/spark/python:/opt/spark/python/lib/py4j-0.8.2.1-src.zip')

3. Create spark and Hive contexts: 

sc = SparkContext(conf=conf)
hc = HiveContext(sc)


4. Now use hive context to access database and perform any operations:

hc.sql(“show databases“)

5. If you wish to compile all above in a python file , then run the following command to access/operate on sparkSQL:


/opt/spark/bin/spark-submit --master yarn  --deploy-mode client  --py-files [Other py files if any]  --executor-memory 2G --num-executors 2 --total-executor-cores 2  [Python program]

Comments