How to access & use sparkSQL via PySpark in spark1.5?

To access the sparkSQL in spark1.5, follow just following steps:

1. Import the Spark Context and hive context

from pyspark import SparkContext, SparkConf

from pyspark.sql import HiveContext

from pyspark.sql.functions import col

2. Set the application name and configurations [This is mandatory only if you are running your code in yarn-client mode]

appName = "SqlPyspark"

conf = SparkConf().setAppName(appName)

conf.setExecutorEnv('PYTHONPATH', '/opt/spark/python:/opt/spark/python/lib/py4j-0.8.2.1-src.zip')

3. Create spark and Hive contexts:

sc = SparkContext(conf=conf)

hc = HiveContext(sc)

4. Now use hive context to access database and perform any operations:

hc.sql(“show databases“)

5. If you wish to compile all above in a python file , then run the following command to access/operate on sparkSQL:

/opt/spark/bin/spark-submit --master yarn --deploy-mode client --py-files [Other py files if any] --executor-memory 2G --num-executors 2 --total-executor-cores 2 [Python program]

Search This Blog

Networking, Python, BigData and Linux

How to access & use sparkSQL via PySpark in spark1.5?

Comments