How to access & use sparkSQL via PySpark in spark1.5?
To access the sparkSQL in spark1.5, follow just following steps:
1. Import the Spark Context and hive context
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import col
2. Set the application name and configurations [This is mandatory only if you are running your code in yarn-client mode]
appName = "SqlPyspark"
conf = SparkConf().setAppName(appName)
conf.setExecutorEnv('PYTHONPATH', '/opt/spark/python:/opt/spark/python/lib/py4j-0.8.2.1-src.zip')
3. Create spark and Hive contexts:
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
4. Now use hive context to access database and perform any operations:
hc.sql(“show databases“)
5. If you wish to compile all above in a python file , then run the following command to access/operate on sparkSQL:
5. If you wish to compile all above in a python file , then run the following command to access/operate on sparkSQL:
/opt/spark/bin/spark-submit --master yarn --deploy-mode client --py-files [Other py files if any] --executor-memory 2G --num-executors 2 --total-executor-cores 2 [Python program]
Comments