Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Tuesday, July 19, 2016

How to link PyCharm with PySpark?

How to link PyCharm with PySpark?


I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:

Last login: Fri Jan  8 12:52:04 on console  user@MacBook-Pro-de-User-2:~$ pyspark  Python 2.7.10 (default, Jul 13 2015, 12:05:58)  [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin  Type "help", "copyright", "credits" or "license" for more information.  Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties  16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1  16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user  16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user  16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)  16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started  16/01/08 14:46:50 INFO Remoting: Starting remoting  16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199]  16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.  16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker  16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster  16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95  16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB  16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393  16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server  16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.  16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator  16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.  16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040  16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.  16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost  16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.  16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201  16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager  16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)  16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager  Welcome to        ____              __       / __/__  ___ _____/ /__      _\ \/ _ \/ _ `/ __/  '_/     /__ / .__/\_,_/_/ /_/\_\   version 1.5.1        /_/    Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)  SparkContext available as sc, HiveContext available as sqlContext.  >>>  

I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:

cant link pycharm with spark

Then from a blog I tried this:

import os  import sys    # Path for spark source folder  os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"    # Append pyspark  to Python Path  sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")    try:      from pyspark import SparkContext      from pyspark import SparkConf      print ("Successfully imported Spark Modules")    except ImportError as e:      print ("Can not import Spark Modules", e)      sys.exit(1)  

And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?.

Update:

Then I search for apache-spark and python path in order to set the environment variables of Pycharm:

apache-spark path:

user@MacBook-Pro-User-2:~$ brew info apache-spark  apache-spark: stable 1.6.0, HEAD  Engine for large-scale data processing  https://spark.apache.org/  /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *    Poured from bottle  From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb  

python path:

user@MacBook-Pro-User-2:~$ brew info python  python: stable 2.7.11 (bottled), HEAD  Interpreted, interactive, object-oriented programming language  https://www.python.org  /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *  

Then with the above information I tried to set the environment variables as follows:

configuration 1

Any idea of how to correctly link Pycharm with pyspark?

Then when I run a python script with the above configuration I have this exception:

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py  Traceback (most recent call last):    File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in       from pyspark import SparkContext  ImportError: No module named pyspark  

UPDATE: Then I tried this configurations proposed by @zero323

Configuration 1:

/usr/local/Cellar/apache-spark/1.5.1/   

conf 1

out:

 user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls  CHANGES.txt           NOTICE                libexec/  INSTALL_RECEIPT.json  README.md  LICENSE               bin/  

Configuration 2:

/usr/local/Cellar/apache-spark/1.5.1/libexec   

enter image description here

out:

user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls  R/        bin/      data/     examples/ python/  RELEASE   conf/     ec2/      lib/      sbin/  

Answer by grc for How to link PyCharm with PySpark?


From the documentation:

To run Spark applications in Python, use the bin/spark-submit script located in the Spark directory. This script will load Spark?s Java/Scala libraries and allow you to submit applications to a cluster. You can also use bin/pyspark to launch an interactive Python shell.

You are invoking your script directly with the CPython interpreter, which I think is causing problems.

Try running your script with:

"${SPARK_HOME}"/bin/spark-submit test_1.py  

If that works, you should be able to get it working in PyCharm by setting the project's interpreter to spark-submit.

Answer by zero323 for How to link PyCharm with PySpark?


Create Run configuration:

  1. Go to Run -> Edit configurations
  2. Add new Python configuration
  3. Set Script path so it points to the script you want to execute
  4. Edit Environment variables field so it contains at least:

    • SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)
    • PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6.0)

      enter image description here

  5. Apply the settings

Add PySpark library to the interpreter path (required for code completion):

  1. Go to File -> Settings -> Project Interpreter
  2. Open settings for an interpreter you want to use with Spark
  3. Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
  4. Save the settings

Use newly created configuration to run your script.

Answer by obug for How to link PyCharm with PySpark?


I used the following page as a reference and was able to get pyspark/Spark 1.6.1 (installed via homebrew) imported in PyCharm 5.

http://renien.com/blog/accessing-pyspark-pycharm/

import os  import sys    # Path for spark source folder  os.environ['SPARK_HOME']="/usr/local/Cellar/apache-spark/1.6.1"    # Append pyspark  to Python Path  sys.path.append("/usr/local/Cellar/apache-spark/1.6.1/libexec/python")    try:      from pyspark import SparkContext      from pyspark import SparkConf      print ("Successfully imported Spark Modules")  except ImportError as e:      print ("Can not import Spark Modules", e)  sys.exit(1)  

With the above, pyspark loads, but I get a gateway error when I try to create a SparkContext. There's some issue with Spark from homebrew, so I just grabbed Spark from the Spark website (download the Pre-built for Hadoop 2.6 and later) and point to the spark and py4j directories under that. Here's the code in pycharm that works!

import os  import sys    # Path for spark source folder  os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"    # Need to Explicitly point to python3 if you are using Python 3.x  os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"    #You might need to enter your local IP  #os.environ['SPARK_LOCAL_IP']="192.168.2.138"    #Path for pyspark and py4j  sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")  sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")    try:      from pyspark import SparkContext      from pyspark import SparkConf      print ("Successfully imported Spark Modules")  except ImportError as e:      print ("Can not import Spark Modules", e)      sys.exit(1)    sc = SparkContext('local')  words = sc.parallelize(["scala","java","hadoop","spark","akka"])  print(words.count())  

I had a lot of help from these instructions, which helped me troubleshoot in PyDev and then get it working PyCharm - https://enahwe.wordpress.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/

I'm sure somebody has spent a few hours bashing their head against their monitor trying to get this working, so hopefully this helps save their sanity!

Answer by sthomps for How to link PyCharm with PySpark?


Here's how I solved this on mac osx.

  1. brew install apache-spark
  2. Add this to ~/.bash_profile

    export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1`  export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec"  export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH  
  3. Add pyspark and py4j to content root (use the correct Spark version):

    /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip  /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip  

enter image description here

Answer by Jason Wolosonovich for How to link PyCharm with PySpark?


Check out this video.

Assume your spark python directory is: /home/user/spark/python

Assume your Py4j source is: /home/user/spark/python/lib/py4j-0.9-src.zip

Basically you add the the spark python directory and the py4j directory within that to the interpreter paths. I don't have enough reputation to post a screenshot or I would.

In the video, the user creates a virtual environment within pycharm itself, however, you can make the virtual environment outside of pycharm or activate a pre-existing virtual environment, then start pycharm with it and add those paths to the virtual environment interpreter paths from within pycharm.

I used other methods to add spark via the bash environment variables, which works great outside of pycharm, but for some reason they weren't recognized within pycharm, but this method worked perfectly.

Answer by cheeech for How to link PyCharm with PySpark?


I followed the tutorials on-line and added the env variables to .bashrc:

# add pyspark to python  export SPARK_HOME=/home/lolo/spark-1.6.1  export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH  

I then just got the value in SPARK_HOME and PYTHONPATH to pycharm:

(srz-reco)lolo@K:~$ echo $SPARK_HOME   /home/lolo/spark-1.6.1  (srz-reco)lolo@K:~$ echo $PYTHONPATH  /home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:  

Then I copied it to Run/Debug Configurations -> Environment variables of the script.


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.