RSS

Category Archives: Spark

Spark Cluster instalation

=========MASTER SETUP==================
$cd conf/
$cp spark-defaults.conf.template spark-defaults.conf

$vi spark-defaults.conf

ADD THIS LINE

spark.master spark://192.168.2.101:7077

$ cp spark-env.sh.template spark-env.sh

$ vi spark-env.sh
%ADD THIS LINE
SPARK_MASTER_HOST=’192.168.2.101′

$ cp slaves.template slaves

$vi slaves

ADD THIS LINE

%localhost
slave1 IP/name
slave2 IP/name

$cd sbin
$./start-all.sh
$./start-master.sh

=========SLAVE SETUP==================

$cd conf/

$ cp spark-env.sh.template spark-env.sh

$ vi spark-env.sh
%ADD THIS LINE
SPARK_MASTER_HOST=’192.168.1.101′

./start-slave.sh spark://192.168.2.101:7077

===========RUN ON MASTER=============
./bin/pyspark

==========BROWSE MASTER===============
http://192.168.2.101:8080/

======================================
import math
from pyspark.sql.functions import *
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,232,124,56,78,34,90,76,23,67,123,345,346,789,546,786,343,7864,1212,876,299,1098],6)
log_values = numbers.map(lambda n : math.log10(n))
log_values.collect()

big_list = range(1000000)
rdd = sc.parallelize(big_list, 6)
odds = rdd.filter(lambda x: x % 2 != 0)
odds.take(500)

===========================================================

https://www.tutorialkart.com/apache-spark/how-to-setup-an-apache-spark-cluster/
Apache Spark Cluster Setup

Setup an Apache Spark Cluster
To Setup an Apache Spark Cluster, we need to know two things :

  1. Setup master node
  2. Setup worker node.
    Setup Spark Master Node
    Following is a step by step guide to setup Master node for an Apache Spark cluster. Execute the following steps on the node, which you want to be a Master.
  3. Navigate to Spark Configuration Directory
    Go to SPARK_HOME/conf/ directory.
    SPARK_HOME is the complete path to root directory of Apache Spark in your computer.
  4. Edit the file spark-env.sh – Set SPARK_MASTER_HOST
    Note : If spark-env.sh is not present, spark-env.sh.template would be present. Make a copy of spark-env.sh.template with name spark-env.sh and add/edit the field SPARK_MASTER_HOST. Part of the file with SPARK_MASTER_HOST addition is shown below:
    Part of spark-env.sh

Options for the daemons used in the standalone deploy mode

– SPARK_MASTER_HOST, to bind the master to a different IP address or hostname

SPARK_MASTER_HOST=’192.168.0.102′

– SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master

Replace the ip with the ip address assigned to your computer (which you would like to make as a master).

  1. Start spark as master
    Goto SPARK_HOME/sbin and execute the following command.

$ ./start-master.sh

~$ ./start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/lib/spark/logs/spark-arjun-org.apache.spark.deploy.master.Master-1-arjun-VPCEH26EN.out

  1. Verify the log file
    You would see the following in the log file, specifying ip address of the master node, the port on which spark has been started, port number on which WEB UI has been started, etc.
    Sample spark startup log

Spark Command: /usr/lib/jvm/default-java/jre/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master –host 192.168.0.102 –port 7077 –webui-port 8080

Using Sparks default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/09 14:09:16 INFO Master: Started daemon with process name: 7715@arjun-VPCEH26EN
17/08/09 14:09:16 INFO SignalUtils: Registered signal handler for TERM
17/08/09 14:09:16 INFO SignalUtils: Registered signal handler for HUP
17/08/09 14:09:16 INFO SignalUtils: Registered signal handler for INT
17/08/09 14:09:16 WARN Utils: Your hostname, arjun-VPCEH26EN resolves to a loopback address: 127.0.1.1; using 192.168.0.102 instead (on interface wlp7s0)
17/08/09 14:09:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/08/09 14:09:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
17/08/09 14:09:17 INFO SecurityManager: Changing view acls to: arjun
17/08/09 14:09:17 INFO SecurityManager: Changing modify acls to: arjun
17/08/09 14:09:17 INFO SecurityManager: Changing view acls groups to:
17/08/09 14:09:17 INFO SecurityManager: Changing modify acls groups to:
17/08/09 14:09:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(arjun); groups with view permissions: Set(); users with modify permissions: Set(arjun); groups with modify permissions: Set()
17/08/09 14:09:17 INFO Utils: Successfully started service ‘sparkMaster’ on port 7077.
17/08/09 14:09:17 INFO Master: Starting Spark master at spark://192.168.0.102:7077
17/08/09 14:09:17 INFO Master: Running Spark version 2.2.0
17/08/09 14:09:18 WARN Utils: Service ‘MasterUI’ could not bind on port 8080. Attempting port 8081.
17/08/09 14:09:18 INFO Utils: Successfully started service ‘MasterUI’ on port 8081.
17/08/09 14:09:18 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://192.168.0.102:8081
17/08/09 14:09:18 INFO Utils: Successfully started service on port 6066.
17/08/09 14:09:18 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/08/09 14:09:18 INFO Master: I have been elected leader! New state: ALIVE
Setting up Master Node is complete.
Setup Spark Slave(Worker) Node
Following is a step by step guide to setup Slave(Worker) node for an Apache Spark cluster. Execute the following steps on all of the nodes, which you want to be as worker nodes.

  1. Navigate to Spark Configuration Directory
    Go to SPARK_HOME/conf/ directory.
    SPARK_HOME is the complete path to root directory of Apache Spark in your computer.
  2. Edit the file spark-env.sh – Set SPARK_MASTER_HOST
    Note : If spark-env.sh is not present, spark-env.sh.template would be present. Make a copy of spark-env.sh.template with name spark-env.sh and add/edit the field SPARK_MASTER_HOST. Part of the file with SPARK_MASTER_HOST addition is shown below:
    Part of spark-env.sh

Options for the daemons used in the standalone deploy mode

– SPARK_MASTER_HOST, to bind the master to a different IP address or hostname

SPARK_MASTER_HOST=’192.168.0.102′

– SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master

Replace the ip with the ip address assigned to your master (that you used in setting up master node).

  1. Start spark as slave
    Goto SPARK_HOME/sbin and execute the following command.

$ ./start-slave.sh spark://:7077

apples-MacBook-Pro:sbin John$ ./start-slave.sh spark://192.168.0.102:7077
starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/Cellar/apache-spark/2.2.0/libexec/logs/spark-John-org.apache.spark.deploy.worker.Worker-1-apples-MacBook-Pro.local.out

  1. Verify the log
    You would find in the log that this Worker node has been successfully registered with master running at spark://192.168.0.102:7077 on the network.
    Verifying the worker startup

Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/bin/java -cp /usr/local/Cellar/apache-spark/2.2.0/libexec/conf/:/usr/local/Cellar/apache-spark/2.2.0/libexec/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker –webui-port 8081 spark://192.168.0.102:7077

Using Sparks default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/09 14:12:55 INFO Worker: Started daemon with process name: 7345@apples-MacBook-Pro.local
17/08/09 14:12:55 INFO SignalUtils: Registered signal handler for TERM
17/08/09 14:12:55 INFO SignalUtils: Registered signal handler for HUP
17/08/09 14:12:55 INFO SignalUtils: Registered signal handler for INT
17/08/09 14:12:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
17/08/09 14:12:56 INFO SecurityManager: Changing view acls to: John
17/08/09 14:12:56 INFO SecurityManager: Changing modify acls to: John
17/08/09 14:12:56 INFO SecurityManager: Changing view acls groups to:
17/08/09 14:12:56 INFO SecurityManager: Changing modify acls groups to:
17/08/09 14:12:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(John); groups with view permissions: Set(); users with modify permissions: Set(John); groups with modify permissions: Set()
17/08/09 14:12:56 INFO Utils: Successfully started service ‘sparkWorker’ on port 58156.
17/08/09 14:12:57 INFO Worker: Starting Spark worker 192.168.0.100:58156 with 4 cores, 7.0 GB RAM
17/08/09 14:12:57 INFO Worker: Running Spark version 2.2.0
17/08/09 14:12:57 INFO Worker: Spark home: /usr/local/Cellar/apache-spark/2.2.0/libexec
17/08/09 14:12:57 INFO Utils: Successfully started service ‘WorkerUI’ on port 8081.
17/08/09 14:12:57 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://192.168.0.100:8081
17/08/09 14:12:57 INFO Worker: Connecting to master 192.168.0.102:7077…
17/08/09 14:12:57 INFO TransportClientFactory: Successfully created connection to /192.168.0.102:7077 after 57 ms (0 ms spent in bootstraps)
17/08/09 14:12:57 INFO Worker: Successfully registered with master spark://192.168.0.102:7077
The setup of Worker node is successful.
Multiple Spark Worker Nodes
To add more worker nodes to the Apache Spark cluster, you may just repeat the process of worker setup on other nodes as well.
Once you have added some slaves to the cluster, you can view the workers connected to the master via Master WEB UI.
Hit the url http://:/ (example is http://192.168.0.102:8080/) in browser. Following would be the output with slaves connected listed under Workers.

Conclusion
In this Apache Spark Tutorial, we have successfully setup a master node and multiple worker nodes, thus an Apache Spark cluster. In our next tutorial we shall learn to configure spark ecosystem.

 
Leave a comment

Posted by on July 25, 2021 in Spark

 

pySpark install in linux

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz

sudo gedit .bashrc

export SPARK_HOME=/usr/lib/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:/usr/lib/spark-2.4.3-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’
source .bashrc
yum –enablerepo=extras install epel-release
sudo yum install -y –enablerepo=”epel” python-pip
pip install –upgrade pip
pip install jupyter
jupyter notebook –allow-root
—–installing weight of evidence———-
pip install mlencoders

jupyter notebook –allow-root

cd $SPARK_HOME
./sbin/start-all.sh
cd
sudo jps

pip install scikit-learn

pip install pandas

pyspark ——–(do not run in user root)

==============================

How to fix “ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.”?
I use PySpark 2.4.0 and when I executed the following code in pyspark:
$ ./bin/pyspark
Python 2.7.16 (default, Mar 25 2019, 15:07:04)

Welcome to
_ _ / / / /_ \ \/ _ \/ _ `/ / ‘/ / / ._/_,// //_\ version 2.4.0 //

Using Python version 2.7.16 (default, Mar 25 2019 15:07:04)
SparkSession available as ‘spark’.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import IntegerType, StringType
slen = pandas_udf(lambda s: s.str.len(), IntegerType())
Traceback (most recent call last):
File “”, line 1, in
File “/Users/x/spark/python/pyspark/sql/functions.py”, line 2922, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File “/Users/x/spark/python/pyspark/sql/udf.py”, line 47, in _create_udf
require_minimum_pyarrow_version()
File “/Users/x/spark/python/pyspark/sql/utils.py”, line 149, in require_minimum_pyarrow_version
“it was not found.” % minimum_pyarrow_version)
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.
How to fix it?

The error message in this case is misleading, pyarrow wasn’t installed.
From the official documentation Spark SQL Guide (that led to Installing PyArrow), you should simply execute one of the following commands:
$ conda install -c conda-forge pyarrow
or
$ pip install pyarrow
It is also important to run it under proper user and Python version. I.e., if one is using Zeppelin under root with Python3, it might be needed to execute

pip3 install pyarrow

instead

I have done that but is not working yet. Could it be related to the folder where it is installed? If I do $ pip list I can see pyarrow 0.16.0 – JOSE DANIEL FERNANDEZ Mar 28 at 0:39
add a comment
0
Re-installing pyarrow is what works for me:
$ pip uninstall pyarrow -y
$ pip install pyarrow
and then restart the kernel.

FOR PYSPARK
$ pip install scikit-learn

$ pip install pandas

reboot

$ pip uninstall pyarrow -y

$ pip install pyarrow

reboot

 
Leave a comment

Posted by on July 25, 2021 in Spark

 

PySpark – Environment Setup

PySpark – Environment Setup

https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm

Let us now download and set up PySpark with the following steps.

Step 1 − Go to the official Apache Spark download page and download the latest version of Apache Spark available there. In this tutorial, we are using spark-2.1.0-bin-hadoop2.7.

Step 2 − Now, extract the downloaded Spark tar file. By default, it will get downloaded in Downloads directory.

tar -xvf Downloads/spark-2.1.0-bin-hadoop2.7.tgz

It will create a directory spark-2.1.0-bin-hadoop2.7. Before starting PySpark, you need to set the following environments to set the Spark path and the Py4j path.

export SPARK_HOME = /home/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:/home/hadoop/spark-2.1.0-bin-hadoop2.7/bin
export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH = $SPARK_HOME/python:$PATH
Or, to set the above environments globally, put them in the .bashrc file. Then run the following command for the environments to work.

source .bashrc

Now that we have all the environments set, let us go to Spark directory and invoke PySpark shell by running the following command −

./bin/pyspark

This will start your PySpark shell.

Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
Welcome to
_ _ / / / /_ \ \/ _ \/ _ `/ / ‘/ / / ._/_,// //_\ version 2.1.0 //
Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
SparkSession available as ‘spark’.
<<<

 
Leave a comment

Posted by on July 25, 2021 in Spark

 

pySpark instalation

tar -xzf spark-3.0.0-preview2-bin-hadoop3.2.tgz

vi .bashrc % write the following in the .bashrc file

export SPARK_HOME=/home/joy/spark-3.0.0-preview2-bin-hadoop3.2
export PATH=$PATH:/home/joy/spark-3.0.0-preview2-bin-hadoop3.2/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH

source .bashrc

cd spark-3.0.0-preview2-bin-hadoop3.2/

./bin/pyspark

=======Testing program========
from pyspark import SparkContext
sc=SparkContext(“local”, “First App”)
logFile=”file:///home/joy/spark-3.0.0-preview2-bin-hadoop3.2/README.md”
logData=sc.textFile(logFile).cache()
numAs=logData.filter(lambda s: ‘a’ in s).count()
numBs=logData.filter(lambda s: ‘b’ in s).count()
print “Lines with a: %i, lines with b: %i” % (numAs, numBs)

 
Leave a comment

Posted by on July 25, 2021 in Spark