Nitendra Gautam

Installing Apache Spark on Linux

Apache Spark is an open-source cluster-computing framework. This post will explain the steps for installing prebuilt version of Apache Spark 2.1.1 as a stand alone cluster in a Linux system. I have used Ubuntu as a debains based OS for this post.

Install open SSH server and client and other prerequisite
sudo apt-get install rsync
sudo apt-get install openssh-client openssh-server
sudo apt-get install rsync
sudo apt-get install telnetd
Add a dedicated user for Spark
#Adding hduser
sudo adduser hduser

#Adding the hduser in the sudoers list 
sudo visudo -f /etc/sudoers

#Paste this in the sudoers file
root    ALL=(ALL:ALL) ALL
hduser  ALL=(ALL:ALL) ALL
Install Java in the Ubuntu Machine
sudo apt-get install software-properties-common
sudo apt-get -y install python-software-properties
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

#Add JAVA_HOME in bashrc file
nano ~/.bashrc 

#Add the Java environment to last line of bashrc file
export JAVA_HOME=/usr/lib/jvm/java-8-oracle 
export PATH=$JAVA_HOME/bin:$PATH

#Reload the bashrc file
source ~/.bashrc
Install Scala
#Remove any older version of scala
sudo apt-get remove scala-library scala
sudo wget
sudo dpkg -i scala-2.11.8.deb
sudo apt-get update
Install SBT(Scala Build Tool)
#Installation of sbt
sudo wget
sudo dpkg -i sbt-0.13.12.deb
sudo apt-get  update
Install git and Apache Maven
#Install git as spark depends upon Git
sudo apt-get install git

#Install Maven Linux
sudo apt-get install maven
Download Apache Spark
#Downloading spark with Pre Buit Hadoop 
sudo wget
sudo mv spark-2.1.1-bin-hadoop2.7.tgz /usr/local/
cd /usr/local
sudo tar -xzf spark-2.1.1-bin-hadoop2.7.tgz
sudo mv /usr/local/spark-2.1.1-bin-hadoop2.7 /usr/local/spark

#Changing ownership and permissions on that directory
sudo chown -R hduser /usr/local/spark
sudo chmod 755 /usr/local/spark

cd /usr/local/spark

#Add SPARK_HOME in the end of the bashrc file as a user hduser
nano ~/.bashrc
#Add the following two lines at the end of bash
export SPARK_HOME=/usr/local/spark/
source ~/.bashrc
Edit the Spark Config files
#Navigate to $SPARK_HOME/conf and copy slaves.template as slaves
cd /usr/local/spark/conf
cp slaves.template ./slaves

#create file using the provided template:
cp $SPARK_HOME/conf/ $SPARK_HOME/conf/

#append a configuration param to the end of the file withe ip address of your machine
Passwordless Cluster

Spark master requires passwordless ssh to connect to its slaves. Since we’re building a standalone Spark cluster, we’ll need to facilitate connection to localhost passwordless connection.

#generate ssh key  and make cluster passwordless for hduser and hostname localhost
ssh-keygen -t rsa -P ''

#Press Enter

#Copy the RSA public Key to the authorized keys file
cp .ssh/ .ssh/authorized_keys

#Test the passwordless key in cluster
ssh localhost
Start the Spark Shell to use spark from command line
Deploying the Spark Batch Aapplication or deploying spark streaming jar file

To run a spark batch or streaming application ,spark master and spark slaves daemons needs to be started

#start the Spark master on your localhost:

#Start the Spark Slaves

Connecting Apache Spark with Apache Hive

To use Apache Hive from Spark shell or spark applications ,it should have access to hive-site.xml and mysql common linrary jar

  • Make a Symbollic link of hive-site.xml at Spark Path c ln -s /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf/hive-site.xml

  • Copy Mysql jar to spark path

cp mysql-connector-java-5.1.44.jar $SPARK_HOME/spark/jars/
  • Add this property to hive-site.xml

Stopping a Spark Cluster



Spark Standalone