Spark on Cloud¶
How to set up and run Spark on Azure or AWS EC2 clusters.
AWS setup is more involved. We will show how to access pyspark
ssh to an EMR
cluster, as well as how to set up the Zeppelin
browser-based notebook (similar to Jupyter).
Know your AWS public and private access keys¶
These will look something like
- public:
- private:
Know your AWS EC2 key-pair¶
This is a name that you give - mine is cliburn-2016
and an
associated PEM file - I keep mine at ~/AWS/cliburn-2016.pem.
Set the correct permissions on the PEM file.
chmod 400 xxx.pem
Configure the AWS command line client¶
aws configure
AWS Access Key ID: <<Your public access key>>
AWS Secret Access Key: <<Your private access key>>
Default region name: us-east-1
Default output format: json
Create a cluster¶
Warning: You will be charged for this.
aws emr create-cluster --name "<<NAME-FOR-CLUSTER>>" --release-label emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox --ec2-attributes KeyName=<<Your key-pair>>> --instance-type m3.xlarge --instance-count 3 --use-default-roles
For example, I start mine with
aws emr create-cluster --name "spak-2016-d" --release-label emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox --ec2-attributes KeyName="cliburn-2016" --instance-type m3.xlarge --instance-count 3 --use-default-role
A cluster-id should be returned
Get information about the cluster¶
aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX
or just inspect the state
aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX | grep \"State\"
Connect to the cluster via ssh
aws emr ssh --cluster-id -XXXXXXXXXXXXXXX --key-pair-file cliburn-2016.pem
Note the IP address that is returned¶
It will be something like
Run pyspark
And you will be in a pyspark
console where you can issue Spark
When you’ve had enough fun playing in pyspark
for a while, end the
session with Ctrl-D
and exit to leave the ssh
Run the Zepellin
Create an SSH tunnel to port 8890
ssh -i xxx.pem -L -N -v
Fill in the xxx
with the locatin of your PEM file, and the
appropriate IP address.
Connect to Zeppelin
Open a browser to http://localhost:8890/ - if it worked you should see this

Zeppelin screenshot
Create notebook and run Spark within it¶
The default cell uses scala
. For pyspark
just start a cell with
Terminate the cluster¶
When you are done, remember to terminate the cluster!
aws emr terminate-clusters --cluster-id j-XXXXXXXXXXXXXXX
and confirm that it is terminating
aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"State\"
You should see
If you are paranoid, log into the AWS Management
Console and click on
Services | EMR
and check the status of your cluster.