Spark on Cloud¶

How to set up and run Spark on Azure or AWS EC2 clusters.

Azure¶

Follow instructions provided by Microsoft.

To terminate the cluster, you have to delete it.

AWS¶

AWS setup is more involved. We will show how to access pyspark via ssh to an EMR cluster, as well as how to set up the Zeppelin browser-based notebook (similar to Jupyter).

References

Know your AWS public and private access keys ¶

These will look something like

public: AKIAIOSFODNN7EXAMPLE
private: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Know your AWS EC2 key-pair ¶

This is a name that you give - mine is cliburn-2016 and an associated PEM file - I keep mine at ~/AWS/cliburn-2016.pem.

Set the correct permissions on the PEM file.

chmod 400 xxx.pem

Install AWS command line client¶

pip install awscli

If you run into problems, see docs

Configure the AWS command line client¶

aws configure

AWS Access Key ID: <<Your public access key>>
AWS Secret Access Key: <<Your private access key>>
Default region name: us-east-1
Default output format: json

Create a cluster¶

Warning: You will be charged for this.

aws emr create-cluster --name "<<NAME-FOR-CLUSTER>>" --release-label  emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox  --ec2-attributes KeyName=<<Your key-pair>>> --instance-type m3.xlarge --instance-count 3 --use-default-roles

For example, I start mine with

aws emr create-cluster --name "spak-2016-d" --release-label    emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox  --ec2-attributes KeyName="cliburn-2016"  --instance-type m3.xlarge --instance-count 3 --use-default-role

A cluster-id should be returned

{
    "ClusterId": "j-XXXXXXXXXXXXXXX"
}

Get information about the cluster¶

aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX

or just inspect the state

aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX | grep \"State\"

Connect to the cluster via `ssh`¶

aws emr ssh --cluster-id -XXXXXXXXXXXXXXX --key-pair-file cliburn-2016.pem

Note the IP address that is returned¶

It will be something like ec2-XX-X-XX-XXX.compute-1.amazonaws.com

Run `pyspark`¶

Run

pyspark

And you will be in a pyspark console where you can issue Spark commands.

When you’ve had enough fun playing in pyspark for a while, end the session with Ctrl-D and exit to leave the ssh session.

Run the `Zepellin` notebook ¶

Create an SSH tunnel to port 8890

ssh -i xxx.pem -L 8192:ec2-xx-xx-xx.compute-1.amazonaws.com:8192 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com -N -v

Fill in the xxx with the locatin of your PEM file, and the appropriate IP address.

Connect to `Zeppelin` notebook¶

Open a browser to http://localhost:8890/ - if it worked you should see this

Zeppelin screenshot

Create notebook and run Spark within it¶

The default cell uses scala. For pyspark just start a cell with %pyspark.

Terminate the cluster¶

When you are done, remember to terminate the cluster!

aws emr terminate-clusters --cluster-id j-XXXXXXXXXXXXXXX

and confirm that it is terminating

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"State\"

You should see

        "State": "TERMINATING"
        "State": "TERMINATING"
"State": "TERMINATING"

If you are paranoid, log into the AWS Management Console and click on Services | EMR and check the status of your cluster.

Spark on Cloud¶

Azure¶

AWS¶

Know your AWS public and private access keys ¶

Know your AWS EC2 key-pair ¶

Install AWS command line client¶

Configure the AWS command line client¶

Create a cluster¶

Get information about the cluster¶

Connect to the cluster via `ssh`¶

Note the IP address that is returned¶

Run `pyspark`¶

Run the `Zepellin` notebook ¶

Connect to `Zeppelin` notebook¶

Create notebook and run Spark within it¶

Terminate the cluster¶

Table Of Contents

Previous page

Next page

This Page

Spark on Cloud¶

Azure¶

AWS¶

Know your AWS public and private access keys¶

Know your AWS EC2 key-pair¶

Install AWS command line client¶

Configure the AWS command line client¶

Create a cluster¶

Get information about the cluster¶

Connect to the cluster via ssh¶

Note the IP address that is returned¶

Run pyspark¶

Run the Zepellin notebook¶

Connect to Zeppelin notebook¶

Create notebook and run Spark within it¶

Terminate the cluster¶

Know your AWS public and private access keys ¶

Know your AWS EC2 key-pair ¶

Connect to the cluster via `ssh`¶

Run `pyspark`¶

Run the `Zepellin` notebook ¶

Connect to `Zeppelin` notebook¶