Use this on the vm-managee cluster.

%%spark
[2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc =  SparkContext.getOrCreate()
spark = (
    SparkSession.builder
    .master("local")
    .appName("BIOS-823")
    .config("spark.executor.cores", 4)
    .getOrCreate()
)

1. Low-level API

  • Create an RDD with numbers from 1 to 100

  • Remove all numbers not divisible by 3

  • Square the numbers

  • Find the remainder modulo 7

  • Count the frequency of each digit from 0-6

[ ]:

2. High-level API

[40]:
from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True)
  • Combine the data and target variables into a pandas DataFrame with columns [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’]

  • Convert into a Spark DataFrame

  • Find the median and IQR for all measurements by species

[ ]:

3. Spark ML

In this exercise you will use Spark to build and run a machine learning pipeline to separate ‘ham’ from ‘spam’ in SMS text messages. Then you will use the pipeline to classify SMS texts.

  • Create a Pandas DataFraem from the data in the fileSMSSpamCollection where each line is tab separated into (label, text). If you find that the read_xxx function in Pandas does not do the job correctly, read in the file line by line before converting to a DataFrame. Create an index column so that each row has a unique number id.

  • Convert to a Spark DataFrame that has two columns (klass, SMS) and split into test and training data sets with proportions 0.8 and 0.2 respectively using a random seed of 123.

  • Build a Spark ML pipeline consisting of the following

    • StringIndexer: To convert klass into a numeric labels column

    • Tokenizer: To covert SMS into a list of tokens

    • StopWordsRemover: To remove “stop words” from the tokens

    • CountVectorizer: To count words (use a vocabular size of 100 and minimum number of occureences of 2)

    • LogisticRegression: Use maxIter=10, regParam=0.001

  • Train the model on the test data.

  • Evaluate the precision, recall and accuracy of this model on the test data.

[ ]:

4. Spark Streaming

In this exercise, you will simulate running a machine learning pipeline to classify steaming data.

  • Convert the test DataFrame into a Pandas DataFrame

  • Write each row of the DataFrame to a separate tab-delimited file in a folder called “incoming_sms”

  • Create a Structured Streaming DataFrame using readStream with option("maxFilesPerTrigger", 1) to simulate streaming data

  • Use the fitted pipeline created in Ex. 1 to transform the input stream

  • Write the transformed stream to memory with name `sms_pred

  • Sleep 30 seconds

  • Use an SQL query to show the index, label and prediction columns

  • Sleep 30 more seconds

  • Use an SQL query to show the index, label and prediction columns

[ ]: