<no title> — BIOS-823-2020 1.0 documentation

Use this on the vm-managee cluster.

%%spark

[2]:

from pyspark import SparkContext
from pyspark.sql import SparkSession

sc =  SparkContext.getOrCreate()
spark = (
    SparkSession.builder
    .master("local")
    .appName("BIOS-823")
    .config("spark.executor.cores", 4)
    .getOrCreate()
)

1. Low-level API

Create an RDD with numbers from 1 to 100
Remove all numbers not divisible by 3
Square the numbers
Find the remainder modulo 7
Count the frequency of each digit from 0-6

[ ]:

2. High-level API

[40]:

from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True)

Combine the data and target variables into a pandas DataFrame with columns [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’]
Convert into a Spark DataFrame
Find the median and IQR for all measurements by species

[ ]:

3. Spark ML

In this exercise you will use Spark to build and run a machine learning pipeline to separate ‘ham’ from ‘spam’ in SMS text messages. Then you will use the pipeline to classify SMS texts.

Create a Pandas DataFraem from the data in the fileSMSSpamCollection where each line is tab separated into (label, text). If you find that the read_xxx function in Pandas does not do the job correctly, read in the file line by line before converting to a DataFrame. Create an index column so that each row has a unique number id.
Convert to a Spark DataFrame that has two columns (klass, SMS) and split into test and training data sets with proportions 0.8 and 0.2 respectively using a random seed of 123.
Build a Spark ML pipeline consisting of the following
- StringIndexer: To convert klass into a numeric labels column
- Tokenizer: To covert SMS into a list of tokens
- StopWordsRemover: To remove “stop words” from the tokens
- CountVectorizer: To count words (use a vocabular size of 100 and minimum number of occureences of 2)
- LogisticRegression: Use maxIter=10, regParam=0.001
Train the model on the test data.
Evaluate the precision, recall and accuracy of this model on the test data.

[ ]:

4. Spark Streaming

In this exercise, you will simulate running a machine learning pipeline to classify steaming data.

Convert the test DataFrame into a Pandas DataFrame
Write each row of the DataFrame to a separate tab-delimited file in a folder called “incoming_sms”
Create a Structured Streaming DataFrame using readStream with option("maxFilesPerTrigger", 1) to simulate streaming data
Use the fitted pipeline created in Ex. 1 to transform the input stream
Write the transformed stream to memory with name `sms_pred
Sleep 30 seconds
Use an SQL query to show the index, label and prediction columns
Sleep 30 more seconds
Use an SQL query to show the index, label and prediction columns

[ ]: