Use this on the vm-managee cluster.
%%spark
[2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = (
SparkSession.builder
.master("local")
.appName("BIOS-823")
.config("spark.executor.cores", 4)
.getOrCreate()
)
1. Low-level API
Create an RDD with numbers from 1 to 100
Remove all numbers not divisible by 3
Square the numbers
Find the remainder modulo 7
Count the frequency of each digit from 0-6
[ ]:
2. High-level API
[40]:
from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True)
Combine the data and target variables into a pandas DataFrame with columns [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’]
Convert into a Spark DataFrame
Find the median and IQR for all measurements by
species
[ ]:
3. Spark ML
In this exercise you will use Spark to build and run a machine learning pipeline to separate ‘ham’ from ‘spam’ in SMS text messages. Then you will use the pipeline to classify SMS texts.
Create a Pandas DataFraem from the data in the file
SMSSpamCollectionwhere each line is tab separated into (label, text). If you find that the read_xxx function in Pandas does not do the job correctly, read in the file line by line before converting to a DataFrame. Create an index column so that each row has a unique number id.Convert to a Spark DataFrame that has two columns (klass, SMS) and split into test and training data sets with proportions 0.8 and 0.2 respectively using a random seed of 123.
Build a Spark ML pipeline consisting of the following
StringIndexer: To convert
klassinto a numericlabelscolumnTokenizer: To covert
SMSinto a list of tokensStopWordsRemover: To remove “stop words” from the tokens
CountVectorizer: To count words (use a vocabular size of 100 and minimum number of occureences of 2)
LogisticRegression: Use
maxIter=10,regParam=0.001
Train the model on the test data.
Evaluate the precision, recall and accuracy of this model on the test data.
[ ]:
4. Spark Streaming
In this exercise, you will simulate running a machine learning pipeline to classify steaming data.
Convert the test DataFrame into a Pandas DataFrame
Write each row of the DataFrame to a separate tab-delimited file in a folder called “incoming_sms”
Create a Structured Streaming DataFrame using
readStreamwithoption("maxFilesPerTrigger", 1)to simulate streaming dataUse the fitted pipeline created in Ex. 1 to transform the input stream
Write the transformed stream to memory with name `sms_pred
Sleep 30 seconds
Use an SQL query to show the
index,labelandpredictioncolumnsSleep 30 more seconds
Use an SQL query to show the
index,labelandpredictioncolumns
[ ]: