Spark MLLib

  • Official documentation: The official documentation is clear, detailed and includes many code examples. You should refer to the official docs for exploration of this rich and rapidly growing library.

MLLib Pipeline

Generally, use of MLLIb for supervised and unsupervised learning follow some or all of the stages in the following template:

  • Get data
  • Pre-process the data
  • Convert data to a form that MLLib functions require (*)
  • Build a model
  • Optimize and fit the model to the data
  • Post-processing and model evaluation

This is often assembled as a pipeline for convenience and reproducibility. This is very similar to what you would do with sklearn, except that MLLib allows you to handle massive datasets by distributing the analysis to multiple computers.

Set up Spark and Spark SQL contexts

from pyspark import SparkContext
sc = SparkContext('local[*]')
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)

Spark MLLib imports

The older mllib package works on RDDs. The newer ml package works on DataFrames. We will show examples using both, but it is more convenient to use the ml package.

from import VectorAssembler
from import StandardScaler
from import StringIndexer
from import PCA
from import Pipeline
from import LogisticRegression

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.clustering import GaussianMixture
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel

Unsupervised Learning

We saw this machine learning problem previously with sklearn, where the task is to distinguish rocks from mines using 60 sonar numerical features. We will illustrate some of the mechanics of how to work with MLLib - this is not intended to be a serious attemtp at modeling the data.

Obtain data

df = ('com.databricks.spark.csv')
      .options(header='false', inferschema='true')
Pre-process the data

df = df.withColumnRenamed("C60","label")

Transform 60 features into MMlib vectors

assembler = VectorAssembler(
    inputCols=['C%d' % i for i in range(60)],
output = assembler.transform(df)

Scale features to have zero mean and unit standard deviation

standardizer = StandardScaler(withMean=True, withStd=True,
model =
output = model.transform(output)

Convert laebl to numeric index

indexer = StringIndexer(inputCol="label", outputCol="label_idx")
indexed =

Extract only columns of interest

sonar =['std_features', 'label', 'label_idx'])
|        std_features|label|label_idx|
|[-0.3985897356694...|    R|      1.0|
|[0.70184498705605...|    R|      1.0|
|[-0.1289179854363...|    R|      1.0|
only showing top 3 rows

Data conversion

We will first fit a Gaussian Mixture Model with 2 components to the first 2 principal components of the data as an example of unsupervised learning. The GaussianMixture model requires an RDD of vectors, not a DataFrame. Note that pyspark converts numpy arrays to Spark vectors.

pca = PCA(k=2, inputCol="std_features", outputCol="pca")
model =
transformed = model.transform(sonar)
features ='pca') x: np.array(x))

Build Model

gmm = GaussianMixture.train(features, k=2)

Optimize and fit the model to data

Note that we are looking at optimistic in-sample errors.

predict = gmm.predict(features).collect()
labels ='label_idx') r: r[0]).collect()

Post-processing and model evaluation

The GMM is poor at clustering rocks and mines based on the first 2 PC of the sonographic data.

np.corrcoef(predict, labels)
array([[ 1.        ,  0.13825324],
       [ 0.13825324,  1.        ]])

Plot discrepancy between predicted and labels

xs = np.array(features.collect()).squeeze()

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(xs[:, 0], xs[:,1], c=predict)
axes[1].scatter(xs[:, 0], xs[:,1], c=labels)

Supervised Learning

We will fit a logistic regression model to the data as an example of supervised learning.

|        std_features|label|label_idx|
|[-0.3985897356694...|    R|      1.0|
|[0.70184498705605...|    R|      1.0|
|[-0.1289179854363...|    R|      1.0|
only showing top 3 rows

Using mllib and RDDs

Convert to format expected by regression functions in mllib

data = x: LabeledPoint(x[2], x[0]))

Split into test and train sets

train, test = data.randomSplit([0.7, 0.3])

Fit model to training data

model = LogisticRegressionWithLBFGS.train(train)

Evaluate on test data

y_yhat = x: (x.label, model.predict(x.features)))
err = y_yhat.filter(lambda x: x[0] != x[1]).count() / float(test.count())
print("Error = " + str(err))
Error = 0.30158730158730157

Using the newer ml pipeline

transformer = VectorAssembler(inputCols=['C%d' % i for i in range(60)],
standardizer = StandardScaler(withMean=True, withStd=True,
indexer = StringIndexer(inputCol="C60", outputCol="label_idx")
pca = PCA(k=5, inputCol="std_features", outputCol="pca")
lr = LogisticRegression(featuresCol='std_features', labelCol='label_idx')

pipeline = Pipeline(stages=[transformer, standardizer, indexer, pca, lr])
df = ('com.databricks.spark.csv')
      .options(header='false', inferschema='true')
train, test = df.randomSplit([0.7, 0.3])
model =
import warnings

with warnings.catch_warnings():
    prediction = model.transform(test)
score =['label_idx', 'prediction'])
acc = x: x[0] == x[1]).sum() / score.count()

Spark MLLIb and sklearn integration

There is a package that you can install with

pip install spark-sklearn

Basically, it provides the same API as sklearn but uses Spark MLLib under the hood to perform the actual computations in a distributed way (passed in via the SparkContext instance).

Example taken directly from package website

from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters),
GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [1, 10]},
       pre_dispatch='2*n_jobs', refit=True,
       sc=<pyspark.context.SparkContext object at 0x11ad38668>,
       scoring=None, verbose=0)
