##![LearnAI Header](https://coursematerial.blob.core.windows.net/assets/LearnAI_header.png)

-sandbox
# Hyperparameter tuning with Azure Databricks

In this lab, we will discuss how to perform hyperparameter tuning. This is an obiquitous task in AI and ML projects.

Let's make sure we understand the distinction between hyperparameter and model parameters:
- **Hyperparameters**: These are choices that e.g. a data scientist has to make when setting up their machine learning pipeline. Examples of hyperparameters are: learning rates, regularization, how to handle missing values.
- **Model parameters**: These are parameters that a model can *learn* in order to e.g. make better predictions. For example, when using [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to predict the weight of a person based on their height, the algorithm learns to find the best [intercept and slope](https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line) to make the most accurate predictions.

## Mount data

In [4]:
%run "../includes/mnt_blob"

## Loading the data

We begin by loading our data, which is stored in the CSV format</a>.

In [6]:
fileName = "gwDF"

initialDF = (spark.read          # our DataFrameReader
  .option("header", "true")      # let Spark know we have a header
  .option("inferSchema", "true") # infering the schema (it is a small dataset)
  .parquet(fileName)             # location of our data
  .cache()                       # mark the DataFrame as cached.
)

initialDF.count()                # materialize the cache

In [7]:
initialDF.printSchema()

## Train/Test Split

As before, we split our dataset into separate training and test sets.

Using the `randomSplit()` function, we split the data such that 80% of the data is reserved for training and the remaining 30% for testing. 

For more information see:
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit" target="_blank">DataFrame.randomSplit()</a>

In [9]:
trainDF, testDF = initialDF.randomSplit(
  [0.8, 0.2],  # 80-20 split
  seed=42)     # For reproducibility

print("We have %d training examples and %d test examples." % (trainDF.count(), testDF.count()))
assert (trainDF.count() == 3346)

In [10]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, IDF, CountVectorizer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression

tokenizer = (RegexTokenizer()
            .setInputCol("tweet")
            .setOutputCol("tokens")
            .setPattern("\\W+"))

remover = (StopWordsRemover()
          .setInputCol("tokens")
          .setOutputCol("stopWordFree"))

counts = (CountVectorizer()
          .setInputCol("stopWordFree")
          .setOutputCol("counts"))

idf = IDF(inputCol="counts", outputCol="features")

lg = LogisticRegression().setLabelCol('existence').setFeaturesCol("features")

pipeline = Pipeline().setStages([tokenizer, remover, counts, idf, lg])

In [11]:
pipelineModel = pipeline.fit(trainDF)

## Test pipeline on hold-out data

Next, apply the trained pipeline model to the test set.

In [13]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

resultDF = pipelineModel.transform(testDF)

evaluator = (MulticlassClassificationEvaluator()
             .setLabelCol("existence")
             .setMetricName('accuracy'))

init_accuracy = evaluator.evaluate(resultDF)

print("Accuracy: %(result)s" % {"result": init_accuracy})

In [14]:
display(resultDF.select("tweet","existence","prediction"))

tweet,existence,prediction
"""""""@NASA Climate Change"""" is now on #Facebook. Become a fan & keep up w/ the #climate science buzz http://bit.ly/dzKcEq RT @Flipbooks""",1,1.0
"""""""All 30 Major League Baseball Teams Throw Curve to Climate Change Deniers : CleanTechnica"""" http://j.mp/ars7W2 #cleantech #greentech #MLB""",1,1.0
"""""""Any"""" = legitimate efforts by scientists to mislead and missrepresent their global warming findings. I haven't heard any implications yet.""",1,1.0
"""""""Kerry Graham Lieberman Climate Bill - KGL Global Warming Energy Bill - thedailygreen.com"""" http://j.mp/adUkuK""",1,1.0
"""""""Political talk shows discuss global warming. This is science (fiction).""""""",0,1.0
"""..leaders are failing to address the gravest threat our world has ever faced..."""" """"Pressuring politicians on climate change is not working.""",1,1.0
"""@1HotItalian First it was global cooling, then it was global warming, now it's climate change (AKA """"weather""""). Simple! :) #tcot""",0,0.0
"""@1kevgriff it's nature's way os saying """"Global warming my a$$!""""""",0,0.0
"""@CalebHowe So in other words, """"Global Climate Change"""" has its benefits. I could live with this.""",1,1.0
"""@KagroX Plate tectonics is one of those scientific """"theories"""" like global warming and evolution which will destroy families and raise taxes""",0,1.0


## ParamGrid

There are a lot of hyperparamaters we could tune, and it would take a long time to manually configure.

Instead of a manual (ad-hoc) approach, let's use Spark's `ParamGridBuilder` to find the optimal hyperparameters in a more systematic approach.

For more information see:
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.ParamGridBuilder" target="_blank">ParamGridBuilder</a>

We have the following pipeline stages for which we might want to tune hyperparameters:
- CountVectorizer
- IDF
- LogisticRegressionClassifier

We can see which parameters we might want to tune, but `explainParams()` on each of those.

In [16]:
lg.explainParams()

In [17]:
counts.explainParams()

In [18]:
idf.explainParams()

In [19]:
from pyspark.ml.tuning import ParamGridBuilder

param_grid = (ParamGridBuilder()
             .addGrid(idf.minDocFreq, [0, 1, 2])
             .addGrid(lg.elasticNetParam, [0.0, 1.0])
             .addGrid(lg.regParam, [0.0, 0.01, 0.1])
             .build())

Setting up Cross-Validation

Now we can use 3-fold cross-validation to identify the optimal combination of hyper-parameters

With 3-fold cross-validation, we train on 2/3 of the data and evaluate with the remaining (held-out) 1/3. We repeat this process 3 times, so each fold gets the chance to act as the validation set. We then average the results of the three rounds.

We pass in the `estimator` (our original pipeline), an `evaluator`, and an `estimatorParamMaps` to the `CrossValidator` so that it knows:
- Which model to use
- How to evaluate the model
- What hyperparamters to set on the model

We can also set the number of folds we want to split our data into (3), as well as setting a seed so we all have the same split in the data.

For more information see:
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator" target="_blank">CrossValidator</a>

In [22]:
from pyspark.ml.tuning import CrossValidator

evaluator = (MulticlassClassificationEvaluator()
             .setLabelCol("existence")
             .setPredictionCol("prediction")
             .setMetricName('accuracy'))

cv = (CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(param_grid)
  .setNumFolds(3)
  .setSeed(27))

In [23]:
cv_model = cv.fit(trainDF)

And now we can take a look at the model with the best hyperparameter configuration:

In [25]:
# Zip the two lists together
results = list(zip(cv_model.getEstimatorParamMaps(), cv_model.avgMetrics))

# # And pretty print 'em
for x in results:
  params, acc = x[0].values(), x[1]
  print(params, acc)

## Use fitted pipeline to transform test data

Using our newest mode, let's make a final set of predictions:

In [27]:
finalResultDF = cv_model.transform(testDF)

display(finalResultDF.groupBy("existence", "prediction").count())

existence,prediction,count
1,0.0,41
0,0.0,99
1,1.0,553
0,1.0,91


## Evaluating the New Model

Let's see how our latest model does:

In [29]:
acc = evaluator.evaluate(finalResultDF)

print("Test ACC = %f" % acc)

## Save model pipeline

Let's save the best model our cross-validation procedure found, so we can use it in the next lab.

Checkout how to extract the best model when using cross-validation!

In [31]:
fileName = "my_pipeline"

cv_model.bestModel.write().overwrite().save(fileName)

In [32]:
# You can ignore this code, we use it for testing our notebooks.
assert acc > .82

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.