##![LearnAI Header](https://coursematerial.blob.core.windows.net/assets/LearnAI_header.png)

## Introduction to Model Development with Spark

This will be the first of three parts of a bootcamp on Model Development with [MLlib](https://spark.apache.org/docs/latest/ml-guide.html), Spark’s machine learning (ML) library.  You will gain hands-on experience with essential steps of a model development using MLlib, which has has the goal to make machine learning scalable and easy. 

At a high level, MLlib provides tools such as:
- ML Algorithms: common learning algorithms such as classification, regression, clustering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.

In this lab, we will cover:
- Splitting of data for training and testing
- Applying Transformers to data frames
- Fitting Estimators to our data
- Creating and executing a ML Pipeline
- Model Evaluation.

## Use case

We will take a break from our usual use-case of predictive maintenance, to instead learn how to use these features and tools by solving a common task in *Natural Language Processing (NLP): Sentiment Analysis*.  Our dataset contains roughly 6,000 tweets about climate change.  Based on the text of a tweet, we want to predict if the tweets supports the existence of climate change.

Other applications of Sentiment Analysis:
- Detecting negative affect in customers who are calling an automated customer hotline
- Agreggating reviews of retail products into an overall rating for each product

> The dataset was made available [here](https://www.figure-eight.com/data-for-everyone/) by Kent Cavender-Bares.

## Initialize Notebook Environment

We start by loading the parquet file we created in the previous lab.

In [4]:
%run "../includes/mnt_blob"

## Load data

In [6]:
gwDF = spark.read.csv("/mnt/data/1377884570_tweet_global_warming.csv", header=True, inferSchema=True)
display(gwDF)

tweet,existence,existence.confidence
"Global warming report urges governments to act|BRUSSELS, Belgium (AP) - The world faces increased hunger and .. [link]",Yes,1
Fighting poverty and global warming in Africa [link],Yes,1
Carbon offsets: How a Vatican forest failed to reduce global warming [link],Yes,0.8786
Carbon offsets: How a Vatican forest failed to reduce global warming [link],Yes,1
URUGUAY: Tools Needed for Those Most Vulnerable to Climate Change [link],Yes,0.8087
RT @sejorg: RT @JaymiHeimbuch: Ocean Saltiness Shows Global Warming Is Intensifying Our Water Cycle [link],Yes,1
Global warming evidence all around us|A message to global warming deniers and doubters: Just look around our .. [link],Yes,1
Migratory Birds' New Climate Change Strategy: Stay Home [link],Yes,1
Southern Africa: Competing for Limpopo Water: Climate change will bring higher temperatures to Southe... [link],Yes,1
"Global warming to impact wheat, rice production in India|Ludhiana, Apr 18 : Scarcity of water will have a serious .. [link]",Yes,1


## Confirm encoding of data

Let's make sure that our data were encoded correctly, we can do this by printing out the schema.

In [8]:
gwDF.printSchema()

What do you think?

It looks like the last two columns should be encoded differently. The `existence` column might be better encoded as `integer`, and the `existence.confidence` column should be encoded as `double`.  We should also change the column name of that last column, otherwise we will run into issue later on, because the [dot notation](https://en.wikipedia.org/wiki/Dot_notation) in [object-oriented programming languages](https://en.wikipedia.org/wiki/Object-oriented_programming) - such as Python.

We can achieve these goals, by explicitly providing the schema for the data when reading the CSV file.

### Hands-on lab

Define the schema for the data, and read it in again. 

A good start to defining the schema is to look at the raw schema of the DataFrame.

In [11]:
gwDF.schema

This tells us that spark guessed the following schema definition.

~~~~
from pyspark.sql.types import *

schema = StructType([
  StructField("tweet", StringType()), 
  StructField("existence", StringType()), 
  StructField("existence.confidence", StringType())])
~~~~

Rename the third column to `confidence` and change the type to Double.

In [13]:
from pyspark.sql.types import *

# put your solution here

In [14]:
# maximize this cell to see the solution

from pyspark.sql.types import *

schema = StructType([
  StructField("tweet", StringType()), 
  StructField("existence", StringType()), 
  StructField("confidence", DoubleType())])

Encoding the "Yes" and "No" responses as `Integer` can be done by first casting the column to `boolean` and then to `integer`.

In [16]:
gwDF = spark.read.schema(schema).csv("/mnt/data/1377884570_tweet_global_warming.csv", header=True, inferSchema=False)

col_name = 'existence'
gwDF = gwDF.withColumn(col_name,gwDF[col_name].cast("boolean").cast("integer")).drop(gwDF[col_name])

Let's confirm that the columns are now encoded correctly.

In [18]:
gwDF.printSchema()

In [19]:
display(gwDF)

tweet,existence,confidence
"Global warming report urges governments to act|BRUSSELS, Belgium (AP) - The world faces increased hunger and .. [link]",1.0,1.0
Fighting poverty and global warming in Africa [link],1.0,1.0
Carbon offsets: How a Vatican forest failed to reduce global warming [link],1.0,0.8786
Carbon offsets: How a Vatican forest failed to reduce global warming [link],1.0,1.0
URUGUAY: Tools Needed for Those Most Vulnerable to Climate Change [link],1.0,0.8087
RT @sejorg: RT @JaymiHeimbuch: Ocean Saltiness Shows Global Warming Is Intensifying Our Water Cycle [link],1.0,1.0
Global warming evidence all around us|A message to global warming deniers and doubters: Just look around our .. [link],1.0,1.0
Migratory Birds' New Climate Change Strategy: Stay Home [link],1.0,1.0
Southern Africa: Competing for Limpopo Water: Climate change will bring higher temperatures to Southe... [link],1.0,1.0
"Global warming to impact wheat, rice production in India|Ludhiana, Apr 18 : Scarcity of water will have a serious .. [link]",1.0,1.0


### End of lab

### Hands-on lab

What does the distribution of tweets that were rated as expressing believe or non-believe in climate change look like?

Let's try this in `SQL`. We begin by creating a temp view of the data frame.

HINT: Use `count()`, group by `existence` and sort.

In [23]:
gwDF.createOrReplaceTempView("gwDF_tempView")

In [24]:
%sql

-- Put your solution here

In [25]:
%sql

-- maximize this cell to see the solution

select existence, count(*) from gwDF_tempView group by existence order by existence;

existence,count(1)
,1960
0.0,1075
1.0,3055


### End of Lab

One of the insights we got from the hands-on lab above is that there are a lot of missing values. Let's drop those.

In [28]:
gwDF_clean = gwDF.dropna()

In [29]:
display(gwDF_clean)

tweet,existence,confidence
"Global warming report urges governments to act|BRUSSELS, Belgium (AP) - The world faces increased hunger and .. [link]",1,1.0
Fighting poverty and global warming in Africa [link],1,1.0
Carbon offsets: How a Vatican forest failed to reduce global warming [link],1,0.8786
Carbon offsets: How a Vatican forest failed to reduce global warming [link],1,1.0
URUGUAY: Tools Needed for Those Most Vulnerable to Climate Change [link],1,0.8087
RT @sejorg: RT @JaymiHeimbuch: Ocean Saltiness Shows Global Warming Is Intensifying Our Water Cycle [link],1,1.0
Global warming evidence all around us|A message to global warming deniers and doubters: Just look around our .. [link],1,1.0
Migratory Birds' New Climate Change Strategy: Stay Home [link],1,1.0
Southern Africa: Competing for Limpopo Water: Climate change will bring higher temperatures to Southe... [link],1,1.0
"Global warming to impact wheat, rice production in India|Ludhiana, Apr 18 : Scarcity of water will have a serious .. [link]",1,1.0


Looks like we removed all NaNs. Let's confirm by printing a summary table that shows the number of NaNs and Nulls in each column.

In [31]:
from pyspark.sql.functions import isnan, when, count, col

columns = ['tweet', 'confidence', 'existence']
gwDF_clean.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in gwDF_clean.columns]).show()


Let's save the data frame to parquet file, to make this easier next time.

In [33]:
gwDF_clean.write.parquet("gwDF", mode='overwrite')

## Train-Test Split

Now we can begin with developing a machine learning model. 

First, we'll split our data into training and test samples. We will use 80% for training, and the remaining 20% for testing. We set a seed to reproduce the same results (i.e. if you re-run this notebook, you'll get the same results both times).

In [35]:
(trainDF, testDF) = gwDF_clean.randomSplit([0.8, 0.2], seed=42)
trainDF.cache()
testDF.cache()

## Feature Engineering

### Apply Tokenizer to setences

Using the [RegexTokenizer](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer), we convert our tweets into a list of tokens.

In [37]:
from pyspark.ml.feature import RegexTokenizer

tokenizer = (RegexTokenizer()
            .setInputCol("tweet")
            .setOutputCol("tokens")
            .setPattern("\\W+"))

tokenizedDF = tokenizer.transform(gwDF_clean)

display(tokenizedDF.select('tweet','tokens').limit(5)) # Look at a few tokenized reviews

tweet,tokens
"Global warming report urges governments to act|BRUSSELS, Belgium (AP) - The world faces increased hunger and .. [link]","List(global, warming, report, urges, governments, to, act, brussels, belgium, ap, the, world, faces, increased, hunger, and, link)"
Fighting poverty and global warming in Africa [link],"List(fighting, poverty, and, global, warming, in, africa, link)"
Carbon offsets: How a Vatican forest failed to reduce global warming [link],"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)"
Carbon offsets: How a Vatican forest failed to reduce global warming [link],"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)"
URUGUAY: Tools Needed for Those Most Vulnerable to Climate Change [link],"List(uruguay, tools, needed, for, those, most, vulnerable, to, climate, change, link)"


There are a lot of words that do not contain much information about the sentiment of the review (e.g. `the`, `a`, etc.). Let's remove these uninformative words using [StopWordsRemover](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover).

In [39]:
from pyspark.ml.feature import StopWordsRemover

remover = (StopWordsRemover()
          .setInputCol("tokens")
          .setOutputCol("stopWordFree"))

removedStopWordsDF = remover.transform(tokenizedDF)

display(removedStopWordsDF.select('tokens','stopWordFree').limit(5)) # Look at a few tokenized reviews without stop words

tokens,stopWordFree
"List(global, warming, report, urges, governments, to, act, brussels, belgium, ap, the, world, faces, increased, hunger, and, link)","List(global, warming, report, urges, governments, act, brussels, belgium, ap, world, faces, increased, hunger, link)"
"List(fighting, poverty, and, global, warming, in, africa, link)","List(fighting, poverty, global, warming, africa, link)"
"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)","List(carbon, offsets, vatican, forest, failed, reduce, global, warming, link)"
"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)","List(carbon, offsets, vatican, forest, failed, reduce, global, warming, link)"
"List(uruguay, tools, needed, for, those, most, vulnerable, to, climate, change, link)","List(uruguay, tools, needed, vulnerable, climate, change, link)"


### hands-on lab

Where do the stop words actually come from? Spark includes a small English list as a default, which we're implicitly using here.

Look into the Spark documentation, to find out how you can get a list of the stop words.

In [41]:
# your solution goes into this cell
stopwords = []
stopwords # this prints the stopwords.

In [42]:
# maximize this cell to see the solution

stopWords = remover.getStopWords()
stopWords

Now try to remove additional stop words. For example, let's remove each occurrence of the `link` from the reviews.

In [44]:
# replace this with your solution
removedStopWordsDF = remover.transform(tokenizedDF)

In [45]:
# maximize this cell to see the solution
# these two lines will remove "br" and all other defined stop words

remover.setStopWords(["link"] + stopWords)
removedStopWordsDF = remover.transform(tokenizedDF)

In [46]:
display(removedStopWordsDF.select('tokens','stopWordFree').limit(5)) # look at a few tokenized reviews without stop words

tokens,stopWordFree
"List(global, warming, report, urges, governments, to, act, brussels, belgium, ap, the, world, faces, increased, hunger, and, link)","List(global, warming, report, urges, governments, act, brussels, belgium, ap, world, faces, increased, hunger)"
"List(fighting, poverty, and, global, warming, in, africa, link)","List(fighting, poverty, global, warming, africa)"
"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)","List(carbon, offsets, vatican, forest, failed, reduce, global, warming)"
"List(carbon, offsets, how, a, vatican, forest, failed, to, reduce, global, warming, link)","List(carbon, offsets, vatican, forest, failed, reduce, global, warming)"
"List(uruguay, tools, needed, for, those, most, vulnerable, to, climate, change, link)","List(uruguay, tools, needed, vulnerable, climate, change)"


### Word-Frequencies

Next, we need to get a numerical representation of the tweets.  A common approach is to count how often any of the words from a given vocabulary appear in each tweet. 

We use an `Estimators` for this purpose.  We first `fit` it to the DataFrame.  This process returns a model (a Transformer), which we can use to transform DataFrames.

Let's apply a [CountVectorizer](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer) model to convert our tokens into a vocabulary.

In [48]:
from pyspark.ml.feature import CountVectorizer

counts = (CountVectorizer()
          .setInputCol("stopWordFree")
          .setOutputCol("counts"))

countModel = counts.fit(removedStopWordsDF)

countsDF = countModel.transform(removedStopWordsDF)

### Hands-on lab

Try to answer these questions:
0. How does the `CountVectorizer` know which vocabulary to use?

### End of lab

In [50]:
display(countsDF.select("stopWordFree", "counts").limit(10))

stopWordFree,counts
"List(global, warming, report, urges, governments, act, brussels, belgium, ap, world, faces, increased, hunger)","List(0, 9983, List(1, 2, 20, 21, 224, 359, 462, 902, 1000, 1220, 2102, 2561, 2704), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(fighting, poverty, global, warming, africa)","List(0, 9983, List(1, 2, 138, 167, 329), List(1.0, 1.0, 1.0, 1.0, 1.0))"
"List(carbon, offsets, vatican, forest, failed, reduce, global, warming)","List(0, 9983, List(1, 2, 39, 129, 466, 549, 702, 743), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(carbon, offsets, vatican, forest, failed, reduce, global, warming)","List(0, 9983, List(1, 2, 39, 129, 466, 549, 702, 743), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(uruguay, tools, needed, vulnerable, climate, change)","List(0, 9983, List(3, 4, 335, 572, 752, 1023), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(rt, sejorg, rt, jaymiheimbuch, ocean, saltiness, shows, global, warming, intensifying, water, cycle)","List(0, 9983, List(1, 2, 7, 116, 123, 237, 560, 570, 856, 1806, 2657), List(1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(global, warming, evidence, around, us, message, global, warming, deniers, doubters, look, around)","List(0, 9983, List(1, 2, 18, 103, 135, 220, 236, 362, 598), List(2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))"
"List(migratory, birds, new, climate, change, strategy, stay, home)","List(0, 9983, List(3, 4, 11, 204, 418, 885, 1099, 2225), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(southern, africa, competing, limpopo, water, climate, change, bring, higher, temperatures, southe)","List(0, 9983, List(3, 4, 116, 138, 460, 494, 656, 1012, 1965, 2322, 2552), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
"List(global, warming, impact, wheat, rice, production, india, ludhiana, apr, 18, scarcity, water, serious)","List(0, 9983, List(1, 2, 116, 191, 233, 268, 368, 474, 812, 1488, 1535, 1575, 1662), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"


## Classification

### Defining a decision tree

Now we are going to use a [DecisionTreeClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier) model to fit to our dataset.

In [52]:
from pyspark.ml.classification import DecisionTreeClassifier

dtc = DecisionTreeClassifier().setLabelCol('existence').setFeaturesCol("counts")

## Pipeline

Let's put all of these stages into a [Pipeline](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Pipeline). This way, you don't have to remember all of the different steps you applied to the training set, and then apply the same steps to the test dataset. The pipeline takes care of that for you!

In [54]:
from pyspark.ml import Pipeline

pipeline = Pipeline().setStages([tokenizer, remover, counts, dtc])

pipelineModel = pipeline.fit(trainDF)

## Evaluate

We are going to use [MultiClassClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator)  to evaluate our predictions (we are using MultiClass because the BinaryClassificationEvaluator does not support accuracy as a metric).

In [56]:
resultDF = pipelineModel.transform(testDF)

In [57]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator().setLabelCol("existence").setMetricName('accuracy')

init_accuracy = evaluator.evaluate(resultDF)

print("Accuracy: %(result)s" % {"result": init_accuracy})

### Feature Importance

We might be curious to know which words contribute the most to knowing whether a tweet supports or denies the existence of climate change (stong opinion one way or the other). For that we can simply list the "feature importance" values for the trained model. This is a measure of how important a feature is to the model's prediction (positive or negative).

In [60]:
import pandas as pd

tree = pipelineModel.stages[-1]

countModel = pipelineModel.stages[2]

#  Zip the list of features with their scores
scores = zip(countModel.vocabulary, tree.featureImportances)

scores_df = pd.DataFrame(list(scores), columns=['word', 'importance'])

scores_df = scores_df.sort_values(by='importance', ascending=False)

display(scores_df.head(2**tree.depth))

word,importance
warming,0.4631159987689565
gore,0.1887165996385719
tcot,0.064276485473319
scam,0.0552576799596505
great,0.048724610961966
snow,0.0425605629736951
utah,0.0423145417901834
scandal,0.0307054572411107
news,0.0119636391240982
disprove,0.0109065107597512


### Hands-on lab (advanced!)

The first pipeline was using a very basic approach to feature engineering: It tokenized tweets into individual words, removed stop words, and counted for each word how often it appeared in each tweet.

Let's try interesting: We use [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) to create [word embeddings](https://en.wikipedia.org/wiki/Word_embedding). Rather then relying simply on the individual words in each tweet, we will create a semantic representation that tells us what topic were touched on in each tweet. Intituitively, this will tell us a bit about what the person had in mind when writing the tweet.

In the following cell, do the following:
- Define a `Word2vec` object and fit it to the `removedStopWordsDF` DataFrame. This will return a `Word2VecModel`. Make sure it uses `stopWordFree` as input, and `features` as output.
- Define a `LogisticRegression` model, using `existence` as labels, and `features` as features.
- Update the pipeline!

In [62]:
from pyspark.ml.feature import Word2VecModel, Word2Vec
from pyspark.ml.classification import LogisticRegression

# word2vecModel = 

# lg = 

# pipeline = 

In [63]:
# maximize this cell to see the solution

from pyspark.ml.feature import Word2VecModel, Word2Vec
from pyspark.ml.classification import LogisticRegression

word2vecModel = Word2Vec(inputCol='stopWordFree', outputCol='features', vectorSize=100).fit(removedStopWordsDF)

lg = LogisticRegression().setLabelCol('existence').setFeaturesCol("features")

pipeline = Pipeline().setStages([tokenizer, remover, word2vecModel, lg])
pipelineModel_trained = pipelineModel.copy()

Let's see whether this pipeline performs better than the basic model.

In [65]:
pipelineModel = pipeline.fit(trainDF)

resultDF = pipelineModel.transform(testDF)

accuracy_w2v = evaluator.evaluate(resultDF)

print("Accuracy: %(result)s" % {"result": accuracy_w2v})

OK, there is a small improvement. Can we do better than that?

One problem with our dataset is that it is very small, maybe too small to create quality word embeddings. 

Let's try to use word embeddings that were created with a very large set of tweets. These embeddings are based on 2B tweets and a vocabulary with 1.2M entries.

We got these embeddings from [here](https://nlp.stanford.edu/projects/glove/).

> Note: this takes about **2 minutes** to run.

In [67]:
from pyspark.ml.feature import Word2VecModel, Word2Vec
from pyspark.ml.classification import LogisticRegression

word2vecModel = Word2VecModel.load('dbfs:/mnt/data/myWord2VecModelTwitter')

In [68]:
pipeline = Pipeline().setStages([tokenizer, remover, word2vecModel, lg])

pipelineModel = pipeline.fit(trainDF)

resultDF = pipelineModel.transform(testDF)

accuracy_w2v_twitter = evaluator.evaluate(resultDF)

print("Accuracy: %(result)s" % {"result": accuracy_w2v_twitter})

Now let's compare the performance of our three models against each other and baseline accuracy (guessing, knowing how many tweets are in support of the existence of climate change).

In [70]:
# calculate baseline accuracy
exists = testDF.filter("existence == 1").count()
totalMeasures = testDF.count()

print("w2v model (twitter): {0:.1f}%, w2v model: {1:.1f}%, initial model: {2:.1f}%, baseline accuracy: {3:.1f}% ({4:d}/{5:d})".format(accuracy_w2v_twitter*100, accuracy_w2v*100, init_accuracy*100, (exists/totalMeasures*100), exists, totalMeasures))

When you are done defining your word2vec estimator and updating your pipeline. Run *all* the below cells again, and see whether the performance of your model has changed.

### End of (optional) hands-on lab.

## Evaluating the predictions

Let's take a closer look at the resulting predictions and compare them to the ground truth.

In [73]:
display(resultDF.select('tweet', 'existence', 'prediction'))

tweet,existence,prediction
"""""""@NASA Climate Change"""" is now on #Facebook. Become a fan & keep up w/ the #climate science buzz http://bit.ly/dzKcEq RT @Flipbooks""",1,1.0
"""""""All 30 Major League Baseball Teams Throw Curve to Climate Change Deniers : CleanTechnica"""" http://j.mp/ars7W2 #cleantech #greentech #MLB""",1,1.0
"""""""Any"""" = legitimate efforts by scientists to mislead and missrepresent their global warming findings. I haven't heard any implications yet.""",1,0.0
"""""""Kerry Graham Lieberman Climate Bill - KGL Global Warming Energy Bill - thedailygreen.com"""" http://j.mp/adUkuK""",1,1.0
"""""""Political talk shows discuss global warming. This is science (fiction).""""""",0,0.0
"""..leaders are failing to address the gravest threat our world has ever faced..."""" """"Pressuring politicians on climate change is not working.""",1,1.0
"""@1HotItalian First it was global cooling, then it was global warming, now it's climate change (AKA """"weather""""). Simple! :) #tcot""",0,1.0
"""@1kevgriff it's nature's way os saying """"Global warming my a$$!""""""",0,1.0
"""@CalebHowe So in other words, """"Global Climate Change"""" has its benefits. I could live with this.""",1,1.0
"""@KagroX Plate tectonics is one of those scientific """"theories"""" like global warming and evolution which will destroy families and raise taxes""",0,0.0


#### Confusion Matrix

Let's see if we had more False Positive or False Negatives.

In [75]:
display(resultDF.groupBy("existence", "prediction").count())

existence,prediction,count
1,0.0,50
0,0.0,98
1,1.0,544
0,1.0,92


## Closing Remarks

We learned how to setup a basic pipeline for sentiment analysis. We then improved our model in several steps: First, by using word2vec semantic embeddings, then by using a semantic embeddings from a Glove model that was trained on 2B tweets.

In a later lab, we will see how to apply this pipeline to streaming data!

In [77]:
# You can ignore this code, we use it for testing our notebooks.
assert accuracy_w2v_twitter > .80

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.